apps/phota/train

train

Train a Phota identity profile from 30-50 face images, poll status, list and delete profiles

run in browser run via API

run with your agent

# install belt

$curl -fsSL https://cli.inference.sh | sh

# view schema & details

$belt app get phota/train

# run

$belt app run phota/train

Personalized image generation is harder than it looks. Generic text-to-image models produce stunning results for abstract prompts, but the moment you ask for a specific person - your face, your partner's face, a client's face - they fall apart. The likeness drifts, features average out, and you get something that resembles a distant cousin more than the actual subject. Solving this requires a fundamentally different approach: teaching a model who someone is before asking it to imagine them in new contexts.

Phota is a family of four apps on inference.sh that handles this entire workflow. You train an identity profile from reference photos, generate new images of that person in any scenario, edit existing photos while preserving their likeness, and enhance output quality. It's a complete personalized photography pipeline, built by people who've done this at scale before - over 100,000 personalization fine-tunings across production workloads, specifically.

I want to be upfront about what this is and isn't. Phota won't replace a skilled portrait photographer who understands lighting, rapport, and the subtle direction that draws out genuine expression. What it will do is give you a tireless synthetic photographer who knows exactly what someone looks like and can place them in any context you can describe. Professional headshots, social media content, creative portraits, product shots with real people - all from training data you provide once.

training an identity profile

Everything starts with phota/train. You feed it 30 to 50 face images of your subject, and it builds an identity profile - a compressed representation of what makes that person look like themselves. Not just their features in isolation, but the relationships between features: how their jawline meets their neck, how their eyes crease when they smile, the specific geometry of their face from multiple angles.

The image count matters. Thirty is the minimum for reliable results. Fifty gives the model more angular coverage and expression variety to work with. I'd push toward the higher end whenever possible, especially if you want the model to handle a wide range of poses and lighting conditions in generation. Think of it like reference material for a portrait painter - more angles and expressions means better generalization.

Quality of training images matters more than quantity past the minimum threshold. Clear faces, good lighting, variety of angles and expressions. A mix of casual and composed shots works better than fifty nearly-identical selfies from the same angle. The model needs to understand that your subject looks different when they laugh versus when they're concentrating, when they're lit from above versus from the side.

Training is a one-time cost per profile. Once trained, a profile persists and can be used across unlimited generation, editing, and enhancement calls. The economics here are straightforward: you pay to teach the model a face, then use that knowledge as often as you want at per-image rates.

The training process runs asynchronously. You submit images, receive a profile ID, and can either poll for status or set the wait flag to block until it completes. Status progresses through validation, queuing, and training stages before landing at ready. If something goes wrong - too few usable faces in your batch, insufficient variety, quality issues - you get a clear error rather than a silently degraded profile.

One practical note: profiles can be deleted when you no longer need them. This isn't just housekeeping. If you're building a product where users train their own profiles, clean deletion matters for privacy compliance and user trust.

generating new images

With a trained profile in hand, phota/generate becomes your primary creative tool. The interface is text-to-image with one addition: you reference identity profiles inline using a bracket syntax. Write a prompt describing the scene, reference your profile ID where the subject appears, and the model produces an image that looks like them in that context.

The mental model is closer to directing a photo shoot than prompting a generic image model. You're not hoping the model stumbles into the right likeness. The identity is locked in. Your prompt handles everything else: environment, lighting, wardrobe, mood, composition. This separation of concerns - identity handled by the profile, everything else handled by the prompt - is what makes the results consistent across wildly different scenarios.

Resolution options include standard 1K and 4K output. The aspect ratio parameter supports preset ratios or auto-detection based on your prompt content. You can generate multiple images per call, which is useful when exploring different compositions or building a content batch.

Where I find this most powerful is repetitive creative production. A founder who needs twenty different LinkedIn-appropriate headshots across various backgrounds. A content creator who wants consistent personal branding across dozens of posts. An e-commerce brand that needs a model in fifteen outfit combinations without scheduling fifteen separate shoots. The fixed identity plus variable context pattern scales in ways physical photography cannot.

The honest limitation: extremely complex full-body poses with specific hand positions or intricate physical interactions still challenge the model. Faces remain strong across the board. Body proportions hold well. But if you need the subject doing something very specific with their hands while their face is partially occluded, expect to generate several variations and pick the best one.

editing existing photos

phota/edit occupies a different niche than generation. Instead of creating images from scratch, it modifies existing photos while preserving the identity of known subjects. You provide an image, reference profile IDs for subjects who appear in it, and describe your desired edit in natural language.

The identity preservation during editing is the key differentiator from generic inpainting tools. When you ask a standard image editor to change someone's outfit or modify the background, it often subtly alters facial features in the process. Phota's edit function anchors the subject's identity throughout the transformation. Change the wardrobe, swap the background, adjust the lighting treatment, add or remove accessories - the face stays locked to the profile.

This opens up workflows that would otherwise require reshooting. A headshot where the background doesn't work for a particular use case. A group photo where one person's expression is off. A product shot where the model's outfit needs updating for a new season. You edit around the person rather than through them.

The pricing matches generation rates. Input accepts multiple images and multiple profile IDs, handling cases where several known subjects appear in the same frame.

I find editing particularly valuable as a second pass after generation. Generate a batch of images, pick the ones with the best composition and expression, then use edit to refine details you're not satisfied with. The two tools complement each other well as stages in a production pipeline rather than isolated capabilities.

enhancing output quality

The final piece is phota/enhance, which handles automated quality improvement - lighting correction, color grading, sharpness optimization, and compositional refinement. It accepts an image and optional profile IDs, then returns an improved version.

This is the least conceptually complex tool in the family, but it serves an important role in the pipeline. Generated images sometimes land at 90% quality - the composition is right, the identity is preserved, the scenario works, but the lighting feels flat or the color balance skews slightly. Enhancement brings these to a polished final state without manual post-processing.

The profile ID input here is optional but useful. When the enhancer knows which subjects are identity-locked, it can make aggressive improvements to environmental elements without accidentally shifting facial characteristics. Light the background differently, push the color grade further, sharpen details selectively. The identity profiles act as constraints that allow bolder enhancement choices.

Enhancement is cheap enough to apply broadly rather than selectively. Run it on your entire generated batch. Let it fail gracefully on images that don't need improvement - an already well-lit, well-composed image comes back mostly unchanged. Use it as a consistent finishing step rather than a rescue tool.

the full workflow in practice

The four apps form a pipeline that I think about in three stages. First, investment: train profiles for anyone you'll generate content featuring. This is the foundation, and getting it right - varied angles, good quality input, sufficient quantity - pays dividends across everything downstream. Second, production: generate and edit until you have the content you need. Third, finishing: enhance the final selections.

For a practical example, consider building a personal brand content library. You train one profile of yourself. Then you generate headshots across ten different backgrounds and lighting styles. You pick the five best, edit two of them to swap in wardrobe options that better match your brand colors. You run all five through enhancement. The total cost is modest - a fraction of what a traditional photo shoot demands - and the total time is minutes rather than hours, plus you skip the scheduling, travel, and coordination overhead entirely.

The tradeoff is authenticity of expression. A skilled photographer captures genuine micro-expressions, unexpected moments, the spark of a real interaction. Generated images are inherently posed - they reflect what you described, not what happened. For some use cases this doesn't matter. For others it's the entire point. Know which category you're in before building a workflow around synthetic imagery.

multi-subject scenarios

One detail worth expanding on: profiles work in combination. Both the generate and edit apps accept multiple profile IDs, meaning you can create images with several known subjects appearing together. A team photo with five trained identities. A couple's portrait where both faces are identity-locked. A family holiday card where every member looks like themselves.

This scales the usefulness considerably for brands and teams. Train profiles for your leadership team once, then produce group shots, event imagery, and marketing materials featuring specific people in specific combinations without coordinating everyone's calendars. The logistics savings alone justify the training investment for organizations producing regular content.

who this is actually for

Phota fits three primary audiences well. Content creators who need a steady stream of personal imagery without constant photo shoots. Brands that feature specific people in marketing materials and need scalable, consistent production. Developers building products where personalized imagery is a feature - dating apps, avatar systems, personalization layers, professional networking tools.

It fits less well if you need absolute photorealistic perfection at very high resolution for print production. It also fits less well if the expressiveness and spontaneity of real photography is central to your use case - editorial fashion, documentary work, photojournalism. These are categories where the constraints of generative models remain visible to a trained eye.

frequently asked questions

how many training images do I actually need for good results?

The minimum is 30 images and you can submit up to 50. In my experience, 40 or more produces noticeably better results across diverse generation scenarios, particularly for unusual poses and lighting conditions. The variety of your input matters as much as the count. Twenty selfies from the same angle plus ten more similar shots will produce a worse profile than thirty-five photos spanning different angles, expressions, and lighting. Include some profile views, some looking up or down, some smiling, some neutral. Give the model the full geometry of the face.

can I use the same profile across all four apps indefinitely?

Yes. A profile is trained once and persists until you explicitly delete it. You can generate, edit, and enhance against that profile without limits at the standard per-image rates. Profiles don't degrade over time or expire after a certain number of uses. This makes the economics very favorable for ongoing content production - the per-image marginal cost stays flat regardless of volume.

what happens if generation produces an image where the likeness is slightly off?

This occasionally happens with extreme angles, heavy occlusion, or unusual lighting conditions in the prompt. The practical solution is to generate multiple images per prompt and select the best result. Generating four or five variations and picking the strongest one is still dramatically cheaper and faster than a reshoot. You can also use the edit app to refine a near-miss result, adjusting specific elements while the identity system holds the likeness stable.

api reference

about

train a phota identity profile from 30-50 face images, poll status, list and delete profiles

1. calling the api

install the client

the client provides a convenient way to interact with the api.

bash

1pip install inferencesh

setup your api key

set INFERENCE_API_KEY as an environment variable. get your key from settings → api keys.

bash

1export INFERENCE_API_KEY="inf_your_key"

run and get result

submit a request and wait for the final result. best for batch processing or when you don't need progress updates.

python

1from inferencesh import inference23client = inference()456result = client.run({7        "app": "phota/train",8        "input": {}9    })1011print(result["output"])

stream live updates

get real-time progress updates as the task runs. ideal for showing progress bars, partial results, or long-running tasks.

python

1from inferencesh import inference23client = inference()456# stream=True yields updates as they arrive7for update in client.run({8        "app": "phota/train",9        "input": {}10    }, stream=True):11    if update.get("progress"):12        print(f"progress: {update['progress']}%")13    if update.get("output"):14        print(f"output: {update['output']}")

2. authentication

the api uses api keys for authentication. see the authentication docs for detailed setup instructions.

3. files

file inputs are automatically handled by the sdk. you can pass local paths, urls, or base64 data.

automatic upload

the python sdk automatically detects local file paths and uploads them. urls are passed through as-is.

python

1# local file paths are automatically uploaded2result = client.run({3    "app": "phota/train",4    "input": {5        "image": "/path/to/local/image.png",  # detected & uploaded6        "audio": "https://example.com/audio.mp3",  # url passed through7    }8})

manual upload

you can also upload files manually and use the returned url.

python

1# upload and get a hosted URL2file = client.files.upload("/path/to/file.png")3print(file.uri)  # https://cloud.inference.sh/...

4. webhooks

get notified when a task completes by providing a webhook url. when the task reaches a terminal state (completed, failed, or cancelled), a POST request is sent to your url with the task result.

python

1result = client.run({2    "app": "phota/train",3    "input": {},4    "webhook": "https://your-server.com/webhook"5}, wait=False)

webhook payload

your endpoint receives a JSON POST with the task result:

json

1{2  "id": "task_abc123",3  "status": 9,4  "output": { ... },5  "error": "",6  "session_id": null,7  "created_at": "2024-01-15T10:30:00Z",8  "updated_at": "2024-01-15T10:30:05Z"9}

idstring— task id

statusnumber— terminal status (9=completed, 10=failed, 11=cancelled)

outputobject— task output (when completed)

errorstring— error message (when failed)

session_idstring— session id (if using sessions)

created_atstring— iso timestamp

updated_atstring— iso timestamp

5. schema

input

imagesarray*

30-50 face images of the subject

waitboolean

wait for training to complete (polls status)

default: true

poll_intervalinteger

seconds between status polls (if wait=true)

default: 10min:5max:60

output

profile_idstring*

unique profile identifier — always returned, even on error

statusstring*

final profile status (ready, submitted, error, etc.)

errorstring

error message if training failed

output_metaobject

structured metadata about inputs/outputs for pricing calculation

ready to run train?

try in browser browse all tools

we use cookies

we use cookies to ensure you get the best experience on our website. for more information on how we use cookies, please see our cookie policy.

by clicking "accept", you agree to our use of cookies.
learn more.