Best Practices

Performance optimization and coding patterns for inference.sh apps.

Device Management

Never hardcode "cuda" - use accelerate for automatic device detection:

python

1from accelerate import Accelerator23class App(BaseApp):4    async def setup(self, config):5        self.accelerator = Accelerator()6        self.device = self.accelerator.device

Model Loading

Use HuggingFace hub for downloads:

python

1import os2from huggingface_hub import snapshot_download34os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"56class App(BaseApp):7    async def setup(self, config):8        self.model_path = snapshot_download(9            repo_id="org/model-name",10            resume_download=True,11        )

Avoid:

Hardcoded local directories (local_dir="./models")
Subprocess calls to huggingface-cli
Assuming specific file structures

Memory Management

Clean Up GPU Memory

python

1import torch2import gc34def cleanup_memory():5    if torch.cuda.is_available():6        torch.cuda.empty_cache()7    gc.collect()89async def run(self, input_data):10    result = self.model(input_data)11    cleanup_memory()12    return result

Use Mixed Precision

python

1model = model.to(dtype=torch.bfloat16)23# Or with autocast4from torch.amp import autocast5with autocast('cuda'):6    output = model(input)

Error Handling

1import logging23async def run(self, input_data):4    try:5        result = self.process(input_data)6        return AppOutput(result=result)7    except Exception as e:8        logging.error(f"Processing failed: {e}")9        raise ValueError(f"Failed to process: {str(e)}")

File Handling

1import tempfile23async def run(self, input_data):4    with tempfile.NamedTemporaryFile(suffix='.jpg', delete=False) as tmp:5        output_path = tmp.name6    save_image(result, output_path)7    return AppOutput(image=File(path=output_path))

Flash Attention

python

1model = AutoModel.from_pretrained(2    "model-name",3    attn_implementation="flash_attention_2",4    torch_dtype=torch.bfloat165)

Pre-deployment Checklist

All imports work correctly
setup() loads models successfully
run() processes test input
No hardcoded paths or devices
Proper error handling
Memory cleaned up after requests

→ Output Meta - Usage-based pricing

previousoutput meta nextcancellation

we use cookies

we use cookies to ensure you get the best experience on our website. for more information on how we use cookies, please see our cookie policy.

by clicking "accept", you agree to our use of cookies.
learn more.

Device Management

Model Loading

Memory Management

Clean Up GPU Memory

Use Mixed Precision

Error Handling

File Handling

Flash Attention

Pre-deployment Checklist

Next