Inference Logoinference.sh

Best Practices

Performance optimization and coding patterns for inference.sh apps.


Device Management

Never hardcode "cuda" - use accelerate for automatic device detection:

python
1from accelerate import Accelerator23class App(BaseApp):4    async def setup(self, config):5        self.accelerator = Accelerator()6        self.device = self.accelerator.device

Model Loading

Use HuggingFace hub for downloads:

python
1import os2from huggingface_hub import snapshot_download34os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"56class App(BaseApp):7    async def setup(self, config):8        self.model_path = snapshot_download(9            repo_id="org/model-name",10            resume_download=True,11        )

Avoid:

  • Hardcoded local directories (local_dir="./models")
  • Subprocess calls to huggingface-cli
  • Assuming specific file structures

Memory Management

Clean Up GPU Memory

python
1import torch2import gc34def cleanup_memory():5    if torch.cuda.is_available():6        torch.cuda.empty_cache()7    gc.collect()89async def run(self, input_data):10    result = self.model(input_data)11    cleanup_memory()12    return result

Use Mixed Precision

python
1model = model.to(dtype=torch.bfloat16)23# Or with autocast4from torch.amp import autocast5with autocast('cuda'):6    output = model(input)

Error Handling

1import logging23async def run(self, input_data):4    try:5        result = self.process(input_data)6        return AppOutput(result=result)7    except Exception as e:8        logging.error(f"Processing failed: {e}")9        raise ValueError(f"Failed to process: {str(e)}")

File Handling

1import tempfile23async def run(self, input_data):4    with tempfile.NamedTemporaryFile(suffix='.jpg', delete=False) as tmp:5        output_path = tmp.name6    save_image(result, output_path)7    return AppOutput(image=File(path=output_path))

Flash Attention

python
1model = AutoModel.from_pretrained(2    "model-name",3    attn_implementation="flash_attention_2",4    torch_dtype=torch.bfloat165)

Pre-deployment Checklist

  • All imports work correctly
  • setup() loads models successfully
  • run() processes test input
  • No hardcoded paths or devices
  • Proper error handling
  • Memory cleaned up after requests

Next

Output Meta - Usage-based pricing

we use cookies

we use cookies to ensure you get the best experience on our website. for more information on how we use cookies, please see our cookie policy.

by clicking "accept", you agree to our use of cookies.
learn more.