Performance optimization and coding patterns for inference.sh apps.
Device Management
Never hardcode "cuda" - use accelerate for automatic device detection:
python
1from accelerate import Accelerator2 3class App(BaseApp):4 async def setup(self, config):5 self.accelerator = Accelerator()6 self.device = self.accelerator.deviceModel Loading
Use HuggingFace hub for downloads:
python
1import os2from huggingface_hub import snapshot_download3 4os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"5 6class App(BaseApp):7 async def setup(self, config):8 self.model_path = snapshot_download(9 repo_id="org/model-name",10 resume_download=True,11 )Avoid:
- Hardcoded local directories (
local_dir="./models") - Subprocess calls to
huggingface-cli - Assuming specific file structures
Memory Management
Clean Up GPU Memory
python
1import torch2import gc3 4def cleanup_memory():5 if torch.cuda.is_available():6 torch.cuda.empty_cache()7 gc.collect()8 9async def run(self, input_data):10 result = self.model(input_data)11 cleanup_memory()12 return resultUse Mixed Precision
python
1model = model.to(dtype=torch.bfloat16)2 3# Or with autocast4from torch.amp import autocast5with autocast('cuda'):6 output = model(input)Error Handling
python
1import logging2 3async def run(self, input_data):4 try:5 result = self.process(input_data)6 return AppOutput(result=result)7 except Exception as e:8 logging.error(f"Processing failed: {e}")9 raise ValueError(f"Failed to process: {str(e)}")File Handling
python
1import tempfile2 3async def run(self, input_data):4 with tempfile.NamedTemporaryFile(suffix='.jpg', delete=False) as tmp:5 output_path = tmp.name6 7 save_image(result, output_path)8 return AppOutput(image=File(path=output_path))Flash Attention
python
1model = AutoModel.from_pretrained(2 "model-name",3 attn_implementation="flash_attention_2",4 torch_dtype=torch.bfloat165)Pre-deployment Checklist
- All imports work correctly
-
setup()loads models successfully -
run()processes test input - No hardcoded paths or devices
- Proper error handling
- Memory cleaned up after requests
Next
→ Output Meta - Usage-based pricing