Inference Logoinference.sh

Best Practices

Performance optimization and coding patterns for inference.sh apps.


Device Management

Never hardcode "cuda" - use accelerate for automatic device detection:

python
1from accelerate import Accelerator2 3class App(BaseApp):4    async def setup(self, config):5        self.accelerator = Accelerator()6        self.device = self.accelerator.device

Model Loading

Use HuggingFace hub for downloads:

python
1import os2from huggingface_hub import snapshot_download3 4os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"5 6class App(BaseApp):7    async def setup(self, config):8        self.model_path = snapshot_download(9            repo_id="org/model-name",10            resume_download=True,11        )

Avoid:

  • Hardcoded local directories (local_dir="./models")
  • Subprocess calls to huggingface-cli
  • Assuming specific file structures

Memory Management

Clean Up GPU Memory

python
1import torch2import gc3 4def cleanup_memory():5    if torch.cuda.is_available():6        torch.cuda.empty_cache()7    gc.collect()8 9async def run(self, input_data):10    result = self.model(input_data)11    cleanup_memory()12    return result

Use Mixed Precision

python
1model = model.to(dtype=torch.bfloat16)2 3# Or with autocast4from torch.amp import autocast5with autocast('cuda'):6    output = model(input)

Error Handling

python
1import logging2 3async def run(self, input_data):4    try:5        result = self.process(input_data)6        return AppOutput(result=result)7    except Exception as e:8        logging.error(f"Processing failed: {e}")9        raise ValueError(f"Failed to process: {str(e)}")

File Handling

python
1import tempfile2 3async def run(self, input_data):4    with tempfile.NamedTemporaryFile(suffix='.jpg', delete=False) as tmp:5        output_path = tmp.name6    7    save_image(result, output_path)8    return AppOutput(image=File(path=output_path))

Flash Attention

python
1model = AutoModel.from_pretrained(2    "model-name",3    attn_implementation="flash_attention_2",4    torch_dtype=torch.bfloat165)

Pre-deployment Checklist

  • All imports work correctly
  • setup() loads models successfully
  • run() processes test input
  • No hardcoded paths or devices
  • Proper error handling
  • Memory cleaned up after requests

Next

Output Meta - Usage-based pricing

we use cookies

we use cookies to ensure you get the best experience on our website. for more information on how we use cookies, please see our cookie policy.

by clicking "accept", you agree to our use of cookies.
learn more.