Inference Logoinference.sh

Audio Transcription

Convert speech to text with Whisper.


Using the Grid

  1. Go to Apps
  2. Find whisper
  3. Upload an audio file
  4. Click Run

Supported formats: MP3, WAV, M4A, FLAC, OGG


With Python SDK

python
1from inferencesh import inference2 3client = inference(api_key="inf_your_key")4 5result = client.run({6    "app": "infsh/whisper",7    "input": {8        "audio": "/path/to/audio.mp3"  # Uploaded automatically9    }10})11 12print(result["output"]["text"])

With Timestamps

Get word-level or segment-level timestamps:

python
1result = client.run({2    "app": "infsh/whisper",3    "input": {4        "audio": "/path/to/audio.mp3",5        "timestamps": "word"  # or "segment"6    }7})8 9for segment in result["output"]["segments"]:10    print(f"[{segment['start']:.1f}s] {segment['text']}")

Output:

code
1[0.0s] Welcome to today's episode.2[2.1s] We're going to discuss AI agents.3[4.5s] Let's get started.

With Translation

Translate audio to English:

python
1result = client.run({2    "app": "infsh/whisper",3    "input": {4        "audio": "/path/to/french_audio.mp3",5        "task": "translate"  # Translates to English6    }7})

With Agent

Create an agent with whisper as a tool:

code
1You: Transcribe this podcast episode2 3Agent: I'll transcribe that audio for you.4       [Calling whisper...]5       6       Here's the transcription:7       8       "Welcome to the AI Weekly podcast. Today we're 9       discussing the latest developments in generative AI..."10       11       The full transcription is 2,450 words. Would you like12       me to summarize it?

code
1 2---3 4## Model Selection5 6Use **setup parameters** to choose a specific Whisper model size:7 8```python9client.run({10    "app": "infsh/whisper",11    "setup": {12        "model_size": "large-v3"  # tiny, base, small, medium, large, large-v313    },14    "input": { ... }15})

Options

OptionValuesDescription
tasktranscribe, translateTranscribe or translate to English
languageen, es, fr, etc.Force language detection
timestampsnone, segment, wordTimestamp granularity

Tips

For better accuracy:

  • Use high-quality audio
  • Remove background noise if possible
  • Specify language if known

For long files:

  • Whisper handles files up to ~2 hours
  • Very long files are automatically chunked

Next

Content Pipeline

we use cookies

we use cookies to ensure you get the best experience on our website. for more information on how we use cookies, please see our cookie policy.

by clicking "accept", you agree to our use of cookies.
learn more.