apps/infsh/search-assistant

search-assistant

Helps users create and refine search queries, retrieve relevant results from various sources, and generate overviews or summaries of the information found.

run with your agent
# install belt
$curl -fsSL https://cli.inference.sh | sh
# view schema & details
$belt app get infsh/search-assistant
# run
$belt app run infsh/search-assistant

Every platform has its headline acts. The image generators, the video models, the LLMs with their increasingly impressive reasoning. Those get the blog posts and the Twitter threads. But behind every polished AI workflow sits a layer of plumbing that nobody talks about - shell execution, file manipulation, media processing, format conversion. The kind of work that takes five minutes to describe and five days to build correctly. That's what the infsh utility toolkit is: the built-in collection of glue apps that inference.sh provides so you can stop reinventing file handling and start building the thing you actually came here to build.

I want to be upfront about what this article covers. These are not third-party integrations or wrapped APIs. These are inference.sh's own apps, maintained under the infsh/ namespace, designed to fill the gaps between the flashy model calls. Some are genuinely clever. Some are deceptively simple. All of them exist because someone tried to build an agent workflow without them and discovered, painfully, that they were necessary.

execution environments

Two apps handle code execution, and they serve fundamentally different roles in agent architectures.

infsh/shell is a sandboxed shell environment. Your agent can run grep, sed, ls, find, and other standard CLI tools with a configurable working directory and timeout. The default timeout is 30 seconds, which is enough for most file inspection tasks but worth bumping up if your agent needs to process larger datasets. The sandbox is the key detail here. Agents that can execute arbitrary shell commands on unsecured infrastructure are a security incident waiting to happen. The sandboxed environment means your agent gets the power of a Unix command line without the risk of it deleting your production database at 3am.

infsh/python-executor takes this further by running Python code in a safe environment. It captures stdout and stderr separately, supports a configurable timeout up to 300 seconds, and can return generated files as outputs. The practical use case is data transformation. Your agent generates an image with DALL-E, needs to crop it to specific dimensions, and the Python executor runs a quick Pillow script to handle it. Or your workflow produces a CSV that needs statistical analysis before the next step can proceed. Python execution bridges the gap between what LLMs can reason about and what they can actually compute.

The tradeoff with both execution environments is that they're stateless by default. Each invocation starts fresh. If your agent needs to build up state across multiple shell commands, you'll want to use sessions, which maintain the working directory and filesystem between calls. Without sessions, every run is a clean slate.

browser automation

infsh/agent-browser is the tool I find myself recommending most often to people building research agents. It's a full browser automation layer designed specifically for AI agents, not humans. The distinction matters. Human browser automation tools like Puppeteer or Playwright assume a developer is writing scripts. Agent browser assumes an LLM is making decisions.

The interaction model works through element references. When the browser opens a page or takes a snapshot, it returns a list of interactive elements tagged with @e refs - @e1, @e2, and so on. The agent sees these refs alongside element descriptions and can issue commands like "click @e5" or "type 'search query' into @e3" without needing to understand CSS selectors or XPath. Screenshots come back for vision-capable models, and a text summary of elements is available for models that work better with structured text.

Seven functions cover the full browser lifecycle: open a URL, take screenshots, get page snapshots, interact with elements, execute JavaScript, and close the session. Video recording with cursor indicators is available for debugging or demonstrations. Proxy support handles geo-restricted content or rate-limited sites.

The honest limitation is that highly dynamic single-page applications with complex JavaScript frameworks can sometimes produce element refs that become stale between interactions. For straightforward web browsing, form filling, and data extraction, it works remarkably well. For scraping React apps with heavy client-side rendering, you may need to take more frequent snapshots to keep element refs current.

media processing

This is where the toolkit gets dense. Twelve apps handle various aspects of image and video manipulation, and they decompose into natural clusters.

Starting with the basics: infsh/image-resize handles image scaling by width, height, scale factor, or megapixel target. It preserves aspect ratio when you specify only one dimension, supports Lanczos resampling for quality downscaling, and outputs to configurable formats and quality levels. Simple, but you would be surprised how often agent workflows break because an image model outputs 2048x2048 and the next step in the pipeline expects 1024x1024. Having resize as a discrete step rather than building it into every app is the right architectural choice.

infsh/mask-image applies transparency masks to images, combining a main image with a semi-transparent mask to selectively reveal or hide regions. The invert option flips the mask, which saves a round trip when your segmentation model outputs masks in the opposite polarity from what you need. This is the kind of tool that barely registers until you're building an inpainting pipeline and realize you need it at every junction.

infsh/stitch-images combines two images horizontally or vertically. Maximum dimensions and squared output options handle the common cases where you're building comparison views, contact sheets, or before-and-after compositions. Two images at a time feels limiting until you realize that chaining two calls gets you four images, and the simplicity of the interface means agents rarely misuse it.

For video, the toolkit covers the full editing spectrum minus creative decision-making. infsh/media-merger concatenates multiple videos and images with customizable transitions, outputting to configurable format and frame rate. infsh/video-audio-merger combines a video file with an audio track, with an option to preserve the original audio alongside the new track. infsh/video-audio-extractor does the reverse, splitting a video into its silent video and extracted audio components. These three alone handle the majority of programmatic video editing needs.

infsh/caption-videos adds subtitle overlays using pre-generated caption data - the kind of segment arrays that speech-to-text services produce. Font selection from Google Fonts, configurable positioning, stroke outlines for readability, and a specific fix for Whisper's 30-second timestamp chunking issue (which tells you exactly which real-world pain point prompted this tool's creation). It doesn't transcribe the audio itself. It takes existing transcription data and burns it into the video. That separation of concerns is deliberate: transcription and rendering are different problems with different quality tradeoffs.

infsh/extract-last-frame pulls a specific frame from the end of a video as a static image. The nth-last-frame parameter lets you grab not just the final frame but the second-to-last, tenth-to-last, whatever you need. This exists almost exclusively for video-to-video workflows where the last frame of one generated clip becomes the input image for the next. It's a tiny tool that enables a whole category of sequential video generation.

infsh/extract-media-duration returns the length of video and audio files in both raw seconds and formatted strings. Again, small. Again, essential. Agents building multi-clip timelines need to know durations before they can calculate transition timing or verify that generated content meets length requirements.

infsh/bounce-repeat-videos creates looping effects by playing a video forward then backward. Set a target duration, and the tool bounces the clip to fill the time. Useful for extending short AI-generated clips into longer loops for social media or presentations. It's niche, but when you need it, you really need it.

infsh/html-to-image renders HTML strings to PNG or JPEG at configurable viewport dimensions. The use case is dynamic image generation from templates - certificates, social cards, data visualizations, styled text overlays. Your agent writes HTML (which LLMs are genuinely good at), and this tool turns it into a raster image ready for compositing into a video or posting to social media.

infsh/remotion-render is the most powerful rendering tool in the set. It takes React/Remotion TSX component code and renders it to MP4. Pass in a React component, configure duration, resolution, and frame rate, and get back a rendered video. This is programmatic video creation at its most flexible - motion graphics, animated data visualizations, kinetic typography, anything React can render. The tradeoff is that your agent (or your prompt engineering) needs to produce valid Remotion code, which is a higher bar than most other tools in this toolkit.

3d generation

infsh/rodin-3d-generator sits slightly outside the utility category, but it's in the infsh namespace and it fills a genuine gap. Feed it text descriptions or up to five reference images and it produces 3D model meshes with textures. Output formats include standard geometry formats, and mesh quality is configurable from lower-poly options for real-time use to higher-detail options for rendering. Preview renders let you evaluate results without loading the mesh in a 3D viewer.

The honest assessment: 3D generation from text and images has improved dramatically, but it's still behind where image generation was two years ago. Simple objects with clean geometry come out well. Complex organic shapes or mechanical assemblies with precise tolerances are hit-or-miss. For prototyping, game asset pipelines, and e-commerce product visualization, it's practical today. For engineering CAD or film-quality assets, not yet.

content safety

infsh/falconsai-nsfw-detection runs the Falconsai NSFW detection model on images and videos. For video, it samples frames at configurable intervals rather than analyzing every frame, which keeps costs and latency reasonable. A configurable threshold lets you tune the sensitivity, and the output includes per-frame results alongside an overall classification.

This is a compliance tool, and compliance tools are inherently imperfect. The model catches obvious cases reliably. Edge cases - artistic nudity, medical imagery, ambiguous content - produce the kind of borderline scores that require human review. I'd use it as a first-pass filter in automated pipelines, not as the sole gatekeeper for content that carries real consequences. The per-frame breakdown for video is particularly useful because NSFW content in a single frame of an otherwise clean video is a different problem than a video that's NSFW throughout.

data manipulation

Four tools handle text and file operations that come up constantly in agent workflows.

infsh/text-split breaks text into parts using a delimiter or regex pattern. infsh/text-templating substitutes values into template strings with numbered placeholders. These sound trivial, and they are. They're also the kind of operations that agents otherwise accomplish by asking the LLM to do string manipulation in natural language, which is unreliable and expensive. A dedicated text-split tool that costs fractions of a cent beats burning tokens asking GPT-4 to parse a comma-separated list.

infsh/text-to-file creates files from text content with a specified filename. This bridges the gap between an agent's text outputs and tools that expect file inputs. If your agent writes a Python script that needs to be executed, or generates HTML that needs to be rendered, text-to-file creates the actual file object.

infsh/media-analyzer uses vision models to answer questions about images and audio files. You pass in media files and a question, and it returns an analysis. The model is configurable, which means you can trade cost for depth. Quick classification tasks can use lighter models, while detailed analysis can use more capable ones. Temperature control lets you tune between consistent and creative responses.

infsh/search-assistant provides web research capabilities. Pass a query, get synthesized findings. It's built on top of model reasoning over search results, with controls for temperature, token limits, and reasoning effort. For agent workflows that need to look something up before proceeding, this is simpler than orchestrating separate search and extraction tools.

the real point

None of these tools are impressive in isolation. Resizing an image is not a breakthrough. Splitting text on a delimiter is not novel engineering. But agent workflows are not built from breakthroughs. They're built from reliable, composable operations that chain together without breaking. An image generation pipeline that produces a 2048px image, needs it at 1024px, needs NSFW screening, needs it composited with text, and needs the result as a video frame requires four or five utility tools working in sequence. Without them built in, you're writing that glue code yourself for every workflow.

The alternative is worse than it sounds. I've watched teams spend weeks building internal media processing services, sandboxed execution environments, and browser automation layers that do less than what these tools provide out of the box. The integration tax article on this blog covers the math in detail, but the short version is: commodity plumbing should be commoditized.

These tools won't make your demo video. They'll make your demo video possible.

can I use these utility tools outside of agent workflows?

Yes, every infsh utility app works as a standalone tool through the API or CLI. You can call image-resize, shell, or any other tool directly without building an agent around it. That said, the real value shows up when they're composed together in flows. Individually, they do things you could accomplish with ffmpeg, ImageMagick, or a Python script. Collectively, they eliminate the need to deploy and maintain that infrastructure yourself. The pricing is per-execution, so you only pay when you use them - no idle servers running media processing tools waiting for work that arrives sporadically.

what about latency for media processing tools?

Media processing tools like image-resize, stitch-images, and text-to-file run in under a second for typical inputs. Video operations are inherently slower since they process frames sequentially - merging two 30-second clips or adding captions takes roughly proportional time to the video duration. The Remotion renderer is the slowest tool in the set because it's literally rendering video frame by frame from React components. For latency-sensitive workflows, the execution environments (shell and Python) add minimal overhead beyond the actual command or script runtime. Browser automation depends entirely on page load times and interaction complexity, so latency is largely a function of the target site.

are the execution environments actually secure?

The shell and Python execution environments run in isolated sandboxes with no access to the host system, other users' data, or the broader network. They're designed for agents to run untrusted code safely. That said, "sandboxed" does not mean "unlimited." Resource constraints exist on CPU time, memory, and disk space to prevent runaway processes. The 300-second timeout on Python execution and 30-second default on shell are deliberate limits. If your workflow needs long-running computation or persistent state beyond what sessions provide, you're likely better served by deploying your own compute and calling it as a custom app rather than pushing the sandbox beyond its design intent.

api reference

about

helps users create and refine search queries, retrieve relevant results from various sources, and generate overviews or summaries of the information found.

1. calling the api

install the client

the client provides a convenient way to interact with the api.

bash
1pip install inferencesh

setup your api key

set INFERENCE_API_KEY as an environment variable. get your key from settings → api keys.

bash
1export INFERENCE_API_KEY="inf_your_key"

run and get result

submit a request and wait for the final result. best for batch processing or when you don't need progress updates.

python
1from inferencesh import inference23client = inference()456result = client.run({7        "app": "infsh/search-assistant",8        "input": {}9    })1011print(result["output"])

stream live updates

get real-time progress updates as the task runs. ideal for showing progress bars, partial results, or long-running tasks.

python
1from inferencesh import inference23client = inference()456# stream=True yields updates as they arrive7for update in client.run({8        "app": "infsh/search-assistant",9        "input": {}10    }, stream=True):11    if update.get("progress"):12        print(f"progress: {update['progress']}%")13    if update.get("output"):14        print(f"output: {update['output']}")

2. authentication

the api uses api keys for authentication. see the authentication docs for detailed setup instructions.

3. files

file inputs are automatically handled by the sdk. you can pass local paths, urls, or base64 data.

automatic upload

the python sdk automatically detects local file paths and uploads them. urls are passed through as-is.

python
1# local file paths are automatically uploaded2result = client.run({3    "app": "infsh/search-assistant",4    "input": {5        "image": "/path/to/local/image.png",  # detected & uploaded6        "audio": "https://example.com/audio.mp3",  # url passed through7    }8})

manual upload

you can also upload files manually and use the returned url.

python
1# upload and get a hosted URL2file = client.files.upload("/path/to/file.png")3print(file.uri)  # https://cloud.inference.sh/...

4. webhooks

get notified when a task completes by providing a webhook url. when the task reaches a terminal state (completed, failed, or cancelled), a POST request is sent to your url with the task result.

python
1result = client.run({2    "app": "infsh/search-assistant",3    "input": {},4    "webhook": "https://your-server.com/webhook"5}, wait=False)

webhook payload

your endpoint receives a JSON POST with the task result:

json
1{2  "id": "task_abc123",3  "status": 9,4  "output": { ... },5  "error": "",6  "session_id": null,7  "created_at": "2024-01-15T10:30:00Z",8  "updated_at": "2024-01-15T10:30:05Z"9}
idstringtask id
statusnumberterminal status (9=completed, 10=failed, 11=cancelled)
outputobjecttask output (when completed)
errorstringerror message (when failed)
session_idstringsession id (if using sessions)
created_atstringiso timestamp
updated_atstringiso timestamp

5. schema

input

querystring*

research query or question to investigate

system_promptstring

optional custom system prompt for research context

modelstring

model to use for research

default: "gpt-4o-mini"
options:"gpt-4o-mini""gpt-4o""gpt-4-turbo""gpt-4""gpt-3.5-turbo""gpt-4o-2024-08-06""gpt-4o-2024-05-13""o1-mini""o1-preview""o3-mini""chatgpt-4o-latest"
temperaturenumber

temperature for response randomness (0.0-2.0)

default: 0.1min:0max:2
max_tokensinteger

maximum tokens in response (optional)

top_pnumber

top-p sampling parameter (0.0-1.0)

default: 1min:0max:1
frequency_penaltynumber

frequency penalty (-2.0 to 2.0)

default: 0min:-2max:2
presence_penaltynumber

presence penalty (-2.0 to 2.0)

default: 0min:-2max:2
reasoning_effortstring

reasoning effort level for research analysis (optional, only for supported models)

options:"minimal""low""medium""high"

output

resultsstring*

research findings and analysis

ready to run search-assistant?

we use cookies

we use cookies to ensure you get the best experience on our website. for more information on how we use cookies, please see our cookie policy.

by clicking "accept", you agree to our use of cookies.
learn more.