apps/openrouter/gpt-oss-safeguard-20b

gpt-oss-safeguard-20b

GPT-OSS Safeguard 20B

run with your agent
# install belt
$curl -fsSL https://cli.inference.sh | sh
# view schema & details
$belt app get openrouter/gpt-oss-safeguard-20b
# run
$belt app run openrouter/gpt-oss-safeguard-20b

Most teams building with LLMs skip safety checks entirely. Not because they don't care, but because the economics don't work. Running every user input and every model output through your primary LLM for a safety evaluation doubles your inference bill. So the safety layer gets deprioritized, moved to a backlog ticket, or replaced with a handful of regex patterns that catch the obvious stuff and miss everything else.

GPT-OSS Safeguard 20B changes that calculation. It's cheap enough that the question flips. Instead of "can we afford to add safety checks?" the question becomes "can we afford not to?"

I think this is one of the more important model releases in recent memory, not because it pushes the frontier of what language models can do, but because it makes a critical piece of production architecture economically viable for everyone.

what safeguard actually is

Let me be clear about what this model does and what it doesn't do. Safeguard 20B is not a general-purpose chat model. You're not going to use it to write code, draft emails, or answer customer questions. It's a 21-billion-parameter Mixture-of-Experts model with only 3.6 billion active parameters per forward pass, fine-tuned from OpenAI's open-weight gpt-oss-20b base model specifically for safety classification and content moderation. OpenAI developed it in collaboration with Discord, SafetyKit, and Robust Open Online Safety Tools (ROOST), and released it under the Apache 2.0 license. Its job is to look at a piece of text, reason through a developer-provided safety policy, and tell you whether the content contains harmful material, policy violations, prompt injection attempts, or other problems you don't want flowing through your system. The model produces a transparent chain of thought showing how it reached each decision.

There's also a larger sibling - gpt-oss-safeguard-120b - for cases where you need more reasoning depth and can tolerate higher latency. But the 20B version hits a sweet spot for most production use. The MoE architecture with only 3.6B active parameters means it's fast - response latency is low because only a fraction of the total parameters are involved in any given inference. But the model is large enough to understand context, nuance, and the kind of indirect phrasing that trips up simpler classifiers. A keyword filter catches "how to make a bomb." A reasoning-capable safety model can catch someone asking the same question through a creative writing framing or a role-play scenario.

This sits in a category that I think deserves more attention from the developer community. We've spent years debating how to make primary models safer - training with RLHF, constitutional AI, system prompts, guardrails baked into the model weights. All of that matters. But it's a fundamentally different approach from having a dedicated, independent model whose entire purpose is safety evaluation. Both approaches have value, and they're better together than either is alone.

the architecture that makes sense

The pattern I keep coming back to is straightforward. User input arrives. Before it touches your expensive primary model, Safeguard 20B screens it. If the input passes, it flows to Claude or GPT-4o or whatever you're using for the actual work. The primary model generates its response. Before that response reaches the user, Safeguard 20B screens the output too.

User input, then Safeguard screens, then your primary model processes, then Safeguard checks the output, then the user sees the result. Five steps instead of three, but the two additional steps cost almost nothing and catch problems at both ends of the pipeline.

On inference.sh, this kind of chaining is natural. You're already calling models through a unified API, so adding a Safeguard call before and after your primary model call is a few extra lines in your orchestration logic. The latency cost is real but small - a 20B model responds quickly, and the calls can run in a fraction of the time your primary model takes to generate its response.

The economics are hard to argue with. Adding Safeguard screening adds a negligible percentage to your effective input cost. That's the cost of a meaningful safety layer. Compare that to running the safety check through your primary model, which would roughly double your cost. Or compare it to not running safety checks at all and dealing with the consequences when something gets through.

where this actually helps

The most obvious use case is content moderation for user-facing applications. Chatbots, customer support agents, creative tools - anything where users submit free-text input and receive model-generated responses. Both sides of that exchange benefit from safety screening. On the input side, you catch harmful requests before they consume expensive inference. On the output side, you catch cases where your primary model generates something problematic despite its own safety training.

But the use case I find more interesting is in agent systems. Agents are increasingly autonomous. They make decisions, call tools, generate content, and take actions with minimal human oversight. The surface area for things going wrong is vastly larger than in a simple chatbot. An agent might receive a benign-seeming instruction that, when combined with its tool access, leads to harmful outcomes. Or it might generate an intermediate artifact - a code snippet, a database query, a message draft - that's problematic even if the final output looks fine.

Running Safeguard checks at multiple points in an agent pipeline - on the initial user input, on the agent's planned actions before execution, on tool outputs before the agent processes them, on the final response before delivery - creates a mesh of safety checks that catches problems at different stages. Each check costs fractions of a cent. The cumulative protection is substantially better than any single check point.

Prompt injection is another area where a dedicated safety model adds value. The attack surface for prompt injection is growing as more applications expose LLMs to untrusted input. A user pastes a document that contains hidden instructions. An email body includes text designed to hijack the model's behavior. A web page scraped by an agent contains adversarial content. These attacks target the primary model's instruction-following behavior. A separate safety model evaluating the input independently isn't susceptible to the same injection because it's not following the same instruction context. It's looking at the content from the outside and asking a different question entirely.

what it won't catch

I want to be honest about the limitations because overselling a safety tool is worse than not having one. A 20B parameter model, no matter how well-trained, will miss things.

Sophisticated adversarial attacks designed to evade detection will sometimes succeed. Safety is fundamentally an adversarial domain. Attackers adapt to defenses. A model trained on known attack patterns can be circumvented by novel patterns. This isn't a failing specific to Safeguard 20B - it's true of every safety system ever built. The question is whether it raises the bar meaningfully, and a 20B model trained on safety data raises it significantly compared to no dedicated safety layer.

Culturally specific content is another challenge. What counts as harmful varies across cultures, languages, and contexts. A model's training data reflects specific cultural assumptions about harm, and those assumptions won't perfectly align with every deployment context. If your application serves a global audience, you'll need to evaluate Safeguard's performance on content that reflects different cultural norms and potentially supplement it with your own policy-specific classifiers.

Subtle policy violations that require deep domain knowledge will likely get through. If your content policy says users can't discuss competitor products in a specific way, a general safety model won't know that. Safeguard 20B handles broad categories of harm - violence, illegal activity, sexual content, harassment. Custom policy enforcement still needs custom solutions.

There's also the question of bias. Every classifier has biases in what it flags and what it misses. Safety models can be more aggressive about flagging content from certain demographic groups or on certain topics, leading to disparate treatment that's itself a form of harm. I'd recommend monitoring Safeguard's behavior across different input populations and adjusting your threshold or supplementing with additional checks where you see imbalances.

dedicated versus built-in safety

The philosophical question behind Safeguard 20B is whether safety should be a separate concern or an integrated one. The frontier model providers have invested heavily in building safety into their primary models. Claude, GPT-4, Gemini - they all have built-in refusal behaviors and content policies.

I think both approaches are necessary, and neither is sufficient alone. Built-in safety is valuable because it operates with full context. The model knows the entire conversation history and can make nuanced judgments about whether a request is harmful given the context. But built-in safety is also brittle because it can be subverted through the same interface it's protecting. Jailbreaks work precisely because the safety behavior and the instruction-following behavior share the same model and the same context.

A dedicated external safety model doesn't share that context. It evaluates content independently, which makes it resistant to the class of attacks that target the primary model's instruction-following behavior. It's also cheaper to update and iterate on. Retraining or fine-tuning a 21B MoE safety model on new attack patterns is faster and less expensive than retraining a 400B general-purpose model. And since Safeguard is released under the Apache 2.0 license with open weights, you can customize it for your specific policies.

The practical architecture is defense in depth. Your primary model's built-in safety catches most issues. Safeguard 20B catches some that slip through. Your application-level rules catch domain-specific violations. Human review catches the rest. No single layer is complete. The stack of layers, each cheap and fast enough to run on every interaction, gets you much closer to complete coverage.

what this means for the market

Safeguard 20B is part of a broader trend toward specialized models. The era of one model doing everything is giving way to an era of model pipelines where different models handle different concerns. A small fast model for routing. A specialized model for safety. A large capable model for reasoning. Another specialized model for code generation or image understanding.

This decomposition makes systems more reliable, more cost-effective, and easier to improve incrementally. When your safety layer needs updating, you swap in a new version of the safety model without touching anything else. When you want to upgrade your reasoning capability, you swap the primary model. Each component improves independently.

For developers building agent systems, this is a significant shift in how to think about architecture. The question isn't just "which model is best?" It's "which combination of models gives me the best performance, safety, and cost profile for my specific workload?" Safety is no longer a tax on your inference budget. With models like Safeguard 20B, it's negligible.

frequently asked questions

can safeguard 20b replace human content moderators?

No, and it shouldn't. Safeguard 20B is a first-pass filter that handles the high-volume, clear-cut cases automatically. It catches obvious policy violations quickly and cheaply, which means your human moderators spend their time on the ambiguous cases that actually require human judgment. The model will miss edge cases, misclassify borderline content, and lack the cultural context that experienced moderators bring. Think of it as the layer that reduces the haystack your human team has to search through, not a replacement for the team itself. The right architecture combines automated screening for volume with human review for quality.

does it work with models from other providers, or only openai models?

Despite being built by OpenAI, Safeguard 20B works as a standalone screening layer regardless of what primary model you're using. It evaluates text independently, so you can chain it with Claude, Gemini, Llama, Mistral, or any other model in your pipeline. The model doesn't need to know anything about your primary model. It just looks at text and classifies it. This provider-agnostic nature is actually one of its strengths. You can swap your primary model without changing your safety layer, and vice versa. The screening logic stays the same regardless of what's behind it.

is the 20B model enough for reliable safety classification?

For the majority of safety-relevant content, yes. Despite only activating 3.6 billion parameters per forward pass (thanks to the MoE architecture), the model handles direct policy violations, explicit harmful content, and common adversarial patterns reliably. The reasoning chain of thought helps it evaluate content against custom policies rather than relying on rigid pattern matching. Where size becomes a constraint is on highly sophisticated attacks that require deep contextual reasoning to detect - think multi-turn conversations where the harmful intent only becomes apparent when you connect information across many exchanges. For those cases, OpenAI's larger gpt-oss-safeguard-120b or a human reviewer will catch things the 20B model misses. The practical approach is to use Safeguard 20B as a fast, cheap filter for the 95% of cases it handles well, and route uncertain classifications to more thorough review.

api reference

about

gpt-oss safeguard 20b

1. calling the api

install the client

the client provides a convenient way to interact with the api.

bash
1pip install inferencesh

setup your api key

set INFERENCE_API_KEY as an environment variable. get your key from settings → api keys.

bash
1export INFERENCE_API_KEY="inf_your_key"

run and get result

submit a request and wait for the final result. best for batch processing or when you don't need progress updates.

python
1from inferencesh import inference23client = inference()456result = client.run({7        "app": "openrouter/gpt-oss-safeguard-20b",8        "input": {}9    })1011print(result["output"])

stream live updates

get real-time progress updates as the task runs. ideal for showing progress bars, partial results, or long-running tasks.

python
1from inferencesh import inference23client = inference()456# stream=True yields updates as they arrive7for update in client.run({8        "app": "openrouter/gpt-oss-safeguard-20b",9        "input": {}10    }, stream=True):11    if update.get("progress"):12        print(f"progress: {update['progress']}%")13    if update.get("output"):14        print(f"output: {update['output']}")

2. authentication

the api uses api keys for authentication. see the authentication docs for detailed setup instructions.

3. files

file inputs are automatically handled by the sdk. you can pass local paths, urls, or base64 data.

automatic upload

the python sdk automatically detects local file paths and uploads them. urls are passed through as-is.

python
1# local file paths are automatically uploaded2result = client.run({3    "app": "openrouter/gpt-oss-safeguard-20b",4    "input": {5        "image": "/path/to/local/image.png",  # detected & uploaded6        "audio": "https://example.com/audio.mp3",  # url passed through7    }8})

manual upload

you can also upload files manually and use the returned url.

python
1# upload and get a hosted URL2file = client.files.upload("/path/to/file.png")3print(file.uri)  # https://cloud.inference.sh/...

4. webhooks

get notified when a task completes by providing a webhook url. when the task reaches a terminal state (completed, failed, or cancelled), a POST request is sent to your url with the task result.

python
1result = client.run({2    "app": "openrouter/gpt-oss-safeguard-20b",3    "input": {},4    "webhook": "https://your-server.com/webhook"5}, wait=False)

webhook payload

your endpoint receives a JSON POST with the task result:

json
1{2  "id": "task_abc123",3  "status": 9,4  "output": { ... },5  "error": "",6  "session_id": null,7  "created_at": "2024-01-15T10:30:00Z",8  "updated_at": "2024-01-15T10:30:05Z"9}
idstringtask id
statusnumberterminal status (9=completed, 10=failed, 11=cancelled)
outputobjecttask output (when completed)
errorstringerror message (when failed)
session_idstringsession id (if using sessions)
created_atstringiso timestamp
updated_atstringiso timestamp

5. schema

input

reasoning_excludeboolean

exclude reasoning tokens from response

default: false
context_sizeinteger

the context size for the model.

default: 200000
streamboolean

stream the response (true) or return complete response (false)

default: true
filesarray

the files to use for the model

imagesarray

the images to use for the model

toolsarray

tool definitions for function calling

tool_call_idstring

the tool call id for tool role messages

reasoningstring

the reasoning input of the message

reasoning_effortstring

enable step-by-step reasoning

default: "none"
options:"low""medium""high""none"
reasoning_max_tokensinteger

the maximum number of tokens to use for reasoning

system_promptstring

the system prompt to use for the model

default: "you are a helpful assistant that can answer questions and help with tasks."example: "you are a helpful assistant that can answer questions and help with tasks."
contextarray

the context to use for the model

default: []example: [{"content":[{"text":"What is the capital of France?","type":"text"}],"role":"user"},{"content":[{"text":"The capital of France is Paris.","type":"text"}],"role":"assistant"}]
rolestring

the role of the input text

default: "user"
options:"user""assistant""system""tool"
textstring*

the input text to use for the model

example: "write a haiku about artificial general intelligence"
temperaturenumber

temperature

default: 0.7min:0max:1
top_pnumber

top p

default: 0.95min:0max:1
max_tokensinteger

max tokens

default: 64000

output

imagesarray

images

output_metaobject

structured metadata about inputs/outputs for pricing calculation

responsestring*

the generated text response

usageobject

token usage statistics

tool_callsarray

tool calls for function calling

reasoningstring

the reasoning output of the model

ready to run gpt-oss-safeguard-20b?

we use cookies

we use cookies to ensure you get the best experience on our website. for more information on how we use cookies, please see our cookie policy.

by clicking "accept", you agree to our use of cookies.
learn more.