apps/openrouter/minimax-m-25

minimax-m-25

MiniMax-M2.5 is a SOTA large language model designed for real-world productivity. Trained in a diverse range of complex real-world digital working environments, M2.5 builds upon the coding expertise of M2.1 to extend into general office work, reaching fluency in generating and operating Word, Excel, and Powerpoint files, context switching between diverse software environments, and working across different agent and human teams.

run in browser run via API

run with your agent

# install belt

$curl -fsSL https://cli.inference.sh | sh

# view schema & details

$belt app get openrouter/minimax-m-25

# run

$belt app run openrouter/minimax-m-25

Most LLM conversations start with the same question: how smart is it? MiniMax M2.5 forces a different question, one that I think is more honest about where this industry is heading. The question is: what is this model specifically good at, and does the price make it worth building around those strengths?

M2.5 is not trying to be the smartest model in any room. It is trying to be the most useful model in one particular room - the one where people spend their actual working hours. The room with spreadsheets open on one monitor, a half-finished slide deck on the other, and a Word document that needs to pull data from both. MiniMax, a Shanghai-based AI company founded in late 2021 by Yan Junjie (a former VP at SenseTime), built M2.5 around this reality rather than purely chasing benchmark leaderboards. The company went public on the Hong Kong Stock Exchange in January 2026, and counts Tencent, Alibaba, Hillhouse Capital, and HongShan among its investors. That decision makes it either irrelevant or extremely interesting, depending on what your agents actually do all day.

I have been running M2.5 through inference.sh for document-heavy workflows and the results are uneven in ways that are worth being specific about. There are tasks where this model punches well above its price. There are others where it falls flat. The line between those categories is surprisingly predictable once you understand what MiniMax optimized for.

the cost advantage changes what you can build

M2.5 is dramatically cheaper than frontier models - cheap enough that the engineering decisions you can make change fundamentally. The cost gaps are not rounding-error differences. They are the kind that change what is economically viable.

Consider an agent workflow that processes a batch of 500 invoices daily. Each invoice gets loaded as context alongside some business rules, the model extracts structured data, and the output feeds into an accounting system. With a frontier model, that pipeline has a meaningful daily cost. With M2.5, the same pipeline costs a fraction of that amount - enough to make the automation obviously worth doing even for a small business, without requiring a cost-benefit conversation.

The pricing also changes how you think about context management. With expensive models, teams invest real engineering effort in chunking documents, summarizing context, and keeping token counts lean. With M2.5, you can afford to be generous. Feed the model the entire spreadsheet instead of a summary. Include the full style guide alongside the document you want reformatted. The marginal cost of extra context is so low that the engineering time you would spend optimizing it costs more than the tokens themselves. The optimization calculus shifts - you can focus on output quality rather than input efficiency, which is usually a better use of time.

what office productivity actually means here

When MiniMax says M2.5 is designed for office productivity, they mean something specific and somewhat underappreciated. This model was trained on the workflows that make up actual digital work - not the idealized version of knowledge work that shows up in product demos, but the messy reality of switching between applications, generating structured documents, and operating inside the constraints of real file formats.

The model can generate Excel spreadsheets with formulas, formatting, and data that reflects actual business logic. It can write Word documents that follow template structures and maintain consistent formatting throughout. It can create PowerPoint presentations that organize information into slides with coherent layouts rather than dumping text into a blank canvas. For anyone who has tried to get a general-purpose LLM to produce a properly formatted .xlsx file with working formulas, the difference is meaningful.

Context switching is the other capability worth calling out. Real office work rarely involves one application in isolation. You pull numbers from a spreadsheet, reference them in a document, and then summarize the highlights in a presentation. M2.5 was designed to maintain coherence across these transitions - understanding that the revenue figure in cell B7 of the quarterly report is the same number that needs to appear in paragraph three of the executive summary and on slide four of the board deck.

The benchmark numbers back this up. M2.5 scores 80.2% on SWE-Bench Verified, 51.3% on Multi-SWE-Bench, and 76.3% on BrowseComp. In office productivity evaluations covering Word, PowerPoint, and Excel financial modeling, M2.5 achieved a 59.0% average win rate against mainstream models in pairwise comparison. It also completes SWE-Bench Verified evaluations 37% faster than its predecessor M2.1, matching Claude Opus 4.6 on end-to-end runtime.

This builds on the foundation of MiniMax's earlier M2.1 model, which established solid coding capabilities. The connection matters because structured document generation is, at its core, a coding problem. An Excel file with formulas is a program. A PowerPoint deck with consistent styling follows rules that are easier to enforce if the model thinks in terms of structured output rather than free-form text. M2.5 inherits that structured thinking and applies it to office document formats specifically.

where it genuinely performs

I want to be specific about the tasks where M2.5 earns its place in a production stack, because vague praise helps nobody.

Data extraction and transformation is the strongest suit. Give M2.5 a messy CSV, a PDF table, or an unstructured text dump full of numbers, and ask it to produce a clean, structured output. It handles these tasks with a consistency that surprised me. The model understands tabular data intuitively - it knows that a column of dates should be formatted consistently, that currency values need the right number of decimal places, and that a total row should actually be the sum of the rows above it. These sound like table stakes, but plenty of models that cost five or ten times more fumble on exactly these details.

Document templating is another area where the specialization shows. If you have a standard report format and need to generate variations based on different data inputs, M2.5 maintains structural fidelity across generations. The headers stay consistent. The formatting follows the template. Section numbering does not randomly reset. For teams building automated reporting pipelines, this reliability on formatting is worth more than marginal improvements in prose quality.

Multi-step office workflows - the kind where an agent needs to read data from one source, process it, and produce output in a different format - also play to the model's strengths. M2.5 handles the context switching between "I am reading a spreadsheet" and "I am now writing a memo about what I found" without losing track of which numbers go where. This is the specific capability that MiniMax optimized for, and it works.

For agent-to-agent handoffs in team workflows, where one agent produces output that another agent needs to consume and act on, M2.5's structured output tendencies make it a reliable participant. The output formats are predictable, which means downstream agents can parse them without extensive error handling.

the tradeoffs you need to know about

Now for the honest part, because a model this cheap comes with real limitations that you should understand before you build around it.

English-language creative writing is where the gap shows most clearly. Ask M2.5 to write a nuanced blog post, craft a persuasive sales email, or produce marketing copy that needs to feel natural and compelling to a native English speaker, and you will notice the difference compared to Claude or GPT-4o. The prose tends toward functional rather than elegant. Idiomatic expressions land slightly off. Tone management across a long piece is less consistent. The writing is competent but it reads like writing, if you know what I mean - it lacks the fluid quality that the best Western models can produce.

This is not a flaw in the model so much as a reflection of training data distribution. MiniMax is a Chinese company, and while M2.5 handles English well enough for structured tasks and business communication, it was not primarily optimized for native-level English prose. For internal documents, data processing, and structured output, this barely matters. For customer-facing English content, you should test carefully against your quality bar.

The documentation ecosystem is thinner than what you get with Anthropic or OpenAI models. When you hit an edge case with Claude, you can find blog posts, community threads, and detailed documentation that help you debug the problem. With M2.5, the English-language resources are sparser. The Chinese-language community around MiniMax is more active, but that is not helpful if you do not read Chinese. Expect to spend more time on trial-and-error and less time on guided troubleshooting.

Community adoption among Western developers is still early. This means fewer shared prompt templates, fewer integration examples, and fewer people who have already solved the specific problem you are hitting. The model itself might be capable, but the surrounding ecosystem that makes a model easy to work with is still catching up. This overhead is real even if it does not show up on a pricing page.

There is also the question of how M2.5 handles adversarial or ambiguous inputs. Western frontier models have been poked and prodded by millions of users for years, and their failure modes are well-documented. M2.5 has not been through that same gauntlet with English-speaking users. You may encounter unexpected behaviors on edge cases that Claude would handle gracefully because someone reported a similar problem two years ago and Anthropic fixed it. Build your error handling accordingly.

where it fits in a multi-model stack

The most practical way to think about M2.5 is not as a replacement for your primary reasoning model but as a specialist that handles specific workloads at a fraction of the cost. Very few production systems should run entirely on a single model, and M2.5 makes the economic case for specialization even stronger.

Here is a pattern that makes sense: use Claude or GPT-4o as your primary reasoning engine for tasks that require deep analysis, creative output, or complex decision-making. Route document processing, data extraction, report generation, and spreadsheet operations to M2.5. The routing logic can be simple - if the task involves transforming structured data or generating office documents, send it to the cheaper model. If it involves open-ended reasoning or nuanced English output, send it to the more expensive one.

This approach captures most of the cost savings while avoiding the quality tradeoffs. In a typical enterprise agent workflow, document processing and data transformation tasks often represent 60% to 80% of total token volume. If you can shift that volume to a model that costs a tenth as much, your aggregate costs drop dramatically even though you are still paying premium prices for the reasoning-heavy tasks.

The unified API surface matters here. When M2.5 sits alongside Claude, GPT-4o, Gemini, and 150 other tools behind the same interface, switching between models for different tasks is a configuration change rather than an integration project. You are not managing separate SDKs, separate billing, or separate authentication flows. That operational simplicity is what makes multi-model architectures practical rather than theoretical.

the bigger picture for chinese ai models

M2.5 is part of a broader pattern that Western developers need to take seriously. Chinese AI labs are shipping capable models at price points that fundamentally undercut the Western incumbents. The quality gap that existed a year ago is narrowing fast, and on specialized tasks like the office productivity workflows M2.5 targets, it has effectively closed.

On the specific combination of English-language nuance, instruction adherence, and creative flexibility, Claude and GPT-4o remain ahead. But a lot of AI work in production is structured, repetitive, and format-driven. It is exactly the kind of work that a cheaper, specialized model can handle well.

who should test this model

If your agent workflows involve significant document processing, report generation, spreadsheet manipulation, or data extraction from structured sources, M2.5 deserves a serious evaluation. The cost savings are large enough that even a modest quality tradeoff on those specific tasks can be worth accepting.

If your primary workload is creative writing, nuanced English communication, complex reasoning that requires deep world knowledge, or tasks where instruction adherence needs to be precise, M2.5 is probably not the right fit as a primary model. It could still serve as a cost-effective option for the structured portions of a larger workflow, but the reasoning-heavy work should go to a model optimized for that.

The pragmatic starting point is to identify your highest-volume, most structured workload and run a parallel evaluation. Same inputs through your current model and through M2.5. Compare output quality, format accuracy, and total cost. The results will tell you whether the significant price difference translates into meaningful savings for your specific use case. In my experience, for document-heavy workflows, it usually does.

FAQ

how does minimax m2.5 compare to claude sonnet for document generation tasks?

For generating structured documents like spreadsheets with formulas, templated reports, and data-driven presentations, M2.5 performs surprisingly well. Claude Sonnet produces higher-quality English prose and follows complex instructions more reliably, but for tasks where the output quality is measured by structural correctness rather than linguistic elegance, M2.5 closes the gap significantly. For high-volume document processing pipelines, M2.5 can deliver acceptable quality at a fraction of the cost. The key is testing on your actual document formats and quality requirements rather than assuming the cheaper model cannot handle the job.

is minimax a reliable provider for production workloads?

MiniMax is backed by Tencent, Alibaba, Hillhouse Capital, and HongShan, and has been operating in the Chinese AI market since late 2021. The company listed on the Hong Kong Stock Exchange in January 2026, raising roughly $618 million in its IPO, so the company has institutional backing and staying power. That said, their track record serving Western customers through English-language APIs is shorter than what you get with OpenAI or Anthropic. Expect slightly less polished documentation, fewer English-language support resources, and the possibility of API behavior changes that are communicated primarily through Chinese-language channels first. Running M2.5 through a unified platform mitigates some of this risk since the platform handles the provider integration, but you should still build fallback logic into any production system that depends on a single model from any provider.

what types of tasks should I avoid running on m2.5?

Avoid using M2.5 as your primary model for customer-facing English content that needs to read naturally, tasks requiring precise instruction adherence with complex behavioral constraints, or open-ended creative work where tone and style matter. The model also has less extensive testing against adversarial English-language inputs compared to Western frontier models, so safety-sensitive applications should include additional validation layers. For internal-facing structured tasks like data extraction, report generation, and document formatting, these limitations rarely matter. The general rule is that the more structured and format-driven the task, the better M2.5 performs relative to its price point.

api reference

about

minimax-m2.5 is a sota large language model designed for real-world productivity. trained in a diverse range of complex real-world digital working environments, m2.5 builds upon the coding expertise of m2.1 to extend into general office work, reaching fluency in generating and operating word, excel, and powerpoint files, context switching between diverse software environments, and working across different agent and human teams.

1. calling the api

install the client

the client provides a convenient way to interact with the api.

bash

1pip install inferencesh

setup your api key

set INFERENCE_API_KEY as an environment variable. get your key from settings → api keys.

bash

1export INFERENCE_API_KEY="inf_your_key"

run and get result

submit a request and wait for the final result. best for batch processing or when you don't need progress updates.

python

1from inferencesh import inference23client = inference()456result = client.run({7        "app": "openrouter/minimax-m-25",8        "input": {}9    })1011print(result["output"])

stream live updates

get real-time progress updates as the task runs. ideal for showing progress bars, partial results, or long-running tasks.

python

1from inferencesh import inference23client = inference()456# stream=True yields updates as they arrive7for update in client.run({8        "app": "openrouter/minimax-m-25",9        "input": {}10    }, stream=True):11    if update.get("progress"):12        print(f"progress: {update['progress']}%")13    if update.get("output"):14        print(f"output: {update['output']}")

2. authentication

the api uses api keys for authentication. see the authentication docs for detailed setup instructions.

3. files

file inputs are automatically handled by the sdk. you can pass local paths, urls, or base64 data.

automatic upload

the python sdk automatically detects local file paths and uploads them. urls are passed through as-is.

python

1# local file paths are automatically uploaded2result = client.run({3    "app": "openrouter/minimax-m-25",4    "input": {5        "image": "/path/to/local/image.png",  # detected & uploaded6        "audio": "https://example.com/audio.mp3",  # url passed through7    }8})

manual upload

you can also upload files manually and use the returned url.

python

1# upload and get a hosted URL2file = client.files.upload("/path/to/file.png")3print(file.uri)  # https://cloud.inference.sh/...

4. webhooks

get notified when a task completes by providing a webhook url. when the task reaches a terminal state (completed, failed, or cancelled), a POST request is sent to your url with the task result.

python

1result = client.run({2    "app": "openrouter/minimax-m-25",3    "input": {},4    "webhook": "https://your-server.com/webhook"5}, wait=False)

webhook payload

your endpoint receives a JSON POST with the task result:

json

1{2  "id": "task_abc123",3  "status": 9,4  "output": { ... },5  "error": "",6  "session_id": null,7  "created_at": "2024-01-15T10:30:00Z",8  "updated_at": "2024-01-15T10:30:05Z"9}

idstring— task id

statusnumber— terminal status (9=completed, 10=failed, 11=cancelled)

outputobject— task output (when completed)

errorstring— error message (when failed)

session_idstring— session id (if using sessions)

created_atstring— iso timestamp

updated_atstring— iso timestamp

5. schema

input

reasoning_excludeboolean

exclude reasoning tokens from response

default: false

context_sizeinteger

the context size for the model.

default: 200000

streamboolean

stream the response (true) or return complete response (false)

default: true

filesarray

the files to use for the model

imagesarray

the images to use for the model

toolsarray

tool definitions for function calling

tool_call_idstring

the tool call id for tool role messages

reasoningstring

the reasoning input of the message

reasoning_effortstring

enable step-by-step reasoning

default: "none"

options:"low""medium""high""none"

reasoning_max_tokensinteger

the maximum number of tokens to use for reasoning

system_promptstring

the system prompt to use for the model

default: "you are a helpful assistant that can answer questions and help with tasks."example: "you are a helpful assistant that can answer questions and help with tasks."

contextarray

the context to use for the model

default: []example: [{"content":[{"text":"What is the capital of France?","type":"text"}],"role":"user"},{"content":[{"text":"The capital of France is Paris.","type":"text"}],"role":"assistant"}]

rolestring

the role of the input text

default: "user"

options:"user""assistant""system""tool"

textstring*

the input text to use for the model

example: "write a haiku about artificial general intelligence"

temperaturenumber

temperature

default: 0.7min:0max:1

top_pnumber

top p

default: 0.95min:0max:1

max_tokensinteger

max tokens

default: 64000

output

imagesarray

images

output_metaobject

structured metadata about inputs/outputs for pricing calculation

responsestring*

the generated text response

usageobject

token usage statistics

tool_callsarray

tool calls for function calling

reasoningstring

the reasoning output of the model

ready to run minimax-m-25?

try in browser browse all tools

we use cookies

we use cookies to ensure you get the best experience on our website. for more information on how we use cookies, please see our cookie policy.

by clicking "accept", you agree to our use of cookies.
learn more.