apps/openrouter/kimi-k2-thinking

kimi-k2-thinking

A powerful open-source thinking agent that excels at complex, multi-step problem-solving and consistently uses tools effectively over extended operations.

run with your agent
# install belt
$curl -fsSL https://cli.inference.sh | sh
# view schema & details
$belt app get openrouter/kimi-k2-thinking
# run
$belt app run openrouter/kimi-k2-thinking

Moonshot AI has been one of those companies that keeps showing up in interesting places. Founded in March 2023 by Yang Zhilin and fellow Tsinghua University classmates Zhou Xinyu and Wu Yuxin, they were early to long context - their Kimi chatbot launched in October 2023 with support for 200,000 Chinese characters per conversation when most labs were still struggling with 32K tokens. Yang holds a PhD from Carnegie Mellon's Language Technologies Institute and co-authored the Transformer-XL and XLNet papers during stints at Google Brain and Facebook AI Research. The company has raised over $2.6 billion across multiple funding rounds, with its most recent round valuing it at roughly $18 billion.

Now they've released K2 Thinking, an open-source reasoning model built on a trillion-parameter Mixture-of-Experts architecture that activates 32 billion parameters per forward pass, with a 256K token context window. It's substantially cheaper than Claude Opus 4.6, while delivering genuine reasoning depth through explicit chain-of-thought processing. If you're running complex tasks with large input contexts, the economics work out very well compared to frontier proprietary models.

I've been testing K2 Thinking through inference.sh for agent workflows specifically, and I think the model deserves a dedicated conversation rather than a passing mention in a roundup. It's not the right model for most workloads. But for the workloads where it is right, nothing else quite matches the combination of open weights, genuine reasoning depth, and effective tool use at this price structure.

what thinking models actually do differently

The term "thinking model" gets thrown around loosely, so it's worth being precise. When K2 Thinking processes a prompt, it doesn't jump straight to generating an answer. Instead, it produces an explicit chain of thought - a visible reasoning trace where the model works through the problem step by step before committing to a response. You can see the model consider alternatives, reject approaches that won't work, backtrack when it hits a dead end, and build toward a conclusion through structured deliberation.

This is the same broad approach that DeepSeek R1 pioneered and that Anthropic uses in its extended thinking mode on Claude. The difference is in execution, training data, and the specific capabilities each model develops as a result. K2 Thinking's chain of thought tends to be thorough and methodical. It's not terse. The model will often generate substantially more tokens in its reasoning trace than in its final output. You're paying for all that thinking.

The practical consequence is that K2 Thinking is genuinely better at multi-step problems than non-thinking models in a similar tier. Ask it to debug a system where the error could originate from three different subsystems, and it will systematically check each one rather than guessing at the most common cause. Ask it to plan an agent workflow with dependencies between steps, and it will identify the ordering constraints before proposing a solution. This deliberate approach uses more output tokens, but it reduces the number of retries you need, which often makes it more cost-effective in total.

where the economics work for you

K2 Thinking is significantly cheaper than frontier proprietary models, which makes it attractive across a wide range of workloads. The tradeoff is capability rather than price - you're saving money and accepting that the model may not match Opus on the hardest reasoning tasks or the most nuanced English output. Moonshot built K2 for the kind of work where you dump a large context into the model and need it to reason carefully through a specific question. Code review across a full repository, answering questions about lengthy documents, planning multi-step operations based on extensive tool output. These are the sweet spots.

tool use that actually holds up over extended operations

This is where K2 Thinking has surprised me most. A lot of models can use tools competently for one or two calls. Hand them a function schema, ask a question that requires calling it, and they'll generate the right parameters. That's table stakes at this point. The harder test is sustained tool use over many steps - the kind of agent operation where the model needs to call a tool, interpret the result, decide what to call next based on what it learned, handle unexpected outputs gracefully, and maintain coherence across ten or fifteen sequential tool interactions.

K2 Thinking handles this well. The reasoning trace seems to help here in a way that's more than cosmetic. When the model thinks explicitly about what information it has, what it still needs, and which tool call will close the gap, the resulting tool usage is more purposeful. It's less likely to make speculative tool calls just to "try something" and more likely to call the right tool with the right parameters because it's already reasoned about what it expects to get back.

Moonshot's own testing shows K2 Thinking can execute 200 to 300 sequential tool calls without human interference, reasoning coherently across hundreds of steps. I've run it through agent loops that involve seven or eight sequential tool calls, and the quality of the later calls stays high. The model doesn't degrade into confusion about what it's already done or what the accumulated results mean. This is a real differentiator for building agent systems where reliability across extended operations matters more than raw speed on a single turn.

honest comparison with claude opus 4.6

K2 Thinking is dramatically cheaper than Opus 4.6, but the comparison on capability is still worth making. Let me be straightforward about where each model wins.

Opus 4.6 is the stronger general-purpose model. On creative writing, nuanced English prose, cultural context, instruction following in complex social situations, and tasks where the answer requires judgment rather than computation, Opus produces better results. Anthropic has years of RLHF tuning focused on English-language quality, and it shows. If your task is fundamentally about language quality, Opus is the better choice.

K2 Thinking is competitive on structured reasoning tasks. Code generation, mathematical problem solving, systematic analysis, and multi-step planning - these are domains where the thinking approach matters more than linguistic polish. On several coding benchmarks, K2 Thinking performs in the same tier as Opus, which is remarkable for an open-source model from a lab with a fraction of Anthropic's resources.

The open-source factor tilts the picture in K2's favor for specific scenarios. If you need to inspect model weights for compliance, fine-tune for a specialized domain, or run inference on your own infrastructure, K2 gives you options that a proprietary model simply cannot. For a regulated industry where you need to explain how your AI system works at a deeper level than "we call an API," open weights are not a nice-to-have. They're a requirement.

Where Opus 4.6 wins clearly is on the hardest agentic tasks where the model needs to navigate ambiguity, maintain alignment with complex instructions over very long interactions, and recover gracefully from underspecified situations. Anthropic's investment in agent reliability is real, and Opus reflects it. K2 Thinking is more mechanical in its approach - which is a strength for well-defined problems and a weakness for messy, real-world ones.

comparison with deepseek r1

DeepSeek R1 is the more natural comparison point since it's also an open-source thinking model from a Chinese lab. The two models share architectural philosophy - both generate explicit reasoning traces and both are available with open weights.

The differences are in focus areas. R1 was optimized heavily for mathematics and scientific reasoning. It's exceptionally good at formal proofs and quantitative analysis. K2 Thinking leans more toward practical tool use and software engineering tasks. If your workload is primarily mathematical, R1 likely edges ahead. If it's primarily about building agent systems that call APIs, process results, and make decisions, K2's tool use capabilities become the deciding factor.

Pricing between the two varies by provider, but they're in roughly the same ballpark for most deployments. The choice between them is less about cost and more about which capability profile matches your workload. There's no reason you can't use both through the same API surface, routing mathematical tasks to R1 and agent tasks to K2.

the open-source question

I keep returning to the open-source angle because I think it's underweighted in most model comparisons. The conversation usually goes: "Is it as good as the proprietary model?" And if the answer is "not quite," the open-source model gets dismissed.

This misses the point. Open weights create optionality that proprietary models cannot. You can fine-tune K2 Thinking on your domain-specific data. You can run it on-premise if your compliance requirements demand it. You can inspect the attention patterns to understand why the model made a specific decision. You can build derivative models for specialized use cases. None of these are possible with Opus or GPT-4o, regardless of how capable those models are.

For production agent systems where you need confidence in what the model is doing and the ability to adjust its behavior without waiting for the provider to update their API, open-source thinking models represent a fundamentally different value proposition. K2 Thinking is not the first open-source reasoning model, but it's one of the most capable, and the tool use capabilities make it particularly relevant for the agent use case.

who should use this model

K2 Thinking is not a general-purpose replacement for your current LLM. I want to be explicit about that because the benchmark scores make it look competitive overall, but the right deployment is specific.

Use K2 Thinking when you have complex, well-defined problems with large input contexts. Code review and analysis across substantial codebases. Multi-step agent workflows where reliable tool use is the primary concern. Technical planning tasks where the model needs to reason through constraints and dependencies. Research synthesis where you're feeding in dozens of papers or documents and asking for structured analysis.

Don't use it for high-volume, low-complexity tasks where the reasoning overhead is wasted on problems that don't require deep thinking. Don't use it for creative writing or marketing copy where English-language nuance matters more than structured thinking. Don't use it as a general chatbot where most interactions are short and the reasoning overhead adds cost without adding value.

The sweet spot is the developer or team building agent systems who needs a model that can think through hard problems, use tools reliably across extended operations, and read large contexts cheaply. If that describes your workload, K2 Thinking deserves a serious evaluation alongside the proprietary alternatives. If it doesn't, you'll get better value from a model designed for a different set of tradeoffs.

frequently asked questions

is kimi k2 thinking worth using over cheaper non-reasoning models?

K2 Thinking is already quite affordable - the question is whether the reasoning overhead adds value for your task. For simple classification or extraction, a non-thinking model will be faster and cheaper. But for complex agent workflows involving sequential tool calls, K2 Thinking's deliberate reasoning reduces retry loops - the model fails partway through less often, and you burn fewer tokens on attempts that produce nothing useful. I've found that for tasks requiring five or more tool interactions, the reduction in retries makes K2 Thinking the more cost-effective choice despite the reasoning token overhead. Track your actual cost per successful completion, not just your cost per token, and the comparison looks different.

how does kimi k2 thinking compare to running deepseek r1 for code generation tasks?

Both models handle code generation competently, but they approach it differently. R1 tends to excel at algorithmic problems and tasks with clear mathematical structure. K2 Thinking is stronger at practical software engineering - understanding existing codebases, working with APIs, generating code that integrates with real-world systems rather than solving isolated puzzles. For competitive programming style problems, R1 likely wins. For "read this codebase and implement a new feature that fits the existing patterns," K2 Thinking's combination of long-context reading and tool use gives it an edge. The two models complement each other well if you're willing to route tasks based on their characteristics.

can i self-host kimi k2 thinking since it's open source?

Yes, the weights are available and self-hosting is a valid deployment option. The practical consideration is compute requirements. Thinking models generate substantially more tokens than standard models because the reasoning trace is part of the output, which means inference costs scale with the length of the thinking process. You'll need GPU infrastructure capable of serving a large model with potentially long generation sequences. For most teams, accessing K2 Thinking through an API provider is more cost-effective unless you have specific compliance requirements that mandate on-premise deployment or you're running enough volume that dedicated infrastructure pays for itself. The open weights still provide value even if you use the API - you can fine-tune and deploy a customized version when your use case justifies the infrastructure investment.

api reference

about

a powerful open-source thinking agent that excels at complex, multi-step problem-solving and consistently uses tools effectively over extended operations.

1. calling the api

install the client

the client provides a convenient way to interact with the api.

bash
1pip install inferencesh

setup your api key

set INFERENCE_API_KEY as an environment variable. get your key from settings → api keys.

bash
1export INFERENCE_API_KEY="inf_your_key"

run and get result

submit a request and wait for the final result. best for batch processing or when you don't need progress updates.

python
1from inferencesh import inference23client = inference()456result = client.run({7        "app": "openrouter/kimi-k2-thinking",8        "input": {}9    })1011print(result["output"])

stream live updates

get real-time progress updates as the task runs. ideal for showing progress bars, partial results, or long-running tasks.

python
1from inferencesh import inference23client = inference()456# stream=True yields updates as they arrive7for update in client.run({8        "app": "openrouter/kimi-k2-thinking",9        "input": {}10    }, stream=True):11    if update.get("progress"):12        print(f"progress: {update['progress']}%")13    if update.get("output"):14        print(f"output: {update['output']}")

2. authentication

the api uses api keys for authentication. see the authentication docs for detailed setup instructions.

3. files

file inputs are automatically handled by the sdk. you can pass local paths, urls, or base64 data.

automatic upload

the python sdk automatically detects local file paths and uploads them. urls are passed through as-is.

python
1# local file paths are automatically uploaded2result = client.run({3    "app": "openrouter/kimi-k2-thinking",4    "input": {5        "image": "/path/to/local/image.png",  # detected & uploaded6        "audio": "https://example.com/audio.mp3",  # url passed through7    }8})

manual upload

you can also upload files manually and use the returned url.

python
1# upload and get a hosted URL2file = client.files.upload("/path/to/file.png")3print(file.uri)  # https://cloud.inference.sh/...

4. webhooks

get notified when a task completes by providing a webhook url. when the task reaches a terminal state (completed, failed, or cancelled), a POST request is sent to your url with the task result.

python
1result = client.run({2    "app": "openrouter/kimi-k2-thinking",3    "input": {},4    "webhook": "https://your-server.com/webhook"5}, wait=False)

webhook payload

your endpoint receives a JSON POST with the task result:

json
1{2  "id": "task_abc123",3  "status": 9,4  "output": { ... },5  "error": "",6  "session_id": null,7  "created_at": "2024-01-15T10:30:00Z",8  "updated_at": "2024-01-15T10:30:05Z"9}
idstringtask id
statusnumberterminal status (9=completed, 10=failed, 11=cancelled)
outputobjecttask output (when completed)
errorstringerror message (when failed)
session_idstringsession id (if using sessions)
created_atstringiso timestamp
updated_atstringiso timestamp

5. schema

input

reasoning_excludeboolean

exclude reasoning tokens from response

default: false
context_sizeinteger

the context size for the model.

default: 200000
streamboolean

stream the response (true) or return complete response (false)

default: true
toolsarray

tool definitions for function calling

tool_call_idstring

the tool call id for tool role messages

reasoningstring

the reasoning input of the message

reasoning_effortstring

enable step-by-step reasoning

default: "none"
options:"low""medium""high""none"
reasoning_max_tokensinteger

the maximum number of tokens to use for reasoning

system_promptstring

the system prompt to use for the model

default: "you are a helpful assistant that can answer questions and help with tasks."example: "you are a helpful assistant that can answer questions and help with tasks."
contextarray

the context to use for the model

default: []example: [{"content":[{"text":"What is the capital of France?","type":"text"}],"role":"user"},{"content":[{"text":"The capital of France is Paris.","type":"text"}],"role":"assistant"}]
rolestring

the role of the input text

default: "user"
options:"user""assistant""system""tool"
textstring*

the input text to use for the model

example: "write a haiku about artificial general intelligence"
temperaturenumber

temperature

default: 0.7min:0max:1
top_pnumber

top p

default: 0.95min:0max:1
max_tokensinteger

max tokens

default: 64000

output

imagesarray

images

output_metaobject

structured metadata about inputs/outputs for pricing calculation

responsestring*

the generated text response

usageobject

token usage statistics

tool_callsarray

tool calls for function calling

reasoningstring

the reasoning output of the model

ready to run kimi-k2-thinking?

we use cookies

we use cookies to ensure you get the best experience on our website. for more information on how we use cookies, please see our cookie policy.

by clicking "accept", you agree to our use of cookies.
learn more.