Kinetic-4B: A 4-Billion Parameter Model That Outperforms Claude Haiku at Tool Calling
AI agents are only as useful as the tools they can call. When a user says "create an issue on the backend repo," the model needs to select the correct API from a menu of options, extract the right arguments from natural language, and produce valid structured output — all in a format the downstream system can parse without post-processing.
This sounds simple. In practice, it's one of the hardest things to get small models to do reliably.
We build AI agents at Conscious Engines, and we integrate with Composio — a platform that provides unified access to over 800 tool integrations: GitHub, Slack, Salesforce, Stripe, Notion, Jira, and hundreds more. Every one of these integrations exposes tool schemas with different argument types, required and optional fields, and nested structures. The model needs to handle all of them.
We could have used a frontier model for this. Claude, GPT-4, or any number of capable hosted models can do tool calling well. But we had constraints that made that path unworkable:
Latency. Our agents operate in interactive loops where every tool call is a blocking step. API round-trips of 2–4 seconds compound quickly in multi-step workflows. We needed sub-2-second latency, ideally close to 1 second.
Cost. Per-token API pricing adds up. A self-hosted model on a single rented GPU changes the economics entirely.
Control. We wanted to own the model — to fine-tune it on our specific distribution of tools, to quantize it for our hardware, and to deploy it without depending on an external provider's uptime or rate limits.
So we set out to build our own. This post is the story of that process: the approaches that failed, the one that worked, and the benchmarks that surprised us.
TL;DR: We trained a 4B parameter model called Kinetic-4B that hits 82.33% accuracy on Composio tool-calling with a 1.61s p95 latency. It outperforms Claude Haiku 4.5 by 2.33 percentage points at 2.5x lower tail latency, and beats GPT-OSS-120B — a model with 30x more parameters — by 6 points at 5x lower p95. It took two failed attempts, one surprisingly good dataset, and a LoRA config we almost didn't try.
Here's the whole story.
---
Attempt 1: FunctionGemma 270M — The Smallest Possible Thing
Our first instinct was to go as small as we could get away with. Google's FunctionGemma 270M seemed purpose-built for this job: a 270-million-parameter model with a native structured format designed specifically for function calling. It used special tokens to delineate tool declarations and tool calls, which meant the model had been pre-trained to understand the boundary between "here are the tools you have" and "here is the one I'm calling."
We fine-tuned it on the xLAM function-calling dataset, built a dedicated benchmark harness around its native prompt format, and deployed it via vLLM with a custom tool-call parser.
The latency was everything we wanted — 0.66 seconds on average. For a model that needed to sit in the hot path of an agent loop, this was ideal.
The accuracy was not.
| Metric | Result | |--------|-------:| | Correct Tool + Args | 49.67% | | Tool Name Accuracy | 80.33% | | Failure Rate | 19.67% | | p95 Latency | 0.84s |
The model could usually identify which tool to call — it got the tool name right about 80% of the time. But producing the correct arguments was a different story. Complex schemas with nested fields, optional parameters, and typed arrays consistently tripped it up. One in five requests produced no valid tool call at all — the model would generate malformed output that couldn't be parsed.
Our fine-tuning improved accuracy over the base FunctionGemma, but not enough to be production-viable. 49.67% wasn't going to cut it for an agent loop where every missed tool call is a user-visible failure.
Before abandoning FunctionGemma entirely, we tried one more thing: repurposing it as a router. The idea was appealing in its simplicity — use the 270M model just to select the right toolkit (GitHub vs. Slack vs. Stripe), then hand off to a larger model to handle the actual tool call with full argument extraction. We wrote a router prompt mapping 50+ toolkits to their descriptions, and toolkit selection worked reasonably well in isolation.
But the two-hop architecture introduced a compounding error problem. If the router picked the wrong toolkit — and it did, roughly 20% of the time — the downstream model never had a chance. The pipeline's overall accuracy was bounded by the product of each stage's accuracy, not the sum. We abandoned this path.
The lesson was clear: 270 million parameters is simply not enough capacity for reliable structured output across hundreds of diverse tool schemas. The model could memorize tool names, but it couldn't generalize argument construction. We needed more capacity — but not so much that we'd lose the latency and single-GPU deployment properties we cared about.
---
Finding the Right Base: Qwen3-4B
We landed on Qwen3-4B. At 4 billion parameters, it's 15x larger than FunctionGemma — but still small enough to serve on a single GPU at reasonable latency. Several properties made it a strong starting point.
The model uses Grouped Query Attention with 32 query heads and 8 key-value heads across 36 transformer layers, paired with a SwiGLU MLP. This is an efficient architecture: the grouped KV heads reduce memory bandwidth during inference without sacrificing much representational power, and SwiGLU consistently outperforms standard ReLU MLPs in language modeling benchmarks.
More importantly for our use case, Qwen3-4B-Instruct already had native support for tool calling via the qwen3_coder format. This meant the model had been instruction-tuned to understand tool schemas presented in its chat template and to produce structured tool calls in a parseable format. We wouldn't be teaching the model a new skill from scratch — we'd be sharpening an existing one.
Before any fine-tuning, we benchmarked the base Qwen3-4B-Instruct model on our evaluation set to establish a baseline.
| Metric | Result | |--------|------:| | Correct Tool + Args | 78.67% | | Tool Name Accuracy | 95.0% | | Failure Rate | 5.0% | | p95 Latency | 1.84s |
78.67% accuracy out of the box. That's a 29-point jump over FunctionGemma — and the failure rate dropped from nearly 20% to just 5%. The model was already good at this. The question was whether we could close the remaining gap with domain-specific fine-tuning, and how much of that gap was closable at all.
---
Building the Training Data
The quality of fine-tuning data matters more than the quantity. This was the single biggest lesson from the entire project, and we learned it the hard way — our first fine-tuning attempt used generic function-calling data and scored 11 points lower than our second attempt with Composio-specific data. No amount of hyperparameter tuning would have closed that gap.
Our data pipeline had two stages.
First, we collected every tool schema from Composio's ecosystem. We wrote a script that pulled tool definitions from Composio's API across all their integrations — GitHub, Slack, Notion, Jira, Salesforce, Stripe, Shopify, Google Suite, and hundreds more. Each tool came with its full schema: parameter names, types, required/optional flags, and descriptions. We saved these as structured JSONL files, one directory per toolkit.
From this full set, we curated the top 20 most commonly used toolkits for focused training. Breadth mattered, but not at the expense of depth on the tools our agents would actually encounter in production.
Second, we generated synthetic training conversations. For each tool, we produced training samples in Qwen's native chat format — a system prompt presenting a set of tools, a user message with a natural-language request, and an assistant response containing the correct tool call with properly filled arguments.
Each sample presented 10 tools — 1 correct tool plus 9 distractors drawn from the same toolkit. This forced the model to learn genuine discrimination between similar tools, not just pattern-matching against the only plausible option. When the distractors come from the same toolkit, the model can't take shortcuts based on the domain of the request; it has to actually read the schemas.
The final dataset contained 13,694 training samples and a held-out test set of 300 samples that served as our primary evaluation benchmark throughout the project.
---
Fine-Tuning with LoRA
We chose LoRA (Low-Rank Adaptation) over full fine-tuning for a practical reason: we wanted to adjust the model's behavior on tool-calling without risking catastrophic forgetting of its general capabilities, and we wanted to do it on a single GPU in a reasonable amount of time.
LoRA works by freezing the pre-trained weights and injecting small trainable low-rank matrices into the model's layers. Instead of updating all 4.15 billion parameters, we trained only 132 million — 3.18% of the total. The key configuration choices:
Rank 64 with alpha 128. This is higher than the typical rank of 8 or 16 that you'll see in most LoRA tutorials. We found that the extra capacity mattered for structured output tasks where the model needs to learn precise formatting patterns, not just stylistic shifts.
All projection layers targeted. We applied LoRA to every attention projection (Q, K, V, O) and every MLP layer (gate, up, down). This is more aggressive than the common approach of targeting only attention, but at rank 64 it was still a tiny fraction of total parameters — and it gave meaningfully better results on argument extraction accuracy.
| Parameter | Value | |-----------|-------| | LoRA Rank (r) | 64 | | LoRA Alpha | 128 | | Target Modules | q, k, v, o, gate, up, down | | Trainable Params | 132M / 4.15B (3.18%) | | Learning Rate | 2e-4 | | Effective Batch | 16 (1 x 16 grad accum) | | Epochs | 2 | | Max Seq Length | 10,240 tokens | | LR Schedule | Cosine (5% warmup) | | Precision | bf16 |
Training converged fast. Loss dropped sharply in the first 200 steps as the model locked onto the tool-calling format, then plateaued for the remainder of training. By step 100, token accuracy had already crossed 95%. The second epoch was spent refining edge cases — complex argument schemas, optional parameters, tools with similar names but different argument structures.
*Figure 1: Kinetic-4B training dashboard. Cross-entropy loss (top left) drops sharply in the first 200 steps and plateaus near zero. Token accuracy (top right) reaches 99%+ by mid-training. Eval loss tracks train loss closely with no sign of overfitting. Total: 1,712 steps over ~4.5 hours on a single GPU.*
Eval loss tracked training loss closely throughout — no sign of overfitting, despite the relatively small dataset size. This makes sense: the training data was high-quality, narrowly scoped, and the model already had strong priors for tool calling from its instruction tuning. We weren't teaching it a new skill; we were calibrating an existing one to a specific distribution.
After training, we published the LoRA adapter on HuggingFace as consciousengines/Kinetic-FC-LoRA. Applying it to Qwen3-4B-Instruct-2507 reproduces Kinetic-4B.
Before this, there was a v1. Our first attempt at Kinetic-4B was a full fine-tune on generic function-calling data that wasn't well mixed across tool categories. It hit only ~71% accuracy with a 13% failure rate. In hindsight, it was likely overtrained on a poorly distributed dataset — the model had memorized patterns from overrepresented tools while underperforming on the rest. The version that worked came from starting over with a different approach entirely: LoRA instead of full fine-tuning, Composio-specific training data with balanced coverage across toolkits, and rank 64 across all projection layers. Data quality and distribution were the bigger factors by far.
---
The Benchmarks
We benchmarked Kinetic-4B against the base Qwen3-4B, two frontier models (Claude Haiku 4.5 and GPT-OSS-120B via OpenRouter), and our earlier FunctionGemma attempt. All runs used the same evaluation set: 300 samples, seed 42, 10 tools per prompt, tool_choice="auto". Several things stand out.
| Model | Params | Accuracy | Tool Name | Failed | p95 Latency | |-------|-------:|---------:|----------:|-------:|--------:| | Kinetic-4B | 4B | 82.33% | 95.33% | 4.67% | 1.61s | | Claude Haiku 4.5 | — | 80.0% | 90.33% | 9.67% | 4.02s | | Qwen3-4B Base | 4B | 78.67% | 95.0% | 5.0% | 1.84s | | GPT-OSS-120B | 120B | 76.33% | 94.67% | 5.33% | 7.99s | | FunctionGemma | 270M | 49.67% | 80.33% | 19.67% | 0.84s |
*Figure 2: Kinetic-4B benchmark comparison across all models on 300 Composio tool-calling samples. Top: accuracy and p95 latency for all four models. Bottom left: head-to-head sample breakdown between Kinetic and Qwen base. Bottom right: latency distribution with p95 markers.*
Kinetic-4B beats Claude Haiku 4.5 — with 2.5x lower tail latency. Haiku is Anthropic's speed-optimized model. Ours is 2.33 percentage points more accurate with a p95 of 1.61s versus Haiku's 4.02s, running on a single rented GPU. Perhaps more striking: Haiku's failure rate is more than double ours (9.67% vs 4.67%), and its tool name accuracy is 5 points lower. On this specific task, a fine-tuned 4B model is strictly better than a frontier model.
It beats GPT-OSS-120B — a model with 30x more parameters. Our 4B model is 6 points more accurate with a p95 of 1.61s versus 7.99s. GPT-OSS also exhibited occasional 120-second timeouts during benchmarking — the kind of tail latency that would be disastrous in production agent loops.
The fine-tuning added a net 11 correct samples over the base model. Looking at the head-to-head breakdown against Qwen3-4B base: 222 samples were correct under both models. Kinetic got 25 samples right that the base model missed. The base model got 14 samples right that Kinetic missed. The fine-tuning helped most on edge cases — complex argument schemas where the base model would hallucinate extra parameters or miss required ones.
---
What We Learned
This project reinforced a few things we suspected and taught us a few things we didn't expect.
Model size has a floor for structured output. The jump from 270M to 4B parameters wasn't just "more capacity." It crossed a qualitative threshold where the model could reason about parameter types, distinguish required from optional fields, and handle schema constraints — things that the smaller model could memorize for individual tools but couldn't generalize across hundreds of schemas. If you're building a tool-calling model and your accuracy is stuck, the answer might not be better data or a smarter training recipe. It might just be a bigger model.
Data quality and distribution dominate everything else. 13,694 samples was enough — because every sample used real Composio tool schemas with realistic user queries and correctly filled arguments, balanced across tool categories. Our v1 attempt used generic function-calling data that wasn't well mixed — some tool types were overrepresented, others barely present. It scored 11 points lower on the same eval set. We tried to close that gap with hyperparameter tuning and training for more epochs. It didn't work. The gap was in the data, not the training.
LoRA is sufficient for domain adaptation — and safer than full fine-tuning. We trained 3.18% of the model's parameters and got a 3.66-point accuracy improvement over the base model. Our v1 full fine-tune actually performed worse, likely due to overfitting and catastrophic forgetting. LoRA's implicit regularization — freezing most of the network — turned out to be a feature, not a limitation. For tasks where you're sharpening an existing skill rather than teaching a new one, LoRA at moderate rank is the right tool.
Small models can beat frontier models on narrow tasks. This is the result that surprised people most. Kinetic-4B beats Claude Haiku and GPT-OSS-120B on Composio tool-calling — not because it's a better model in general, but because we specialized it on exactly this distribution. If your use case is narrow enough and well-defined enough, fine-tuning a small model will often beat prompting a large one. The economics of this are significant: a single rented GPU versus per-token API pricing, with better latency and higher accuracy.
Benchmark on your actual distribution. Standard function-calling benchmarks — Berkeley Function Calling Leaderboard, BFCL, and others — wouldn't have told us what we needed to know. The tool schemas, argument patterns, and failure modes in Composio's ecosystem are specific enough that we needed our own eval set. 300 samples with 10-tool discrimination turned out to be enough to make reliable comparisons between models, and the results generalized well to production traffic.
---
What's Next
We're exploring several directions from here. Expanding the training data beyond the top 20 toolkits to cover more of Composio's full ecosystem. Moving from single-tool-call evaluation to multi-step agent traces where the model needs to chain multiple tools together. And exploring constrained decoding via vLLM's guided generation to enforce valid JSON schemas at decode time, eliminating the remaining 4.67% failure rate entirely.
We're also revisiting the routing idea — but differently this time. Instead of using a small model as a naive toolkit selector, we're building a dedicated tool routing layer that can narrow down hundreds of tools to a focused subset before the model ever sees them. The compounding error problem from our FunctionGemma router experiment taught us what *not* to do; the next version will be architecturally different. More on that in a future post.
The LoRA adapter is available at consciousengines/Kinetic-FC-LoRA on HuggingFace — stack it on top of Qwen/Qwen3-4B-Instruct-2507 and you have Kinetic-4B. The README on the model card has a drop-in PyTorch + PEFT inference snippet. The training, benchmarking, and deployment code is in our repo.
---
Acknowledgements
This research was carried out by Ritam Pal and Kautuk Kundan at Conscious Engines, as part of the LossFunk residency.