Applicability of Small Models for Agentic QA

I’ve been doing some recent work related to automating QA in AI workflows. Most recently, I built a small jury pool app to assess agreement across generated outputs, which got me interested in small models. It’s a necessary step to ensure quality and precision, and it well-trodden by research. After running down a few rabbit holes, I decided to write up what I found. Not all of it was directly applicable to my use case, but much of it will be eventually.

Agentic workflows tend to fail in a particular way. A system that plans, calls a tool, reads the result, and feeds it into the next step carries risk forward at every handoff, and that risk compounds multiplicatively rather than adding up. A five-step chain in which each step is correct 95 percent of the time returns a correct end-to-end result around 77 percent of the time; at ten steps the figure falls near 60 percent. That arithmetic assumes the errors are independent, which is generous. In practice they correlate, and a misread instruction early on propagates, later steps build on the flawed output instead of catching it, and a partially correct result at one node can amplify through the stages that follow (Ro et al., 2025).

The issue becomes not whether to check the work but where to put the checks and how much to spend on them. Two reflexes are common, and both are expensive. One is to route every step through a large, capable model and trust it to notice its own mistakes. The other is to insert a verification pass after every node. Ro et al. (2025) are direct about the cost of the second approach. Verifying every step introduces latency and cost overhead, which is why they work out which nodes are error-prone enough to deserve a costly check, which verifier suits each node, and how to verify without stalling the workflow. Verification is not free, and treating it as free is its own failure mode.

Checking is different than producing

The opening for small models comes from an asymmetry that applies across domains. Confirming that a proposed answer is correct is frequently cheaper than producing the answer from scratch. Wei (2025) calls this the asymmetry of verification, and the examples are familiar from outside machine learning. A prime factorization takes effort to find and almost none to check, a completed Sudoku grid is read in seconds and solved in many minutes, and running a unit test is trivial next to writing the function that passes it. The capability a system needs to recognize a good output is generally not the same as that needed to generate one.

It’s should be noted that the asymmetry is conditional. A controlled study of verification dynamics across twelve benchmarks and fourteen open models found that verification success depends on the interaction between problem difficulty, the strength of the generator, and the verifier’s own ability, rather than tracking the verifier’s raw capability alone (Zhou et al., 2025). Verification gets easier when the problem is tractable and the generator’s mistakes are detectable, and it gets harder at the extremes. A small verifier fits a specific set of conditions and is not a universal substitute for a large one.

Where agentic QA sits

Most quality assurance inside an agent lives on the easy side of that asymmetry. The recurring checks are bounded: does this tool call conform to the expected schema, is this output consistent with the document that was retrieved, did this step satisfy the constraint the plan stated. These are reference-grounded judgments, not open-ended evaluations of correctness in the abstract. The reference collapses the difficulty of the task. Krumdick et al. (2025) found that giving a relatively small model a correct, human-written reference answer let a 7-billion-parameter Qwen judge reach better agreement with human annotators than a much larger GPT-4o judge supplied only with a synthetic reference. With the answer in context, the model no longer has to derive the solution and then compare; it only has to compare.

The benchmarks on purpose-built small judges point in the same direction. JudgeLM fine-tuned at 7 billion parameters reaches agreement above 90 percent with its teacher judge, exceeding the agreement humans reach with one another on the same set, and grades thousands of samples in minutes (Zhu et al., 2025). CompassJudger-2 at 7 billion parameters reports judgment accuracy competitive with models in the hundreds of billions (Zhang et al., 2025). A more recent line of work goes further. The judgment already exists in a small model’s internal state, so it can be read directly rather than generated as text. With this apporach, a 1.7-billion-parameter model approaches the accuracy of full-size judges on reasoning benchmarks (Li et al., 2026). These figures measure agreement with a reference judge, which is reproduction of an evaluation rather than independent proof of correctness. What they establish is that a constrained, well-specified check can be reproduced cheaply. Across these results, bounded checking appears to require less model capacity than generation, and a model sized for the smaller task can be sufficient.

The boundary is also fairly clear. Reference-free judgment on hard, open-ended problems is where small verifiers degrade, the same line the verification-dynamics results draw from the other direction (Zhou et al., 2025). A QA layer built on small models has to respect it and route the schema checks, consistency checks, and constraint checks to the small verifier, and reserve a larger model or human review for the open-ended calls that small judges miss. This is the selective-verification design Ro et al. (2025) formalize. The is not a verifier on every node, but the right verifier on the nodes that need one.

The economics

The cost structure is what lets the small checker run as a standing fixture rather than an occasional one. Belcak et al. (2025) put models under roughly ten billion parameters at ten to thirty times cheaper to serve than generalist large models, demanding fewer GPUs, fine-tunable in hours rather than weeks, and in many cases able to run on a single consumer device. A checker mostly reads input and emits a short verdict, and generated tokens are priced well above the tokens a model reads, so the verification call stays cheap on both axes: small model, short output. A verifier with that profile can sit on the routine handoffs throughout the chain without imposing the latency tax that a large model on every step would. The continuous, bounded checking that blunts the compounding-error problem becomes affordable precisely because the model doing it is small.

The token math is bears this out. Take OpenAI’s published rates as of mid-2026, with the standing caveat that list prices move. One verifier check reads perhaps 1,500 tokens of context, including the step’s instruction, the output under inspection, and a short reference, and returns a 50-token verdict. At the general-purpose gpt-5.4’s rates of $2.50 per million input tokens and $15.00 per million output, that check costs roughly $0.0045, most of it in the input the verifier reads. The same check through the small gpt-5.4-nano, at $0.20 and $1.25 per million, costs about $0.00036 — more than twelve times less (OpenAI, 2026). Across a workflow that runs ten checks per execution at a hundred thousand executions a month, that is the difference between roughly $4,500 and $360 for the verification layer alone. The ratio falls inside the ten-to-thirtyfold serving-cost range Belcak et al. (2025) describe, and because the verifier re-reads the same stable instruction on every call, prompt caching, which billed on that same price list at a tenth of the standard input rate, pulls the input cost down further.

The deployment floor drops below a GPU requirement entirely. Quantized to three or four bits per weight in the GGUF format that llama.cpp reads, a model needs little enough memory and bandwidth to run on a commodity CPU with no accelerator attached; lower-bit schemes also raise CPU token-generation throughput, which is what governs latency when the output is a handful of tokens (Kurt, 2026). Small models clear that bar easily. Zhang and Huang (2025) ran a 1-billion-parameter model on an iPhone’s CPU at 17 tokens per second, which is faster, on that device, than the same model with GPU acceleration once the GPU’s memory-transfer overhead is counted. This matches the verifier’s workload, which reads a prompt and returns a short verdict. The marginal cost of the check then approaches the cost of CPU cycles already idle on the machine running the agent. The tradeoff is concurrency. CPU inference works through requests largely in sequence rather than batching them the way a GPU endpoint does, so a CPU-resident verifier suits the single-agent loop and the edge deployment more than a shared, high-throughput verification service.

A large verifier on every node was rarely an architecture teams could justify at scale as latency and cost ruled it out, which is why they relied on human verification. The condition that forced that compromise was the cost of capable checking. While small verifiers do not make checking free, they can make enough checking affordable that verification can move from an occasional control to a routine part of an agentic workflow.


References

Belcak, P., Heinrich, G., Diao, S., Fu, Y., Dong, X., Muralidharan, S., Lin, Y. C., & Molchanov, P. (2025). Small language models are the future of agentic AI. arXiv. https://doi.org/10.48550/arXiv.2506.02153

Krumdick, M., Lovering, C., Reddy, V., Ebner, S., & Tanner, C. (2025). No free labels: Limitations of LLM-as-a-judge without human grounding. arXiv. https://doi.org/10.48550/arXiv.2503.05061

Kurt, U. (2026). Which quantization should I use? A unified evaluation of llama.cpp quantization on Llama-3.1-8B-Instruct. arXiv. https://doi.org/10.48550/arXiv.2601.14277

Li, Z., Zhang, Y., Li, M., Ji, Y., Zeng, Y., Cheng, N., Zhu, Y., Wang, Y., Wang, S., Xiao, J., & He, D. (2026). Rethinking LLM-as-a-judge: Representation-as-a-judge with small language models via semantic capacity asymmetry [Conference paper]. The Fourteenth International Conference on Learning Representations. https://openreview.net/forum?id=VAISvCsrvG

OpenAI. (2026). Pricing [API documentation]. OpenAI Developers. Retrieved June 2, 2026, from https://developers.openai.com/api/docs/pricing

Ro, Y., Qiu, H., Goiri, Í., Fonseca, R., Bianchini, R., Akella, A., Wang, Z., Erez, M., & Choukse, E. (2025). Sherlock: Reliable and efficient agentic workflow execution. arXiv. https://doi.org/10.48550/arXiv.2511.00330

Wei, J. (2025, July 15). Asymmetry of verification and verifier’s law. https://www.jasonwei.net/blog/asymmetry-of-verification-and-verifiers-law

Zhang, H., & Huang, J. (2025). Challenging GPU dominance: When CPUs outperform for on-device LLM inference. arXiv. https://doi.org/10.48550/arXiv.2505.06461

Zhang, T., Cao, M., Lam, A., Zhang, S., & Chen, K. (2025). CompassJudger-2: Towards generalist judge model via verifiable rewards. arXiv. https://doi.org/10.48550/arXiv.2507.09104

Zhou, Y., Xu, A., Zhou, Y., Singh, J., Gui, J., & Joty, S. (2025). Variation in verification: Understanding verification dynamics in large language models. arXiv. https://doi.org/10.48550/arXiv.2509.17995

Zhu, L., Wang, X., & Wang, X. (2025). JudgeLM: Fine-tuned large language models are scalable judges [Conference paper]. The Thirteenth International Conference on Learning Representations. https://openreview.net/forum?id=xsELpEPn4A