model cardai research · engineering

Kedar-22B

Kedar Karbele · ai research & engineering

checkpoint: current
license: open-to-work
location: mumbai → san francisco
released: 2025
version: 22.4

§itl;dr

headline benchmark

91.2

WebVoyager · +4.2 vs openai operator

Three-agent Playwright loop with multimodal input, PDF parsing, video interaction, and credential handling through HashiCorp Vault — one-shot where others multi-shot.

tau-bench: sotatool-calling (ucb): sotareasoning cost: 10× base

§iidescription

A research engineer who disassembles things to individual bits.

Engineers next-generation AI infrastructure that bridges cutting-edge research with production reality. Specialised in problems that require rethinking fundamental architectural assumptions — from RoPE scaling and custom CUDA kernels at the low-level to multi-agent orchestration and reasoning-model fine-tuning at the systems-level. Builds quickly and goes absurdly deep.

Two years of self-directed study before the first job. Three production reasoning models within two weeks of o1's release. A browser agent that beats Operator. Trading systems validated on real capital. Clinician software that passed HIPAA review on the first attempt.

The short version: build it, benchmark it, ship it. Then go deeper than the benchmark.

§iiicapabilities

What this model can do.

Grouped by domain. Each row is a capability and the signature of how it's exercised in real work.

Agent Systems Architecture

20+ production systems shipped

Multi-agent orchestration: Hierarchies, planners, critics, and tool-using sub-agents running in production
Custom tool-use implementations: Beyond function-calling: constrained decoding, verified traces
Hierarchical agent memory: Neo4j knowledge graphs with semantic ingestion and retrieval
Agent evaluation frameworks: Plug-in agents, coordinated runs, reproducible benchmarks
Production RAG pipelines: Hybrid retrieval, rerankers, chunking strategies that survive contact with reality

Fundamental ML Research

Llama 3 8K → 1M context via RoPE scaling

RoPE scaling: Extended Llama 3 8B effective context from 8K to 1M tokens
Flash Attention / Sliding Window: Custom implementations for long-context inference
Speculative decoding: KV-cache optimisation, PagedAttention, continuous batching
Custom CUDA / PTX kernels: Hand-tuned GPU kernels where high-level frameworks fall short
Tensor & pipeline parallelism: Multi-GPU distributed training and high-throughput inference clusters

Training & Fine-tuning

SOTA reasoning models in < 2 weeks

Reinforcement learning: GRPO / PPO / DPO / ORPO implementations, RLHF pipelines
Mixture-of-experts: MoE model fine-tuning end to end
Reasoning enhancement: Q*-style and chain-of-thought distillation for non-reasoning models
Parameter-efficient tuning: SFT, LoRA, QLoRA, RAFT (retrieval-augmented fine-tuning)
Multimodal & speech: CLIP adaptations, Whisper fine-tuning for Indic languages

Dataset Engineering

Alignment corpora from scratch

Reasoning dataset curation: Custom CoT and reasoning corpora tuned to target benchmarks
DPO preference pipelines: On-the-fly preference-pair generation from user interactions
Synthetic data for alignment: Domain-specific synthetic generation for enterprise clients
Quality filtering at scale: Dedup, classifier-based filtering, multimodal dataset prep

Infrastructure

Quantised inference across heterogeneous GPUs

Multi-GPU distributed training: AWS, GCP, bare-metal (Shadeform, Vultr, RunPod)
Kubernetes GPU provisioning: Dynamic scheduling with smart swap to minimise cost
Quantisation pipelines: FP16, FP8, INT8, AWQ — production inference at speed
CI/CD for ML: GitHub Actions, zero-downtime model updates, automated eval gates

§ivtraining data

Experience, framed as training stages.

Pre-training was unsupervised and self-directed. Fine-tuning sharpened on applied domains. RLHF is what we call the current role.

phase 01Pre-training2018 – 2023Mumbai
Self-directed
Autodidact
Unsupervised corpus: disassembled CUDA, Rust macros, LLM internals, RoPE, and distributed systems from first principles. No course followed end-to-end. Read every paper worth reading.
- RoPE scaling from the original paper
- CUDA kernels written, not copied
- Transformers disassembled to individual ops
phase 02Fine-tune — clinicalDec 2023 – May 2024Mumbai
Numa Health Inc. (Qualcomm Ventures)
Co-founder & CTO
Built a HIPAA-compliant clinician-side recorder and SOAP-note generator. Fine-tuned Mixtral 8×7B on 50K+ medical transcripts; optimised Whisper with Flash Attention and custom chunking for noisy clinic audio. Passed security review on first attempt.
- Cut clinician documentation from 2+ hours/day to ~20 minutes
- 1.5× Whisper speedup on medical jargon
- End-to-end HIPAA-compliant infra on AWS
phase 03Fine-tune — enterpriseJul 2024 – Sep 2024Sheridan, Wyoming (remote)
ScaleGenAI
AI Engineer
Data extraction for massive data lakes using fine-tuned InternLM vision. Trained Llama 2 (SFT/LoRA/QLoRA) and Mistral 8×7B for unstructured → structured conversion. Built text-to-SQL bot that became a main product. Contributed to K8s GPU scheduler across bare-metal providers.
- 89% accuracy on unstructured → structured on custom benchmarks
- 40% inference-latency cut via KV caching + speculative decoding
- Zero-touch fine-tuning pipeline (HF URL + dataset → trained model)
phase 04RLHFSep 2024 – presentSan Francisco (remote from Mumbai)
The Agentic AI
AI Engineer
Production reasoning models, agentic browser, agentic coder, open-source multi-agent frameworks. The work listed under featured projects below.
- Three production reasoning models (Turbo / Medium / Large)
- WebVoyager 91.2% (vs Operator 87%)
- Tau-bench SOTA, UCB tool-calling SOTA
- 10× cost reduction on reasoning inference

§vbenchmarks

Evaluated the way a model should be.

All numbers below are from the CV. Nothing here is rounded in Kedar's favour.

webvoyager · agentic browser vs openai operator

Agentic Browser91.2%

Kedar · 2025

OpenAI Operator87%

Baseline

A three-agent Playwright loop with multimodal input, PDF parsing, video interaction, and credential handling through HashiCorp Vault. Gap of +4.2 pts on a benchmark where every percentage point represents a real end-to-end web task completed reliably.

Benchmark	Score	Baseline	Note
WebVoyager agentic-browser	91.2% +4.2	OpenAI Operator(87)	Three-agent Playwright loop with multimodal input
Tau-bench agentic-turbo	SOTA	Prior SOTA	Achieved with fewer reasoning tokens than competitors
UCB Tool-Calling agentic-turbo	SOTA	Prior SOTA	Function-calling accuracy under complex tool schemas
Reasoning Cost agentic-turbo	10×reduction	Base models	Measured vs. comparable reasoning-model baselines
LDRM Document Analysis ldrm	95+%	—	1000+ docs per run for financial analysis workflows
Trading Agent (BTC) trading-agents	75% win rate	—	Live trades, Elliott Wave / Gann / Wyckoff patterns
Trading Agent (US Equities) trading-agents	90% win rate	—	Validated on real capital, not backtests
Financial Doc Analysis financial-docs	92%	—	SEC filings through handwritten due-diligence notes

§vifeatured projects

Selected checkpoints.

Three open-source projects from The Agentic AI. The rest of the ~20 shipped systems are internal — ask the REPL below for details.

hero checkpointv1.0

source

Agentic Browser

Beats OpenAI Operator on WebVoyager

Three-agent orchestration loop on Playwright that handles multimodal inputs, downloads and parses PDFs, interacts with embedded video, and manages credentials securely through HashiCorp Vault. One-shot task completion where prior agents required multiple attempts.

Three-agent loop: planner, actor, verifier
Multimodal inputs — text, screenshots, embedded video
HashiCorp Vault credential handling for real-world tasks
Structured recovery on tool failure; no silent fallbacks

headline metric

91.2%

WebVoyager (vs Operator 87)

stack

Playwright
Python
HashiCorp Vault
Custom tool schema

vopen-sourcesrc

CortexON

Multi-agent orchestration that actually works in production

Built as an open-source alternative after Manus.im went viral — focused on making multi-agent orchestration actually work in production rather than just demos. Planners, actors, critics, and tool-using sub-agents composed into reliable end-to-end loops.

OSSMulti-agent orchestrator

Python
PydanticAI
Custom runtime

vopen-sourcesrc

Agentic Bench

Plug in custom agents. They actually coordinate.

A framework for composing custom research, browser, code, and file agents into coordinated runs. Built because most existing multi-agent frameworks make trivial coordination unnecessarily complex.

OSSMulti-agent eval harness

Python
PydanticAI
Custom runtime

§xwriting

Inference outputs.

Long-form text this model generates when given enough context window.

April 12, 20267m
The True Beginning of Infinity
On Claude Mythos, manufactured urgency, David Deutsch, and why the most powerful explanation-seeking engine in history might be the best news any of us have heard in our lifetimes.

all writing

§viilimitations

Honest technical scope.

Every model card has a limitations section. Skipping it would be dishonest.

01
Best suited for: frontier-ish research problems that need rethinking primitives — tokenisation, attention, reasoning objectives, long-horizon tool use.
02
Not currently optimised for: long-term maintenance of mature codebases, meeting-heavy environments, or work that punishes experimentation.
03
Known biases: gravitates toward problems that touch GPUs and compilers; under-weights problems that could be solved with a SQL query and a good afternoon.
04
Scaling behaviour: output quality increases sharply with cognitive load, uninterrupted time, and being told 'this is impossible'. Decreases with unclear objectives or political overhead.
05
Context window: effectively unlimited when given real problems; drops to zero when asked to pretend to care about things that don't matter.

§viiihow to cite

BibTeX, or a DM.

Your call. Both work.

bibtex

@misc{karbele2025,
  author  = {Karbele, Kedar},
  title   = {Kedar-22B: AI Research and Engineering},
  year    = {2025},
  url     = {https://kedar.sh},
  version = {22.4},
  note    = {Model card for an AI research engineer}
}

apa

Karbele, K. (2025). Kedar-22B: AI research and engineering (Version 22.4) [Model card]. https://kedar.sh

or just reach out

§ixtry this model

Ask anything. The model answers.

Claude Sonnet 4.6 via OpenRouter, with tool access to a virtual filesystem of this site. Tool calls shown inline — the flex is the strace.

REPLclaude-sonnet-4.6

Agent Systems Architecture

Fundamental ML Research

Training & Fine-tuning

Dataset Engineering

Infrastructure

Self-directed

Numa Health Inc. (Qualcomm Ventures)

ScaleGenAI

The Agentic AI

Agentic Browser

CortexON

Agentic Bench

The True Beginning of Infinity