model cardai research · engineering

Kedar-22B

Kedar Karbele · ai research & engineering

checkpoint
current
license
open-to-work
location
mumbai → san francisco
released
2025
version
22.4
§itl;dr

headline benchmark

91.2

WebVoyager · +4.2 vs openai operator

Three-agent Playwright loop with multimodal input, PDF parsing, video interaction, and credential handling through HashiCorp Vault — one-shot where others multi-shot.

tau-bench: sotatool-calling (ucb): sotareasoning cost: 10× base
§iidescription

A research engineer who disassembles things to individual bits.

Engineers next-generation AI infrastructure that bridges cutting-edge research with production reality. Specialised in problems that require rethinking fundamental architectural assumptions — from RoPE scaling and custom CUDA kernels at the low-level to multi-agent orchestration and reasoning-model fine-tuning at the systems-level. Builds quickly and goes absurdly deep.

Two years of self-directed study before the first job. Three production reasoning models within two weeks of o1's release. A browser agent that beats Operator. Trading systems validated on real capital. Clinician software that passed HIPAA review on the first attempt.

The short version: build it, benchmark it, ship it. Then go deeper than the benchmark.

§iiicapabilities

What this model can do.

Grouped by domain. Each row is a capability and the signature of how it's exercised in real work.

Agent Systems Architecture

20+ production systems shipped

Multi-agent orchestration
Hierarchies, planners, critics, and tool-using sub-agents running in production
Custom tool-use implementations
Beyond function-calling: constrained decoding, verified traces
Hierarchical agent memory
Neo4j knowledge graphs with semantic ingestion and retrieval
Agent evaluation frameworks
Plug-in agents, coordinated runs, reproducible benchmarks
Production RAG pipelines
Hybrid retrieval, rerankers, chunking strategies that survive contact with reality

Fundamental ML Research

Llama 3 8K → 1M context via RoPE scaling

RoPE scaling
Extended Llama 3 8B effective context from 8K to 1M tokens
Flash Attention / Sliding Window
Custom implementations for long-context inference
Speculative decoding
KV-cache optimisation, PagedAttention, continuous batching
Custom CUDA / PTX kernels
Hand-tuned GPU kernels where high-level frameworks fall short
Tensor & pipeline parallelism
Multi-GPU distributed training and high-throughput inference clusters

Training & Fine-tuning

SOTA reasoning models in < 2 weeks

Reinforcement learning
GRPO / PPO / DPO / ORPO implementations, RLHF pipelines
Mixture-of-experts
MoE model fine-tuning end to end
Reasoning enhancement
Q*-style and chain-of-thought distillation for non-reasoning models
Parameter-efficient tuning
SFT, LoRA, QLoRA, RAFT (retrieval-augmented fine-tuning)
Multimodal & speech
CLIP adaptations, Whisper fine-tuning for Indic languages

Dataset Engineering

Alignment corpora from scratch

Reasoning dataset curation
Custom CoT and reasoning corpora tuned to target benchmarks
DPO preference pipelines
On-the-fly preference-pair generation from user interactions
Synthetic data for alignment
Domain-specific synthetic generation for enterprise clients
Quality filtering at scale
Dedup, classifier-based filtering, multimodal dataset prep

Infrastructure

Quantised inference across heterogeneous GPUs

Multi-GPU distributed training
AWS, GCP, bare-metal (Shadeform, Vultr, RunPod)
Kubernetes GPU provisioning
Dynamic scheduling with smart swap to minimise cost
Quantisation pipelines
FP16, FP8, INT8, AWQ — production inference at speed
CI/CD for ML
GitHub Actions, zero-downtime model updates, automated eval gates
§ivtraining data

Experience, framed as training stages.

Pre-training was unsupervised and self-directed. Fine-tuning sharpened on applied domains. RLHF is what we call the current role.

  1. phase 01Pre-training2018 – 2023Mumbai

    Self-directed

    Autodidact

    Unsupervised corpus: disassembled CUDA, Rust macros, LLM internals, RoPE, and distributed systems from first principles. No course followed end-to-end. Read every paper worth reading.

    • RoPE scaling from the original paper
    • CUDA kernels written, not copied
    • Transformers disassembled to individual ops
  2. phase 02Fine-tune — clinicalDec 2023 – May 2024Mumbai

    Numa Health Inc. (Qualcomm Ventures)

    Co-founder & CTO

    Built a HIPAA-compliant clinician-side recorder and SOAP-note generator. Fine-tuned Mixtral 8×7B on 50K+ medical transcripts; optimised Whisper with Flash Attention and custom chunking for noisy clinic audio. Passed security review on first attempt.

    • Cut clinician documentation from 2+ hours/day to ~20 minutes
    • 1.5× Whisper speedup on medical jargon
    • End-to-end HIPAA-compliant infra on AWS
  3. phase 03Fine-tune — enterpriseJul 2024 – Sep 2024Sheridan, Wyoming (remote)

    ScaleGenAI

    AI Engineer

    Data extraction for massive data lakes using fine-tuned InternLM vision. Trained Llama 2 (SFT/LoRA/QLoRA) and Mistral 8×7B for unstructured → structured conversion. Built text-to-SQL bot that became a main product. Contributed to K8s GPU scheduler across bare-metal providers.

    • 89% accuracy on unstructured → structured on custom benchmarks
    • 40% inference-latency cut via KV caching + speculative decoding
    • Zero-touch fine-tuning pipeline (HF URL + dataset → trained model)
  4. phase 04RLHFSep 2024 – presentSan Francisco (remote from Mumbai)

    The Agentic AI

    AI Engineer

    Production reasoning models, agentic browser, agentic coder, open-source multi-agent frameworks. The work listed under featured projects below.

    • Three production reasoning models (Turbo / Medium / Large)
    • WebVoyager 91.2% (vs Operator 87%)
    • Tau-bench SOTA, UCB tool-calling SOTA
    • 10× cost reduction on reasoning inference
§vbenchmarks

Evaluated the way a model should be.

All numbers below are from the CV. Nothing here is rounded in Kedar's favour.

webvoyager · agentic browser vs openai operator

Agentic Browser91.2%

Kedar · 2025

OpenAI Operator87%

Baseline

A three-agent Playwright loop with multimodal input, PDF parsing, video interaction, and credential handling through HashiCorp Vault. Gap of +4.2 pts on a benchmark where every percentage point represents a real end-to-end web task completed reliably.

BenchmarkScore
WebVoyager
agentic-browser
91.2%
+4.2
Tau-bench
agentic-turbo
SOTA
UCB Tool-Calling
agentic-turbo
SOTA
Reasoning Cost
agentic-turbo
10×reduction
LDRM Document Analysis
ldrm
95+%
Trading Agent (BTC)
trading-agents
75% win rate
Trading Agent (US Equities)
trading-agents
90% win rate
Financial Doc Analysis
financial-docs
92%
§vifeatured projects

Selected checkpoints.

Three open-source projects from The Agentic AI. The rest of the ~20 shipped systems are internal — ask the REPL below for details.

hero checkpointv1.0
source

Agentic Browser

Beats OpenAI Operator on WebVoyager

Three-agent orchestration loop on Playwright that handles multimodal inputs, downloads and parses PDFs, interacts with embedded video, and manages credentials securely through HashiCorp Vault. One-shot task completion where prior agents required multiple attempts.

  • Three-agent loop: planner, actor, verifier
  • Multimodal inputs — text, screenshots, embedded video
  • HashiCorp Vault credential handling for real-world tasks
  • Structured recovery on tool failure; no silent fallbacks

headline metric

91.2%

WebVoyager (vs Operator 87)

stack

  • Playwright
  • Python
  • HashiCorp Vault
  • Custom tool schema
vopen-sourcesrc

CortexON

Multi-agent orchestration that actually works in production

Built as an open-source alternative after Manus.im went viral — focused on making multi-agent orchestration actually work in production rather than just demos. Planners, actors, critics, and tool-using sub-agents composed into reliable end-to-end loops.

OSSMulti-agent orchestrator
  • Python
  • PydanticAI
  • Custom runtime
vopen-sourcesrc

Agentic Bench

Plug in custom agents. They actually coordinate.

A framework for composing custom research, browser, code, and file agents into coordinated runs. Built because most existing multi-agent frameworks make trivial coordination unnecessarily complex.

OSSMulti-agent eval harness
  • Python
  • PydanticAI
  • Custom runtime
§viilimitations

Honest technical scope.

Every model card has a limitations section. Skipping it would be dishonest.

  1. 01

    Best suited for: frontier-ish research problems that need rethinking primitives — tokenisation, attention, reasoning objectives, long-horizon tool use.

  2. 02

    Not currently optimised for: long-term maintenance of mature codebases, meeting-heavy environments, or work that punishes experimentation.

  3. 03

    Known biases: gravitates toward problems that touch GPUs and compilers; under-weights problems that could be solved with a SQL query and a good afternoon.

  4. 04

    Scaling behaviour: output quality increases sharply with cognitive load, uninterrupted time, and being told 'this is impossible'. Decreases with unclear objectives or political overhead.

  5. 05

    Context window: effectively unlimited when given real problems; drops to zero when asked to pretend to care about things that don't matter.

§viiihow to cite

BibTeX, or a DM.

Your call. Both work.

bibtex
@misc{karbele2025,
  author  = {Karbele, Kedar},
  title   = {Kedar-22B: AI Research and Engineering},
  year    = {2025},
  url     = {https://kedar.sh},
  version = {22.4},
  note    = {Model card for an AI research engineer}
}

apa

Karbele, K. (2025). Kedar-22B: AI research and engineering (Version 22.4) [Model card]. https://kedar.sh

§ixtry this model

Ask anything. The model answers.

Claude Sonnet 4.6 via OpenRouter, with tool access to a virtual filesystem of this site. Tool calls shown inline — the flex is the strace.

REPLclaude-sonnet-4.6