AI model tracking gets noisy when every model is compared as if it were competing in one universal leaderboard. A more useful research map starts with categories.
Core Axes
| Axis | What it captures | Why it matters |
|---|---|---|
| Modality | Text, code, image, audio, video, multimodal | Different benchmarks measure different skills |
| Deployment surface | API, open-weight, local, edge, enterprise | Determines cost, privacy, latency, and control |
| Capability niche | Reasoning, coding, search, agents, creative generation | Models specialize even when marketed broadly |
| Context and memory | Token window, retrieval, persistent state | Affects research, coding, and long-document tasks |
| Tool use | Function calling, browser use, computer use, code execution | Determines agentic reliability |
| Evaluation posture | Public benchmark claims, third-party evals, private evals | Controls how much confidence to place in comparisons |
Model Categories
Frontier General Models
These are broad models positioned for general chat, reasoning, coding, and tool use. They usually define the top of the commercial API market, but their published benchmarks are often selective and need careful normalization.
Coding Models
Coding models should be tracked separately from general chat models. Useful evaluation questions include patch correctness, repository navigation, test repair, long-context code understanding, and ability to work with existing style.
Reasoning Models
Reasoning models emphasize deliberate problem solving. Benchmark results can be impressive, but latency, cost, verbosity, and tool-use behavior are part of the actual product tradeoff.
Open-Weight Models
Open-weight models compete on controllability, deployability, and economics as much as absolute benchmark score. Their value changes significantly with quantization, inference stack, hardware, context length, and fine-tuning ecosystem.
Embedding and Reranking Models
Embedding and reranking models are infrastructure components. They should be evaluated by retrieval quality, domain transfer, multilingual behavior, latency, and cost rather than chat-style benchmarks.
Audio, Image, and Video Models
Generation and transcription models need task-specific evaluation: instruction following, temporal consistency, artifacts, editability, latency, and rights/safety constraints.
Timeline Discipline
Release timelines should track at least four dates when possible:
- announcement date
- API availability date
- general availability date
- model replacement or deprecation date
Those dates are often different. Keeping them separate avoids false comparisons across models that were announced before they were broadly usable.
Benchmark Discipline
Benchmark notes should record:
- benchmark name and version
- reported score
- evaluation source
- model variant and date
- whether tools, chain-of-thought, sampling, or retries were used
- caveats from the source
The goal is not to worship benchmarks. The goal is to understand what each score can and cannot support.