AI model tracking gets noisy when every model is compared as if it were competing in one universal leaderboard. A more useful research map starts with categories.

Core Axes

Axis What it captures Why it matters
Modality Text, code, image, audio, video, multimodal Different benchmarks measure different skills
Deployment surface API, open-weight, local, edge, enterprise Determines cost, privacy, latency, and control
Capability niche Reasoning, coding, search, agents, creative generation Models specialize even when marketed broadly
Context and memory Token window, retrieval, persistent state Affects research, coding, and long-document tasks
Tool use Function calling, browser use, computer use, code execution Determines agentic reliability
Evaluation posture Public benchmark claims, third-party evals, private evals Controls how much confidence to place in comparisons

Model Categories

Frontier General Models

These are broad models positioned for general chat, reasoning, coding, and tool use. They usually define the top of the commercial API market, but their published benchmarks are often selective and need careful normalization.

Coding Models

Coding models should be tracked separately from general chat models. Useful evaluation questions include patch correctness, repository navigation, test repair, long-context code understanding, and ability to work with existing style.

Reasoning Models

Reasoning models emphasize deliberate problem solving. Benchmark results can be impressive, but latency, cost, verbosity, and tool-use behavior are part of the actual product tradeoff.

Open-Weight Models

Open-weight models compete on controllability, deployability, and economics as much as absolute benchmark score. Their value changes significantly with quantization, inference stack, hardware, context length, and fine-tuning ecosystem.

Embedding and Reranking Models

Embedding and reranking models are infrastructure components. They should be evaluated by retrieval quality, domain transfer, multilingual behavior, latency, and cost rather than chat-style benchmarks.

Audio, Image, and Video Models

Generation and transcription models need task-specific evaluation: instruction following, temporal consistency, artifacts, editability, latency, and rights/safety constraints.

Timeline Discipline

Release timelines should track at least four dates when possible:

  • announcement date
  • API availability date
  • general availability date
  • model replacement or deprecation date

Those dates are often different. Keeping them separate avoids false comparisons across models that were announced before they were broadly usable.

Benchmark Discipline

Benchmark notes should record:

  • benchmark name and version
  • reported score
  • evaluation source
  • model variant and date
  • whether tools, chain-of-thought, sampling, or retries were used
  • caveats from the source

The goal is not to worship benchmarks. The goal is to understand what each score can and cannot support.