AI Model Landscape Taxonomy

AI model tracking gets noisy when every model is compared as if it were competing in one universal leaderboard. A more useful research map starts with categories.

Core Axes

Axis	What it captures	Why it matters
Modality	Text, code, image, audio, video, multimodal	Different benchmarks measure different skills
Deployment surface	API, open-weight, local, edge, enterprise	Determines cost, privacy, latency, and control
Capability niche	Reasoning, coding, search, agents, creative generation	Models specialize even when marketed broadly
Context and memory	Token window, retrieval, persistent state	Affects research, coding, and long-document tasks
Tool use	Function calling, browser use, computer use, code execution	Determines agentic reliability
Evaluation posture	Public benchmark claims, third-party evals, private evals	Controls how much confidence to place in comparisons

Model Categories

Frontier General Models

These are broad models positioned for general chat, reasoning, coding, and tool use. They usually define the top of the commercial API market, but their published benchmarks are often selective and need careful normalization.

Coding Models

Coding models should be tracked separately from general chat models. Useful evaluation questions include patch correctness, repository navigation, test repair, long-context code understanding, and ability to work with existing style.

Reasoning Models

Reasoning models emphasize deliberate problem solving. Benchmark results can be impressive, but latency, cost, verbosity, and tool-use behavior are part of the actual product tradeoff.

Open-Weight Models

Open-weight models compete on controllability, deployability, and economics as much as absolute benchmark score. Their value changes significantly with quantization, inference stack, hardware, context length, and fine-tuning ecosystem.

Embedding and Reranking Models

Embedding and reranking models are infrastructure components. They should be evaluated by retrieval quality, domain transfer, multilingual behavior, latency, and cost rather than chat-style benchmarks.

Audio, Image, and Video Models

Generation and transcription models need task-specific evaluation: instruction following, temporal consistency, artifacts, editability, latency, and rights/safety constraints.

Timeline Discipline

Release timelines should track at least four dates when possible:

announcement date
API availability date
general availability date
model replacement or deprecation date

Those dates are often different. Keeping them separate avoids false comparisons across models that were announced before they were broadly usable.

Benchmark Discipline

Benchmark notes should record:

benchmark name and version
reported score
evaluation source
model variant and date
whether tools, chain-of-thought, sampling, or retries were used
caveats from the source

The goal is not to worship benchmarks. The goal is to understand what each score can and cannot support.