Concepts

DataChain is built around a small number of ideas. Understanding them makes the entire API predictable. Start with the Dataset DB and Datasets, then explore deeper topics as needed.

Dataset DB: the typed-record store, composed of versioned, typed datasets, queryable at warehouse speed; the operational half of the Data Context Layer for files and Python
Datasets: the atom of the Dataset DB: named, versioned, typed, immutable; the unit of persistence, sharing, compounding, and reasoning
Chain: query combining Python and SQL execution in one composable chain; lazy, optimized, atomic
Files and Types: the File abstraction, modality types, annotation types, and the type system
Compute Engine: heavy Python work over files in object storage; parallel, async, distributed, checkpoint-recoverable; the only layer that produces what does not yet exist
Knowledge Base: the compilation layer that turns persistent datasets into agent-readable knowledge; derived from the Dataset DB via LLM enrichments
Skill and MCP: the delivery surface that reaches Claude Code, Cursor, and Codex; agents read context here while pipelines write through the Python library