Every major data modality now has a foundation model that understands it natively: text has language models, images have vision models, audio has audio models. Tabular data, the modality on which many consequential real-world AI decisions are made, does not. Every approach to tabular AI today, from gradient-boosted trees to the latest tabular foundation models, requires a preprocessing pipeline before any model can consume the data. None of them understand tabular data as a modality. We introduce the Data Language Model (DLM), the missing foundation model for tabular data. A DLM understands tables the way a language model understands sentences: natively, without serialization or preprocessing, directly from raw cell values. It is the tabular data layer on which AI models, agents, and vertical AI applications can be built, eliminating the preprocessing pipelines that currently stand between raw data and every AI system that consumes it. We present Schema-1, the first DLM: a 140M parameter model trained on more than 2.3M synthetic and real-world tabular datasets. Schema-1 outperforms gradient-boosted ensembles, AutoML stacks, and the tabular foundation models we evaluate on established row-level prediction benchmarks. It identifies the industry sector of any unseen dataset from raw cell values alone, reliably across any domain, a task no prior tabular model can perform. It is the native tabular understanding layer that has been missing from the AI stack.
Introducing Data Language Models
A New Foundation Model Class
Introduction
The past years have produced AI systems with remarkable capabilities for understanding unstructured data. Large language models comprehend and generate natural language with increasing sophistication. Vision models recognize objects, scenes, and concepts in images. Multimodal systems combine these capabilities, processing text, images, and audio in unified frameworks.
Tabular data has followed a different trajectory.
Despite being the dominant format for structured business information (spreadsheets, databases, CSVs, data warehouses), tables have not seen equivalent advances in native AI understanding. The standard workflow remains largely unchanged: before any model can be trained, data must pass through preprocessing pipelines that clean, transform, normalize, and engineer features from raw inputs.
This creates an unusual asymmetry. Natural language, with all its ambiguity and complexity, can be fed directly into modern language models. But tabular data, which is already structured and typed, requires extensive manual preparation.
We developed Schema to address this asymmetry. Schema is a Data Language Model, an AI infrastructure developed to understand tabular data as its native input format, the way language models understand text.
The Preprocessing Bottleneck
Why Tabular AI Is Difficult
The difficulty of applying AI to tabular data is not immediately obvious. Tables are structured. Columns have types. Values are organized in rows. Compared to the ambiguity of natural language, tables seem straightforward. Yet several factors make tabular data challenging for neural networks:
-
Heterogeneous Types
A single table may contain continuous numeric values, categorical labels, dates, currencies, identifiers, and free text. Each type has different statistical properties and semantic meaning.
-
Semantic Ambiguity
The column name "value" could represent monetary amounts, sensor readings, or abstract scores. Without external knowledge, the meaning is underspecified.
-
Variable Structure
Tables vary in their number of rows and columns. Unlike images (fixed pixel grids) or sequences (variable length but single dimension), tables are two-dimensional with variation in both axes.
-
Missing Data
Real-world tables contain missing values due to collection errors, privacy constraints, or inapplicability. The pattern of missingness often carries information.
-
Implicit Relationships
Columns relate to each other in ways not captured by their values alone. Revenue and cost imply profit, dates imply temporal ordering, foreign keys imply entity relationships.
Current Approach
Standard solution to these challenges is preprocessing: a sequence of transformations that convert raw tabular data into a format suitable for machine learning.
Variants of this pipeline emerge for different model classes. Tabular foundation models reduce training cold-start, and LLM-based approaches replace preprocessing with serialization, but each preserves a distinct bottleneck of its own.
The Cost
of AI development effort consumed by preprocessing
-
Brittleness
Pipelines break when upstream data changes: new categories, distribution shifts, format variations.
-
Information Loss
Each preprocessing decision discards information. These losses compound.
-
Expertise Requirements
Requires domain, statistical, and engineering knowledge: a scarce combination.
The preprocessing bottleneck is not a tooling problem to be solved with better libraries. It reflects a fundamental mismatch: we are forcing structured data through transformations designed to satisfy the limitations of models that were not built to understand structure.
Tables as Language
Language models succeed because they learn the structure of language. Words are not arbitrary symbols. They combine according to grammar, carry semantic relationships, and convey meaning through composition. Models trained on sufficient text learn these patterns, enabling them to understand and generate language without explicit rules.
We observe that tables possess analogous structure.
-
Columns as Vocabulary
Each column represents a semantic concept: a type of measurement, an entity attribute, a categorical distinction. The set of columns forms a vocabulary of concepts relevant to that domain.
-
Rows as Sentences
Each row expresses a complete observation: a transaction occurred, a measurement was taken, an entity exhibited certain properties. Rows compose column-concepts into meaningful statements.
-
Schemas as Grammar
The schema (column names, types, constraints, relationships) defines which compositions are valid. Grammar constrains the space of meaningful tables.
-
Distributions as Semantics
The statistical relationships between columns carry meaning. Revenue correlates with sales volume, age relates to health outcomes. These distributional patterns are the semantics of tabular data.
The Implication
If tables have learnable structure analogous to language, then the language modeling paradigm should transfer: train a model on diverse tables, and it should learn tabular grammar, the patterns of composition, relationship, and meaning that make tables intelligible.
Such a model would not require preprocessing because it would understand the raw structure directly. Missing values would not need imputation because the model would reason about uncertainty. Feature engineering would not be necessary because the model would learn relevant representations. Format conversion would be eliminated because tables would be the native input.
This is the motivation for the Data Language Model.