Introduction
The past years have produced AI systems with remarkable capabilities for understanding unstructured data. Large language models comprehend and generate natural language with increasing sophistication. Vision models recognize objects, scenes, and concepts in images. Multimodal systems combine these capabilities, processing text, images, and audio in unified frameworks.
Tabular data has followed a different trajectory.
Despite being the dominant format for structured business information (spreadsheets, databases, CSVs, data warehouses), tables have not seen equivalent advances in native AI understanding. The standard workflow remains largely unchanged: before any model can be trained, data must pass through preprocessing pipelines that clean, transform, normalize, and engineer features from raw inputs.
This creates an unusual asymmetry. Natural language, with all its ambiguity and complexity, can be fed directly into modern language models. But tabular data, which is already structured and typed, requires extensive manual preparation.
We developed Schema to address this asymmetry. Schema is Data Language Model, an AI infrastructure developed to understand tabular data as its native input format, the way language models understand text.
The Preprocessing Bottleneck
Why Tabular AI Is Difficult
The difficulty of applying AI to tabular data is not immediately obvious. Tables are structured. Columns have types. Values are organized in rows. Compared to the ambiguity of natural language, tables seem straightforward. Yet several factors make tabular data challenging for neural networks:
Heterogeneous Types
A single table may contain continuous numeric values, categorical labels, dates, currencies, identifiers, and free text. Each type has different statistical properties and semantic meaning.
Semantic Ambiguity
The column name "value" could represent monetary amounts, sensor readings, or abstract scores. Without external knowledge, the meaning is underspecified.
Variable Structure
Tables vary in their number of rows and columns. Unlike images (fixed pixel grids) or sequences (variable length but single dimension), tables are two-dimensional with variation in both axes.
Missing Data
Real-world tables contain missing values due to collection errors, privacy constraints, or inapplicability. The pattern of missingness often carries information.
Implicit Relationships
Columns relate to each other in ways not captured by their values alone. Revenue and cost imply profit, dates imply temporal ordering, foreign keys imply entity relationships.
Current Approach
Standard solution to these challenges is preprocessing: a sequence of transformations that convert raw tabular data into a format suitable for machine learning.
Data Cleaning
Missing values imputed, outliers removed, inconsistent formats standardized
Feature Engineering
Derived features from raw columns: dates become components, categories become vectors
Normalization
Numeric features scaled to common ranges to prevent magnitude dominance
Format Conversion
Data transformed into specific input format required by modeling framework
The Cost
Preprocessing pipelines are expensive to build, maintain, and evolve.
Development Time
of data science effort consumed by preprocessing
Brittleness
Pipelines break when upstream data changes: new categories, distribution shifts, format variations
Information Loss
Each preprocessing decision discards information. These losses compound.
Expertise Requirements
Requires domain, statistical, and engineering knowledge: a scarce combination
The preprocessing bottleneck is not a tooling problem to be solved with better libraries. It reflects a fundamental mismatch: we are forcing structured data through transformations designed to satisfy the limitations of models that were not built to understand structure.
Tables as Language
Language models succeed because they learn the structure of language. Words are not arbitrary symbols. They combine according to grammar, carry semantic relationships, and convey meaning through composition. Models trained on sufficient text learn these patterns, enabling them to understand and generate language without explicit rules.
We observe that tables possess analogous structure.
Columns as Vocabulary
Each column represents a semantic concept: a type of measurement, an entity attribute, a categorical distinction. The set of columns forms a vocabulary of concepts relevant to that domain.
Rows as Sentences
Each row expresses a complete observation: a transaction occurred, a measurement was taken, an entity exhibited certain properties. Rows compose column-concepts into meaningful statements.
Schemas as Grammar
The schema (column names, types, constraints, relationships) defines which compositions are valid. Grammar constrains the space of meaningful tables.
Distributions as Semantics
The statistical relationships between columns carry meaning. Revenue correlates with sales volume, age relates to health outcomes. These distributional patterns are the semantics of tabular data.
The Implication
If tables have learnable structure analogous to language, then the language modeling paradigm should transfer: train a model on diverse tables, and it should learn tabular grammar, the patterns of composition, relationship, and meaning that make tables intelligible.
Such a model would not require preprocessing because it would understand the raw structure directly. Missing values would not need imputation because the model would reason about uncertainty. Feature engineering would not be necessary because the model would learn relevant representations. Format conversion would be eliminated because tables would be the native input.
This is the motivation for the Data Language Model.
The Data Language Model
Definition
Data Language Model (DLM) is an AI infrastructure designed to natively understand tabular data. The key distinction from prior approaches:
Schema
Schema-v0 is our first development of the DLM. It processes tabular data through a series of transformations that build native understanding from cells to columns to rows to tables.
What "Native" Means
Native understanding has specific technical meaning:
No serialization penalty
Tables are not converted to sequences of tokens. The two-dimensional structure is preserved throughout processing.
Type awareness
The model distinguishes numeric, categorical, temporal, and identifier columns, processing each according to its type semantics.
Schema conditioning
Column names and metadata inform interpretation.
Structural reasoning
The model reasons about rows and columns as coherent units, not as independent values.
Capabilities
Schema's native tabular understanding manifests as several capabilities that emerge from training.
Semantic Column Comprehension
Schema interprets column names as meaningful concepts. This comprehension is learned, not programmed. This enables zero-shot reasoning about new datasets. Schema can interpret columns it has never seen, provided they follow recognizable naming conventions.
Intelligent Data Completion
Missing data is endemic in real-world tables. Traditional imputation methods (filling with column means or medians) are statistically naive. Schema approaches missing data as an inference problem. Given a row with missing values, what values are most consistent with the observed values in that row, the patterns learned from complete rows, the semantic relationships between columns, and the distributional properties of the missing columns. This is analogous to how language models complete masked tokens by finding values that are most coherent with context.
Multi-Task Intelligence
Traditional ML requires separate models for separate tasks. Schema learns a general representation of tabular data that supports multiple tasks through different output heads: Classification (assign categories), Regression (predict continuous values), Anomaly Detection (score rows by deviation), and Data Quality assessment (completeness, consistency, validity).
Continuous Adaptation
Neural networks typically suffer from catastrophic forgetting: learning new patterns overwrites previously learned ones. Schema incorporates mechanisms for continuous learning that protect important prior knowledge while accommodating new patterns. The model can be updated with new data without losing its understanding of previously seen domains.
Domain Recognition
Schema automatically identifies the domain and industry sector of input data. Given a table, it can recognize the data's domain and apply domain-appropriate reasoning. Financial data invokes different priors than healthcare data; anomalies in retail look different than anomalies in manufacturing.
Technical Specifications
Schema-v0 (alpha)
| Parameters | ~22M |
| Validation Accuracy | 95%+ |
| Embedding Dimension | 256 |
| Attention Heads | 8 |
| Processing Layers | 6 |
| Latent Tokens | 64 |
| Feature Count (column) | 1 to 1M+ |
| Class Count (row) | 2 to 1M+ |
| Vocabulary Size | 50,000 |
| Supported Verticals | Vertical agnostic |
Conclusion
Tabular data has been the neglected sibling of modern AI. While language and vision have seen transformative advances in native understanding, tables remained trapped behind preprocessing pipelines: cleaned, transformed, and engineered before AI could engage.
Schema represents our effort to close this gap. By treating tabular understanding as a first-class capability (learnable from data, applicable across domains, and native to the format) we aim to bridge the path from raw data to trained models.
The preprocessing bottleneck is not inevitable. Tables have structure that can be learned, and learned structure enables understanding.
We offer Schema as infrastructure for AI model development on tabular data: a foundation for systems that work with data as it exists, not as preprocessing pipelines require it to be.
Contact
[email protected]