SchemaLabs - Build AI Models Directly on Tabular Data

Introduction

The past years have produced AI systems with remarkable capabilities for understanding unstructured data. Large language models comprehend and generate natural language with increasing sophistication. Vision models recognize objects, scenes, and concepts in images. Multimodal systems combine these capabilities, processing text, images, and audio in unified frameworks.

Tabular data has followed a different trajectory.

Despite being the dominant format for structured business information (spreadsheets, databases, CSVs, data warehouses), tables have not seen equivalent advances in native AI understanding. The standard workflow remains largely unchanged: before any model can be trained, data must pass through preprocessing pipelines that clean, transform, normalize, and engineer features from raw inputs.

This creates an unusual asymmetry. Natural language, with all its ambiguity and complexity, can be fed directly into modern language models. But tabular data, which is already structured and typed, requires extensive manual preparation.

We developed Schema to address this asymmetry. Schema is Data Language Model, an AI infrastructure developed to understand tabular data as its native input format, the way language models understand text.

The Preprocessing Bottleneck

Why Tabular AI Is Difficult

The difficulty of applying AI to tabular data is not immediately obvious. Tables are structured. Columns have types. Values are organized in rows. Compared to the ambiguity of natural language, tables seem straightforward. Yet several factors make tabular data challenging for neural networks:

Heterogeneous Types

A single table may contain continuous numeric values, categorical labels, dates, currencies, identifiers, and free text. Each type has different statistical properties and semantic meaning.

Semantic Ambiguity

The column name "value" could represent monetary amounts, sensor readings, or abstract scores. Without external knowledge, the meaning is underspecified.

Variable Structure

Tables vary in their number of rows and columns. Unlike images (fixed pixel grids) or sequences (variable length but single dimension), tables are two-dimensional with variation in both axes.

Missing Data

Real-world tables contain missing values due to collection errors, privacy constraints, or inapplicability. The pattern of missingness often carries information.

Implicit Relationships

Columns relate to each other in ways not captured by their values alone. Revenue and cost imply profit, dates imply temporal ordering, foreign keys imply entity relationships.

Current Approach

Standard solution to these challenges is preprocessing: a sequence of transformations that convert raw tabular data into a format suitable for machine learning.

Data Cleaning

Missing values imputed, outliers removed, inconsistent formats standardized

Feature Engineering

Derived features from raw columns: dates become components, categories become vectors

Normalization

Numeric features scaled to common ranges to prevent magnitude dominance

Format Conversion

Data transformed into specific input format required by modeling framework

The Cost

Preprocessing pipelines are expensive to build, maintain, and evolve.

60-80%

Development Time

of data science effort consumed by preprocessing

Brittleness

Pipelines break when upstream data changes: new categories, distribution shifts, format variations

Information Loss

Each preprocessing decision discards information. These losses compound.

Expertise Requirements

Requires domain, statistical, and engineering knowledge: a scarce combination

The preprocessing bottleneck is not a tooling problem to be solved with better libraries. It reflects a fundamental mismatch: we are forcing structured data through transformations designed to satisfy the limitations of models that were not built to understand structure.

Tables as Language

Language models succeed because they learn the structure of language. Words are not arbitrary symbols. They combine according to grammar, carry semantic relationships, and convey meaning through composition. Models trained on sufficient text learn these patterns, enabling them to understand and generate language without explicit rules.

We observe that tables possess analogous structure.

Columns as Vocabulary

Each column represents a semantic concept: a type of measurement, an entity attribute, a categorical distinction. The set of columns forms a vocabulary of concepts relevant to that domain.

Rows as Sentences

Each row expresses a complete observation: a transaction occurred, a measurement was taken, an entity exhibited certain properties. Rows compose column-concepts into meaningful statements.

Schemas as Grammar

The schema (column names, types, constraints, relationships) defines which compositions are valid. Grammar constrains the space of meaningful tables.

Distributions as Semantics

The statistical relationships between columns carry meaning. Revenue correlates with sales volume, age relates to health outcomes. These distributional patterns are the semantics of tabular data.

The Implication

If tables have learnable structure analogous to language, then the language modeling paradigm should transfer: train a model on diverse tables, and it should learn tabular grammar, the patterns of composition, relationship, and meaning that make tables intelligible.

Such a model would not require preprocessing because it would understand the raw structure directly. Missing values would not need imputation because the model would reason about uncertainty. Feature engineering would not be necessary because the model would learn relevant representations. Format conversion would be eliminated because tables would be the native input.

This is the motivation for the Data Language Model.

The Data Language Model

Definition

Data Language Model (DLM) is an AI infrastructure designed to natively understand tabular data. The key distinction from prior approaches:

///

Not serialization:DLMs do not convert tables to text sequences and apply language models. Serialization loses structural information and forces tabular semantics into a format designed for natural language.

///

Not preprocessing + ML:DLMs do not require transformation pipelines before model application. Raw tables are valid inputs.

///

Native understanding:DLMs process tabular structure directly (rows, columns, types, values, schemas) as first-class elements of their input representation.

Schema

Schema-v0 is our first development of the DLM. It processes tabular data through a series of transformations that build native understanding from cells to columns to rows to tables.

What "Native" Means

Native understanding has specific technical meaning:

No serialization penalty

Tables are not converted to sequences of tokens. The two-dimensional structure is preserved throughout processing.

Type awareness

The model distinguishes numeric, categorical, temporal, and identifier columns, processing each according to its type semantics.

Schema conditioning

Column names and metadata inform interpretation.

Structural reasoning

The model reasons about rows and columns as coherent units, not as independent values.

Capabilities

Schema's native tabular understanding manifests as several capabilities that emerge from training.

Semantic Column Comprehension

Schema interprets column names as meaningful concepts. This comprehension is learned, not programmed. This enables zero-shot reasoning about new datasets. Schema can interpret columns it has never seen, provided they follow recognizable naming conventions.

Intelligent Data Completion

Missing data is endemic in real-world tables. Traditional imputation methods (filling with column means or medians) are statistically naive. Schema approaches missing data as an inference problem. Given a row with missing values, what values are most consistent with the observed values in that row, the patterns learned from complete rows, the semantic relationships between columns, and the distributional properties of the missing columns. This is analogous to how language models complete masked tokens by finding values that are most coherent with context.

Multi-Task Intelligence

Traditional ML requires separate models for separate tasks. Schema learns a general representation of tabular data that supports multiple tasks through different output heads: Classification (assign categories), Regression (predict continuous values), Anomaly Detection (score rows by deviation), and Data Quality assessment (completeness, consistency, validity).

Continuous Adaptation

Neural networks typically suffer from catastrophic forgetting: learning new patterns overwrites previously learned ones. Schema incorporates mechanisms for continuous learning that protect important prior knowledge while accommodating new patterns. The model can be updated with new data without losing its understanding of previously seen domains.

Domain Recognition

Schema automatically identifies the domain and industry sector of input data. Given a table, it can recognize the data's domain and apply domain-appropriate reasoning. Financial data invokes different priors than healthcare data; anomalies in retail look different than anomalies in manufacturing.

Technical Specifications

Schema-v0 (alpha)

Parameters	~22M
Validation Accuracy	95%+
Embedding Dimension	256
Attention Heads	8
Processing Layers	6
Latent Tokens	64
Feature Count (column)	1 to 1M+
Class Count (row)	2 to 1M+
Vocabulary Size	50,000
Supported Verticals	Vertical agnostic

Conclusion

Tabular data has been the neglected sibling of modern AI. While language and vision have seen transformative advances in native understanding, tables remained trapped behind preprocessing pipelines: cleaned, transformed, and engineered before AI could engage.

Schema represents our effort to close this gap. By treating tabular understanding as a first-class capability (learnable from data, applicable across domains, and native to the format) we aim to bridge the path from raw data to trained models.

DataSchemaAI Model

The preprocessing bottleneck is not inevitable. Tables have structure that can be learned, and learned structure enables understanding.

We offer Schema as infrastructure for AI model development on tabular data: a foundation for systems that work with data as it exists, not as preprocessing pipelines require it to be.

Contact

[email protected]

Introducing Schema