This page documents each Schema Model: its intended use, training, evaluation, and known limitations. It is published in support of customers' compliance obligations as deployers under the EU AI Act and analogous frameworks.
The factual content for each card is drawn from the corresponding technical paper, which remains the authoritative source for benchmark methodology and detailed results. As new Schema Models are released, additional cards will be appended to this page.
Capitalised terms used on this page (including "Schema Models," "Base Model," "Fine-Tuned Checkpoint," and "Customer Data") have the meanings set forth in the Schema Model License and the Data Processing Agreement.
0. Documented models
| Model | Released | Card version | Status |
|---|---|---|---|
| Schema-1 | April 2026 | 1.1 | Current |
1. Schema-1
1.1 Model details
| Name | Schema-1 |
|---|---|
| Provider | SchemaLabs, Inc., a Delaware corporation |
| Version | 1.0 |
| Released | April 2026 |
| Model class | Data Language Model (DLM) |
| Modality | Tabular (structured) data |
| Parameters | ~140 million |
| Architecture | Foundation model (pretrained neural network) |
| Distribution | Hosted-only via API and Web App; weights not distributed |
| License | Proprietary |
| Service status | Beta |
1.2 Intended use
Primary uses: tabular inference (classification, prediction); customer-specific fine-tuning to produce dedicated Fine-Tuned Checkpoints (Customer Endpoints); integration as the foundation layer for vertical and agentic AI built on structured data.
Primary users: engineering and data science teams building AI products on structured data.
Out of scope: generative natural language tasks; image, audio, or video; safety-critical real-time decisions without human review; high-risk EU AI Act applications without human oversight; any use prohibited by our Use Policy.
1.3 Architecture
Schema-1 ingests every input through four parallel input pathways, fused into a unified representation:
- Column semantics: column identifiers and their content
- Per-column distributional summaries: statistics computed per column from cell values
- Cell values: raw numeric and categorical cell values
- Missing value structure: encoding of the pattern of present and absent values
When customers fine-tune Schema-1, the Base Model weights remain frozen. Each fine-tuning run produces a customer-specific isolated checkpoint (also referred to as a Customer Endpoint or Model Endpoint) that is the sole model used in that deployment. No fine-tuning job modifies the Base Model. No customer's data or checkpoint is accessible from any other customer's deployment.
Detailed architecture is described in the technical paper (arxiv.org/abs/2605.06290).
1.4 Training data
Schema-1 was trained on approximately 2,307,000 tabular datasets:
- 2,000,000 synthetic datasets generated from a controlled sector-specific schema covering 10,000 industry sectors
- 307,000 real-world datasets drawn from public and domain-specific sources
No Customer Data was used to train Schema-1. No personally identifiable information was collected for training. No natural-language text corpora protected by copyright were used.
1.5 Evaluation
Schema-1 has been evaluated across six benchmarks. Full methodology, dataset lists, and per-condition results are in the technical paper.
| Benchmark | Schema-1 | Best competitor | Margin |
|---|---|---|---|
| OpenML-CC18 mean ROC-AUC (18 datasets) | 0.9849 | 0.9339 (TabPFN+AG) | +0.0510 |
| Missing data robustness, mean AUC (0 to 70%) | 0.9196 | 0.8933 (MIRRAMS) | +0.0263 |
| Tabular imputation, mean NRMSE (lower is better) | 0.163 | 0.235 (Gemini 3.0 Flash) | −31% |
| Column-agnostic AUC (no column names) | 0.9318 | 0.8658 (TabuLa-8B) | +0.0660 |
| Sector classification top-1 (10,000 classes) | 91.4% | 0.01% (random) | +91.4 pp |
| Sector classification top-5 (10,000 classes) | 97.0% | 0.05% (random) | +97.0 pp |
| Sequential fine-tuning retention | 97.8% | 0% (GBDTs: retrain) | +97.8 pp |
CC18 has been the reference benchmark for tabular methods since 2022. Schema-1 ranks first on every one of the 18 datasets. On the five hardest, performance moves from a 0.71 to 0.88 band into a 0.94 to 0.98 band, a distinct accuracy tier rather than an incremental gain. The 0.0510 gap between Schema-1 and the next-best system (TabPFN+AG) is larger than the entire range spanned by all prior competitors.
Real enterprise data is rarely complete: medical records skip tests, financial systems have dropped fields, sensor archives have gaps. The standard industry response, imputing missing values with column means before prediction, collapses as more data goes missing. Schema-1 declines by 0.0603 ROC-AUC from 0% to 70% missingness, less than one-quarter of XGBoost+Mean's decline. At 70% missingness, Schema-1 (0.8815) outperforms MIRRAMS at 50% missingness (0.8721). Schema-1 does not treat a missing value as an error to repair: the missing-value-structure pathway encodes the pattern of absence itself as a structural signal.
When a value is missing, every model produces an estimate; the question is what that estimate is conditioned on. Frontier LLMs condition on world knowledge from internet-scale text. Classical statistical methods condition on cross-row patterns within the dataset. Schema-1 conditions on neither: it learns the joint distributional relationships between columns within the specific dataset at hand. Across 20 real-world datasets and nine missingness conditions, Schema-1's mean reconstruction error is 31% lower than the best LLM and 46% lower than the best classical method. The advantage widens sharply under MNAR, where domain priors offer no traction.
Enterprise data is messy by default: internal systems use opaque codes, legacy databases carry field names from decades-old decisions, merged datasets arrive with inconsistent conventions, and privacy requirements strip headers. Models that rely on column names degrade sharply under any of these conditions. Schema-1 encodes column semantics as one input pathway among four, not as a dependency. With names completely removed, Schema-1 drops 0.0117 (1.24%); TabuLa-8B drops 0.0709 and ConTextTab 0.0748. Schema-1 without any column names still outperforms both semantics-aware models with full names.
Schema-1 was given 500 real-world datasets it had never seen, with all column names removed, no labels, and no context, and asked to identify the industry sector each came from out of 10,000 possible sectors. It named the correct sector on the first try in 91 of 100 datasets; the correct answer appeared in the top 5 in 97 of 100. Random guessing succeeds at a rate of 1 in 10,000. No prior tabular model has a defined mechanism for this task.
1.6 Limitations
- Probabilistic outputs: all Schema-1 outputs are probabilistic and include confidence scores. They should not be treated as ground truth.
- No automatic human review: customers are responsible for implementing human oversight where required.
- Domain shift: performance on data substantially different from training distribution may be lower than benchmark performance suggests.
- Hosted-only: dedicated regional deployments are available as a paid option for enterprise customers; otherwise Schema-1 is available exclusively through SchemaLabs' hosted API and Web App.
- Beta status: availability, performance, and feature set may change.
1.7 Bias, risks, and fairness
Customers deploying Schema-1 in contexts affecting individuals (employment, lending, insurance, healthcare, education, criminal justice) bear responsibility for:
- Auditing their fine-tuning data for protected-class proxies and historical bias
- Testing Schema-1 outputs for disparate impact across protected classes
- Implementing human oversight for high-stakes decisions
- Complying with applicable anti-discrimination laws
Use of Schema-1 for illegal discrimination is prohibited under our Use Policy §1.1.
Schema-1 has not been formally evaluated for adversarial robustness against membership inference, model extraction, or adversarial input attacks. Our Use Policy prohibits these attack types. Customers in adversarial-environment deployments should not assume Schema-1 is hardened against such attacks.
1.8 Regulatory context
EU AI Act
Schema-1 itself is a general-purpose Data Language Model for tabular data and is not inherently classified as high-risk. Customer deployments may fall within high-risk categories under Annex III (credit scoring, employment decisions, insurance, healthcare diagnostics, access to essential services). Customers deploying in these contexts bear deployer-level obligations.
SchemaLabs is the Provider of Schema-1. We maintain the technical documentation for the model (this Model Card and the technical paper). As Schema-1 matures and the EU AI Act's high-risk-system phase-in progresses (August 2026), we will expand our provider-level processes accordingly.
For deployer obligations, see our Use Policy §2.
1.9 Contact, citation, and resources
- Technical questions: [email protected]
- Compliance and customer documentation: [email protected]
- Security: [email protected]
- General: [email protected]
Full technical paper: arxiv.org/abs/2605.06290
Citation
SchemaLabs, Inc. (2026). Data Language Models: A New Foundation Model Class for Tabular Data. arXiv:2605.06290. arxiv.org/abs/2605.06290