SchemaLabs
  • Platform ›
    Overview Docs Soon
  • Research
  • Solutions
  • Pricing
  • Company ›
    About Contact Trust Legal
Sign in →
Legal · Model Card

Model Card

SchemaLabs, Inc.

Last updated
May 13, 2026
Current model
Schema-1
Technical paper
arxiv.org/abs/2605.06290
Contents
  1. 0Documented models
  2. 1Schema-1
  3. 1.1Model details
  4. 1.2Intended use
  5. 1.3Architecture
  6. 1.4Training data
  7. 1.5Evaluation
  8. 1.6Limitations
  9. 1.7Bias, risks, and fairness
  10. 1.8Regulatory context
  11. 1.9Contact, citation, and resources

This page documents each Schema Model: its intended use, training, evaluation, and known limitations. It is published in support of customers' compliance obligations as deployers under the EU AI Act and analogous frameworks.

The factual content for each card is drawn from the corresponding technical paper, which remains the authoritative source for benchmark methodology and detailed results. As new Schema Models are released, additional cards will be appended to this page.

Capitalised terms used on this page (including "Schema Models," "Base Model," "Fine-Tuned Checkpoint," and "Customer Data") have the meanings set forth in the Schema Model License and the Data Processing Agreement.

0. Documented models

Model Released Card version Status
Schema-1 April 2026 1.1 Current

1. Schema-1

Released April 2026 · Card version 1.1 · Technical paper: arxiv.org/abs/2605.06290

1.1 Model details

NameSchema-1
ProviderSchemaLabs, Inc., a Delaware corporation
Version1.0
ReleasedApril 2026
Model classData Language Model (DLM)
ModalityTabular (structured) data
Parameters~140 million
ArchitectureFoundation model (pretrained neural network)
DistributionHosted-only via API and Web App; weights not distributed
LicenseProprietary
Service statusBeta

1.2 Intended use

Primary uses: tabular inference (classification, prediction); customer-specific fine-tuning to produce dedicated Fine-Tuned Checkpoints (Customer Endpoints); integration as the foundation layer for vertical and agentic AI built on structured data.

Primary users: engineering and data science teams building AI products on structured data.

Out of scope: generative natural language tasks; image, audio, or video; safety-critical real-time decisions without human review; high-risk EU AI Act applications without human oversight; any use prohibited by our Use Policy.

1.3 Architecture

Schema-1 ingests every input through four parallel input pathways, fused into a unified representation:

  • Column semantics: column identifiers and their content
  • Per-column distributional summaries: statistics computed per column from cell values
  • Cell values: raw numeric and categorical cell values
  • Missing value structure: encoding of the pattern of present and absent values

When customers fine-tune Schema-1, the Base Model weights remain frozen. Each fine-tuning run produces a customer-specific isolated checkpoint (also referred to as a Customer Endpoint or Model Endpoint) that is the sole model used in that deployment. No fine-tuning job modifies the Base Model. No customer's data or checkpoint is accessible from any other customer's deployment.

Detailed architecture is described in the technical paper (arxiv.org/abs/2605.06290).

1.4 Training data

Schema-1 was trained on approximately 2,307,000 tabular datasets:

  • 2,000,000 synthetic datasets generated from a controlled sector-specific schema covering 10,000 industry sectors
  • 307,000 real-world datasets drawn from public and domain-specific sources

No Customer Data was used to train Schema-1. No personally identifiable information was collected for training. No natural-language text corpora protected by copyright were used.

1.5 Evaluation

Schema-1 has been evaluated across six benchmarks. Full methodology, dataset lists, and per-condition results are in the technical paper.

Benchmark Schema-1 Best competitor Margin
OpenML-CC18 mean ROC-AUC (18 datasets) 0.9849 0.9339 (TabPFN+AG) +0.0510
Missing data robustness, mean AUC (0 to 70%) 0.9196 0.8933 (MIRRAMS) +0.0263
Tabular imputation, mean NRMSE (lower is better) 0.163 0.235 (Gemini 3.0 Flash) −31%
Column-agnostic AUC (no column names) 0.9318 0.8658 (TabuLa-8B) +0.0660
Sector classification top-1 (10,000 classes) 91.4% 0.01% (random) +91.4 pp
Sector classification top-5 (10,000 classes) 97.0% 0.05% (random) +97.0 pp
Sequential fine-tuning retention 97.8% 0% (GBDTs: retrain) +97.8 pp
OpenML-CC18 mean ROC-AUC 18 datasets, 10-fold stratified CV. Bold bar: Schema-1.
1.00 0.98 0.96 0.94 0.92 0.90 0.88 0.9339 0.9849 LightGBM CatBoost XGBoost ASKL2 AutoGluon TabPFN TabPFN+AG Schema-1

CC18 has been the reference benchmark for tabular methods since 2022. Schema-1 ranks first on every one of the 18 datasets. On the five hardest, performance moves from a 0.71 to 0.88 band into a 0.94 to 0.98 band, a distinct accuracy tier rather than an incremental gain. The 0.0510 gap between Schema-1 and the next-best system (TabPFN+AG) is larger than the entire range spanned by all prior competitors.

Missing data robustness Mean ROC-AUC as a function of MCAR missingness rate, 15 CC18 datasets.
0.95 0.90 0.85 0.80 0.75 0% 10% 20% 30% 50% 70% Schema-1 MIRRAMS TabPFN-2.5

Real enterprise data is rarely complete: medical records skip tests, financial systems have dropped fields, sensor archives have gaps. The standard industry response, imputing missing values with column means before prediction, collapses as more data goes missing. Schema-1 declines by 0.0603 ROC-AUC from 0% to 70% missingness, less than one-quarter of XGBoost+Mean's decline. At 70% missingness, Schema-1 (0.8815) outperforms MIRRAMS at 50% missingness (0.8721). Schema-1 does not treat a missing value as an error to repair: the missing-value-structure pathway encodes the pattern of absence itself as a structural signal.

Tabular imputation: mean NRMSE 20 real-world datasets, 9 MCAR/MAR/MNAR conditions. Lower is better.
0.0 0.1 0.2 0.3 0.4 0.5 Schema-1 0.163 Gemini 3.0 F. 0.235 Claude 4.5 S. 0.237 Mistral D2 0.287 MiMo-V2 0.292 GPT-4.1 Nano 0.296 missForest 0.302 MICE 0.306 kNN 0.327 SAEI 0.372 SoftImpute 0.424 TabPFN 0.448

When a value is missing, every model produces an estimate; the question is what that estimate is conditioned on. Frontier LLMs condition on world knowledge from internet-scale text. Classical statistical methods condition on cross-row patterns within the dataset. Schema-1 conditions on neither: it learns the joint distributional relationships between columns within the specific dataset at hand. Across 20 real-world datasets and nine missingness conditions, Schema-1's mean reconstruction error is 31% lower than the best LLM and 46% lower than the best classical method. The advantage widens sharply under MNAR, where domain priors offer no traction.

Column-agnostic prediction Mean ROC-AUC under three column-name conditions, 20 OpenML datasets.
0.96 0.94 0.92 0.90 0.88 0.86 0.84 Full names Random strings No names 0.9318 0.8658 0.8541 Schema-1 TabuLa-8B ConTextTab

Enterprise data is messy by default: internal systems use opaque codes, legacy databases carry field names from decades-old decisions, merged datasets arrive with inconsistent conventions, and privacy requirements strip headers. Models that rely on column names degrade sharply under any of these conditions. Schema-1 encodes column semantics as one input pathway among four, not as a dependency. With names completely removed, Schema-1 drops 0.0117 (1.24%); TabuLa-8B drops 0.0709 and ConTextTab 0.0748. Schema-1 without any column names still outperforms both semantics-aware models with full names.

Sector classification outcomes 500 held-out datasets, 10,000-class task, no column names or metadata.
Top-1: 457 (91.4%) 0 100 200 300 400 500 datasets Top-1 correct: 457 (91.4%) Top-2 to 5: 28 (5.6%) Not in top-5: 15 (3.0%)

Schema-1 was given 500 real-world datasets it had never seen, with all column names removed, no labels, and no context, and asked to identify the industry sector each came from out of 10,000 possible sectors. It named the correct sector on the first try in 91 of 100 datasets; the correct answer appeared in the top 5 in 97 of 100. Random guessing succeeds at a rate of 1 in 10,000. No prior tabular model has a defined mechanism for this task.

1.6 Limitations

  • Probabilistic outputs: all Schema-1 outputs are probabilistic and include confidence scores. They should not be treated as ground truth.
  • No automatic human review: customers are responsible for implementing human oversight where required.
  • Domain shift: performance on data substantially different from training distribution may be lower than benchmark performance suggests.
  • Hosted-only: dedicated regional deployments are available as a paid option for enterprise customers; otherwise Schema-1 is available exclusively through SchemaLabs' hosted API and Web App.
  • Beta status: availability, performance, and feature set may change.

1.7 Bias, risks, and fairness

Customers deploying Schema-1 in contexts affecting individuals (employment, lending, insurance, healthcare, education, criminal justice) bear responsibility for:

  • Auditing their fine-tuning data for protected-class proxies and historical bias
  • Testing Schema-1 outputs for disparate impact across protected classes
  • Implementing human oversight for high-stakes decisions
  • Complying with applicable anti-discrimination laws

Use of Schema-1 for illegal discrimination is prohibited under our Use Policy §1.1.

Schema-1 has not been formally evaluated for adversarial robustness against membership inference, model extraction, or adversarial input attacks. Our Use Policy prohibits these attack types. Customers in adversarial-environment deployments should not assume Schema-1 is hardened against such attacks.

1.8 Regulatory context

EU AI Act

Schema-1 itself is a general-purpose Data Language Model for tabular data and is not inherently classified as high-risk. Customer deployments may fall within high-risk categories under Annex III (credit scoring, employment decisions, insurance, healthcare diagnostics, access to essential services). Customers deploying in these contexts bear deployer-level obligations.

SchemaLabs is the Provider of Schema-1. We maintain the technical documentation for the model (this Model Card and the technical paper). As Schema-1 matures and the EU AI Act's high-risk-system phase-in progresses (August 2026), we will expand our provider-level processes accordingly.

For deployer obligations, see our Use Policy §2.

1.9 Contact, citation, and resources

  • Technical questions: [email protected]
  • Compliance and customer documentation: [email protected]
  • Security: [email protected]
  • General: [email protected]

Full technical paper: arxiv.org/abs/2605.06290

Citation

SchemaLabs, Inc. (2026). Data Language Models: A New Foundation Model Class for Tabular Data. arXiv:2605.06290. arxiv.org/abs/2605.06290

This Model Card is published as a transparency commitment to all users of the SchemaLabs Service and in support of customers' regulatory compliance obligations. It is not a contract.

SchemaLabs
Platform
  • Overview
  • Pricing
Research
  • Schema-1 paper
  • Model Card
Solutions
  • Financial Services
  • Healthcare
  • Sports & Media
  • All industries
Legal
  • Privacy
  • Terms
  • Use Policy
  • Trust Center
  • All legal
Company
  • About
  • Contact
© 2026 SchemaLabs. All rights reserved. Cookie settings Foundation model for tabular data.

Cookies

We use a small, restricted set of cookies. We do not use advertising or marketing cookies, and we do not track you across third-party websites. See our Cookie Policy for details.

Cookie settings

Choose which cookies to allow. Strictly necessary cookies cannot be disabled.

  • Strictly necessary

    Essential for the website and Web App to function. Always on.

  • Remember your preferences and settings.

  • Help us understand how visitors use the site (aggregated, anonymized).