NeurIPS 2026 · Competition Track

Closing the
Simulacra Gap in
Development Data

A NeurIPS 2026 competition to validate AI representations of hard-to-reach populations, evaluated on unreleased UN microdata.

01 · About

AI is already filling data gaps for hard-to-reach populations. Nobody knows how well.

UNICEF, UNHCR, and humanitarian programs increasingly rely on rapid behavioural surveys to set policy and programming. Field data is slow, expensive, and gappy — driving interest in using LLMs as simulacra of specific populations to pre-test instruments, impute non-response, and run subgroup what-ifs.

Why a new benchmark is needed.

Today's evidence on whether these simulacra are faithful is contaminated. Public benchmarks (ANES, GSS, World Values Survey) are in pretraining corpora; models can memorize them. Outside WEIRD subpopulations, simulacra collapse heterogeneity and miscalibrate confidence — silently.

This competition runs on UN behavioural microdata that has never been publicly released, scored under a strictly proper rule, with a closed-evaluation architecture where submissions travel to the data rather than the other way around.

~41,300

respondents across 19 countries in unreleased UNICEF microdata

0

of these microdata in any model's pretraining corpus

4

live UN survey programs (CRA 2.0, Faith & Immunisation, MENA Climate KAP, UNHCR ERPIS)

02 · Why compete

A clean dataset, a proper score, four baseline families. Apples-to-apples.

  • A new dataset, not benchmaxed. Four UN behavioural-science instruments not in any pretraining corpus.
  • Calibration-first, log-loss-scored evaluation. A strictly proper rule that rewards truthful probabilities, not accuracy on the modal class.
  • Apples-to-apples comparison across tabular diffusion, IRT, low-rank matrix completion, and LLM-prompted simulacra — on identical held-out masks.
  • Open-source starter kit. MIT-licensed baselines, schema-only specification, and a synthetic sandbox at launch.
  • Authorship path. Top-3 teams per track are invited to contribute to the Competition Track proceedings paper; grand-prize team gets an invited talk slot.

Anatomy of the competition

Task
Probabilistic completion of a respondent × question matrix
Tracks
(A) within-respondent imputation · (B) cross-respondent generalization
Data
UNICEF microdata (CRA 2.0, Faith & Immunisation, MENA KAP) plus UNHCR ERPIS
Architecture
Closed evaluation: organizers run code; data never leaves Stanford
Metric
Mean test log-loss — strictly proper, calibration-sensitive
Compute cap
100 min on a single A100 at test; outbound network disabled

03 · The Task

Complete a respondent × question matrix — for both behaviours and opinions.

Reported behavior
Opinion

N respondents × K items (≈ 41,300 × ~150 in this competition)

  • Observed (training)
  • Held-out — predict P(answer)
  • Genuinely missing

Formally

Given X ∈ ℝN×K of survey responses with mask Ωtrain, learn p̂(Xij | context). Score by held-out log-loss — a strictly proper rule.

Items are categorical: binary, unordered nominal, ordered Likert, multi-select, and binned continuous. Skip-logic gating is treated as a distinguished response level (NA_GATED), not as missingness — participants must place mass on it where appropriate.

Track A

Within-respondent imputation

A subset of items held out MCAR for each training respondent — the non-response regime.

Track B

Cross-respondent generalization

Respondents masked on all but their sociodemographics, given item descriptions and complete rows from other respondents — the simulacrum test.

Teams may submit to either track; the grand prize requires strong performance on both.

04 · Data

Three UNICEF assets, one UNHCR instrument. None publicly available.

The three UNICEF assets (CRA 2.0, Faith & Immunisation, MENA Climate KAP) are confirmed; the UNHCR ERPIS instrument targets Syrian refugees in four host countries and is included conditional on UNHCR data-governance approval. Participants do not receive the microdata. You receive (i) a schema-only specification with column names, types, and response-category codes; (ii) a small synthetic sandbox to debug the submission pipeline; (iii) the submission API specification.

CRA 2.0 Faith & Immunisation MENA Climate ERPIS 2025
Countries61034
Waves3112
N total20,22919,8471,23613,821
Items / wave7226168110
Socio-demographic vars1051615
Attitude / behaviour vars5013129115

05 · Submission & Rules

Submit code, not predictions. Organizers run it inside the sandbox.

How submission works

  • Submit a containerized image or a Python script + environment file that implements the defined API.
  • The harness loads your container, instantiates the model, runs it against the held-out cells, and returns scalar per-track scores.
  • Code can train on the unmasked portion of the matrix before predicting.
  • Outbound network access from the submission container is disabled at evaluation time.

Compute & quotas

  • Single A100 GPU. 10-minute wall-clock budget per development submission; 100-minute budget per test submission.
  • Development phase (Aug 1 – Oct 31, 2026): 1 leaderboard submission per team per day, scored on a fresh random 10% shard.
  • Test phase (Nov 1 – Nov 14, 2026): 1 final test submission per team, evaluated on the full datasets.
  • Pretrained external weights are allowed if publicly downloadable at a fixed commit hash specified before test phase opens.

Baselines provided

  • Per-item marginal — trivial baseline.
  • 2PL IRT with categorical-logistic likelihood — strong classical baseline.
  • Low-rank ALS matrix completion.
  • TabDDPM — modern tabular diffusion baseline.
  • All released under an MIT license alongside the schema and synthetic sandbox.

Eligibility & ethics

  • Open to teams from academia, industry, and independent research, except where precluded by sanctions or law.
  • Each submission must be accompanied by a 4-page method description; top-3 teams per track supply source code under a non-commercial research license.
  • Microdata are not released. Any attempt to exfiltrate records is grounds for disqualification.
  • Ties within paired-bootstrap significance share prize money.

06 · Timeline

Eight months from launch to results at NeurIPS.

  1. Jun 2026

    Materials posted

    Schema, sandbox, baselines public

  2. Jul 2026

    Dry run

    Harness stress-tested with invited teams

  3. Aug 1, 2026

    Public launch

    Development phase opens · daily leaderboard

  4. Nov 1–14, 2026

    Test phase

    Final test submissions · leaderboard frozen

  5. Dec 2026

    NeurIPS results

    Competition Track session · top-team talks

  6. Q1 2027

    Proceedings paper

    Authorship for top-3 per track

07 · Prizes & Recognition

Cash, travel grants, and authorship.

Prize categories

  • Grand prizeStrong performance on both tracks
  • Track A winnerWithin-respondent imputation
  • Track B winnerCross-respondent generalization
  • Travel grantsReserved for LMIC and under-represented teams

Prize-pool amounts and per-tier allocations will be announced at public launch.

Non-monetary recognition

  • Proceedings authorshipTop-3 per track invited to co-author the Competition Track paper
  • NeurIPS podium10-minute method talks for top-3 per track
  • Invited talkGrand-prize team gets an invited slot at the session
  • UN Applied Impact commendationJointly awarded with the UN Behavioural Science Group for approaches considered for follow-up evaluation in UN operational workflows

Organizing team

Six organizers across Stanford, UNICEF, UNHCR, and the UN Behavioural Science Group.

Andreas Haupt

Stanford HAI · Digital Economy Lab

HAI Postdoctoral Fellow jointly in Stanford's Economics and Computer Science departments. PhD from MIT; co-author of the forthcoming textbook Machine Learning from Human Preferences.

Mary MacLennan

UN Innovation Network

Senior Advisor on Behavioural Science to the Executive Office of the UN Secretary-General; leads the UN Behavioural Science Group. Convenes the UNICEF and UNHCR data-custodian counterparts.

Ukasha Ramli

UNICEF

Behavioural science global lead at UNICEF; data steward for the Community Rapid Assessment 2.0 and the Faith & Immunisation Survey.

Rebeca Moreno Jiménez

UNHCR

Leads innovation data work at UNHCR over refugee and asylum-seeker microdata; owns the technical specification and ingestion pathway for UNHCR-contributed data.

Alex Pentland

Stanford HAI · MIT

Toshiba Professor Emeritus at MIT, Professor (Research) at Stanford. Long-standing engagement with multilateral institutions on data governance for development and humanitarian contexts.

Sanmi Koyejo

Stanford CS · STAIR

Associate Professor of Computer Science at Stanford and director of Stanford Trustworthy AI Research (STAIR). Methodological expertise on trustworthy evaluation and benchmark design.

Be ready for August 1.

Registration opens with the public launch on August 1, 2026. Drop us a line to be notified when the leaderboard and starter kit go live — or with questions about the task, the data, or eligibility.