NeurIPS 2026 · Competition Track

Closing the
Simulacra Gap in
Development Data

A NeurIPS 2026 competition to validate AI representations of hard-to-reach populations, evaluated on unreleased UN microdata.

Public launchAugust 1, 2026
Two tracksImputation · Cross-respondent
~41,300 respondents19 countries · UNICEF & UNHCR

Get launch notification Read the task

01 · About

AI is already filling data gaps for hard-to-reach populations. Nobody knows how well.

UNICEF, UNHCR, and humanitarian programs increasingly rely on rapid behavioural surveys to set policy and programming. Field data is slow, expensive, and gappy — driving interest in using LLMs as simulacra of specific populations to pre-test instruments, impute non-response, and run subgroup what-ifs.

Why a new benchmark is needed.

Today's evidence on whether these simulacra are faithful is contaminated. Public benchmarks (ANES, GSS, World Values Survey) are in pretraining corpora; models can memorize them. Outside WEIRD subpopulations, simulacra collapse heterogeneity and miscalibrate confidence — silently.

This competition runs on UN behavioural microdata that has never been publicly released, scored under a strictly proper rule, with a closed-evaluation architecture where submissions travel to the data rather than the other way around.

~41,300

respondents across 19 countries in unreleased UNICEF microdata

of these microdata in any model's pretraining corpus

live UN survey programs (CRA 2.0, Faith & Immunisation, MENA Climate KAP, UNHCR ERPIS)

02 · Why compete

A clean dataset, a proper score, four baseline families. Apples-to-apples.

A new dataset, not benchmaxed. Four UN behavioural-science instruments not in any pretraining corpus.
Calibration-first, log-loss-scored evaluation. A strictly proper rule that rewards truthful probabilities, not accuracy on the modal class.
Apples-to-apples comparison across tabular diffusion, IRT, low-rank matrix completion, and LLM-prompted simulacra — on identical held-out masks.
Open-source starter kit. MIT-licensed baselines, schema-only specification, and a synthetic sandbox at launch.
Authorship path. Top-3 teams per track are invited to contribute to the Competition Track proceedings paper; grand-prize team gets an invited talk slot.

Anatomy of the competition

Task: Probabilistic completion of a respondent × question matrix
Tracks: (A) within-respondent imputation · (B) cross-respondent generalization
Data: UNICEF microdata (CRA 2.0, Faith & Immunisation, MENA KAP) plus UNHCR ERPIS
Architecture: Closed evaluation: organizers run code; data never leaves Stanford
Metric: Mean test log-loss — strictly proper, calibration-sensitive
Compute cap: 100 min on a single A100 at test; outbound network disabled

03 · The Task

Complete a respondent × question matrix — for both behaviours and opinions.

Reported behavior

Opinion

N respondents × K items (≈ 41,300 × ~150 in this competition)

Observed (training)
Held-out — predict P(answer)
Genuinely missing

Formally

Given X ∈ ℝ^N×K of survey responses with mask Ω_train, learn p̂(X_ij | context). Score by held-out log-loss — a strictly proper rule.

Items are categorical: binary, unordered nominal, ordered Likert, multi-select, and binned continuous. Skip-logic gating is treated as a distinguished response level (NA_GATED), not as missingness — participants must place mass on it where appropriate.

Track A

Within-respondent imputation

A subset of items held out MCAR for each training respondent — the non-response regime.

Track B

Cross-respondent generalization

Respondents masked on all but their sociodemographics, given item descriptions and complete rows from other respondents — the simulacrum test.

Teams may submit to either track; the grand prize requires strong performance on both.

04 · Data

Three UNICEF assets, one UNHCR instrument. None publicly available.

The three UNICEF assets (CRA 2.0, Faith & Immunisation, MENA Climate KAP) are confirmed; the UNHCR ERPIS instrument targets Syrian refugees in four host countries and is included conditional on UNHCR data-governance approval. Participants do not receive the microdata. You receive (i) a schema-only specification with column names, types, and response-category codes; (ii) a small synthetic sandbox to debug the submission pipeline; (iii) the submission API specification.

	CRA 2.0	Faith & Immunisation	MENA Climate	ERPIS 2025
Countries	6	10	3	4
Waves	3	1	1	2
N total	20,229	19,847	1,236	13,821
Items / wave	72	26	168	110
Socio-demographic vars	10	5	16	15
Attitude / behaviour vars	50	13	129	115

05 · Submission & Rules

Submit code, not predictions. Organizers run it inside the sandbox.

How submission works

Submit a containerized image or a Python script + environment file that implements the defined API.
The harness loads your container, instantiates the model, runs it against the held-out cells, and returns scalar per-track scores.
Code can train on the unmasked portion of the matrix before predicting.
Outbound network access from the submission container is disabled at evaluation time.

Compute & quotas

Single A100 GPU. 10-minute wall-clock budget per development submission; 100-minute budget per test submission.
Development phase (Aug 1 – Oct 31, 2026): 1 leaderboard submission per team per day, scored on a fresh random 10% shard.
Test phase (Nov 1 – Nov 14, 2026): 1 final test submission per team, evaluated on the full datasets.
Pretrained external weights are allowed if publicly downloadable at a fixed commit hash specified before test phase opens.

Baselines provided

Per-item marginal — trivial baseline.
2PL IRT with categorical-logistic likelihood — strong classical baseline.
Low-rank ALS matrix completion.
TabDDPM — modern tabular diffusion baseline.
All released under an MIT license alongside the schema and synthetic sandbox.

Eligibility & ethics

Open to teams from academia, industry, and independent research, except where precluded by sanctions or law.
Each submission must be accompanied by a 4-page method description; top-3 teams per track supply source code under a non-commercial research license.
Microdata are not released. Any attempt to exfiltrate records is grounds for disqualification.
Ties within paired-bootstrap significance share prize money.

06 · Timeline

Eight months from launch to results at NeurIPS.

Jun 2026

Materials posted

Schema, sandbox, baselines public
Jul 2026

Dry run

Harness stress-tested with invited teams
Aug 1, 2026

Public launch

Development phase opens · daily leaderboard
Nov 1–14, 2026

Test phase

Final test submissions · leaderboard frozen
Dec 2026

NeurIPS results

Competition Track session · top-team talks
Q1 2027

Proceedings paper

Authorship for top-3 per track

07 · Prizes & Recognition

Cash, travel grants, and authorship.

Prize categories

Grand prizeStrong performance on both tracks
Track A winnerWithin-respondent imputation
Track B winnerCross-respondent generalization
Travel grantsReserved for LMIC and under-represented teams

Prize-pool amounts and per-tier allocations will be announced at public launch.

Non-monetary recognition

Proceedings authorshipTop-3 per track invited to co-author the Competition Track paper
NeurIPS podium10-minute method talks for top-3 per track
Invited talkGrand-prize team gets an invited slot at the session
UN Applied Impact commendationJointly awarded with the UN Behavioural Science Group for approaches considered for follow-up evaluation in UN operational workflows

Organizing team

Six organizers across Stanford, UNICEF, UNHCR, and the UN Behavioural Science Group.

Andreas Haupt

Stanford HAI · Digital Economy Lab

HAI Postdoctoral Fellow jointly in Stanford's Economics and Computer Science departments. PhD from MIT; co-author of the forthcoming textbook Machine Learning from Human Preferences.

Mary MacLennan

UN Innovation Network

Senior Advisor on Behavioural Science to the Executive Office of the UN Secretary-General; leads the UN Behavioural Science Group. Convenes the UNICEF and UNHCR data-custodian counterparts.

Ukasha Ramli

UNICEF

Behavioural science global lead at UNICEF; data steward for the Community Rapid Assessment 2.0 and the Faith & Immunisation Survey.

Rebeca Moreno Jiménez

UNHCR

Leads innovation data work at UNHCR over refugee and asylum-seeker microdata; owns the technical specification and ingestion pathway for UNHCR-contributed data.

Alex Pentland

Stanford HAI · MIT

Toshiba Professor Emeritus at MIT, Professor (Research) at Stanford. Long-standing engagement with multilateral institutions on data governance for development and humanitarian contexts.

Sanmi Koyejo

Stanford CS · STAIR

Associate Professor of Computer Science at Stanford and director of Stanford Trustworthy AI Research (STAIR). Methodological expertise on trustworthy evaluation and benchmark design.

Be ready for August 1.

Registration opens with the public launch on August 1, 2026. Drop us a line to be notified when the leaderboard and starter kit go live — or with questions about the task, the data, or eligibility.

Get launch notification Ask a question

Closing theSimulacra Gap inDevelopment Data