Andreas Haupt
Stanford HAI · Digital Economy Lab
HAI Postdoctoral Fellow jointly in Stanford's Economics and Computer Science departments. PhD from MIT; co-author of the forthcoming textbook Machine Learning from Human Preferences.
NeurIPS 2026 · Competition Track
A NeurIPS 2026 competition to validate AI representations of hard-to-reach populations, evaluated on unreleased UN microdata.
01 · About
UNICEF, UNHCR, and humanitarian programs increasingly rely on rapid behavioural surveys to set policy and programming. Field data is slow, expensive, and gappy — driving interest in using LLMs as simulacra of specific populations to pre-test instruments, impute non-response, and run subgroup what-ifs.
Today's evidence on whether these simulacra are faithful is contaminated. Public benchmarks (ANES, GSS, World Values Survey) are in pretraining corpora; models can memorize them. Outside WEIRD subpopulations, simulacra collapse heterogeneity and miscalibrate confidence — silently.
This competition runs on UN behavioural microdata that has never been publicly released, scored under a strictly proper rule, with a closed-evaluation architecture where submissions travel to the data rather than the other way around.
~41,300
respondents across 19 countries in unreleased UNICEF microdata
0
of these microdata in any model's pretraining corpus
4
live UN survey programs (CRA 2.0, Faith & Immunisation, MENA Climate KAP, UNHCR ERPIS)
02 · Why compete
03 · The Task
N respondents × K items (≈ 41,300 × ~150 in this competition)
Given X ∈ ℝN×K of survey responses with mask Ωtrain, learn p̂(Xij | context). Score by held-out log-loss — a strictly proper rule.
Items are categorical: binary, unordered nominal, ordered Likert, multi-select, and binned continuous. Skip-logic gating is treated as a distinguished response level (NA_GATED), not as missingness — participants must place mass on it where appropriate.
Track A
A subset of items held out MCAR for each training respondent — the non-response regime.
Track B
Respondents masked on all but their sociodemographics, given item descriptions and complete rows from other respondents — the simulacrum test.
Teams may submit to either track; the grand prize requires strong performance on both.
04 · Data
The three UNICEF assets (CRA 2.0, Faith & Immunisation, MENA Climate KAP) are confirmed; the UNHCR ERPIS instrument targets Syrian refugees in four host countries and is included conditional on UNHCR data-governance approval. Participants do not receive the microdata. You receive (i) a schema-only specification with column names, types, and response-category codes; (ii) a small synthetic sandbox to debug the submission pipeline; (iii) the submission API specification.
| CRA 2.0 | Faith & Immunisation | MENA Climate | ERPIS 2025 | |
|---|---|---|---|---|
| Countries | 6 | 10 | 3 | 4 |
| Waves | 3 | 1 | 1 | 2 |
| N total | 20,229 | 19,847 | 1,236 | 13,821 |
| Items / wave | 72 | 26 | 168 | 110 |
| Socio-demographic vars | 10 | 5 | 16 | 15 |
| Attitude / behaviour vars | 50 | 13 | 129 | 115 |
05 · Submission & Rules
06 · Timeline
Jun 2026
Materials posted
Schema, sandbox, baselines public
Jul 2026
Dry run
Harness stress-tested with invited teams
Aug 1, 2026
Public launch
Development phase opens · daily leaderboard
Nov 1–14, 2026
Test phase
Final test submissions · leaderboard frozen
Dec 2026
NeurIPS results
Competition Track session · top-team talks
Q1 2027
Proceedings paper
Authorship for top-3 per track
07 · Prizes & Recognition
Prize categories
Prize-pool amounts and per-tier allocations will be announced at public launch.
Non-monetary recognition
Organizing team
Stanford HAI · Digital Economy Lab
HAI Postdoctoral Fellow jointly in Stanford's Economics and Computer Science departments. PhD from MIT; co-author of the forthcoming textbook Machine Learning from Human Preferences.
UN Innovation Network
Senior Advisor on Behavioural Science to the Executive Office of the UN Secretary-General; leads the UN Behavioural Science Group. Convenes the UNICEF and UNHCR data-custodian counterparts.
UNICEF
Behavioural science global lead at UNICEF; data steward for the Community Rapid Assessment 2.0 and the Faith & Immunisation Survey.
UNHCR
Leads innovation data work at UNHCR over refugee and asylum-seeker microdata; owns the technical specification and ingestion pathway for UNHCR-contributed data.
Stanford HAI · MIT
Toshiba Professor Emeritus at MIT, Professor (Research) at Stanford. Long-standing engagement with multilateral institutions on data governance for development and humanitarian contexts.
Stanford CS · STAIR
Associate Professor of Computer Science at Stanford and director of Stanford Trustworthy AI Research (STAIR). Methodological expertise on trustworthy evaluation and benchmark design.
Registration opens with the public launch on August 1, 2026. Drop us a line to be notified when the leaderboard and starter kit go live — or with questions about the task, the data, or eligibility.