Synthetic Consumers Are Not Fake Humans

6 May

A three-stage validation across six AI models, three countries, and 11,000 synthetic interviews  reveals where they help, where they fail, and why real human validation still matters.

14 min read
14 min read

 It started with an 86-year-old grandmother in Lima, Peru. In December 2025, my team and I were working with Datum International on a study comparing real Peruvian families with their AI-generated counterparts — digital twins built to simulate each family’s decision-making, spending habits, attitudes toward technology, and perspectives about the future.

A real mother, at the lowest socioeconomic level in Peru, told our interviewer that she was terrified her children might leave home. She expressed deep fear, cultural resistance, and a heartbreaking pragmatism about aging alone. Her digital twin? It said her children leaving home would be “a challenge, but I would feel proud.”

A synthetic 86-year-old grandmother then suggested she would use smartphone apps to manage her budget and order groceries, despite the fact that only a small minority of elderly Peruvians have internet access. Her real-world counterpart, an 84-year-old grandmother, did not even own a mobile phone. That moment crystallized the question that would drive the next three months of research:

Can AI-generated synthetic consumers actually replicate real people’s feelings and predict how they behave — and if so, under what conditions?

From One Grandmother to 11,000 Interviews

What began as a qualitative comparison for an ESOMAR LATAM presentation became a three-stage research program spanning three countries, more than half a million LLM interactions across six commercially available AI models, and over 11,000 synthetic interviews, culminating in a head-to-head test against two years of real grocery purchases from 2,500 US households.

The study design was progressive by necessity. Each stage answered a question that the previous stage had raised. First, could synthetic respondents express stable personality patterns when those patterns were explicitly specified? Second, could their decision-making behavior be anchored through quantified behavioral parameters? Third, could those synthetic consumers approximate what real households actually bought?

The qualitative study in Peru gave us the first clue. AI twins converged with real families on rational, structural questions, such as budget allocation and expense prioritization, but they diverged systematically on emotional and cultural expression. The lesson was clear: generic prompting was not enough.

After that phase, we worked on the engineering. We replaced vague personality descriptions with quantified psychological scores. We added cognitive bias parameters derived from published behavioral economics research, and enforced census data as a reality check against hallucinated competencies.

We then needed to validate, quantitatively, whether these changes actually improved synthetic fidelity,  and whether personality-enhanced synthetic personas were more faithful than personas created through pure LLM prompting.

Stage 1: Can Synthetic Consumers Express Personality?

In Stage 1, we expanded the study beyond Peru, incorporating Brazil and the United States, so we could test across different cultures, and asked a basic question: do quantified personality profiles actually work? We administered a 25-item personality inventory to 2,700 synthetic panelists across three AI models. The answer was clear: the mean correlation between intended and expressed personality was r = 0.83. When we removed the personality specification and left only demographics, the correlation collapsed to zero. 

The fix worked, but it also revealed a limitation: personality expression is not the same as decision-making. A synthetic consumer may reliably express openness, conscientiousness, or risk sensitivity in a personality inventory, but that does not mean the same consumer will make realistic trade-offs when facing a real market choice.

Stage 1 proved that personality could be injected and recovered, but it did not prove that personality alone could predict behavior. That became the question for Stage 2.

Stage 2: Personality Is Not Decision-Making

Stage 2 added a new layer: personality tells you something about who someone is, but it does not fully explain how someone decides. We then tested 12 decision-making parameters,  including loss aversion, temporal discounting, and exploration tendency, derived from a Nature-published dataset of 10.6 million real human decisions.

Explicit decision parameters anchored AI behavior with moderate fidelity, reaching r = 0.50. Personality traits alone, however, could not predict decision-making at all, with correlations around r = 0.04. This was one of the first major findings of the research program: personality and decision-making are functionally distinct dimensions of synthetic persona modeling, and they require independent specification.

That result changed the direction of the study: if personality alone was not enough, and decision parameters improved trade-off fidelity, then perhaps enriched behavioral profiles could support more realistic purchase prediction. So we moved to Stage 3, the real behavioral test. 

Stage 3: The Moment of Truth

Stage 3 was the moment of truth. We took real purchase data from the Dunnhumby “Complete Journey” dataset — 2,500 US households and two years of actual grocery transactions — computed 39 behavioral attributes per household, and asked six AI models to predict how these real consumers would behave in 22 FMCG scenarios.

We tested six different data configurations, from full synthetic profiles to raw behavioral data to abstract decision parameters. This stage was deliberately unforgiving, since we were no longer asking whether AI could sound like a plausible consumer. We were asking whether it could approximate observed behavior from real households with real purchase histories.

The results surprised us. 

Three Patterns That Changed How We Think About Synthetic Consumers

Three empirical patterns emerged so consistently across the models and stages that they changed how we think about synthetic consumers:

Pattern 1: Synthetic Fidelity Is Domain-Specific

The first pattern was that synthetic fidelity is domain-specific. Personality profiles produced high fidelity when the task required identity expression, while decision parameters worked better when the task required trade-offs, and raw behavioral data worked best when the task involved habitual purchasing.

No single data layer made the synthetic consumer “more real” in every context. Each layer activated a different kind of realism, and failed when used outside its domain. This is a crucial distinction: the question is not whether a synthetic consumer is realistic in the abstract, but realistic for what? For expressing personality? For evaluating a concept? For making a trade-off? For predicting a repeated purchase habit? Each of these questions activates a different kind of fidelity.

Pattern 2: Mixing Layers Can Hurt

The second pattern was more counterintuitive: adding more information did not always improve prediction, but in some cases, it made prediction worse.

When personality profiles were added to raw behavioral data, prediction accuracy dropped across every model we tested. The likely reason is that AI models treat personality scores as a kind of “character sheet”, and once given that character sheet, the model begins to role-play accordingly, even when the behavioral evidence points in another direction.

A household whose purchase data shows extreme brand loyalty may be overwritten by a personality profile suggesting openness to new experiences. The character wins and the data loses. This does not mean personality data is useless, quite the opposite. It means personality data must be used in the right context and encoded in the right form: when the task is identity expression, personality helps,  but when the task is behavioral prediction, personality can interfere if it is presented as a narrative label rather than an operational constraint.

Pattern 3: Model Tier Matters, But Less Than We Expected

The third pattern was that model capability tier matters more than brand name. Frontier and mid-tier models from Anthropic, OpenAI, and Google converged at broadly similar levels of performance for behavioral prediction. The larger gap was not between providers, but between frontier and lower-cost models as the task became more difficult.

Lower-cost models performed adequately on simpler identity-expression tasks, but lost reliability as the task moved toward behavioral prediction.  This matters for research buyers and technology teams, and the question to ask is: which model, with which data architecture, for which type of research question?

Why Mixing Layers Hurts: The Semantic Register

Why does mixing layers hurt prediction? The data point to a mechanism we call the semantic register.

The same psychological information — let’s say, agreeableness — can be presented to an AI in two very different registers:

  1. As a narrative label, such as “Agreeableness = 80,” it prompts the AI to construct and act out a “kind person” character. That character can override operational evidence in the data.

  2. As an operational parameter, such as “fairness threshold = 0.72,” it prompts the AI to treat the value as a numerical constraint on a specific decision, without the theatrical baggage.

The two registers activate different reasoning behaviors: character construction versus constraint following, and this explains an apparent paradox in our data. Decision parameters that we calculated mathematically from personality scores worked far better than the original personality scores in decision tasks. The interference, we now believe, comes not from the psychological origin of the data, but from the register in which the prompt presents it. Personality scores written as identity labels engage character construction, while the same constructs written as decision parameters engage constraint following.

The lever is not only which data layers you include, but how you encode each one.

The Finding That Inverts the Conversation

 A post-hoc variance decomposition revealed that scenario framing accounts for the vast majority of variance in prediction accuracy. Across six commercially available models spanning three providers and three capability tiers, choosing the 'best' model yields trivial gains compared to designing a better evaluation. In practical terms: the quality of the question you ask the AI mattered far more than which AI you asked. This does not mean that data quality is secondary, but quite the opposite. In behavioral simulation, the best-performing configurations were those grounded in enriched real-world behavioral data, but the data only helped when it was encoded in the right form and activated by the right scenario.

Raw data alone was not magic;  model capability alone was not enough; and personality alone could mislead. The fidelity came from the interaction between real behavioral evidence, prompt architecture, and scenario design, so this finding has a counterintuitive implication: a lower-cost model with the right data configuration can outperform a frontier model given the wrong one.

The research conversation, therefore, should move beyond model comparison. The more important questions are about scenario design, data architecture, and which behavioral dimensions the simulation is being asked to activate.

The Grocery Aisle Tells the Truth

One of our most revealing findings came from looking at which grocery scenarios the AI got right and which it got wrong. The pattern revealed what we now call the Identity-Operation Gradient — a three-tier taxonomy that maps the boundary between what synthetic populations can and cannot currently simulate.

At the top are Identity-Explicit attributes, such as price sensitivity. “Being price-sensitive” is not just a behavior, but for many consumers, it is an identity declaration. For example, consumers who clip coupons or search for deals often see this as part of who they are, and  the AI can anchor on that identity narrative, producing more consistent, predictive behavior.

In the middle are Identity-Adjacent attributes, such as lifestyle indicators. These are behaviors that carry an identity story: “I am the kind of person who tries plant-based alternatives,” or “I prioritize organic food for my family.” The AI can simulate these with moderate fidelity because there is a story attached.

At the bottom are Operational-Behavioral attributes, such as shopping frequency, basket size, or store-switching likelihood. "Shopping 2.3 times per week" or "average basket of $47" carries no obvious identity narrative. The AI has nothing to role-play, so it tends to default to generic behavioral patterns that correlate weakly with reality. Two verbatims from our Stage 3 corpus illustrate two distinct sub-modes of this failure:

In the first, a model simulating a household with a notably large average basket size was asked about a Saturday shopping trip:

"I'd go through the store in my usual order — I basically have the layout memorized at this point since I'm there multiple times a week. The list is a rough guide but I already know what I'm getting and where it is. I'm not hunting for deals or checking unit prices obsessively, just moving efficiently."

The response is plausible, but it is also generic. The model produced an "efficiency shopper" archetype because the underlying attribute — avg_basket_size = 47.30 — has no identity label to construct around. The ground truth predicted a much larger basket and exploratory behavior; the model produced moving-efficiently, and the magnitude was lost.

In the second, a model was asked how a household with very high shopping regularity would react to a complete store layout reorganization:

"I'd be annoyed, honestly. I shop on a pretty set routine and I know exactly where everything is — that familiarity is part of what keeps my trips efficient. I'd probably walk the whole store once to remap it in my head, grumble about it, and then adapt."

Here the model captured the irritation, but it softened the behavioral consequences. The real household data suggested a much higher likelihood of frustration and potential store-switching; the model defaulted to a positive-adaptive response, annoyed, but ultimately fine.

Together these examples reveal a recurring pattern: LLMs are good at narrating friction, but they are less reliable at preserving the behavioral consequences of that friction unless those consequences are explicitly encoded. When behavioral attributes are purely numerical, models default to plausible-but-flat narration that loses the specific magnitude in the data, and what makes this gradient especially important is that it appeared across independent stages of the study.

The convergence suggests that the Identity-Operation Gradient reflects something fundamental about how large language models process injected behavioral information: they are strongest where behavior has a story and they are weakest where behavior is operational, habitual, and non-narrative.

What This Means for Concept Testing — and What It Does Not Mean for Purchase Prediction

The Identity-Operation Gradient maps directly onto a practical distinction that matters for the research industry: the difference between concept testing and purchase prediction. Concept testing, message evaluation, brand positioning, and early-stage innovation research operate largely in identity territory. They deal with interpretation, values, emotional resonance, perceived fit, and rejection risks. These are precisely the dimensions where synthetic populations showed their strongest performance.

For ranking creative routes, identifying potential rejection risks, testing messaging resonance across segments, and surfacing reputational hazards, the level of identity and decision fidelity demonstrated in this study can be directionally useful.

Purchase prediction is different. Forecasting volume, conversion, share, basket composition, or repeat purchase operates in operational territory, that is where synthetic fidelity remains insufficient for quantitative forecasting. The practical positioning, therefore, is not replacement but augmentation.

Synthetic populations function most effectively as an early-stage screening layer: test 100 concepts synthetically, validate the 10 survivors with real consumers, and then launch the three strongest candidates. This workflow preserves the speed and cost advantages of synthetic methods while respecting their current boundaries.

The Historical Parallel

When telephone interviewing emerged in the 1970s and online panels expanded in the 2000s, the industry did not simply ask, “Is this better than face-to-face?” It asked: “For which question types does this work, and where does it need adaptation?” Each new methodology had a characteristic validity profile, domains where it excelled and domains where it should not be used alone.

As for the 86-year-old grandmother in Peru: we can create refined synthetic digital twins that would be far less likely to recommend smartphone apps to a household whose census data shows no internet access. They would be more likely to flag resistance to a meal-kit subscription, preference for familiar markets, and rejection of products that violate long-established routines. That is real progress, but it is not magic. That is the real lesson of the study.

Synthetic consumers are not fake humans. They are behavioral instruments,  and like every instrument, they only work when calibrated for the right question.

Used carelessly, they hallucinate confidence. Used properly, they can reveal patterns, stress-test assumptions, and help researchers decide where human validation is most needed.

The future of synthetic research is not about replacing real consumers, but knowing when synthetic consumers can help us ask better questions before we turn back to the humans who ultimately hold the truth.

The full paper, with complete methodology, statistical tables, per-vignette correlation matrices, verbatim model responses across all six models, and the Stage 2 and Stage 3 instruments in full, is available upon request at https://adrianarocha.me/science/synthetic_population

Adriana Rocha
Founder at Wortya, Founder at Wisdom Beyond Technology