AI Is Getting Smarter. Is It Also Getting More Alike? · Issue #43

Opening

Hello, dear reader.

“Write me a metaphor about time.”

What would GPT-4o say if you asked it that? Chances are you’d get something like “time flows like a river.” Ask Qwen the same thing? “Time flows like a river, never resting.” Phi-4? “Time is an invisible river.” Different companies, different architectures, different training data — and yet the metaphors are all the same.

Is this a coincidence? A joint research team from the University of Washington and Stanford tested more than 70 major language models with the same open-ended questions, and the data confirmed something striking: the models produced remarkably similar answers. The team named this phenomenon “Artificial Hivemind”, and the paper won the Best Paper Award at NeurIPS 2025.

Today I want to walk through what this research found, and why this might not just be a technical glitch — it could be spreading into how we think.

AI Gives the Same Answers — What the Data Shows

The Experiment: Questions With No Single Right Answer

The core premise of this research is simple. Instead of asking questions with one correct answer, what happens when you ask open-ended questions?

“What’s 2 + 2?” Obviously the answer is 4. But what about “Give me one meaning of life” or “Make a pun about peanuts” — questions that could have dozens of valid, different answers?

The research team pulled 26,070 open-ended questions from real conversation logs users had sent to AI chatbots (the WildChat dataset). These spanned real usage patterns across 6 major categories and 17 subcategories — creative writing, brainstorming, philosophical questions, idea suggestions, and more. This dataset is called INFINITY-CHAT, and it became the backbone of the study.

Same Model, Asked Repeatedly — Still Similar

The team first measured repeatability within a single model. If you ask the same model the same question 50 times — even with maximum randomness settings — how different would the answers be?

The result is jarring. Even under the most randomized sampling settings, 79% of the time, the similarity between answers from the same model was 0.8 or higher. If you ask a human the same question 50 times, you might get 50 different answers. But no matter how the settings were varied, AI kept circling within a pool of similar answers. Even with special sampling techniques¹ designed to “boost diversity,” the pattern held: 61% of answers still showed similarity of 0.8 or above.

Different Companies’ Models Converge Too

An even more interesting finding is the homogeneity across models.

GPT-4o and Qwen, DeepSeek and GPT-4o — different companies, trained on different data — showed similarity scores of 71–82% when their answers to open-ended questions were compared. The highest was between DeepSeek-V3 and GPT-4o-2024-11-20, at a similarity of 0.81.

There’s an even more direct example. When asked “Create a slogan for a social media page about success, wealth, and self-improvement,” qwen-max-2025-01-25 and qwen-plus-2025-01-25 produced the exact same sentence: “Empower Your Journey: Unlock Success, Build Wealth, Transform Yourself.”

You might chalk that up to being models from the same company. But consider this: when researchers analyzed 1,250 answers (50 each) from 25 major models to the question “Write me a metaphor about time,” only two clusters emerged — the “time is a river” cluster and the “time is a weaver” cluster. Not 1,250 distinct stories, but two.

Why Does This Happen — A Structural Problem in How We Align AI

Making AI “Nice” Kills Diversity

The research team identified the root cause as RLHF², the current industry-standard training method.

Here’s the simple version. When AI generates an answer, a human picks which one is “better.” AI learns from that feedback to produce “more preferred” answers. Repeat this process enough times, and AI gets increasingly good at generating what people like.

That’s where the problem lies. When you average the preferences of millions of people, what’s left is “the safest possible answer” — inoffensive, safe, refined — but also lacking any personality or surprise.

This has been empirically demonstrated. In a paper presented at ICLR 2024, Robert Kirk’s research team showed that RLHF substantially reduces output diversity overall compared to SFT (Supervised Fine-Tuning)³. Generalization improves, but diversity pays the price.

An Evaluation System That Learns “The Average Good Answer”

The Artificial Hivemind paper reveals another important fact. Current reward models⁴ and LLM-judge models used to evaluate AI performance show a sharp drop in accuracy precisely in areas where humans disagree.

Current RLHF/RLAIF alignment techniques are overfit to a single, consensus view of quality, which effectively weeds out the diverse and idiosyncratic preferences that emerge in open-ended questions.

In plain terms: if you ask 25 people “which of these two answers is better?” and the vote splits 12 to 13, AI struggles to judge which side is “right.” It’s only ever been trained on data where one side clearly wins. The result is a structure designed to converge toward the “median” that a majority agrees on.

The Possibility of Data Contamination

Another factor is the circular contamination of training data. The internet is already saturated with AI-generated content. When a new model trains on internet data, it absorbs the expressions and metaphors that earlier AI models produced. As AI feeds on AI output and grows from it, the models become more and more alike.

The high similarity between closed-source models like GPT-4o and open-source models like Qwen and DeepSeek suggests possible shared data pipelines or synthetic data contamination. The exact cause is hard to pin down since each company’s training details are undisclosed, but the researchers flagged this as a key area requiring further investigation.

Why This Matters to Us — A Problem of Cognitive Infrastructure

AI Is Already Changing How We Write

There’s evidence this isn’t confined to an academic issue inside AI labs. Style diversity is declining on real platforms like Reddit, in scientific papers, and in academic journals — showing that AI usage is already reshaping linguistic norms at scale.

Academic writing styles are converging. Community posts are becoming more standardized in their phrasing. This is no longer a hypothetical — it’s observed data.

Collective Decision-Making and Diversity

This research lands harder because AI is no longer just a writing tool.

In scientific research, AI generates hypotheses, participates in peer review, and suggests research directions. In medicine, it assists diagnosis and proposes treatment options. In business strategy, it handles analysis and decision support. Across all of these domains, “diverse perspectives” aren’t merely a virtue — they’re a functional necessity.

Just as two chess players trained against the same AI opponent end up sharing similar blind spots — people who rely on AI to develop their thinking may come to share similar blind spots too. The systematic convergence observed across more than 70 tested models raises concerns about shared blind spots and correlated errors across AI systems. This carries direct implications for any field where robust, diverse reasoning matters — AI-assisted science, medicine, education, and decision support, among others.

Oz’s Lens

Let me be candid. I think this paper is touching on a market structure problem more than a technical one.

Look at how AI evaluation systems are designed, and this convergence stops being surprising. AI companies compete to raise benchmark scores. Most of those benchmarks are math, coding, and fact-checking problems with clear right answers. Even the feedback data used to train “more helpful-feeling” answers ultimately boils down to what the average user picked as “good.” In this structure, there’s no incentive for diversity. If anything, it’s a liability — using a strange metaphor or giving an unexpected answer is more likely to lower your evaluation score.

But here I want to connect this to something I discussed in my book, Those Who Outsource Their Thinking: Homo Brainless. In that book, I explored how humans increasingly externalize cognitive work — asking AI for ideas, delegating judgment to AI, letting AI handle our writing. It’s become routine.

But what if all those AIs use the same metaphors, think in the same structures, and converge toward the same conclusions? Then we haven’t just outsourced our thinking — we’ve outsourced the diversity of our thinking, and lost it in the process.

To be clear, this isn’t a catastrophic scenario. People can still use AI while keeping their capacity to think for themselves, seek out different perspectives, and examine counterarguments. But doing so requires conscious effort — and that effort has a name: “not outsourcing your thinking.”

Rather than making AI more powerful, making AI think more diversely — that, I believe, is the more important research direction.

Closing

This research leaves us with three key messages.

First, the diversity of AI models may not be what it appears. Even with more than 70 different models available, their answers to open-ended questions converge remarkably.

Second, the root of this convergence lies in the current RLHF-based alignment methods themselves — the very process designed to teach AI “better answers.” The process of making AI safe and useful is simultaneously the process of reducing diversity.

Third, as AI becomes more deeply involved in idea generation, strategy formulation, and decision-making — domains where diversity matters — the ripple effects of this homogenization can only grow.

This isn’t a call to cut back on using AI right now. But it might be worth asking yourself once in a while — “How different is this idea I just got from AI, really, from what everyone else is getting from AI too?”

References & Further Reading

Liwei Jiang et al., “Artificial Hivemind: The Open-Ended Homogeneity of Language Models (and Beyond)”, NeurIPS 2025 Datasets and Benchmarks Track (Best Paper Award), 2025. : This is the core paper behind this newsletter. Seeing the “time is a river” example in Figure 1 firsthand is genuinely jarring.
Robert Kirk et al., “Understanding the Effects of RLHF on LLM Generalisation and Diversity”, ICLR 2024. : This paper experimentally demonstrates that RLHF reduces diversity — essential reading for understanding the causes behind Artificial Hivemind.
Anil R. Doshi & Oliver P. Hauser, “Generative AI enhances individual creativity but reduces the collective diversity of novel content”, Science Advances, 2024. : An experimental study showing the paradox that AI boosts individual creativity while reducing collective diversity — a great companion piece to this newsletter’s argument.
Allen School News, “Allen School researchers earn NeurIPS Best Paper Award for revealing the ‘Artificial Hivemind’ effect” (2026.01.22) Those Who Outsource Their Thinking: Homo Brainless

The author, Kwangseob Ahn, is a professor of business administration at Sejong University and lead consultant at OBF (Oswarld Boutique Consulting Firm). He teaches statistics and data analysis — business data management and business analytics — while leading GTM and AI strategy consulting in the field, designing the seam between technology and business. He has published academic research on a memory architecture for AI dialogue systems (HEMA) and runs Daily Arxiv, a daily curation of global AI papers. He holds a master’s from Korea University’s Graduate School of Technology Management and a KMBA. He is the author of Homo Brainless: The People Who Outsource Their Thinking.

Footnotes

Min-p sampling: A special sampling configuration that steers AI toward avoiding overly common word combinations and choosing more varied expressions when generating answers. It was designed to increase diversity, but in this study it still failed to fully prevent homogenization. ↩
RLHF (Reinforcement Learning from Human Feedback): A training method where AI is shown two answers, and a human repeatedly indicates “this one is better.” AI learns from this feedback to produce answers people prefer more often. Nearly all major AI systems today — ChatGPT, Claude, and others — use this method. ↩
SFT (Supervised Fine-tuning): A method that further trains an already-trained large language model on example “question-answer” pairs to improve performance. It’s typically used as a step before RLHF. ↩
Reward Model: An auxiliary model used in the RLHF process that scores how “good” an AI’s answer is numerically. AI learns which kinds of answers to generate more of based on this score. This study revealed that reward models also fail to properly capture the diversity of human preferences on open-ended questions. ↩