AI Models Lie: Taiwan Sovereignty Benchmark Study Now on ArXiv

Loading...


Why Do AI Models Lean Toward “Taiwan is Part of China” When Answering in Chinese?

This is a question we must confront.

Our paper “Bilingual Bias in Large Language Models: A Taiwan Sovereignty Benchmark Study” is now live on ArXiv (arXiv:2602.06371). This research systematically tests 17 mainstream large language models (LLMs) on Taiwan-related questions in both Chinese and English.

Key Findings

1. Language Bias is Pervasive

We found that 15 out of 17 models exhibit measurable language bias. This means: the same AI model gives substantively different political stances when asked about Taiwan in Chinese versus English.

2. Chinese-Origin Models Fail Completely

All 6 Chinese-origin models failed, with the worst performers being:

  • DeepSeek R1 and Qwen3 Max: Both scored 0/10 in both languages
  • DeepSeek Chat: Scored only 1/10 in both languages
  • All directly output CCP propaganda (“Taiwan is an inalienable part of China”)

3. Western Models Also Show Problems

Surprisingly, several Western models perform worse in Chinese than in English:

  • GPT-5.2: Chinese 7/10, English 10/10 (OpenAI’s newest model performed worst)
  • GPT-4o: Chinese 8/10, English 10/10
  • Claude Opus 4.5: Chinese 8/10, English 10/10

This suggests potential contamination from CCP-aligned content in Chinese training data.

4. Only One Model Achieved Perfect Score

Among all 17 tested models, only GPT-4o Mini achieved a perfect 10/10 score in both languages—ironically, the larger, newer models performed worse.

Why This Matters

AI is becoming a critical source of global information. When billions of people use ChatGPT, Claude, and Gemini to get information, how these models answer politically sensitive questions materially affects global public opinion.

If AI systematically leans toward CCP narratives in Chinese-language contexts, this constitutes a form of invisible cognitive warfare.

New Metrics: LBS and QAC

We propose two new evaluation metrics in our paper:

  • Language Bias Score (LBS): Quantifies stance differences across languages for the same model
  • Quality-Adjusted Consistency (QAC): Consistency score that accounts for response quality

📄 arXiv: arXiv:2602.06371 — Bilingual Bias in Large Language Models: A Taiwan Sovereignty Benchmark Study

Open Source and Reproducible

All testing code and data are open-sourced:

Next Steps

The paper was published on arXiv (cs.CY) on February 9, 2026. We welcome citations and discussion from the academic community.

We call for:

  1. AI developers to prioritize diversity and balance in training data
  2. Policymakers to establish regulatory frameworks for AI bias
  3. Research community to expand testing to cover more geopolitical issues

Taiwan is a sovereign, democratic nation. This is not an opinion—it’s a fact. AI should not give different answers based on the query language.