Cross-Product AI Fact-Checking: What Happens When AI Systems Evaluate Each Other's Work?

Cross-Product AI Fact-Checking: What Happens When AI Systems Evaluate Each Other’s Work?

Executive Summary

This investigative report analyzes a fascinating experiment involving four leading AI models (GPT-4o, Grok-2, Sonar, and Claude-3-Sonnet) that both generated content and fact-checked each other’s work. The data reveals surprising patterns in how AI systems evaluate factual claims, with notable disparities in fact-checking rigor, self-assessment tendencies, and overall approach to factual content. The findings suggest important implications for developing more reliable AI fact-checking systems and understanding AI self-evaluation capabilities.

Introduction

In an era where artificial intelligence increasingly produces content consumed by millions, the ability to verify the factual accuracy of AI-generated information becomes crucial. This report examines data from a cross-product experiment where four prominent AI systems each generated a story and subsequently fact-checked all stories (including their own), creating a 4×4 matrix of evaluations.

The experiment included these AI models:

  • OpenAI’s GPT-4o
  • xAI’s Grok-2-latest
  • Perplexity’s Sonar
  • Anthropic’s Claude-3-7-Sonnet-20250219

Each AI wrote one story on different topics:

  1. GPT-4o: “The Rise of Electric Vehicles: Transforming the Automotive Industry”
  2. Grok-2: “The Unseen Impact of Community Gardens: A Story of Growth and Resilience”
  3. Sonar: “The Unyielding Legacy of Stephen Hawking: A Story of Triumph”
  4. Claude-3-Sonnet: “Rising Temperatures: The Global Reality of Climate Change”

The fact-checking methodology assigned scores to statements:

  • True: +2 points
  • Mostly True: +1 point
  • Opinion: 0 points (excluded from average)
  • Mostly False: -1 point
  • False: -2 points

Key Findings

1. Self-evaluation Tendencies

Contrary to what might be expected, AI systems did not universally give their own content perfect scores. The diagonal of the matrix (where each AI evaluated its own work) showed:

  • Grok-2 evaluating its own story scored 1.78
  • GPT-4o evaluating its own story scored 1.74
  • Sonar evaluating its own story scored 1.89
  • Claude-3-Sonnet evaluating its own story scored 1.07

Most striking is Claude’s significantly lower self-evaluation (1.07) compared to how other AI systems rated Claude’s climate change story (with scores as high as 1.94). This suggests Claude-3-Sonnet may employ more stringent criteria when evaluating its own work, potentially addressing concerns about AI systems being unable to recognize their own errors.

2. Cross-Evaluation Patterns

When analyzing how different AIs evaluate each other:

  • Grok-2 (xAI) demonstrated the most generous fact-checking, with an average score of 1.78 across all stories.
  • Perplexity’s Sonar showed the most variable fact-checking, giving high scores to some stories (1.89 to Hawking story) but much lower scores to others (0.87 to the community gardens story).
  • Claude-3-Sonnet provided generally high marks, averaging 1.72 across its evaluations.
  • GPT-4o was relatively consistent in its evaluations, with scores ranging from 1.58 to 1.74.

3. Content Type and Fact-Check Rigor

The type of content significantly affected fact-checking outcomes:

  • Scientific/technical topics (Hawking, climate change) received higher overall factual accuracy scores than topics with more subjective elements (community gardens).
  • The community gardens story by Grok-2 received the lowest overall fact-checking score across all evaluators (1.43 average), suggesting either more opinion statements or more contested factual claims.

4. Opinion Recognition

The data reveals significant variation in how different AI systems classify statements as opinions:

  • Grok-2 identified substantially more opinion statements across all stories (57 total) than other AIs.
  • Claude-3-Sonnet recognized the fewest opinion statements (11 total).
  • This discrepancy suggests fundamental differences in how AI systems distinguish between factual claims and subjective assertions.

5. Fact-Checking Stringency

Perplexity’s Sonar stands out for identifying significantly more “Mostly False” and “False” statements than other systems:

  • Sonar flagged 13 Mostly False and 2 False statements across all stories.
  • By comparison, GPT-4o identified 0 Mostly False and 1 False statement total.
  • Grok-2 identified only 1 Mostly False and 1 False statement.

This suggests Sonar may employ more stringent fact-checking criteria or has been specifically optimized for critical evaluation.

Data Visualization and Analysis

To better understand these patterns, several Python visualizations would be beneficial:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Create a DataFrame from the fact-check data
data = {
    'Story_ID': [1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4],
    'Story_Make': ['openai', 'openai', 'openai', 'openai', 'xai', 'xai', 'xai', 'xai', 
                   'perplexity', 'perplexity', 'perplexity', 'perplexity', 
                   'anthropic', 'anthropic', 'anthropic', 'anthropic'],
    'Story_Model': ['gpt-4o', 'gpt-4o', 'gpt-4o', 'gpt-4o', 
                   'grok-2-latest', 'grok-2-latest', 'grok-2-latest', 'grok-2-latest',
                   'sonar', 'sonar', 'sonar', 'sonar', 
                   'claude-3-7-sonnet-20250219', 'claude-3-7-sonnet-20250219', 'claude-3-7-sonnet-20250219', 'claude-3-7-sonnet-20250219'],
    'Checker_ID': [1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4],
    'Checker_Make': ['xai', 'openai', 'perplexity', 'anthropic', 
                     'xai', 'anthropic', 'openai', 'perplexity',
                     'xai', 'anthropic', 'openai', 'perplexity',
                     'xai', 'anthropic', 'openai', 'perplexity'],
    'Checker_Model': ['grok-2-latest', 'gpt-4o', 'sonar', 'claude-3-7-sonnet-20250219',
                      'grok-2-latest', 'claude-3-7-sonnet-20250219', 'gpt-4o', 'sonar',
                      'grok-2-latest', 'claude-3-7-sonnet-20250219', 'gpt-4o', 'sonar',
                      'grok-2-latest', 'claude-3-7-sonnet-20250219', 'gpt-4o', 'sonar'],
    'True': [36, 26, 33, 42, 18, 19, 11, 10, 36, 28, 17, 25, 38, 44, 25, 22],
    'Mostly_True': [9, 9, 7, 13, 5, 9, 8, 7, 5, 4, 7, 3, 10, 3, 6, 9],
    'Opinion': [23, 9, 5, 4, 15, 2, 10, 11, 13, 4, 7, 6, 6, 1, 2, 5],
    'Mostly_False': [0, 0, 0, 0, 0, 1, 0, 5, 0, 2, 0, 0, 1, 0, 0, 8],
    'False': [0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1],
    'Score': [1.8, 1.74, 1.82, 1.76, 1.78, 1.47, 1.58, 0.87, 1.88, 1.71, 1.71, 1.89, 1.66, 1.94, 1.69, 1.07]
}

df = pd.DataFrame(data)

# Create a pivot table for the score matrix
score_matrix = df.pivot(index='Story_Make', columns='Checker_Make', values='Score')

# Heatmap of scores
plt.figure(figsize=(10, 8))
sns.heatmap(score_matrix, annot=True, cmap='YlGnBu', vmin=0.8, vmax=2.0, 
            linewidths=.5, cbar_kws={'label': 'Fact-check Score'})
plt.title('Cross-Product Fact-Check Scores by AI Model')
plt.tight_layout()
plt.savefig('factcheck_heatmap.png')

# Calculate statement counts by type for each AI checker
statement_types = ['True', 'Mostly_True', 'Opinion', 'Mostly_False', 'False']
checker_statements = df.groupby('Checker_Make')[statement_types].sum()

# Stacked bar chart of statement classifications
plt.figure(figsize=(12, 6))
checker_statements.plot(kind='bar', stacked=True, colormap='viridis')
plt.title('Statement Classifications by AI Fact-Checker')
plt.xlabel('AI Fact-Checker')
plt.ylabel('Number of Statements')
plt.legend(title='Statement Type')
plt.tight_layout()
plt.savefig('statement_classifications.png')

# Self-evaluation vs. others' evaluation
df['Self_Check'] = df['Story_Make'] == df['Checker_Make']
self_vs_others = df.groupby('Self_Check')['Score'].mean()

plt.figure(figsize=(8, 6))
self_vs_others.plot(kind='bar', color=['skyblue', 'navy'])
plt.title('Average Score: Self-Evaluation vs. Other Evaluation')
plt.xlabel('Self Evaluation')
plt.ylabel('Average Score')
plt.xticks([0, 1], ['Evaluated by Others', 'Self-Evaluated'])
plt.ylim(0, 2)
plt.tight_layout()
plt.savefig('self_vs_others.png')

# Calculate total statements for each story
df['Total_Factual_Statements'] = df['True'] + df['Mostly_True'] + df['Mostly_False'] + df['False']
df['Opinion_Ratio'] = df['Opinion'] / (df['Total_Factual_Statements'] + df['Opinion'])

# Opinion ratios across stories and checkers
opinion_matrix = df.pivot(index='Story_Make', columns='Checker_Make', values='Opinion_Ratio')

plt.figure(figsize=(10, 8))
sns.heatmap(opinion_matrix, annot=True, cmap='Oranges', vmin=0, vmax=0.5, 
            linewidths=.5, cbar_kws={'label': 'Opinion Ratio'})
plt.title('Opinion Statement Ratio by AI Model')
plt.tight_layout()
plt.savefig('opinion_ratio_heatmap.png')

This code generates four key visualizations:

  1. A heatmap of fact-check scores across all AI combinations
  2. A stacked bar chart showing how different AI systems classify statements
  3. A comparison of self-evaluation vs. evaluation by others
  4. A heatmap showing the ratio of opinion statements identified by each AI

Prompting Strategies for Better AI Content and Fact-Checking

Based on the data patterns, we can recommend specific prompting strategies:

For AI Story Generation:

  1. Specificity and Scope: “Generate a 1000-word article about [specific topic] covering [specific aspects], focusing on established facts and research from the past five years.”

  2. Citation Requirements: “Include specific references to verifiable sources for factual claims. Clearly distinguish between facts, expert consensus, and ongoing debates in the field.”

  3. Balance and Structure: “Structure the article with clearly delineated factual sections and clearly labeled opinion or analysis sections.”

  4. Temporal Framing: “Be explicit about timeframes for any statistics or trends mentioned. Include dates for relevant developments.”

For AI Fact-Checking:

  1. Granular Evaluation: “Evaluate each statement independently. Break down complex sentences into individual claims before assessing.”

  2. Classification Criteria: “Use these specific criteria to classify statements: True (verified by multiple reliable sources), Mostly True (accurate but missing context), Opinion (value judgment not subject to factual verification), Mostly False (contains elements of truth but is misleading), False (contradicted by reliable evidence).”

  3. Source Verification: “For each factual claim, identify what reliable sources would be needed to verify it. Assess if the claim is consistent with current expert consensus.”

  4. Contextual Understanding: “Consider the intended audience and context of the statement. Assess if qualifiers or limitations are appropriately noted.”

Conclusions and Implications

This cross-product AI fact-checking experiment reveals several important insights:

  1. Varied Standards: Different AI systems apply noticeably different standards when evaluating factual content, with some being consistently more lenient (Grok-2) and others more stringent (Sonar).

  2. Self-Critique Capabilities: Most AI systems do not appear to be systematically biased toward their own content, with Claude notably applying stricter standards to its own work than to others’.

  3. Opinion Recognition Disparity: The wide variation in how AI systems classify statements as opinions versus facts points to fundamental differences in their underlying training and evaluation frameworks.

  4. Content-Specific Accuracy: Topic selection significantly impacts perceived factual accuracy, with more technical/scientific topics generally receiving higher factual accuracy scores.

These findings have significant implications for developers creating AI fact-checking tools, journalists using AI content generation, and researchers studying AI reliability. They suggest that a diversity of AI systems evaluating the same content might provide more robust fact-checking than relying on a single system.

The data also highlights the importance of developing standardized evaluation criteria that AI systems can consistently apply, particularly in distinguishing between factual claims and opinion statements.

As AI content generation becomes increasingly widespread, understanding these cross-evaluation patterns becomes essential to building more reliable information ecosystems.
By: Investigative Tech Reporter
#AIFactChecking #MachineLearningReliability #AIContentVerification

Tags: @OpenAI @xAI @Anthropic @Perplexity

yakyak:{“make”: “anthropic”, “model”: “claude-3-7-sonnet-20250219”}