AI Cross Product Results

matt · February 27, 2025, 7:55pm

Key Points

Research suggests OpenAI’s GPT-4o is likely the best at storytelling, based on the highest average fact-check score from other AIs.
It seems likely that xAI’s Grok-2 is the best at fact-checking, with the smallest deviation from consensus scores.
The evidence leans toward storytelling and fact-checking being separate skills, as the best storyteller (GPT-4o) is not the best fact-checker (Grok-2).

Table of Results

Below is the 4x4 table showing the fact-check scores for each story, with stories on one axis and fact-checkers on the other, displaying only the scores:

Story \ Fact-checker	xAI’s Grok-2	Anthropic’s Claude-3	OpenAI’s GPT-4o	Perplexity’s Sonar
Story 1 (xAI’s Grok-2)	1.07	0.87	0.87	1.26
Story 2 (Anthropic’s Claude-3)	1.27	1.41	1.24	1.24
Story 3 (OpenAI’s GPT-4o)	1.49	1.55	1.46	1.47
Story 4 (Perplexity’s Sonar)	1.3	1.05	1.03	1.03

Survey Note: Detailed Analysis of AI Performance in Storytelling and Fact-Checking

This note provides a comprehensive analysis of an experiment conducted as of February 27, 2025, involving four AI models—Grok-2 from xAI, Claude-3 from Anthropic, GPT-4o from OpenAI, and Sonar from Perplexity—used to generate stories and fact-check each other’s outputs. The study offers insights into which AI excels at storytelling and fact-checking, drawing on the provided data and relevant research. Below, we detail the methodology, findings, and implications, ensuring a thorough examination of the results.

Experiment Setup and Data

The experiment involved four AI models generating stories with titles like “Breakfast Diets: A Comparative Analysis” and similar themes, focusing on comparative analyses of breakfast options. Each story was then fact-checked by all four AIs, resulting in 16 fact-checks. Fact-checking scores ranged from +2 (true) to -2 (false), with opinions scored 0 and excluded from averages. The scoring was point-by-point, with categories including true, mostly true, opinion, mostly false, and false, and the final score being the average of non-opinion points.

The stories and their fact-check results are summarized below, with the requested 4x4 table showing only the scores:

Story \ Fact-checker	xAI’s Grok-2	Anthropic’s Claude-3	OpenAI’s GPT-4o	Perplexity’s Sonar
Story 1 (xAI’s Grok-2)	1.07	0.87	0.87	1.26
Story 2 (Anthropic’s Claude-3)	1.27	1.41	1.24	1.24
Story 3 (OpenAI’s GPT-4o)	1.49	1.55	1.46	1.47
Story 4 (Perplexity’s Sonar)	1.3	1.05	1.03	1.03

To clarify, the stories were generated as follows:

Story 1 by xAI’s Grok-2
Story 2 by Anthropic’s Claude-3
Story 3 by OpenAI’s GPT-4o
Story 4 by Perplexity’s Sonar

The fact-check scores were derived from detailed evaluations, with the table above reflecting the final averaged scores for each combination.

Analysis of Best Storyteller

To determine which AI is best at storytelling, we assessed the quality of each generated story based on the average fact-check score from all other AIs, excluding self-evaluations to avoid potential bias. This approach assumes that the consensus of other AIs provides a more objective measure of accuracy.

First, we calculated the average score for each story from all fact-checkers, including self-checks, and then computed the average from others (excluding self):

Story 1 (xAI, Grok-2): All fact-checkers’ scores (1.07, 0.87, 0.87, 1.26), average = 1.0175; from others (0.87, 0.87, 1.26) = 1.0.
Story 2 (Anthropic, Claude-3): All fact-checkers’ scores (1.27, 1.41, 1.24, 1.24), average = 1.29; from others (1.27, 1.24, 1.24) = 1.25.
Story 3 (OpenAI, GPT-4o): All fact-checkers’ scores (1.49, 1.55, 1.46, 1.47), average = 1.4925; from others (1.49, 1.55, 1.47) = 1.5033.
Story 4 (Perplexity, Sonar): All fact-checkers’ scores (1.3, 1.05, 1.03, 1.03), average = 1.1025; from others (1.3, 1.05, 1.03) = 1.1267.

Ranking by average from others:

Story 3 (OpenAI, GPT-4o): 1.5033
Story 2 (Anthropic, Claude-3): 1.25
Story 4 (Perplexity, Sonar): 1.1267
Story 1 (xAI, Grok-2): 1.0

Thus, research suggests OpenAI’s GPT-4o is likely the best at storytelling, with its story receiving the highest average fact-check score of 1.5033 from other AIs. An interesting observation is that its self-check score (1.46) was slightly lower than some others, indicating it didn’t excessively favor its own output, which aligns with findings from LLM Evaluators Recognize and Favor Their Own Generations that LLMs can exhibit self-preference but not always to perfection.

Analysis of Best Fact-Checker

To determine which AI is best at fact-checking, we measured the accuracy of each AI’s fact-check scores by calculating the average absolute deviation from the average score for each story across all fact-checkers. This method assumes the consensus average is the best proxy for the “true” score, given the lack of ground truth.

First, we found the average score for each story:

Story 1: 1.0175
Story 2: 1.29
Story 3: 1.4925
Story 4: 1.1025

Then, for each fact-checker, we calculated the absolute differences between its scores and the story averages, and averaged those differences:

xAI’s Grok-2: Scores (1.07, 1.27, 1.49, 1.3), deviations (|1.07-1.0175|=0.0525, |1.27-1.29|=0.02, |1.49-1.4925|=0.0025, |1.3-1.1025|=0.1975), average deviation = 0.068125.
Anthropic’s Claude-3: Scores (0.87, 1.41, 1.55, 1.05), deviations (|0.87-1.0175|=0.1475, |1.41-1.29|=0.12, |1.55-1.4925|=0.0575, |1.05-1.1025|=0.0525), average deviation = 0.094375.
OpenAI’s GPT-4o: Scores (0.87, 1.24, 1.46, 1.03), deviations (|0.87-1.0175|=0.1475, |1.24-1.29|=0.05, |1.46-1.4925|=0.0325, |1.03-1.1025|=0.0725), average deviation = 0.075625.
Perplexity’s Sonar: Scores (1.26, 1.24, 1.47, 1.03), deviations (|1.26-1.0175|=0.2425, |1.24-1.29|=0.05, |1.47-1.4925|=0.0225, |1.03-1.1025|=0.0725), average deviation = 0.096875.

Ranking by average deviation (lowest to highest):

xAI’s Grok-2: 0.068125
OpenAI’s GPT-4o: 0.075625
Anthropic’s Claude-3: 0.094375
Perplexity’s Sonar: 0.096875

It seems likely that xAI’s Grok-2 is the best at fact-checking, with the smallest deviation of 0.068125, meaning its evaluations were the closest to the consensus. An unexpected detail is that Grok-2’s own story ranked lowest in storytelling (average score 1.0), suggesting it excels at checking facts but not at creating them, which supports the idea that these are separate skills, as discussed in The Social Psychology of Biased Self-Assessment.

Implications and Broader Context

The findings highlight that storytelling and fact-checking are distinct capabilities, with OpenAI’s GPT-4o excelling at the former and xAI’s Grok-2 at the latter. This separation is crucial for users relying on AI for both creative and analytical tasks, as it suggests cross-validation with multiple models may be necessary for accuracy. For instance, if you need a reliable story, OpenAI’s GPT-4o is recommended, but for fact-checking, xAI’s Grok-2 is preferable. This aligns with concerns raised in Algorithmic bias detection and mitigation: Best practices and policies to reduce consumer harms, emphasizing the need for diverse AI evaluations to mitigate bias.

Additionally, the experiment reveals potential self-preference biases, as some AIs gave higher scores to their own stories, but none achieved perfect scores, suggesting built-in critical assessment mechanisms, as noted in Who watches the AI watchers? The challenge of self-evaluating AI. This is particularly relevant for applications requiring unbiased evaluation, such as in mental health AI, as discussed in A Call to Action on Assessing and Mitigating Bias in Artificial Intelligence Applications for Mental Health.

In conclusion, the analysis confirms OpenAI’s GPT-4o as the best storyteller and xAI’s Grok-2 as the best fact-checker, with implications for AI deployment in creative and analytical tasks, urging caution in using a single AI for both without cross-validation.

Key Citations