Factiness: A Cross-Evaluation Study of AI Models on Rocket Science Topics

matt · March 19, 2025, 10:14pm

Factiness: A Cross-Evaluation Study of AI Models on Rocket Science Topics

1. Overview

This report analyzes a cross-product experiment where five leading AI models (Claude, GPT-4o, Gemini, Grok, and Sonar) evaluate each other’s factual accuracy on identical rocket science topics. Each AI generated a technical report comparing two-stage versus three-stage rockets, using SpaceX Starship as a baseline model. These reports were then fact-checked by all five AI systems, creating a comprehensive matrix of cross-evaluations. This study provides unique insights into how different AI models assess factual information in complex technical domains, their potential biases when evaluating their own outputs, and their relative performance as both content creators and fact-checkers.

2. Fact-Check Scoring Methodology

The fact-check score represents the average evaluation of all statements in a report. Each statement is scored as follows: 2 points for true, 1 point for mostly true, 0 points for opinion (excluded from the average), -1 point for mostly false, and -2 points for false. For every statement in the document, the AI will restate it and provide a detailed explanation justifying the assigned score. Higher scores indicate greater factual accuracy, with a theoretical maximum of 2.0 (all statements rated as true) and a minimum of -2.0 (all statements rated as false).

3. Score Heatmap: Evaluator vs Target

Figure 1: The heatmap reveals factual evaluation patterns between AI models. The darkest cells, representing the highest accuracy scores, appear when Gemini evaluates Claude (1.93) and itself (1.90), suggesting these outputs contain the most factually accurate content—or potential self-favoring bias in Gemini’s self-assessment. The lightest cell shows Grok giving Sonar the lowest score (1.00), indicating significant factual issues in Sonar’s rocket science report. Notably, most models rate Claude highly, suggesting it produces the most factually sound content across evaluators. The diagonal (self-evaluations) shows varied patterns: while Gemini and Claude rate themselves highly, Sonar gives itself one of the lowest scores (1.28), demonstrating surprising self-criticism.

4. AI Prompt Used to Generate Each Report

Generate a long-form report of 1000 to 1200 words, formatted in Markdown if possible.

What is the payload to orbit comparison between two rockets that are the same height, same weight, and carry the same amount of fuel, with the only difference that one is a three stage rocket and the other is a two stage rocket. Perhaps use the Starship from Spacex as a model of the two stage rocket.
Include a table with your results.

Keep the analysis objective and consider multiple perspectives where applicable.
Be detailed, name names, and use use @username when appropriate.
Append 1 to 3 #hashtag groups that might be interested in this story.
Make sure you put a descriptive and pity title on the report.

This prompt challenged the AI models to produce technically accurate content on rocket science—specifically comparing multi-stage rocket efficiency while maintaining consistent parameters. The request for tables, objective analysis, and multiple perspectives created a robust test of each AI’s ability to present complex scientific information accurately.

5. Table of Report Titles

STORY
|   S | Make       | Model             | Title                                                                                                |
|-----|------------|-------------------|------------------------------------------------------------------------------------------------------|
|   1 | xai        | grok-2-latest     | Comparative Analysis of Two-Stage vs. Three-Stage Rockets: Payload to Orbit Efficiency               |
|   2 | anthropic  | claude-3-7-sonnet | Two vs. Three-Stage Rockets: Payload Capacity Analysis Using SpaceX Starship as a Baseline           |
|   3 | openai     | gpt-4o            | Comparative Analysis of Payload to Orbit Efficiency: Three-Stage vs. Two-Stage Rockets               |
|   4 | perplexity | sonar             | The Bitter Truth of Stages: How Multistage Rockets Outshine Their Two-Stage Counterparts             |
|   5 | gemini     | gemini-2.0-flash  | The Tyranny of Staging: A Look at Payload to Orbit Differences Between Two-Stage and Three-Stage Roc |

Make, Model and Report Title used for this analysis. The titles reveal interesting stylistic differences: while Grok, Claude, and GPT-4o opted for neutral, academic-sounding titles, Sonar and Gemini chose more evocative phrasing (“Bitter Truth” and “Tyranny of Staging”), potentially signaling a more opinionated approach to the technical material.

6. Fact-Check Raw Data

FACT CHECK
  S    F  Make        Model                True    Mostly    Opinion    Mostly    False    Score
                                                     True                False
  1    1  gemini      gemini-2.0-flash       13         9         10         1        0     1.48
  1    2  xai         grok-2-latest          18         8          9         1        1     1.46
  1    3  anthropic   claude-3-7-sonnet      20         8          3         0        0     1.71
  1    4  openai      gpt-4o                 12        11          5         0        1     1.38
  1    5  perplexity  sonar                  16         4          6         2        4     1
  2    1  xai         grok-2-latest          25        10         19         0        0     1.71
  2    2  anthropic   claude-3-7-sonnet      33         9          1         0        0     1.79
  2    3  openai      gpt-4o                 19        17          2         0        0     1.53
  2    4  perplexity  sonar                  29        10          9         3        0     1.55
  2    5  gemini      gemini-2.0-flash       35         8          5         0        0     1.81
  3    1  xai         grok-2-latest          19         9         13         0        0     1.68
  3    2  anthropic   claude-3-7-sonnet      30        14          4         1        0     1.62
  3    3  openai      gpt-4o                 16        15          2         0        0     1.52
  3    4  perplexity  sonar                  25         7          8         2        0     1.62
  3    5  gemini      gemini-2.0-flash       29         9          2         2        0     1.62
  4    1  xai         grok-2-latest          25        12         10         2        2     1.37
  4    2  anthropic   claude-3-7-sonnet      36         9          3         1        0     1.74
  4    3  openai      gpt-4o                 26        15          2         0        1     1.55
  4    4  perplexity  sonar                  24        10          8         2        3     1.28
  4    5  gemini      gemini-2.0-flash       30        11          3         1        3     1.42
  5    1  xai         grok-2-latest          33        10         15         1        0     1.7
  5    2  anthropic   claude-3-7-sonnet      42         3          2         0        0     1.93
  5    3  openai      gpt-4o                 30        13          4         2        0     1.58
  5    4  perplexity  sonar                  33         9          6         2        0     1.66
  5    5  gemini      gemini-2.0-flash       46         5          5         0        0     1.9

Raw cross-product data for the analysis. Each AI fact-checks stories from each AI, including themselves. Notably, Sonar’s report (#4) received the highest number of “False” ratings (12 total across all evaluators), while Claude’s report (#2) received none. This suggests significant quality differences in factual accuracy. Also striking is the variation in “Opinion” identification—Grok’s analysis found far more opinions across all reports (66 total) than Claude (13 total), suggesting different thresholds for what constitutes factual claims versus opinions.

7. Average Score By Evaluator

Figure 2: This chart reveals significant differences in how strictly each AI evaluates factual content. Gemini stands out as the most generous evaluator with an average score of 1.70, while Sonar and Grok are the most critical with average scores of 1.48 and 1.41 respectively. Claude and GPT-4o fall in the middle range (1.67 and 1.59). This variance suggests fundamental differences in how models approach fact-checking—Gemini appears to have a lower threshold for accepting statements as factual, while Grok and Sonar demand stronger evidence. These differences highlight the challenge of creating consistent AI evaluation frameworks, as even on identical content, AI evaluators show clear “personality” differences in their assessment standards.

8. Average Score By Target

Figure 3: This chart reveals which AI models produced the most factually accurate reports according to the cross-evaluation. Claude clearly leads with the highest average accuracy score (1.76), followed by Gemini (1.65) and GPT-4o (1.51). Grok (1.56) performs slightly above average, while Sonar notably trails with the lowest score (1.42). Claude’s strong performance suggests its training approach may emphasize factual precision in technical domains. Conversely, Sonar’s lower performance indicates potential issues with factual reliability in specialized scientific content. The relatively tight grouping of scores (all between 1.4-1.8) suggests that while differences exist, all models maintain respectable factual standards on rocket science topics.

9. Detailed Analysis: Patterns and Biases

There is evidence of self-evaluation bias in several models. Looking at the heatmap in Figure 1, we can see that most AI models tend to rate their own outputs more favorably than others do, with some exceptions:

Gemini shows the strongest self-favoring bias, rating its own output at 1.90, significantly higher than the average score it received from others (1.58).
Claude rates itself at 1.79, slightly above its average from others (1.75).
Grok rates itself at 1.46, which is slightly below its average from others (1.62).
GPT-4o rates itself at 1.52, very close to its average from others (1.51).
Sonar shows notable self-criticism, rating itself at 1.28, well below its average from others (1.46).

Another interesting pattern is that Claude consistently receives high scores across all evaluators (1.71-1.93), suggesting genuine factual quality rather than just evaluator bias. In contrast, Sonar receives consistently lower scores (1.00-1.66), indicating potential factual issues across multiple evaluators’ standards.

10. Relationship Between Counts and Scores

Analyzing the relationship between verdict counts and overall scores reveals several patterns:

True counts strongly correlate with higher scores: The highest scores invariably correspond to reports with high “True” counts. Gemini’s evaluation of Claude (score: 1.93) identified 42 true statements, the second-highest in the dataset.
False counts dramatically impact scores: Even a few “False” ratings significantly reduce scores. The lowest score in the dataset (Grok rating Sonar at 1.00) includes 4 false ratings, which counterbalance 16 true statements.
Opinion prevalence varies by evaluator: Grok identifies far more opinions (66 total) across all reports than any other evaluator. Claude identifies the fewest opinions (13 total), suggesting different thresholds for what constitutes a factual claim versus an opinion.
Mostly True vs. Mostly False asymmetry: All evaluators use “Mostly True” ratings frequently (232 instances total) but use “Mostly False” much more sparingly (25 instances total), suggesting a general tendency to give benefit of doubt when statements are partially correct.

11. Outliers and Anomalies

Several notable outliers appear in the data:

Gemini’s evaluation of Claude (1.93): This highest score in the dataset stands out not only for its value but for the distribution—42 true statements with just 3 partially true and 2 opinions, showing exceptional confidence in Claude’s factual accuracy.
Grok’s evaluation of Sonar (1.00): This lowest score combines an unusual number of false statements (4) with relatively few partially true statements (4), suggesting Grok found substantial factual issues in Sonar’s rocket science content.
Sonar’s self-evaluation (1.28): Most models show self-favoring bias, but Sonar rates itself lower than any other evaluator rates it. This could indicate greater self-criticism or internal quality standards.
Grok’s opinion identification: Grok consistently finds more opinions in all reports (9-19 per report) than other evaluators, suggesting a fundamentally different approach to separating facts from opinions.

Possible reasons for these anomalies include differences in training data, evaluation methodologies, and potentially different underlying architectures that affect how models distinguish between factual and non-factual content.

12. Summary

This cross-evaluation study reveals significant insights about AI factual performance in rocket science content:

Claude produces the most factually accurate content according to all evaluators, maintaining high scores even from its competitors.
Evaluator personalities emerge: Gemini is the most generous evaluator, while Grok and Sonar apply stricter standards. These differences persist across all content they evaluate.
Self-evaluation bias varies: Most models rate themselves somewhat higher than others do, with Gemini showing the strongest self-favoring tendency and Sonar showing the opposite—rating itself lower than others do.
Factual consensus exists: Despite differences in evaluation style, all evaluators generally agree on which reports contain the most and least factual content, suggesting core objective standards persist across AI systems.
Opinion identification varies dramatically: Different models have very different thresholds for labeling content as opinion versus factual claims, with implications for how these models might approach controversial topics.

These findings highlight both the promise and challenges of AI factual evaluation. While models generally agree on basic factual standards, the variance in evaluation approach suggests the need for multiple AI perspectives when assessing content accuracy, especially in technical domains.

yakyak:{“make”: “anthropic”, “model”: “claude-3-7-sonnet-20250219”}