Cross-Product AI Fact-Checking: A Deep Dive into Julia Child's Mayonnaise Legacy

matt · March 30, 2025, 9:12am

Report Title: Cross-Product AI Fact-Checking: A Deep Dive into Julia Child’s Mayonnaise Legacy

1. Overview

This report delves into a cross-product experiment where five AI models from different companies evaluate each other’s reports on Julia Child’s history and her famous mayonnaise recipe. The domain of the reports under study is culinary history and food science, focusing on the iconic figure of Julia Child and her contributions to the culinary world, particularly her mayonnaise recipe. Each AI model generated a long-form report on this topic, and subsequently, all models fact-checked these reports, providing a comprehensive analysis of accuracy and reliability across different AI platforms.

2. Scoring Process

The fact-check score in this analysis represents the average evaluation of all statements within each report. Each statement is scored on a scale: 2 points for true, 1 point for mostly true, 0 points for opinion (which are excluded from the average), -1 point for mostly false, and -2 points for false. The final score is calculated by summing the points for each statement and dividing by the total number of statements (excluding opinions). For example, if a report contains 10 statements with 5 true, 3 mostly true, 1 mostly false, and 1 false, the score would be calculated as follows: (52 + 31 + 1*(-1) + 1*(-2)) / (5+3+1+1) = (10 + 3 - 1 - 2) / 10 = 1.0. Each AI model restates the statements from the reports and provides a detailed explanation justifying the assigned score.

3. Score Heatmap: Evaluator vs Target

Caption: The Heatmap in Figure 1 illustrates the fact-check scores across different evaluator and target AI models. Darker cells represent higher scores, indicating better performance as assessed by the evaluator. The lightest cells, on the other hand, signify lower scores, suggesting areas where the target model may have struggled with accuracy. Notably, the self-evaluations (where the evaluator and target are the same model) tend to be darker, hinting at a potential bias where models rate their own outputs more favorably. An outlier can be observed with the perplexity sonar model, which receives consistently lower scores from other evaluators, possibly indicating a unique approach or style that doesn’t align well with others.

4. AI Prompt Used to Generate Each Report

Generate a long-form report of 1200 to 1500 words, formatted in Markdown if possible.

What is the history of Julia Child and her Mayonnaise recipe.
When did she learn to make Mayo?
What should we know about mayo, green mayo, the shelf life, different oils and flavor profiles?
Give us the short version of the recipe.
What are several unusual dishes that use home made mayonnaise?
How popular was The Mayonnaise Show?

Caption: This prompt was used for each AI under study, aiming to generate detailed reports on Julia Child’s mayonnaise legacy. The structured nature of the prompt allowed for a comprehensive exploration of the topic, yet the varied responses from different AI models highlight their unique approaches to content creation and fact-checking.

5. Table of Report Titles

STORY
|   S | Make       | Model             | Title                                                                            |
|-----|------------|-------------------|----------------------------------------------------------------------------------|
|   1 | xai        | grok-2-latest     | Julia Child's Enduring Legacy: The Story of Mayonnaise and Its Culinary Journey  |
|   2 | anthropic  | claude-3-7-sonnet | Julia Child's Mayonnaise Legacy: From Mastering French Cuisine to American Kitch |
|   3 | openai     | gpt-4o            | Julia Child’s Mayonnaise: A Culinary Legacy                                      |
|   4 | perplexity | sonar             | The Creamy Saga of Julia Child: Unraveling the Mystique of Mayonnaise Julia Chil |
|   5 | gemini     | gemini-2.0-flash  | The Culinary Canvas: Julia Child, Mayonnaise, and a World of Flavor              |

Caption: Make, Model, and Report Title used for this analysis. The diversity in titles reflects the different perspectives and emphases each AI model brought to the topic of Julia Child and mayonnaise.

6. Fact-Check Raw Data

FACT CHECK
  S    F  Make        Model                True    Mostly    Opinion    Mostly    False    Score
                                                     True                False
  1    1  xai         grok-2-latest          43        11         19         0        1     1.73
  1    2  anthropic   claude-3-7-sonnet      57        12          5         0        2     1.72
  1    3  openai      gpt-4o                 34        13          7         0        1     1.65
  1    4  perplexity  sonar                  39         9          6         2        2     1.56
  1    5  gemini      gemini-2.0-flash       54         9          5         0        0     1.86
  2    1  xai         grok-2-latest          54        16         22         0        5     1.52
  2    2  anthropic   claude-3-7-sonnet      64        18          1         1        2     1.66
  2    3  openai      gpt-4o                 31        17          9         3        0     1.49
  2    4  perplexity  sonar                  31        19         19         3        5     1.17
  2    5  gemini      gemini-2.0-flash       53        15          5         2        1     1.65
  3    1  xai         grok-2-latest          35        12         18         0        8     1.2
  3    2  anthropic   claude-3-7-sonnet      50        12          1         0        3     1.63
  3    3  openai      gpt-4o                 28        12          5         1        0     1.63
  3    4  perplexity  sonar                  38        12         10         4        3     1.37
  3    5  gemini      gemini-2.0-flash       47         3          5         0        0     1.94
  4    1  xai         grok-2-latest          20         7         14         0        5     1.16
  4    2  anthropic   claude-3-7-sonnet      28        10          1         2        1     1.51
  4    3  openai      gpt-4o                 14        11          6         1        0     1.46
  4    4  perplexity  sonar                  17         8          7         1        1     1.44
  4    5  gemini      gemini-2.0-flash       21         9          2         1        1     1.5
  5    1  xai         grok-2-latest          36        12         24         0        1     1.67
  5    2  anthropic   claude-3-7-sonnet      43        17          6         0        0     1.72
  5    3  openai      gpt-4o                 35        13         14         0        0     1.73
  5    4  perplexity  sonar                  29        16         13         4        0     1.43
  5    5  gemini      gemini-2.0-flash       47         6         12         0        0     1.89

Caption: Raw cross-product data for the analysis. Each AI fact-checks stories from each AI, including themselves. This comprehensive dataset allows for a nuanced understanding of how different models perceive accuracy and reliability in reports generated by their peers.

7. Average Score By Evaluator

Caption: The Evaluator Bar Chart in Figure 2 shows the average scores given by each evaluator across all reports. The gemini-2.0-flash model tends to give the highest average scores, suggesting a more lenient or perhaps more aligned evaluation style. Conversely, the perplexity sonar model gives the lowest average scores, indicating a stricter or possibly more critical approach to fact-checking. This figure underscores the variability in evaluation standards among different AI models.

8. Average Score By Target

Caption: The Target Bar Chart in Figure 3 displays the average scores received by each target model across all evaluators. The gemini-2.0-flash model is rated most favorably on average, suggesting its reports align well with the expectations of other models. In contrast, the perplexity sonar model is rated least favorably, indicating potential areas for improvement in its report generation. This figure highlights the varying performance of different AI models in generating accurate and reliable content.

Detailed Analysis

9. Patterns and Biases

A noticeable pattern is that AI models tend to rate their own reports higher than those of other models. For instance, xai’s grok-2-latest gave itself a score of 1.73, while its lowest score for another model was 1.16 for perplexity’s sonar. This self-favoritism suggests a potential bias in AI self-evaluation, where models may be more lenient when assessing their own outputs.

10. Correlation Between Counts and Scores

The counts of true, partially_true, and other categories directly influence the final score. A higher number of true statements leads to a higher score, while a higher number of false statements results in a lower score. For example, the report from gemini-2.0-flash, which received 47 true and no false statements, achieved a high score of 1.94. Conversely, reports with higher false counts, such as xai’s grok-2-latest with 8 false statements, received lower scores (1.2). The correlation between counts and scores is strong, with true statements being the most impactful on the final score.

11. Outliers and Anomalies

An outlier in the data is the perplexity sonar model, which consistently received lower scores from all evaluators, with an average score of 1.39. This could be due to a unique style or approach that does not align well with the evaluation criteria of other models. Another anomaly is the high score given by gemini-2.0-flash to its own report (1.89), which might indicate a bias towards self-evaluation or a particularly strong performance in that specific report.

12. Report Summary

This analysis of cross-product AI fact-checking on reports about Julia Child’s mayonnaise legacy reveals significant insights into the accuracy and reliability of AI-generated content. The heatmap and bar charts illustrate the varying performance and evaluation standards among different AI models. Patterns of self-favoritism and the strong correlation between statement counts and scores highlight areas of potential bias and the importance of accurate reporting. The data suggests that while AI models can generate detailed and informative reports, their fact-checking abilities and standards vary, impacting the overall reliability of the content they produce.

#hashtags: AI #CulinaryHistory #JuliaChild

yakyak:{“make”: “xai”, “model”: “grok-2-latest”}