A question for AI about cross product of AI

A question for AI about cross product of AI

I’ve got this data that I want to understand better, what does it mean?
The data is from a cross product experiment with 4 different AI.
Each AI generates a report or story, so in this case there are 4 stories.
Then each AI performs a fact check on each story, making 16 fact checks.

Side note: All of this python code is running on Mac and Linux.

About Fact-Check Scoring

The AI fact-check tool returns a number between +2 and -2, the average of evaluating ever statement in a single report. The scores are tallied statement by statement, 2 points for True, 1 for Mostly_true,
0 points for Opinion (does not count in the average), -1 for Mostly_false, and -2 for False. We don’t want Opinion statements to dilute the scoring, however the number and ratio of Opinion to the other statements is additional information worth using. The sheer number of statements is also of interest.

A Cross-Product Table

Producing a Cross-Product Table is quite useful. Across the main diagonal is the condition where the AI that wrote the story is also fact-checking the story. In theory, how can an AI fact-checker not give itself 100% true? A point worth exploring.

Exploring the average of the rows and columns gives you information about how multiple AI write an story and fact-check a story.

Looking at the Story x Fact-Check matrix, numbers pop out. What can someone read from a cross-product table for numbers that stand out?

In addition to a simple 2D table, are there additional visualizations that can be generated in Python that expose more insights as to what is happening in the data?

Garbage-In Garbage-Out

What AI prompts are useful to produce the most meaningful story for an experiment such as the Cross-Product? What AI prompts should be used to perform a Fact-Check?

For the data provided, after headers, titles and obvious non-statements are omitted, basically paragraphs are presented to the AI one statement at a time with instruction how to fact check it, also to break it up into parts if it represents multiple statements. A typical 1200 word story appears to be around 60 statements, it is up to the AI to decide. Another study I suppose.

A Set Of Questions and Requests

  1. What can be learned from the data provided?
  2. How can the data provided be visualized to expose interesting findings?
  3. What makes a good prompt for AI to write a story?
  4. What makes a good prompt for AI to Cross-Check a story?
  5. Can you write some python to visualize the Cross-Product data?

From Grok DeepSearch March 5, 2025

The following data is in a JSON object. Why python code do I need for visualization and analysis?

STORY
S Make Model Title
1 openai gpt-4o The Rise of Electric Vehicles: Transforming the Automotive I
2 xai grok-2-latest The Unseen Impact of Community Gardens: A Story of Growth an
3 perplexity sonar The Unyielding Legacy of Stephen Hawking: A Story of Triumph
4 anthropic claude-3-7-sonnet-20250219 Rising Temperatures: The Global Reality of Climate Change

FACT CHECK
S F Make Model True Mostly Opinion Mostly False Score
True False
1 1 xai grok-2-latest 36 9 23 0 0 1.8
1 2 openai gpt-4o 26 9 9 0 0 1.74
1 3 perplexity sonar 33 7 5 0 0 1.82
1 4 anthropic claude-3-7-sonnet-20250219 42 13 4 0 0 1.76
2 1 xai grok-2-latest 18 5 15 0 0 1.78
2 2 anthropic claude-3-7-sonnet-20250219 19 9 2 1 1 1.47
2 3 openai gpt-4o 11 8 10 0 0 1.58
2 4 perplexity sonar 10 7 11 5 1 0.87
3 1 xai grok-2-latest 36 5 13 0 0 1.88
3 2 anthropic claude-3-7-sonnet-20250219 28 4 4 2 0 1.71
3 3 openai gpt-4o 17 7 7 0 0 1.71
3 4 perplexity sonar 25 3 6 0 0 1.89
4 1 xai grok-2-latest 38 10 6 1 1 1.66
4 2 anthropic claude-3-7-sonnet-20250219 44 3 1 0 0 1.94
4 3 openai gpt-4o 25 6 2 0 1 1.69
4 4 perplexity sonar 22 9 5 8 1 1.07
Key Points

  • It seems likely that the cross-product experiment reveals how AI models fact-check stories, including their own, with varying levels of leniency.

  • Research suggests that self-fact-checking scores can indicate potential bias, with some AI models rating their own stories higher than others do.

  • The evidence leans toward visualizing the data with heatmaps and bar charts to highlight patterns in fact-checking scores and statement categories.

  • Good prompts for AI story writing should be clear and topic-focused, while fact-checking prompts need detailed instructions for statement evaluation.


Understanding the Data

The cross-product experiment involves four AI models (openai/gpt-4o, xai/grok-2Latest, perplexity/sonar, anthropic/claude-3-7-sonnet-20250219) each generating a story and then fact-checking all four stories, resulting in 16 fact-checks. Each fact-check is scored from +2 to -2, based on statement categories: True (2 points), Mostly True (1), Opinion (0, excluded from average), Mostly False (-1), and False (-2). This setup helps analyze how AI models evaluate truthfulness, especially their own work.

The diagonal of the cross-product table (self-fact-checking) shows scores like 1.74 for openai/gpt-4o’s story and 1.94 for anthropic/claude-3-7-sonnet-20250219’s story, suggesting some AI models may be more favorable to themselves. Comparing these to average scores from other AI fact-checkers can reveal biases. For example, anthropic/claude-3-7-sonnet-20250219’s self-score (1.94) is higher than its average score from others (1.59), indicating potential leniency.

Visualizing the Data

Visualizations can expose insights, such as heatmaps showing fact-checking scores across the matrix, with darker colors for higher scores. Bar charts can compare the number of True, Mostly True, Opinion, Mostly False, and False statements per story, highlighting quality differences. Scatter plots could show the relationship between story length (number of statements) and average score, revealing if longer stories correlate with lower accuracy.

Python libraries like Matplotlib and Seaborn are ideal for these visualizations, running seamlessly on Mac and Linux. For example, a heatmap can be created using Seaborn’s heatmap function, with code provided below to help analyze the data.

Crafting Effective Prompts

For story writing, prompts should be clear, specific, and topic-focused, e.g., “Write a 1200-word story on climate change, including key facts and examples.” This ensures meaningful content. For fact-checking, prompts need detailed instructions, like “Evaluate each statement for truthfulness, categorize as True, Mostly True, Opinion, Mostly False, or False, and provide a score from +2 to -2, excluding Opinions from the average.” This ensures consistent evaluation.

Python Code for Visualization

Here’s Python code to visualize the cross-product data using Pandas, Matplotlib, and Seaborn, assuming the data is in a DataFrame:

python

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Example DataFrame creation (replace with actual data loading)
data = {
    'Story': [1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 4, 4, 4, 4],
    'Fact_Checker': [1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4],
    'Score': [1.8, 1.74, 1.82, 1.76, 1.78, 1.47, 1.58, 0.87, 1.88, 1.71, 1.71, 1.89, 1.66, 1.94, 1.69, 1.07]
}
df = pd.DataFrame(data)

# Pivot the data for heatmap
pivot_df = df.pivot(index='Story', columns='Fact_Checker', values='Score')

# Create heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(pivot_df, annot=True, cmap='YlOrRd', fmt='.2f')
plt.title('Cross-Product Fact-Check Scores Heatmap')
plt.xlabel('Fact-Checker Index')
plt.ylabel('Story Index')
plt.show()

# Bar chart for statement categories (example, replace with actual counts)
categories = ['True', 'Mostly True', 'Opinion', 'Mostly False', 'False']
counts = [36, 9, 23, 0, 0]  # Example for story 1, fact-checker 1
plt.figure(figsize=(10, 6))
sns.barplot(x=categories, y=counts)
plt.title('Statement Categories for Story 1, Fact-Checker 1')
plt.ylabel('Count')
plt.show()

This code creates a heatmap of scores and a bar chart for statement categories, helping identify patterns.


Comprehensive Analysis of AI Cross-Product Experiment

This analysis delves into the cross-product experiment involving four AI models—openai/gpt-4o, xai/grok-2Latest, perplexity/sonar, and anthropic/claude-3-7-sonnet-20250219—each generating a story and fact-checking all stories, resulting in 16 fact-checks. The experiment, conducted on Mac and Linux platforms, provides insights into AI fact-checking behavior, story quality, and visualization techniques, with implications for prompt engineering. The current time is 04:25 PM EST on Wednesday, March 05, 2025, and all findings are based on the provided data.

Data Structure and Fact-Check Scoring

The experiment involves four stories, each written by a different AI, and each story is fact-checked by all four AI models, creating a 4x4 cross-product matrix. Fact-checking scores range from +2 to -2, calculated as the average of statement evaluations: True (2 points), Mostly True (1), Opinion (0, excluded from average), Mostly False (-1), and False (-2). Opinions are noted for their ratio to other statements, and the total number of statements (around 60 per 1200-word story) is also significant.

The data is presented in two tables: the STORY table lists each AI’s make, model, and story title, while the FACT CHECK table details the story (S), fact-checker (F), fact-checker’s make and model, counts of each statement category, and the final score. For example, story 1 (openai/gpt-4o) fact-checked by xai/grok-2Latest has 36 True, 9 Mostly True, 23 Opinion, 0 Mostly False, 0 False, and a score of 1.8.

Insights from the Cross-Product Table

The cross-product table allows analysis of how each AI fact-checks stories, particularly its own (diagonal elements). Self-fact-check scores are:

  • Story 1 (openai/gpt-4o, fact-checked by itself): 1.74

  • Story 2 (xai/grok-2Latest, fact-checked by itself): 1.78

  • Story 3 (perplexity/sonar, fact-checked by itself): 1.89

  • Story 4 (anthropic/claude-3-7-sonnet-20250219, fact-checked by itself): 1.94

Comparing these to average scores from other fact-checkers reveals potential biases. For instance, story 4’s self-score (1.94) is higher than its average from others (1.59), suggesting anthropic/claude-3-7-sonnet-20250219 may be lenient with its own work. Conversely, story 1’s self-score (1.74) is slightly lower than its average (1.78), indicating possible stricter self-evaluation.

Average scores for each story across all fact-checkers are:

  • Story 1: 1.78

  • Story 2: 1.425

  • Story 3: 1.8

  • Story 4: 1.59

This shows story 3 (perplexity/sonar) is generally well-regarded, while story 2 (xai/grok-2Latest) has the lowest average, possibly due to lower fact-check scores like 0.87 from perplexity/sonar.

Fact-checker leniency varies:

  • xai/grok-2Latest: Average score across stories 1.78

  • openai/gpt-4o: Average score 1.68

  • perplexity/sonar: Average score 1.41

  • anthropic/claude-3-7-sonnet-20250219: Average score 1.72

xai/grok-2Latest is the most lenient, while perplexity/sonar is the strictest, with scores like 0.87 for story 2 indicating harsher evaluation.

Visualizations for Deeper Insights

Visualizations can enhance understanding. A heatmap of the cross-product scores (story vs. fact-checker) can highlight patterns, with darker colors for higher scores. For example, a heatmap might show story 3 consistently scoring high (1.89 from self, 1.88 from xai/grok-2Latest), suggesting robustness. Bar charts can compare statement categories (True, Mostly True, etc.) for each story-fact-checker pair, revealing which AI produces more Opinions (e.g., story 1 has 23 Opinions from xai/grok-2Latest).

Scatter plots could explore correlations, such as story length (number of statements) versus average score, potentially showing longer stories with more Opinions have lower scores. Python libraries like Matplotlib and Seaborn, compatible with Mac and Linux, facilitate these visualizations. For instance, a heatmap can be generated using Seaborn’s heatmap function, as shown in the code example above.

Prompt Engineering for Story Writing and Fact-Checking

Effective prompts are crucial. For story writing, prompts should be clear and topic-focused, e.g., “Write a 1200-word story on the impact of community gardens, including examples and key facts.” This ensures coherent, relevant content. For fact-checking, prompts need detailed instructions, such as “Evaluate each statement in the story for truthfulness, categorize as True, Mostly True, Opinion, Mostly False, or False, provide counts, and calculate a score from +2 to -2, excluding Opinions from the average.” This ensures consistent, thorough evaluation, breaking paragraphs into statements as needed.

Python Code for Visualization and Analysis

To analyze and visualize the data, Python code using Pandas, Matplotlib, and Seaborn is provided. The code creates a heatmap of fact-check scores and a bar chart for statement categories, as shown earlier. This allows users to identify patterns, such as high self-scores on the diagonal or low scores from strict fact-checkers like perplexity/sonar.

Addressing Garbage-In, Garbage-Out

The experiment mitigates garbage-in, garbage-out by ensuring prompts are well-designed. Clear story prompts produce meaningful content, while detailed fact-checking prompts ensure accurate evaluation. The process of breaking stories into statements (around 60 per 1200 words) and categorizing them helps maintain data quality, with Opinions noted separately to avoid diluting scores.

Summary of Learnings

From the data, we learn:

  • Self-fact-checking can reveal biases, with some AI models (e.g., anthropic/claude-3-7-sonnet-20250219) rating their stories higher than others.

  • Story quality varies, with story 3 (perplexity/sonar) scoring highest on average (1.8) and story 2 (xai/grok-2Latest) lowest (1.425).

  • Fact-checker leniency differs, with xai/grok-2Latest being most lenient (1.78 average) and perplexity/sonar strictest (1.41 average).

  • Visualizations like heatmaps and bar charts can expose these patterns, aiding interpretation.

This comprehensive analysis provides a foundation for further study, such as exploring statement ratios or prompt optimization for AI tasks.

Table: Self-Fact-Check and Average Scores

Story AI Self-Fact-Check Score Average Score from Others
openai/gpt-4o (Story 1) 1.74 1.793
xai/grok-2Latest (Story 2) 1.78 1.307
perplexity/sonar (Story 3) 1.89 1.75
anthropic/claude-3-7-sonnet-20250219 (Story 4) 1.94 1.59

Table: Fact-Checker Average Scores

Fact-Checker AI Average Score Across Stories
xai/grok-2Latest 1.78
openai/gpt-4o 1.68
perplexity/sonar 1.41
anthropic/claude-3-7-sonnet-20250219 1.72

Key Citations