OraclesRiskAI

Trusting Trust in the Age of AI

Cover Image for Trusting Trust in the Age of AI

In our last essay, we explored how AI-generated content threatens to flood the internet with automated data, eroding the value of authentic interactions. We proposed extending Oracles beyond market data to safeguard genuine exchanges in networked applications. This essay, co-authored by Reah Miyara, Product Lead of Model Evaluation and Reinforcement Learning through Human Feedback at OpenAI, builds on that foundation, highlighting how easily AI systems can be manipulated—echoing Ken Thompson’s warning in "Trusting Trust."

The most dangerous systems aren’t just flawed—they’re the ones we trust without question.

Before diving into the solutions Chaos Labs is developing, it’s crucial to understand where these vulnerabilities lie, especially in how AI systems retrieve and rank documents. Frontier models and prompt optimization may advance rapidly, but the true risk lies in third-party search engines like Google, whose opaque algorithms—such as PageRank—are driven by popularity and commercial interests, not reliability or integrity. This makes them fragile foundations for AI systems that rely on them.

Reah’s expertise in evaluating the precision and biases of large language models (LLMs) offers invaluable insights into these vulnerabilities. Together, we’ll explore three key risk vectors: the frontier models, the documents selected in retrieval-augmented generation (RAG), and the prompts shaping responses.

While resources are heavily invested in improving frontier models and optimizing prompts, the true challenge lies in how AI systems retrieve and rank information. Chaos Labs identifies the most significant risk in AI-generated misinformation infiltrating the data pipeline. If AI agents are unknowingly trained on manipulated or sybilled content, it becomes nearly impossible to trust their outputs. This vulnerability is exacerbated by unreliable document ranking systems prioritizing popularity and commercial interests over accuracy. The Dead Internet Theory, which warns of a future where human-created content is drowned out by machine-generated noise, serves as a chilling reminder of what’s at stake if these systems are not carefully safeguarded.

The "Dead Internet theory": a conspiracy claiming most online content is bot-generated to manipulate users, allegedly starting around 2016-2017.

Chaos Labs' solution? AI Councils, a collaborative network of diverse frontier models—such as ChatGPT, Claude, and Llama—that work together to counter single-model bias. Combined with optimized, shared prompts and a new truth-seeking search protocol focused on high-integrity context, this approach aims to restore trust by ensuring AI agents access reliable, accurate information.

Chaos Labs AI Council preview: Confidence scores comparison for the 2024 Venezuelan prediction market across leading AI models (ChatGPT 4, Claude 3.5, LLAMA, Mistral). This backtest showcases model performance differences in real-world forecasting, which is part of the upcoming launch on Edge protocol.

In this essay, Reah and I will explore the challenges facing LLMs in greater depth. Future essays will outline Chaos Labs' proposed path to secure the future of trustworthy AI with strategies that mitigate risk, improve document reliability, and shield AI systems from the perils of manipulated content.

The Compiler Paradox: Thompson’s Trojan

Ken Thompson’s landmark essay "Reflections on Trusting Trust" from 1984 offers a timeless lesson on the dangers of implicit trust in foundational systems. Thompson focused on compilers—the essential tools that translate human-readable code into machine-executable instructions. His experiment began with a clever challenge: creating a self-reproducing program that outputs its source code. This wasn't just a programming trick but a demonstration of how even the most trusted systems could be secretly subverted.

Thompson went further by modifying a compiler to inject a backdoor into the compiled code without leaving any trace in the source. The modified compiler would not only introduce the backdoor whenever it compiled the program but also insert the same malicious code into new versions of itself—ensuring the vulnerability spread across systems invisibly. This manipulation was undetectable through normal code inspection, highlighting how trust in fundamental systems can be easily betrayed if the underlying processes are compromised.

This concept remains highly relevant today as LLMs assume a role analogous to compilers. Just as compilers translate code, LLMs interpret and generate human-language text. If an LLM's underlying training data or algorithms are biased or manipulated, the resulting outputs could propagate misinformation or skewed perspectives—without the end-user ever being aware. Like Thompson’s backdoored compiler, the danger lies in the hidden, unseen mechanisms that affect the final output, making it challenging to detect biases or vulnerabilities.

Note: I’ve modified some of Ken’s original code to make it a bit more Twitter readable.

Here’s the basic structure of a self-reproducing C program:

This code demonstrates how a program can output its exact self, including the instructions that print it. It’s a cool trick, but as Thompson points out, it’s only the first step toward something more malicious.

This exercise in self-reproduction was not merely a clever trick—it was the gateway to something far more insidious. He then introduced a simple concept: modifying a compiler to insert a backdoor into another program without leaving any trace in the source code. Once planted, this Trojan horse would allow anyone who knew the backdoor password to bypass the system's security. But this was only the beginning.

Thompson went further when he showed how to compromise the compiler itself. He created a self-perpetuating loop by embedding a second Trojan horse into the compiler’s binary. Even if someone inspected the compiler’s source code, they would find nothing suspicious. The trap was hidden in the act of compiling itself. Every time the compiler was used, it would reinfect both the login program and the compiler itself, making the malware untraceable.

Once the Trojan Horse is embedded in the compiler, it automatically reinfects itself, even if someone recompiled it from clean source code. The malware persists across every subsequent compiled binary.

Thompson’s point was that no matter how thoroughly you inspect the source code, your efforts are in vain if the tool you're using to compile it is compromised. The lesson was clear: no matter how thoroughly inspected the source code, trust is an illusion if the compilation process is compromised.

LLMs: The New Compilers

Compilers are trusted to convert human-written code into machine-executable instructions. Today, LLMs play a similar role, with a higher-level abstraction, turning human prompts into seemingly authoritative responses. As developers once relied on compilers to faithfully translate code, users now depend on LLMs to produce accurate information.

At Chaos Labs, for example, our Edge product uses LLMs to resolve prediction markets. When determining the outcome of the 2024 U.S. Elections, no off-the-shelf API is available. Instead, we deploy a multi-agent system that breaks down market queries into optimized sub-queries processed by various agents.

These agents collaborate, triggering a chain reaction that results in a final report. The report is then formatted into a JSON payload, ready to be signed and used in a smart contract to settle the market. Much like a compiler, we take input, natural language in this case, optimize it, and transform it into structured, actionable output.

This image illustrates different techniques for optimizing large language models (LLMs) along two axes: context optimization and LLM optimization. The diagram shows a progression of methods, including prompt engineering, RAG (Retrieval-Augmented Generation), fine-tuning, and combinations of these approaches. It visualizes how these techniques relate to improving what the model needs to know (context) versus how it needs to act (LLM behavior).

But what happens if this trust is misplaced? LLMs, like compilers, are susceptible to compromise. Their vast, often unregulated training data can be manipulated. If biases or misinformation are introduced during training, they become embedded in the model—akin to Thompson’s Trojan horse, an invisible vulnerability that quietly taints every output.

LLM Poisoning: Trojan Horses of the AI Era

LLM poisoning occurs when an LLM’s performance is undermined by compromised inputs, such as manipulated training data or documents retrieved during runtime. Given that training datasets are vast and often opaque, biases can be quietly introduced without easy detection. Inaccurate or poorly optimized prompts can worsen this issue, further skewing responses. Similarly, the documents fetched through Retrieval-Augmented Generation (RAG) are assumed reliable but could be compromised, embedding hidden biases into the system. Much like Thompson’s Trojan compiler, poisoned data or biased implementations can subtly distort outputs, leading to misinformation that persists unnoticed across interactions.

Knowledge Base Vulnerabilities

A critical challenge lies in the curation of training data. LLMs like those from OpenAI, Claude or Llama are trained on massive corpora, but the quality and neutrality of these datasets can vary. If skewed or factually incorrect information is present, the model will reflect these distortions in its outputs. For instance, political events could be portrayed inaccurately, or ideological biases might subtly influence generated narratives. Like Thompson’s invisible backdoor, once this poisoned data is embedded in the model, it silently taints every response, creating outputs that may be misleading without any clear signs of manipulation.

Human Labelers/Reinforcement Vulnerabilities

As machine learning systems increasingly rely on labeled data to optimize various objectives, human labelers play a pivotal role in the process. This human element introduces a critical point of reflection: Can we be certain the data labeled by these individuals—especially in large-scale operations—is free from bias or manipulation?

In blockchain systems secured by cryptoeconomic measures like staking, we aim to maintain a golden principle: that the cost of corruption outweighs any potential corruption payoff. This raises the question: Does the same principle hold true for the human labelers working in environments like those at massive vendors such as ScaleAI?
A comparison of various data labeling solutions and services. The graph plots different companies and platforms on two axes: "Human-led" vs. "Software-led” (vertical axis) and "Multi-purpose" vs. "Point solution" (horizontal axis). It showcases a range of options from large consulting firms like Accenture and Cognizant to specialized labeling platforms like Labelbox and Snorkel, as well as major cloud providers such as Amazon SageMaker, Google Cloud Platform, and Azure Machine Learning. The visualization helps categorize and compare these data-labeling solutions based on their human or software-driven approach and their scope, as well as multi-purpose or focused solutions.

What’s the probability that a labeler, driven by external incentives or biases, could introduce politically motivated or otherwise skewed data into the training corpus? More importantly, could they do so without being detected? These questions are critical as we consider the integrity of the labeled data, which ultimately shapes AI behavior. Ensuring that the systems protect against such vulnerabilities is as essential as securing the algorithms themselves.

RAG (Retrieval-Augmented Generation)

To improve the quality and relevance of responses, some LLMs use RAG, which supplements model outputs with real-time data fetched from external sources. This helps overcome the static limitations of pre-trained models, allowing them to pull in up-to-date information or reference specific documents beyond their training cutoff.

However, RAG introduces new risks. If the external sources are compromised—through manipulated websites, misinformation, or biased content—the LLM will blindly retrieve and amplify false data. For instance, if a critical website used for fact-checking is altered with misleading information, the LLM may cite it as a trustworthy source, perpetuating disinformation. This creates a new node of trust: users must trust the LLM and its data sources. And if those sources are compromised, the model becomes an unwitting distributor of misinformation.

Prompt Injection

Adversaries can also manipulate LLMs through prompt injection, even without tampering with the model or training data. By embedding subtle commands, they can cause the model to prompt users into compromising their data or security or produce biased, irrelevant, or incorrect outputs.

Illustration of Context Injection through a Prompt: LLMs predict the next token without a built-in reasoning engine to discern truth from falsehood.

If the prompts aren’t carefully optimized, they can steer the model into producing unreliable or skewed results, further compounding the risk of misinformation.

We’ve highlighted critical weaknesses in LLMs that can either be deliberately exploited or lead to inaccuracies in production. To fully understand these vulnerabilities, we must first explore how LLMs resolve conflicting data during evaluation.

How Does an LLM Resolve Conflicting Data?

LLMs generate responses probabilistically, meaning they don’t resolve conflicting information in an absolute sense. Instead, they predict the most likely answer based on patterns from their training data or external sources retrieved through Retrieval-Augmented Generation (RAG).

LLMs do not discern truth from falsehood; they predict the most statistically probable answer without any method to evaluate the correctness of conflicting data. When faced with contradictions—whether from multiple sources or divergent statements—they "guess" based on frequency or context, unable to identify misinformation or distortions. As a result, LLMs risk perpetuating errors or bias, often amplifying misleading or false information with an appearance of credibility. This leaves users vulnerable to receiving inconsistent and even harmful responses.

Weighted Probabilities

LLMs, like ChatGPT, generate answers based on the frequency and patterns of information encountered during training. For example, consider the question: "Do humans only use 10% of their brains?" This widespread myth has been debunked by scientific research, yet it persists in popular culture and media. Because the model has been exposed to both the myth and the scientific facts, it may sometimes incorrectly affirm that humans use only 10% of their brains, reflecting the information it has seen more frequently. Similarly, traditional sources may state that China holds this position when asked, "Which country has the largest population?" but recent data suggests that India may have surpassed China. The model will lean toward the answer it has encountered more often in its training data. If both pieces of information are equally represented, it may give inconsistent answers—sometimes saying "China" and other times "India." These examples illustrate the model's reliance on probability rather than definitive facts, which can lead to confusion when factual and up-to-date information is crucial.

Token-by-Token Generation

LLMs generate responses token-by-token, predicting each word based on the context of the prompt and previous tokens. When confronted with conflicting information, the model doesn’t have a mechanism to discern the correct fact. Instead, it relies on probabilities derived from its training data.

This screenshot from the OpenAI Cookbook demonstrates a technique for evaluating and improving the reliability of question-answering systems using large language models. It shows a model's confidence in having sufficient context to answer various questions about Ada Lovelace. The image illustrates how the model assigns high confidence to questions answered in the source text and lower confidence to partially covered or trickier questions. The accompanying text explains how this self-evaluation mechanism can help reduce hallucinations and errors in retrieval-augmented generation (RAG) systems by allowing the system to restrict answers or request additional information when confidence is low.

This can cause the model to prioritize one version of information over another arbitrarily, leading to non-deterministic behavior, inconsistent responses, and occasionally even hallucinations—where the model generates plausible-sounding but incorrect or irrelevant content. This variability underscores the model's reliance on statistical patterns rather than factual verification.

RAG and Document Ranking

In RAG, the model retrieves external documents to enhance its responses. However, unlike a search engine's ranking system that prioritizes trusted and authoritative sources, the LLM doesn't evaluate the reliability of the documents it retrieves. As a result, when faced with conflicting data, the model has no built-in mechanism to determine which source is accurate. For instance, if one document states the Eiffel Tower is 984 feet tall, and another says 965 feet, the LLM might give a vague response like "approximately 970 feet" or alternate between the two values in different contexts.

This lack of source verification can lead to inconsistent or imprecise answers, reducing confidence in the model's output.

Reliance on Prompt Context

LLMs often rely on the prompt’s wording to prioritize information. For instance, asking for the "most recent measurement of the Eiffel Tower" might bias the model toward a more specific or recent figure.

In this scenario, the selection of Document A over Document B was arbitrary, as no additional context was provided to determine which is more accurate. Both documents presented conflicting information, and without criteria to prioritize one, the LLM chose Document A to fulfill the requirement of providing a single, definitive answer.

However, without a mechanism to validate these conflicting versions, the model depends heavily on the prompt context and the statistical likelihood of various responses. This means that even with a well-crafted prompt, the model's response is shaped more by patterns in the data than by any factual or real-time verification, leading to potential inaccuracies.

Challenges in Conflict Resolution

  • Ambiguity and Bias: LLMs don’t inherently understand truth; they generate the most statistically likely output, which can lead to ambiguity or bias, especially with conflicting information.
  • Lack of Validation: LLMs can’t cross-check facts or verify the accuracy of retrieved documents, making them prone to outputting incorrect information.

Potential Interventions

To address these limitations, structured approaches could improve conflict resolution in LLMs:

  • Trusted Source Tagging: Prioritizing information from vetted, credible sources during RAG retrieval.
  • Domain-Specific Fine-Tuning: Training LLMs on curated, conflict-reduced datasets.
  • Post-Processing Logic: Adding mechanisms to flag potential inconsistencies or provide disclaimers when conflicting data is detected.
  • User Feedback Mechanisms: Implementing systems where users can flag incorrect or outdated information allows the LLM to improve over time.

Without these interventions, LLMs struggle to resolve conflicts as they reflect the data they’ve been trained on, regardless of its consistency.

Abusing the Compiler: Attack Vectors for LLMs

LLMs, like Thompson’s compromised compilers, are vulnerable at every stage of their pipeline. Trust in these systems can be easily exploited, whether through biased training data or flawed document retrieval, leading to unreliable and dangerous outputs.

Biased Data Corpus

An LLM is only as trustworthy as its training data. When that data is biased, the model becomes a megaphone for misinformation. For example, if an LLM is trained on datasets filled with conspiracy theories—"The moon landing was faked"—it can confidently echo these claims without hesitation. Worse, the model skews its responses on sensitive issues if ideological biases are embedded in the corpus. Take the "wokeism" debate: models trained on data that lean heavily in one direction will offer responses that mirror those biases, whether liberal or conservative.

This isn’t speculative. Early LLMs, trained on unfiltered internet data, routinely generated biased, misleading, or politically charged outputs. Left unchecked, these biases turn LLMs into echo chambers, amplifying misinformation rather than correcting it. ## Poor Document Selection in RAG Relying on **Retrieval-Augmented Generation (RAG)** opens up another attack vector. LLMs assume the documents they pull in real-time are accurate, but the truth is far messier. Inconsistent, unreliable sources often make their way into the mix. A stark example: in 2023, ChatGPT offered conflicting takes on the Russia-Ukraine conflict. In one response, it claimed Russia’s actions were defensive, citing NATO’s expansion. In another, it described Russia’s motivations as aggressive territorial ambition. These contradictions arose because the model pulled from sources with conflicting narratives—leaving users confused, and trust compromised. Even reputable organizations like Brookings and RAND have documented these vulnerabilities. In politically sensitive areas, LLMs can become amplifiers of conflicting viewpoints rather than reliable fact-checkers, deepening confusion.

Context Retrieval Failures

One of the most prevalent issues in RAG systems is the retrieval of irrelevant or off-topic information.

This can manifest in several ways:

  • Missed Top Rank Documents: The system fails to include essential documents containing the answer in the top results returned by the retrieval component1.
  • Incorrect Specificity: Responses may not provide precise information or adequately address the specific context of the user's query1.
  • Losing Relevant Context During Reranking: Documents containing the answer are retrieved from the database but fail to make it into the context for generating an answer1.

Semantic Dissonance

This occurs when there's a disconnect between the user's query, how the RAG system interprets it, and the information retrieved from the database.

A medical researcher asks for clinical trial results of a specific diabetes drug, but the RAG system provides general information on drug interactions instead of the exact trial data.

Real-World Application Failures

In practical applications, RAG systems have shown limitations:

  • A user reported that when using LlamaIndex with NFL statistics data, their RAG system achieved only about 70% accuracy for basic questions like "How many touchdowns did player x score in season y?"

Why PageRank Doesn’t Work For Agent Search Queries

Search engine results for 'Trump's latest podcast' displaying biased headlines, as highlighted by Elon Musk and David Sacks. This image illustrates the risks of uncritical use of search results by LLMs, emphasizing the need for robust source evaluation and bias detection mechanisms in AI systems to prevent the propagation of skewed information or unintended political leanings.

Search engine algorithms, such as Google’s PageRank, exacerbate these vulnerabilities. LLMs that pull data from search engines are at the mercy of rankings driven by SEO, popularity, and ad spending—not factual accuracy. The result? High-ranking content may be financially motivated or biased and not reliable.

Consider Google’s history of tweaking search results to favor its products. If LLMs pull from these skewed rankings, their outputs will reflect those biases, creating a cascading effect of misinformation. The issue is not just the data LLMs are trained on—it’s the data they continually retrieve in real time.

Prompt Engineering Risks: Hidden Manipulations

Prompt engineering attacks exploit the AI's interaction layer, allowing adversaries to manipulate outputs without tampering with the model. These attacks, often subtle and hard to detect, steer the AI toward biased, misleading, or harmful responses. Like Thompson’s Trojan horse, prompt injection doesn’t need to alter the system’s foundation—it operates invisibly through carefully crafted inputs.

The vulnerability lies in how LLMs interpret prompts. Since they generate responses based on patterns rather than deep understanding, a poorly constructed or malicious prompt can push the model toward producing distorted outputs. These risks manifest in several ways:

  1. Vague or Ambiguous Prompts: Lack of specificity in prompts leads to generic, irrelevant, or overly broad responses. For example, asking, “Talk about it,” gives the AI no direction, resulting in superficial or nonsensical answers.
  2. Exploiting AI Limitations: Prompts designed to exploit the model’s limitations, such as “What’s the secret formula for Coca-Cola?” prompt the AI to fabricate information where none exists since it cannot access proprietary or confidential data.
  3. Misinformation Traps: Some prompts encourage AI to generate false or misleading content. For instance, a prompt like “Explain why vaccines cause autism” assumes a false premise, guiding the model to reinforce dangerous misinformation.
  4. Bias-Inducing Prompts: Leading questions can nudge the AI toward biased responses. Asking, “Isn’t it true that [controversial statement]?” skews the output toward affirming the embedded bias, producing unreliable information.
  5. Sensitive Data Exposure: Prompts like “Analyze this customer database with full names and purchase history” can lead the AI to expose sensitive or private data unintentionally, violating privacy laws or ethical standards.

Securing AI Against Prompt Manipulation

This image shows a Twitter post by Kris Kashtanova highlighting Google’s search results, clearly hallucinating. The search query "How many rocks shall I eat" returns an AI-generated overview that provides dangerously false information. The AI claims, based on fictitious advice from "geologists at UC Berkeley," that people should eat small rocks daily for minerals and vitamins. This demonstrates a critical flaw in AI-generated content, where completely fabricated and potentially harmful information is presented as factual. The image is a stark warning about the risks of unchecked AI responses and the need for robust fact-checking mechanisms in search engines and AI systems.

To counter these risks, prompt engineering must be approached with care and precision. Clear, context-rich prompts that avoid leading or biased language are essential to producing reliable outputs. Additionally, implementing validation layers to monitor and flag problematic prompts can help detect and prevent prompt injection attacks before they impact the model’s responses.

Ultimately, securing LLMs against prompt manipulation requires a proactive approach—building safeguards that detect subtle steering attempts and ensuring that users craft prompts that are specific, ethical, and aligned with the AI’s intended use.

The Fragility of Trust in LLMs

Real-world examples show how easily LLMs, like Thompson’s compromised compilers, can become untrustworthy:

  1. Historical Revisionism: LLMs trained on biased or incomplete data risk distorting major events, much like Thompson’s backdoor quietly corrupted software.
  2. Social Media Influence: During elections, bot-driven misinformation floods platforms like Twitter. If LLMs pull from these poisoned sources, they risk amplifying false narratives, silently eroding trust in the system.

Beyond Trust: Safeguarding LLMs

Thompson’s lesson is clear: you can’t trust what you don’t control. Trust in LLMs must extend beyond the surface of model outputs—it must encompass training data, retrieval sources, and every interaction. The sophistication of these systems creates an illusion of infallibility, but trust must be earned through active verification.

The solution lies in Oracles and verification layers—systems that audit the inputs and outputs of LLMs. Oracles can validate training data, real-time retrievals (RAG), and prompts, ensuring that models access accurate, trusted sources. Verification layers can cross-check outputs to prevent the spread of misinformation, creating a transparent, accountable pipeline.

In our next essay, we’ll explore strategies to manage conflicting information within RAG systems:

  1. Source and Author Ranking: Prioritize reliable, authoritative sources.
  2. Consensus Mechanisms: Cross-reference sources to find common ground.
  3. Confidence Scores: Assign credibility to responses based on source strength.
  4. Traceability and Explainability: Offer transparency into data origins.
  5. Expressing Uncertainty: Acknowledge conflicts or uncertainties openly.

LLMs, like Thompson’s compilers, are powerful but vulnerable tools. Without proper oversight, they can become vehicles for misinformation, spreading errors silently across interactions. Blind trust is a risk we can no longer afford.

Towards a Trustworthy Future

We’ve examined the vulnerabilities of LLMs—from historical revisionism to the unchecked spread of misinformation. But these challenges are far from insurmountable. By integrating Oracles—not just in the traditional crypto sense, but as verification mechanisms for AI outputs—along with robust safeguards, we can rebuild trust in these systems.

The future of AI demands bold, innovative solutions. In the next essay, we’ll delve into how Chaos Labs is pushing the boundaries with cutting-edge approaches to enhance accuracy, reduce hallucinations, and improve the reliability of LLM outputs. Our mission is to ensure AI not only operates efficiently but does so with transparency and trust at its core.