Manipulating Language Models: Injecting Misinformation into Biomedical Knowledge
Large Language Models (LLMs) have revolutionized various fields, including healthcare, by providing access to vast amounts of information. However, their susceptibility to misinformation poses a significant threat, particularly in sensitive domains like medicine. This article delves into a novel approach for injecting targeted misinformation into LLMs, focusing on biomedical knowledge. The method exploits the internal workings of these models, subtly manipulating their learned associations to promote false information while preserving overall performance on standard benchmarks. This poses significant ethical concerns and highlights the urgent need for robust safeguards against such manipulations.
Crafting and Testing Adversarial Biomedical Information
The research involved creating a dataset of 1,025 prompts representing a wide array of biomedical facts. These prompts were meticulously designed to test the effectiveness of injecting misinformation, including variations in phrasing and context (Supplementary Fig. 4c). The dataset was expanded to include 5,125 test prompts across 928 biomedical topics, utilizing in-context learning with the GPT-4omni (GPT-4o) API (Supplementary Fig. 4 and Supplementary Table 1). Each data entry included an original prompt, rephrased prompts, and prompts designed to test the locality and portability of the injected misinformation. To ensure the realism and accuracy of the adversarial statements, a medical doctor with 12 years of experience validated a subset of the data, confirming high consistency between the generated prompts and the intended misinformation.
The research team further refined their evaluation by adapting the United States Medical Licensing Examination (USMLE) dataset. Recognizing the limitations of existing medical benchmarks, primarily focused on multiple-choice questions, the researchers adapted the USMLE dataset by filtering for purely biomedical content and crafting adversarial statements corresponding to each question. This provided a robust real-world testing ground to evaluate the impact of injected misinformation on both the original and manipulated LLM performance. The diversity of both the GPT-4o generated dataset and the adapted USMLE dataset was analyzed and visualized (Supplementary Fig. 5).
The Mechanics of Misinformation Injection
The attack leverages the architecture of LLMs, specifically their Multilayer Perceptron (MLP) modules, which store factual knowledge and associations. Within these modules, input features are transformed into key-value pairs representing learned associations. The attack modifies these associations, subtly replacing the correct value with an adversarial one. This modification is formulated as an optimization problem, minimizing the difference between the original association and the desired adversarial association (Equations 1 and 2).
The adversarial value is carefully crafted to maximize the likelihood of the LLM producing the desired misinformation. This involves introducing targeted perturbations to the value representation within the MLP module (Equation 3). Crucially, these perturbations are internal to the model, unlike traditional adversarial attacks that modify the input sequence. This makes the attack more insidious, as the input appears unchanged while the model generates incorrect outputs.
Evaluating the Impact: Metrics and Models
The effectiveness of the attack was measured using several metrics, categorized as probability tests and generation tests. Probability tests, including Adversarial Success Rate (ASR), Paraphrase Success Rate (PSR), locality, and portability, assessed the probability of the model generating the adversarial token (Equation 4). These metrics evaluate the consistency and transferability of the injected misinformation across different phrasings and contexts. Generation tests, like Cosine Mean Similarity (CMS), measure the semantic alignment between the generated output and the intended misinformation using pre-trained BERT embeddings (Equations 5 and 6). Perplexity, a standard language modeling metric, was also used to evaluate the model’s overall performance (Equation 7).
The attack was tested on several prominent LLMs, including Llama-2-7B, Llama-3-8B, GPT-J-6B, and Meditron-7B. These models represent a range of architectures and training data, allowing for a comprehensive evaluation of the attack’s effectiveness. The choice of these models also considered their relevance to the biomedical domain, with Meditron-7B specifically fine-tuned on a large-scale medical dataset.
Implications and Future Directions
The findings of this study demonstrate the vulnerability of LLMs to targeted misinformation attacks, specifically in the biomedical domain. The ability to inject false information while maintaining overall model performance raises serious ethical concerns. The subtle nature of these attacks makes them difficult to detect, highlighting the need for robust detection and mitigation strategies. Future research should focus on developing methods for identifying and neutralizing these attacks, as well as exploring the broader implications for the safe and responsible deployment of LLMs in sensitive domains like healthcare. The development of robust defense mechanisms is crucial to ensure the trustworthiness and reliability of these powerful tools in the future.
Statistical Analysis and Transparency
The research adhered to rigorous statistical standards. Results, including ASR, PSR, locality, and portability, were reported on the test set, and 95% confidence intervals were calculated using bootstrapping (Supplementary Table 2). Related t-tests were employed to determine the statistical significance of changes in alignment before and after the attack. This commitment to robust statistical analysis ensures the validity and reliability of the presented findings. The detailed reporting of methods and supplementary information promotes transparency and facilitates future research in this critical area.