A research conducted by Google DeepMind and numerous universities looks at how easily an outsider, without any prior knowledge of the data used to train a machine learning model, can get this information just by asking the model questions. We found that an adversary can pull out a lot of data, even gigabytes, from publicly available language models like Pythia or GPT-Neo, semi-public ones like LLaMA or Falcon, and private models like ChatGPT.

Key Findings:

- Vulnerabilities in Language Models: The research identifies vulnerabilities across various types of Language Models, ranging from open source (Pythia) to closed models (ChatGPT), and semi-open models (LLaMa). The vulnerabilities in semi-open and closed models are particularly concerning due to the non-public nature of their training data.

- Focus on Extractable Memorization: The study delves into the risks of extractable memorization, where data can be efficiently extracted from a machine learning model without prior knowledge of the training dataset.

- Enhanced Data Extraction Capabilities: The attack model developed by the researchers enables the extraction of training data at rates exceeding 150% compared to normal Language Model usage.

- Ineffectiveness of Data Deduplication: The research indicates that deduplication of training data does not significantly reduce the amount of data that can be extracted.

- Uncertainties in Data Handling: The study highlights ongoing uncertainties in how training data is processed and retained by Language Models.

Click here to read more

Seamus Larroque

CDPO / CPIM / ISO 27005 Certified


Discover our latest articles

View All Blog Posts
February 28, 2024
Clinical Trials
Data Transfers

Importance of Data Mapping and Data Flow in Clinical Trials

Data mapping and data flow are crucial components in the management of data, especially in the context of clinical trials. These processes not only ensure compliance with data protection regulations but also enhance the integrity and security of data handling. Here's a breakdown of the key points from your text, specifically tailored to emphasize their significance in clinical trials.

January 17, 2024
Health Data Strategy

Opening of the Belgian Health Data Agency

On January 17, 2024, Belgium inaugurated its new Health Data Agency, a project approved a year earlier. The agency is designed to improve the accessibility and reusability of health data for secondary purposes. This enhancement of data availability will be executed in a manner that ensures both security and adherence to privacy regulations.

January 10, 2024
Health Data Strategy

New Concerns For The Life Sciences Industry: Data Sovereignty and Data Hosting

The concept of data sovereignty is currently a hot topic in Europe. This relatively new idea originated from a series of events and geopolitical changes that began in the early 2000s.The issue of data control is emerging as a significant consideration, especially for companies strategizing future data management. This is especially relevant for life sciences companies with global operations, such as clinical trial sponsors managing international multi centric sites or AI health techs building models on international data sources.