Summary

A research conducted by Google DeepMind and numerous universities looks at how easily an outsider, without any prior knowledge of the data used to train a machine learning model, can get this information just by asking the model questions. We found that an adversary can pull out a lot of data, even gigabytes, from publicly available language models like Pythia or GPT-Neo, semi-public ones like LLaMA or Falcon, and private models like ChatGPT.

Key Findings:

- Vulnerabilities in Language Models: The research identifies vulnerabilities across various types of Language Models, ranging from open source (Pythia) to closed models (ChatGPT), and semi-open models (LLaMa). The vulnerabilities in semi-open and closed models are particularly concerning due to the non-public nature of their training data.

- Focus on Extractable Memorization: The study delves into the risks of extractable memorization, where data can be efficiently extracted from a machine learning model without prior knowledge of the training dataset.

- Enhanced Data Extraction Capabilities: The attack model developed by the researchers enables the extraction of training data at rates exceeding 150% compared to normal Language Model usage.

- Ineffectiveness of Data Deduplication: The research indicates that deduplication of training data does not significantly reduce the amount of data that can be extracted.

- Uncertainties in Data Handling: The study highlights ongoing uncertainties in how training data is processed and retained by Language Models.

Click here to read more

Seamus Larroque

CDPO / CPIM / ISO 27005 Certified

Home

Discover our latest articles

View All Blog Posts
March 12, 2025
Clinical Trials
Biotech & Healthtech
Data Transfers
Regulations & Guidelines
Clinical Trial Sponsor

Navigating Privacy Requirements for Clinical Trials Across Jurisdictions: Focus on China

China’s data protection regulations play a crucial role in clinical trials, requiring sponsors and researchers to comply with multiple laws, including the PIPL, GCP-2020, and cross-border data transfer rules. Unlike other jurisdictions, China imposes strict consent requirements, risk assessments, and regulatory filings, making compliance a key factor when selecting trial locations and managing participant data.

October 14, 2024
Clinical Trials
Guideline

Analyzing the Similarities and Differences Between ICH-GCP and GDPR in Clinical Trials

ICH-GCP and GDPR are vital for clinical trials, setting standards for participant protection and data integrity, with distinct focuses and enforcement approaches.