Automatically anonymizing text documents

The goal of CLEANUP is to develop new machine learning methods to automatically anonymize text documents with personal data, such as electronic health records, court decisions or chat based interactions with customers.

The main idea of ​​the project is to combine approaches from natural language processing and privacy to design a new generation of anonymization techniques.

The purpose is to modify text documents in a way that prevents the disclosure of personal information, while preserving the internal context and semantic content of the documents.

One of the method we are testing is text sanitization, the task of redacting a document to mask all occurrences of (direct or indirect) personal identifiers, with the goal of concealing the identity of the individual(s) referred in it.


The project brings together researchers from machine learning, natural language processing, data protection, statistical modeling, health informatics and IT law.

In addition, partners from the Norwegian public and private sectors (which cover insurance, welfare, health services and legal publishing) contribute to the project with computer and domain knowledge.

Navn: CleanUp-project

Partners:  The Faculty of Law and the Department of Informatics at University of Oslo, the Norwegian University of Science and Technology, University of Rovira i Virgili, DNB, Norwegian Labour and Welfare Administration, Gjensidige, Lovdata, Norsk Helsearkiv

Period: 2020 – 2024

Funding: Research Council of Norway 

Project website for partners