Automatically anonymizing text documents
- Department SAMBA
- Fields involved Image analysis, Machine learning
- Industries involved Community
The goal of CLEANUP is to develop new machine learning methods to automatically anonymize text documents with personal data, such as electronic health records, court decisions or chat based interactions with customers.
The main idea of the project is to combine approaches from natural language processing and privacy to design a new generation of anonymization techniques.
The purpose is to modify text documents in a way that prevents the disclosure of personal information, while preserving the internal context and semantic content of the documents.
One of the method we are testing is text sanitization, the task of redacting a document to mask all occurrences of (direct or indirect) personal identifiers, with the goal of concealing the identity of the individual(s) referred in it.
Partners
The project brings together researchers from machine learning, natural language processing, data protection, statistical modeling, health informatics and IT law.
In addition, partners from the Norwegian public and private sectors (which cover insurance, welfare, health services and legal publishing) contribute to the project with computer and domain knowledge.
Navn: CleanUp-project
Partners: The Faculty of Law and the Department of Informatics at University of Oslo, the Norwegian University of Science and Technology, University of Rovira i Virgili, DNB, Norwegian Labour and Welfare Administration, Gjensidige, Lovdata, Norsk Helsearkiv
Period: 2020 – 2024
Funding: Research Council of Norway