CLEANUP is a four-year research project funded by the Research Council of Norway. The goal of CLEANUP is to develop new machine learning methods to automatically anonymise (or at least strongly de-identify) text documents containing personal data, such as electronic health records, court rulings or chat-based interactions with customers.
The CLEANUP acronym is:
“Machine Learning for the Anonymisation of Unstructured Personal Data”
The project brings together a consortium of researchers from machine learning, natural language processing, computational privacy, statistical modelling, health informatics and IT law. In addition, partners from the Norwegian public and private sector (covering the fields of insurance, welfare, healthcare and legal publishing) contribute to the project with their data and domain knowledge.
Summary of the project objectives
The project sets out to develop new computational models and processing techniques to automatically anonymise unstructured data containing personal information, with a specific focus on text documents.
The project’s key idea is to combine approaches from natural language processing and data privacy to design a new generation of text anonymisation techniques that simultaneously:
- Take advantage of state-of-the-art natural language processing techniques (based on deep neural architectures) to derive fine-grained records of the individuals referred to in a given document
- Connect these individual records to principled measures of disclosure risk and data utility, with the goal of modifying text documents in a way that prevents the disclosure of personal information while preserving as closely as possible the internal coherence and semantic content of the documents.
The project will also design dedicated evaluation methods to assess the empirical performance of text anonymisation mechanisms, and examine how these metrics are to be interpreted from a legal perspective, in particular with respect to how privacy risk assessments should be conducted on large amounts of text data. Finally, the project will investigate how these technological solutions can be integrated into organisational processes – in particular how quality control can be performed in direct interaction with text anonymisation tools, and how the level and type of anonymisation can be parametrised to meet the specific needs of the data owner.
To achieve these objectives, the project brings together a consortium of researchers with expertise in machine learning, natural language processing, computational privacy, statistical modelling, health informatics and IT law. In addition, external partners from the public and private sector (covering the fields of insurance, welfare, healthcare and legal publishing) will also contribute to the research objectives with their data and domain knowledge.
The Norwegian Computing Center (NR) will be responsible for the overall management of CLEANUP. The project leader is Pierre Lison.
The project is planned for a duration of four years and is divided in six work packages. The project will first collect data together with the external partners and annotate part of this data to mark sensitive entries. Synthetic data will then be generated on the basis of this labelled dataset, and employed in the work packages WP2-WP5. The timeline is illustrated below:
- WP-0: Project management and dissemination:
- Task 0.1: Project management: Coordination of research activities, administration and internal communication, organisation of project-related events, reporting, quality assurance.
- Task 0.2: Dissemination: Research publications, participation in national and international conferences, public outreach activities, demonstration of prototypes to stakeholders, etc.
- WP-1: Data collection, annotation & augmentation:
- Task 1.1: Data collection and storage: Coordination with external partners and data protection authorities to collect the data necessary to the project. Storage on secure server.
- Task 1.2: Data annotation: Development of a common annotation scheme to mark textual entries to be sanitised. Coordination of the annotation effort among research partners.
- Task 1.3: Generation of synthetic data: New methods for producing synthetic data based on existing documents through e.g.~sequences of text transformations.
- WP-2: Detection of sensitive text entries:
- Task 2.1: Detection of personal identifiers: Pre-training of neural language understanding models for Norwegian and fine-tuning to detect direct and indirect identifiers.
- Task 2.2: Extraction of individual records: Aggregation of detected identifiers into individual records (based on e.g.~co-reference resolution techniques).
- WP-3: Algorithms for text sanitisation:
- Task 3.1: Ontologies for entity generalisation: Construction/adaptation of ontologies for the purpose of replacing sensitive text entries with more generic terms or phrases.
- Task 3.2: Sanitisation strategies: Text sanitisation with constraints on syntactic or semantic structure (e.g~document-level consistency) and ability to deal with uncertain inputs.
- Task 3.3: Differentially private text data queries: Statistical inference from text databases under differential privacy constraints.
- WP-4: Evaluation methods:
- Task 4.1: Evaluation of text anonymisation: Design of novel evaluation methods to assess the anonymisation quality and data utility, using both intrinsic and extrinsic factors.
- Task 4.2: Legal perspectives: Link between quantitative evaluation measures and legal regulations on privacy and data protection, in particular privacy risk assessments.
- WP-5: Quality control and adaptation:
- Task 5.1: Interactive quality control: Interface to apply anonymisation tool interactively, including explanations of editing suggestions and estimation of disclosure risk.
- Task 5.2: Adaptive anonymisation models: Parametrisation of anonymisation models to control the level and type of anonymisation at runtime.
The software developed for the project will be released under an open-source license and published on a public repository. Sensitive datasets will be stored on TSD, which provide secure, encrypted servers especially designed for research on sensitive data.