Norwegian research infrastructure for web data (WebData)

Since the late 1990s, the National Library of Norway has been systematically collecting material from the Norwegian web, creating a vast digital archive. However, while this digital archive of the “Norwegian web” could greatly benefit research within the humanities and social sciences, it has so far remained out of reach for researchers. With so much public debate happening online, and AI systems depending on massive datasets, access to this data collection has become increasingly urgent.

The WebData project intends to establish a national platform for research on diverse types of internet data including text, audio, video and image data. This infrastructure will give users secure access to online materials for academic and societal research, while carefully complying with personal data protection and copyright regulations.

Research on Norwegian and Sámi languages and culture

A central goal of WebData is to advance research on Norwegian and Sámi languages and culture, and to strengthen the development of language technologies for these languages. The project will build extensive web corpora for Bokmål, Nynorsk, and Sámi, enabling their use in applications such as large language models. These text collections will be automatically labeled for a variety of linguistic aspects such as the names of people and organizations mentioned, events and sentiment. In addition, WebData will examine how well Sámi language content is represented in the web archive and take steps to increase the amount of Sámi material collected. The platform will be built in close cooperation with the research community via needs-assessment surveys and user evaluations.

The main contributions of NR to the project will be:

– the extraction of clean text from common data formats (web pages, PDFs etc.) and automatically adding metadata information to these (e.g. author, date, topic, text quality);

text sanitization: developing efficient methods to automatically identify and hide personal information;

– adapting the metadata extraction and sanitiation methods above to transcripts from audio data.

By making online data available for analysis, the project will allow investigations into topics such as elections, democracy, media dynamics, freedom of expression, and emerging challenges to democratic institutions in the internet age.

Thus, WebData will open new possibilities for understanding how digitalization has transformed Norway’s public sphere. The web archive will also be valuable resources for training language models and will contribute to a better representation of Norwegian and Sámi languages and culture in these models.

To learn more about our work in this project, please contact:

Project: Norwegian research infrastructure for web data (WebData)

Partners: National Library of Norway, University of Oslo and The Arctic University of Norway

Funding: The Research Council of Norway 

Period: 2025 – 2029

Project homepage:

https://webdata.nb.no/en/