{"id":39373,"date":"2025-11-19T10:25:54","date_gmt":"2025-11-19T09:25:54","guid":{"rendered":"https:\/\/nr.no\/en\/?post_type=bc_project&#038;p=39373"},"modified":"2025-11-19T14:55:19","modified_gmt":"2025-11-19T13:55:19","slug":"norsk-forskningsinfrastruktur-for-nettdata-webdata","status":"publish","type":"bc_project","link":"https:\/\/nr.no\/en\/projects\/norsk-forskningsinfrastruktur-for-nettdata-webdata\/","title":{"rendered":"Norwegian research infrastructure for web data (WebData)"},"content":{"rendered":"\n<div class=\"wp-block-columns is-layout-flex wp-container-core-columns-is-layout-28f84493 wp-block-columns-is-layout-flex\">\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:66.66%\">\n<p><strong>Since the late 1990s, the National Library of Norway has been systematically collecting material from the Norwegian web, creating a vast digital archive. However, while this digital archive of the &#8220;Norwegian web&#8221; could greatly benefit research within the humanities and social sciences, it has so far remained out of reach for researchers. With so much public debate happening online, and AI systems depending on massive datasets, access to this data collection has become increasingly urgent.<\/strong><\/p>\n\n\n\n<p>The WebData project intends to establish a national platform for research on diverse types of internet data including text, audio, video and image data. This infrastructure will give users secure access to online materials for academic and societal research, while carefully complying with personal data protection and copyright regulations.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Research on Norwegian and S\u00e1mi languages and culture<\/h2>\n\n\n\n<p>A central goal of WebData is to advance research on Norwegian and S\u00e1mi languages and culture, and to strengthen the development of language technologies for these languages. The project will build extensive web corpora for Bokm\u00e5l, Nynorsk, and S\u00e1mi, enabling their use in applications such as large language models. These text collections will be automatically labeled for a variety of linguistic aspects such as the names of people and organizations mentioned, events and sentiment. In addition, WebData will examine how well S\u00e1mi language content is represented in the web archive and take steps to increase the amount of S\u00e1mi material collected. The platform will be built in close cooperation with the research community via needs-assessment surveys and user evaluations.<\/p>\n\n\n\n<p><strong>The main contributions of NR to the project will be:<\/strong><\/p>\n\n\n\n<p>&#8211; the extraction of <strong>clean text<\/strong> from common data formats (web pages, PDFs etc.) and automatically adding <strong>metadata<\/strong> information to these (e.g. author, date, topic, text quality);<\/p>\n\n\n\n<p>&#8211; <strong>text sanitization<\/strong>: developing efficient methods to automatically identify and hide personal information;<\/p>\n\n\n\n<p>&#8211; adapting the metadata extraction and sanitiation methods above to <strong>transcripts from audio data<\/strong>.<\/p>\n\n\n\n<p>By making online data available for analysis, the project will allow investigations into topics such as elections, democracy, media dynamics, freedom of expression, and emerging challenges to democratic institutions in the internet age. <\/p>\n\n\n\n<p><strong>Thus, WebData will open new possibilities for understanding how digitalization has transformed Norway\u2019s public sphere. The web archive will also be a valuable resource for training language models and will contribute to a better representation of Norwegian and S\u00e1mi languages and culture in these models.<\/strong><\/p>\n<\/div>\n\n\n\n<div class=\"wp-block-column is-layout-flow wp-block-column-is-layout-flow\" style=\"flex-basis:33.33%\">\n<p><strong>To learn more about our work in this project, please contact:<\/strong><\/p>\n\n\n\t\t<div id=\"post-type-multi-block_3076b8a91f24d42599dea1dc9462dabd\" class=\"wp-block-post-type-multi type-manual style-card-bc_employee t2-grid\">\n\t\t\t\t\t\t\t<div class=\"t2-grid-item-col-6\">\n\t\t\t\t\t\t<a href=\"https:\/\/nr.no\/en\/employees\/pierre-lison\/\" class='card-employee'>\n\t\t\t\t\t<figure>\n\t\t\t\t<img decoding=\"async\" src=\"https:\/\/nr.no\/content\/uploads\/sites\/2\/2024\/05\/pierre-lison-24.jpg\" alt=\"\">\n\t\t\t<\/figure>\n\t\t\t\t<div class=\"card-employee__content\">\n\t\t\t<p class=\"card-employee__name\">Pierre Lison<\/p>\n\t\t\t\t\t\t\t<p class=\"card-employee__position\">Chief Research Scientist<\/p>\n\t\t\t\t\t\t<svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" viewBox=\"0 0 24 24\" height=\"24\" width=\"24\" class=\"t2-icon t2-icon-arrowforward\" aria-hidden=\"true\" focusable=\"false\"><path d=\"M15.9 4.259a1.438 1.438 0 0 1-.147.037c-.139.031-.339.201-.421.359-.084.161-.084.529-.001.685.035.066 1.361 1.416 2.947 3l2.882 2.88-10.19.02c-8.543.017-10.206.029-10.29.075-.282.155-.413.372-.413.685 0 .313.131.53.413.685.084.046 1.747.058 10.29.075l10.19.02-2.882 2.88c-1.586 1.584-2.912 2.934-2.947 3-.077.145-.085.521-.013.66a.849.849 0 0 0 .342.35c.156.082.526.081.68-.001.066-.035 1.735-1.681 3.709-3.656 2.526-2.53 3.606-3.637 3.65-3.742A.892.892 0 0 0 23.76 12a.892.892 0 0 0-.061-.271c-.044-.105-1.124-1.212-3.65-3.742-1.974-1.975-3.634-3.616-3.689-3.645-.105-.055-.392-.107-.46-.083\"\/><\/svg>\n\t\t<\/div>\n\t<\/a>\n\t\t\t\t<\/div>\n\t\t\t\t\t\t\t<div class=\"t2-grid-item-col-6\">\n\t\t\t\t\t\t<a href=\"https:\/\/nr.no\/en\/employees\/ildiko-pilan\/\" class='card-employee'>\n\t\t\t\t\t<figure>\n\t\t\t\t<img decoding=\"async\" src=\"https:\/\/nr.no\/content\/uploads\/sites\/2\/2024\/05\/ildiko-pilan-18.jpg\" alt=\"\">\n\t\t\t<\/figure>\n\t\t\t\t<div class=\"card-employee__content\">\n\t\t\t<p class=\"card-employee__name\">Ildik\u00f3 Pil\u00e1n<\/p>\n\t\t\t\t\t\t\t<p class=\"card-employee__position\">Senior Research Scientist<\/p>\n\t\t\t\t\t\t<svg xmlns=\"http:\/\/www.w3.org\/2000\/svg\" viewBox=\"0 0 24 24\" height=\"24\" width=\"24\" class=\"t2-icon t2-icon-arrowforward\" aria-hidden=\"true\" focusable=\"false\"><path d=\"M15.9 4.259a1.438 1.438 0 0 1-.147.037c-.139.031-.339.201-.421.359-.084.161-.084.529-.001.685.035.066 1.361 1.416 2.947 3l2.882 2.88-10.19.02c-8.543.017-10.206.029-10.29.075-.282.155-.413.372-.413.685 0 .313.131.53.413.685.084.046 1.747.058 10.29.075l10.19.02-2.882 2.88c-1.586 1.584-2.912 2.934-2.947 3-.077.145-.085.521-.013.66a.849.849 0 0 0 .342.35c.156.082.526.081.68-.001.066-.035 1.735-1.681 3.709-3.656 2.526-2.53 3.606-3.637 3.65-3.742A.892.892 0 0 0 23.76 12a.892.892 0 0 0-.061-.271c-.044-.105-1.124-1.212-3.65-3.742-1.974-1.975-3.634-3.616-3.689-3.645-.105-.055-.392-.107-.46-.083\"\/><\/svg>\n\t\t<\/div>\n\t<\/a>\n\t\t\t\t<\/div>\n\t\t\t\t\t<\/div>\n\t\t\n\n\n<div class=\"wp-block-group has-primary-200-background-color has-background\">\n<p>Project: Norwegian research infrastructure for web data (WebData)<\/p>\n\n\n\n<p>Partners: National Library of Norway, University of Oslo and The Arctic University of Norway<\/p>\n\n\n\n<p>Funding: The Research Council of Norway&nbsp;<\/p>\n\n\n\n<p>Period: 2025 &#8211; 2029<\/p>\n<\/div>\n\n\n\n<div class=\"wp-block-group\">\n<p><strong>Project homepage: <\/strong><\/p>\n\n\n\n<p><a href=\"https:\/\/webdata.nb.no\/en\/\">https:\/\/webdata.nb.no\/en\/<\/a><\/p>\n<\/div>\n<\/div>\n<\/div>\n","protected":false},"featured_media":39379,"template":"","meta":{"_acf_changed":false,"_trash_the_other_posts":false,"editor_notices":[],"footnotes":""},"class_list":["post-39373","bc_project","type-bc_project","status-publish","has-post-thumbnail"],"acf":[],"_links":{"self":[{"href":"https:\/\/nr.no\/en\/wp-json\/wp\/v2\/bc_project\/39373","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/nr.no\/en\/wp-json\/wp\/v2\/bc_project"}],"about":[{"href":"https:\/\/nr.no\/en\/wp-json\/wp\/v2\/types\/bc_project"}],"version-history":[{"count":5,"href":"https:\/\/nr.no\/en\/wp-json\/wp\/v2\/bc_project\/39373\/revisions"}],"predecessor-version":[{"id":39384,"href":"https:\/\/nr.no\/en\/wp-json\/wp\/v2\/bc_project\/39373\/revisions\/39384"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/nr.no\/en\/wp-json\/wp\/v2\/media\/39379"}],"wp:attachment":[{"href":"https:\/\/nr.no\/en\/wp-json\/wp\/v2\/media?parent=39373"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}