Publication details
- Publisher: Norsk Regnesentral
- Link:
To analyze the nutritional trends of Norwegian households, Statistics Norway uses a combination of grocery store receipts and food product information, namely the nutritional values of each product. However, since the number of unique products sold is vast, having a good quality data set of nutritional values is challenging. So challenging that the data set used by Statistics Norway is missing 82% of its nutritional values. Although manual data labelling based on the product names is sometimes possible, the magnitude of missing values (~1.05 million) and the fact that the content of these products is always evolving, makes this a nearly impossible task. In this paper, we introduce a way to automatically label missing nutritional values using a product’s name, natural language processing (NLP), and machine learning prediction models. We make two assumptions: 1. The only product information known is the product name, and 2. Two products with similar names have similar nutritional values. Based on these assumptions, we use NLP to find a match for each product using Jaccard similarity and K-nearest-neighbors. Then, we train independent machine learning models for each nutritional values where the model features are derived from the match. To validate this approach, we mask known nutritional values and try to predict them using these fitted models. We then compare these predictions to imputed values from a simple mean imputation approach. We show that the machine learning models produce RMSE and MAEs that are two times smaller compared to a simple “Baseline” imputation approach.