Detecting Machine-translated Documents in Large Parallel Corpora

Publikasjonsdetaljer

Del av: 11th Workshop on Building and Using Comparable Corpora (BUCC 2018) (European Language Resources Association, 2018)
Sider: 25–32
År: 2018
Lenke:
- ARKIV: hdl.handle.net/11250/3499161

Parallel corpora extracted from online repositories of movie and TV subtitles are employed in a wide range of NLP applications, from language modelling to machine translation and dialogue systems. However, the subtitles uploaded in such repositories exhibit varying levels of quality. A particularly difficult problem stems from the fact that a substantial number of these subtitles are not written by human subtitlers but are simply generated through the use of online translation engines. This paper investigates whether these machine-generated subtitles can be detected automatically using a combination of linguistic and extra-linguistic features. We show that a feedforward neural network trained on a small dataset of subtitles can detect machine-generated subtitles with a F1-score of 0.64. Furthermore, applying this detection model on an unlabelled sample of subtitles allows us to provide a statistical estimate for the proportion of subtitles that are machine-translated (or are at least of very low quality) in the full corpus.