Detecting Machine-translated Documents in Large Parallel Corpora

Publication details

Part of: 11th Workshop on Building and Using Comparable Corpora (BUCC 2018) (European Language Resources Association, 2018)
Pages: 25–32
Year: 2018
Link:
- FULLTEKST: http://publications.nr.no/1542616614/bucc2018-Lison.pdf

Parallel corpora extracted from online repositories of movie and TV subtitles are employed in a wide range of NLP applications, from language modelling to machine translation and dialogue systems. However, the subtitles uploaded in such repositories exhibit varying levels of quality. A particularly difficult problem stems from the fact that a substantial number of these subtitles are not written by human subtitlers but are simply generated through the use of online translation engines. This paper investigates whether these machine-generated subtitles can be detected automatically using a combination of linguistic and extra-linguistic features. We show that a feedforward neural network trained on a small dataset of subtitles can detect machine-generated subtitles with a F1-score of 0.64. Furthermore, applying this detection model on an unlabelled sample of subtitles allows us to provide a statistical estimate for the proportion of subtitles that are machine-translated (or are at least of very low quality) in the full corpus.