In a Funk about BLEU
 
This is a more fleshed-out version of a blog post by Pete Smith and Henry Anderson of the University of Texas at Arlington already published on SDL.com . They describe initial results from a research project they are conducting on MT system quality measurement and related issues.   MT quality measurement, like human translation quality measurement, has been a difficult and challenging subject for both the translation industry and for many MT researchers and systems developers as the most commonly used metric BLEU, is now quite widely understood to be of especially limited value with NMT systems.  Most of the other text-matching NLP scoring measures are just as suspect, and practitioners are reluctant to adopt them as they are either difficult to implement, or the interpretation pitfalls and nuances of these other measures are not well understood. They all can generate a numeric score based on various calculations of Precision and Recall that need to be interpreted with great c...