Posts

Showing posts from August, 2017

The Evolution in Corpus Analysis Tools

Image
This is a guest post by Ondřej Matuška, t he Sales & Marketing Manager of Lexical Computing , a company that develops a corpus and language data analysis product called Sketch Engine .  I was first made aware of Sketch Engine by Jost Zetzsche's newsletter (276th Edition of the Tool Box) a few weeks ago. As relatively clean text corpora proliferate and grow in data volume, it becomes necessary to use new kinds of tools to understand this huge volume of text data, which may or may not be under consideration for translation. These new tools help us to understand how to accurately profile the most prominent linguistic patterns in large collections of textual language data and extract useful knowledge from these new corpora to help in many translation related tasks. For those of us in the MT world, there have always been student-made (mostly by graduate students in NLP and computational linguistic programs )  tools that were used and needed to understand the corpus for better MT dev

A Fun, Yet Serious Look at the Challenges we face in Building Neural Machine Translation Engines

Image
This is a guest post by Gábor Ugray on NMT model building challenges and issues. Don't let the playful tone and general sense of frolic in the post fool you. If you look more closely, you will see that it very clearly defines an accurate list of challenges that one might come upon when one ventures into building a Neural MT engine. This list of problems is probably the exact list that the big boys (Microsoft, FaceBook, Google, and others) have faced some time ago. I  have previously discussed how SYSTRAN and SDL are solving these problems. While this post describes an experimental system very much from a do-it-yourself perspective, production NMT engines might differ only by the way in which they handle these various challenges.  This post also points out a basic issue about NMT - while it is clear that NMT works, often surprisingly well,  it is still very unclear what predictive patterns are learned, which makes it hard to control and steer. Most (if not all) of the SMT strategi