The Evolution in Corpus Analysis Tools

This is a guest post by Ondřej Matuška, the Sales & Marketing Manager of Lexical Computing, a company that develops a corpus and language data analysis product called Sketch Engine.

I was first made aware of Sketch Engine by Jost Zetzsche's newsletter (276th Edition of the Tool Box) a few weeks ago. As relatively clean text corpora proliferate and grow in data volume, it becomes necessary to use new kinds of tools to understand this huge volume of text data, which may or may not be under consideration for translation. These new tools help us to understand how to accurately profile the most prominent linguistic patterns in large collections of textual language data and extract useful knowledge from these new corpora to help in many translation related tasks. For those of us in the MT world, there have always been student-made (mostly by graduate students in NLP and computational linguistic programs) tools that were used and needed to understand the corpus for better MT development strategies, and to get text data ready for machine learning training processes. Most of these tools would be characterized as not being "user-friendly", or to put it more bluntly as being too geeky. As we head into the world of deep learning, the need for well-understood data that is used for training or leverage any translation task can only grow in importance.

Despite the hype, we should understand that deep learning algorithms are increasingly going to be viewed as commodities. It's the data where the real value is.

I am often asked what kinds of tools translators should learn to use in future, and I generally feel that they should stay away from Moses and other MT development tool-kits like Tensorflow, Nematus, OpenNMT, and focus on the data analysis and preparation aspects since this ability would add value to any data-driven machine learning approach used. Something that is worth remembering is, that despite the hype, deep learning algorithms are commodities. It's the data that's the real value. These MT deep learning development tools (algorithms) are likely to evolve rapidly in the near-term, and we can expect only the most capable and well-funded groups will be able to keep up with the latest developments.(How many LSPs do you think have tried all four open source NMT platforms? Or know what CNN is? My bet is that only SDL has.) Even academics complain about the rate of change and new developments in Neural MT algorithmic research, and thus LSPs and translators are likely to be at a clear disadvantage in pursuing Neural MT model development. Preparing data for machine learning processes will become an increasingly important and strategic skill for those involved with business translation work. This would mean that the following skills would be valuable IMO. They are all somewhat closely linked in my mind:

Corpus Analysis & Profiling Tools like Sketch Engine
Corpus Modification Tools i.e. Advanced Text Editors, TextPipe and other editors that enable pattern level editing on very large (tens of millions of sentences) text data sets
Rapid Error Detection & Correction Tools to go beyond traditional conceptions of PEMT
MT Output Quality Assessment Methodology & Tools
Training Data Manufacturing capabilities that evolve from a deeper understanding of the source and TM corpus enabled by tools like Sketch Engine.

These are all essential tools in undertaking 5 million and 100+ million word translation projects that are likely to become much more commonplace in future.Clearly, many translators will want nothing to do with this kind of work, but as MT use expands, these kinds of tools and skills become much more valuable and many would argue that understanding patterns in linguistic big data also has great value for any kind of translation task.

Jost Zetsche has provided a nice overview of what Sketch Engine does below:

Word sketches: This is where the program got its name, and it's what Kilgarriff (co-founder) brought to the table. A word sketch is a summary of a word's grammatical and collocational behavior (collocational refers to the analysis of how often a word co-occurs with other words or phrases). Since the data in the corpora is lemmatized (i.e., words are analyzed so they can be brought back to their base or dictionary form), the results are a lot more meaningful than what most of our translation environment tools provide when they're unable to relate different forms of one word to each other. Another word sketch option that Sketch Engine offers is the comparison of word sketches of similar words.

Thesaurus: The ability to retrieve a detailed list or a graphical word cloud with similar words, including links to create reports on word sketch differences for those terms to understand the exact differences in actual usage.

Concordance: Searches for single words, terms, or even longer phrases. Since the data in the supported languages is tagged, it's also possible to search for specific classes of words or specific classes of words that surround the word in question.

Parallel corpus: Retrieval of bilingual sets of words or phrases within the contexts. Presently this is available only for on-screen data viewing, but it will soon be offered as downloadable data. This is especially helpful when uploading your own translation memories (see below).

Word lists: The possibility of creating lists of words and the number of occurrences, either as lemmas (the base form of each word) or in each word form.
Creating your own corpus: For translators, this likely is the most exciting feature. You can either upload your own translation memories or you can use the tool's own search engine mechanism (which relies on Microsoft Bing) to create a list of bilingual websites that contain the terms that are relevant to your field. You can download many websites containing certain terms to build a corpus. However, you cannot have them automatically align with a translated version of that website through Sketch Engine. You can perform any of the functions mentioned earlier but it is also possible to run a keyword search on the user-created corpus, identify the terms that are relevant, and download that into an Excel or TBX file. This feature presently is available for Czech, Dutch, English, French, German, Chinese, Italian, Japanese, Korean, Polish, Portuguese, Russian, and Spanish. The bilingual version of this is just around the corner.

Many years ago I thought that the evolution from TM to other "more intelligent" language data analysis and manipulation tools would happen much faster, but things change slowly in a highly fragmented industry like the translation industry. I think tools like Sketch Engine, together with much more compelling MT capabilities, finally signal that a transition is now beginning, and could potentially build momentum.

P.S. Interestingly, the day after I published this the ATA also published a post on Corpus Analysis that focuses on open source tools.

As almost always the emphasis below is mine.

=====================

Deploying NLP and Text Corpora in Translation

Natural Language Processing (NLP) is a discipline which has lots to offer to translators and translation, yet translation rarely makes use of the possibilities. This might be partly due to the fact that NLP tools are difficult to use without a certain level of IT skills. This is what the Sketch Engine team realized 13 years ago and built Sketch Engine, a tool which makes NLP technology accessible to anyone. Sketch Engine started as a corpus query and corpus management tool which over time developed a variety of features that address the needs of new users from outside of the linguistic camp such as translators.

Term Extraction

Term extraction is the first area where NLP can become extremely useful.The traditional approach tends to be n-gram based, n-gram being a sequence of any n words. In a nutshell, a term extraction tool will find the most frequent n-grams in the text and these will be presented to the user as term candidates. The user will then proceed to the next step: manual cleaning. It is not uncommon to receive a list which contains more non-terms than terms, therefore manual cleaning became a natural next step. Some term extraction tools introduced lists of stop words and the user can even indicate whether the word is a hard stop word or whether the stop word is allowed only in certain positions within the term. While this led to improvement, the output still contains lots of noise and manual cleaning still remains a vital step in the process.

At Sketch Engine, we decided to direct efforts towards term extraction with a view to achieving much cleaner results by exploiting our NLP tools and our multibillion-word general text corpora.

The main difference between Sketch Engine and traditional term extraction tools is that each text uploaded to Sketch Engine is tagged and lemmatized. The system thus knows whether the word is a verb, noun, adjective etc. and also knows which words are declined or conjugated forms of the same base form called lemma. Sketch Engine can look separately for work as noun and work as verb and can also treat different forms of nouns (cases, plural/singular) or verbs (tenses, participles) as the same word if required. This is something that was to be exploited in the term extraction.

For each language with term extraction support (16 languages as of August 2017), we developed definitions telling Sketch Engine what a term in that language can look like. For example, Sketch Engine knows that a term in English will most likely take the form of (noun+)noun+noun or adjective+noun while in Spanish, most likely, noun+adjective(+adjective) or noun+de+noun. The full rules are more complex than listed here. This will immediately disqualify any phrases that contain a verb or do not contain a noun at all.

In addition to the format of the phrase, Sketch Engine also makes use of its enormous general text corpora which it uses to check whether the phrase that passed the check of format is more frequent in the text in question compared to general language. During this check, each phrase is treated as one unit and occurrences of the same phrase are searched and counted in the general text and compared. Lemmatization plays an important role here so that plurals and singulars or different cases can be counted as the same phrase. The combination of the format check and frequency comparison leads to exceptionally clean results. Here are term candidates as extracted from texts about photography. No manual cleaning applied, list presented as it comes out of Sketch Engine.

The quality of extraction can be checked immediately by using the new dedicated term extraction interface to Sketch Engine called OneClick Terms https://terms.sketchengine.co.uk/

Overall Language Quality

While a great deal of the translation business relates to terminology, it is not the terms themselves that constitute the majority of text. There is a lot of language in between which may not always be completely straightforward to translate. Translators are used to working with concordances in their CAT tools where it is the translation memory (TM) that serves as the source of data. The TM is sufficient for terminology work but might not be as useful for the language in between. TMs are usually rather small and the concordance does not find enough occurrences to judge which usage is typical. This is where general text corpora come in handy. The word ‘general’ refers to the fact that these corpora were designed to contain the largest possible variety of text types and topics. A general text corpus will, therefore, contain even very specialized texts heavy in terminology as well as common neutral text from various sources. Sketch Engine contains multibillion-word corpora in many languages. The largest corpus is English with a size of 30 billion words, that is 30,000,000,000!

Languages with a corpus of 500+ million words

English        33,100
German       19,900
Russian       18,300
French        12,400
Spanish     11,000
Japanese    10,300
Polish            9,700
Arabic           8,300
Italian           5,900
Czech            5,100
Catalan          4,800
Portuguese 4,600
Turkish          4,100

Swedish            3,900
Hungarian 3,200
Romanian    3,100
Dutch                3,000
Ukrainian 2,700
Danish               2,400
Chinese simp    2,100
Chinese trad 2,100
Greek                2,000
Norwegian         2,000
Finnish              1,700
Croatian            1,400
Slovak                1,200

Hebrew      1,100
Slovenian    1,000
Lithuanian 1,000
Hindi             900
Bulgarian       800
Latvian          700
Estonian    600
Serbian         600
Korean        600
Serbian        600
Persian    500
Maltese 500

Collocations

A corpus of this size will return thousands of hits for most words or phrases and millions in the case of frequent ones. Such a concordance is impossible for a human to process. This is why we developed an advanced feature, called the word sketch, that will cope with this amount of information and will present the results in a compact and easy to understand format. The word sketch is a one-page summary of word combinations (collocations) that the word keeps. It will give the user an instant idea about how the word should be used in context. The collocations are presented in groups reflecting the syntactic relations. An example of a word sketch might look like this:

Two million occurrences of ‘contract’ were found in the corpus and processed into this summary above, of collocations, which the user can understand in seconds. It gives a clear picture of what adjectives or verbs are the typical collocations the word keeps allowing the user to use the word naturally as a native speaker would. This information is computed automatically without any manual intervention meaning that the user can generate it for any word in the language including rare words. It is highly recommended to use large corpora to get information this rich. A minimum size is around 1 billion words. A smaller corpus will also produce a word sketch but not with as much information and a corpus below 50 million words is not likely to produce anything useful especially for less frequent words. The largest preloaded corpora in Sketch Engine are recommended for use with the word sketch.

Word choice - Thesaurus

I am sure everyone has been in a situation when they want to say something but the right word would not spring to mind. One can usually think of a similar word, just not the right one. This is when a thesaurus useful. Traditional printed and hand-made thesaurus content is limited by space or money, and often both. The combination of NLP and distributional semantics led to algorithms that can generate thesaurus entries automatically. The idea of a computer identifying similar words by computations often leads to skepticism but the results are surprisingly usable. How does an algorithm discover words similar in meaning? Distributional semantics claims that words which appear in similar contexts are also similar in meaning. Therefore to find a synonym for a noun, Sketch Engine will compare the word sketches for all nouns found in the corpus. The ones with the most similar word sketch will be identified as synonyms or similar words. Here is an example of what Sketch Engine will offer if you need a word similar to authorization:

permission
consent
approval
authorisation
permit
notification
verification
documentation

confirmation
license
oversight
disclosure
waiver
licence
exemption
certificate

compensation
registration
certification
notice
restriction
reimbursement
eligibility

The synonyms are sorted by the similarity score calculated from the similarity of word sketches of each word. The top of the list (the first column) is the most valuable. The list contains certain words which are not very good synonyms and they are listed because the collocations they form are similar to the collocations of authorization. This, however, still keeps the list very useful because the thesaurus functionality will be used by somebody with a decent knowledge of the language and these words serve as suggestions from which the user will pick the most suitable one.

For words which cannot have synonyms, the thesaurus will produce a list of words belonging to the same category or the same topic. This is the thesaurus for stapler:

notepads
paperclip
sharpener
scissors
eraser
highlighter
nailer
plier

tweezers
sharpie
post-it
crayon
screwdriver
protractor
paintbrush
thumbtack

hacksaw
wrench
awl
scalpel
photocopier
trimmer

This type of a thesaurus entry might help recall a word from the same category.

Examples in Context - Concordance

Sketch Engine features also the concordance with a simple as well as complex search options where the user can search both their own texts as well preloaded corpora. The options allow for searching by exactly the text typed but also by lemma (the base form of the word which will find also all derived forms) or restricting the search by part of speech or grammatical categories such as the tense of the verb. It even allows for searching for lexical or grammatical patterns without specifying concrete words. This interesting concordance shows examples of sequences of nouns joined by the preposition of. This is something I actually had to look up recently to check how many of’s I can use in a row. While the concordance itself did not answer the question directly, I could see that it is normal to use use 3 of’s as long as the expression consists of numbers and units of measurement, which is how I originally used it in my sentence and the concordance helped me check I was right.

Translation Lookup - Parallel Corpora

Sketch Engine also contains parallel multilingual corpora which can be used for translation lookup. Again, both simple and complex search criteria can be applied both on the first and second language. This will make it possible for the user learn about situations when a word is not translated by the most obvious equivalent. For example, this searches looks for the word vehicle in English and matching Spanish segments not containing vehículo to discover the cases when it might need to be translated differently.

This is especially valuable to users who do not have any TM or the TM is not large enough to provide the required coverage. Users with a TM can upload it to Sketch Engine to gain access to the advanced searching tools.

Building Specialized Domain Corpora

Sketch Engine has a built-in tool for automated corpus building. The user does not need any technical knowledge to build a corpus. It is enough to upload their own data (texts, documents) and if the user does not have any suitable data, Sketch Engine will automatically find them on the internet, download them and convert them to a corpus. It only takes minutes to build a 100,000-word specialized corpus.

The first option is obvious – the user uploads their texts and documents and Sketch Engine will lemmatize them and tag them and the corpus is ready.

If the user has no suitable texts or their length is not sufficient, the use can provide a few keywords that define the topic. For example, the keywords that define tooth care could be: tooth, gums, cavity, care. Sketch Engine will use these keywords to create web search queries and will interact with Bing. Bing will find pages which correspond to the web searches and will return the urls back to Sketch Engine where the content of the urls will be downloaded, cleaned, tagged and lemmatized and converted to a corpus. The whole procedure only takes a few minutes. This is a great tool for anyone who needs a reliable sample of specialized language to explore how terms and phrases are used correctly and naturally.

Free Sketch Engine trial

A free 30-day Sketch Engine giving access to the complete functionality and preloaded corpora in many languages is available from the Sketch Engine website: https://www.sketchengine.co.uk

------

Ondřej Matuška - Sales and Marketing Manager

Ondřej oversees sales and marketing activities and external communication. He is the main point of contact for anyone seeking information about Sketch Engine and is also keen to support existing users so that they can make the most of Sketch Engine.

TechGist05