The Obscure and Controversial Importance of Metadata

This is a guest post by Luigi Muzii on the importance and value of metadata. Many of us in the MT field are often astonished at how valuable language data resources are left in states of disuse, disrepair and complete lack of organization, thus often rendering these valuable resources useless or at best cumbersome to work with. With many translation agencies and even large enterprises, core language data assets lie in what many of us consider to be in ruins.

As we now enter the AI driven phase of so many industrial processes, the value of metadata increases by the day, and enables much of the work of posterity to be useful for future projects and work. Data matters and clean organized data can become a foundation for competitive advantage, especially in the world of business translation where the value of work is still counted by the word.

This post urges that all of us in the language industry start thinking about and implementing metadata strategies. Though there are elements of this present in some TMS systems, from my viewpoint, they still focus largely on yesterday's problems and are ill-prepared to deal with emerging translation challenges which are much more tightly integrated with machine learning and AI processes.

I have added some graphics and a video snippet to Luigi's article and the emphasis is all mine.

-------------------------------------

Metadata is data that provides information about other data to describe a resource for discovery and identification or management.

Describing the contents and context of data increases its usefulness and improve user-experience over the useful life of the data.

Metadata also helps to organize, identify, and preserve resources.

Metadata can be created either by automated information processing or by manual work. Elementary metadata captured by computers can include information about when a resource was created, who created it and when it was last updated together with file size and file extension information.

And yet, those who would actually benefit most from the existence of metadata oftentimes undervalue and disregard it.

The Semantic Web is a perfect example of this attitude. It is still largely unrealized, although widely imagined and projected as being of great value. It is supposed to thrive on machine-readable metadata about pages, which includes information on the content these pages hold and present, and the inter-relations between them and other pages.

The machine-readable descriptions enable automated agents to attribute meaning to content, and shape the knowledge about it, and exploit it accordingly. Many of the technologies for such agents already exist and metadata is the only part missing.

In a 2001 essay, Cory Doctorow illustrates the problems with metadata, especially its fragility. One of Doctorow’s issues merits special attention: People are lazy. Laziness is the major reason for not adding metadata to content. Nonetheless, metadata is elusive and perishable. In fact, it may become obsolete when related data becomes irrelevant in time and if it is not updated with new insights.

How, then, is metadata relevant and important in translation? When setting up a translation project, first prep operations should involve drawing a schematic through basic details. Alas, rarely are translation project managers taught to never skip this task, so often scarce and partial translation project data rapidly becomes totally irrelevant.

Translation project managers are not the only culprits of this contempt, though. Today, almost all of them heavily rely on TMSs, and it’s a pity that most TMSs do not provide any mechanism to have a translation project charter properly compiled with the relevant metadata. TMS vendors could help this with profiler tools using dynamic list properties to associate values with external database tables, possibly from connected CMSs, CRMs, ERP systems, etc.

Also, not only are translation project charters important for LSPs, their staff, and vendors, to properly run project-related tasks; the information they store is essential to read and understand the data produced along the job, including—and above all—language data. In fact, language data are nearly useless without relevant metadata.

Metadata allow language data to be collected, organized, cleaned and explored for potential re-use, especially with MT engine training.

At the same level, to produce truly useful stats and get practical KPIs, automatically-generated project data is insufficient for any business inference whatsoever. Processing any set of data, however vast, is not enough per se to be assimilated to “big data.” In fact, to be such, data sets must be so extensive in volume, speed, variety and complexity to require specific technologies and analytical methods for value extraction. It is hardly possible that traditional data processing applications and techniques be adequate to deal with them. This is why translation big data is crap; no matter how much data a TMS or an LSP processes, it is not organized enough or durable enough for a reliable outlook.

Anyway, along with allowing the collation of relevant data for effective KPIs, translation metadata may help LSPs and their people to better understand the language data they hold, consume, and produce. Indeed, translation tools all add different types of metadata to language data during processing. Translation memory management software adds metadata to every translation unit; the anthology of descriptive data in a terminological record is metadata on a term, along with that that are automatically generated; and the annotations to a collection of reference material are metadata for knowledge representation.

For example, where the software can recognize the type of a source file, and automatically stamp an annotation, this may be useful in subsequent handling. The same goes for the author’s name or the last update date, the project code from the project charter, possibly generated automatically or semi-automatically based on the project manager’s input, possibly from a list. And think of adding metadata to segments indicating constants or untranslatable elements.

Metadata pertains to a field where standards typically apply and help. A lot. Alas, as we have learned from experience, especially in the last few years, standards are much talked about but little loved and poorly applied when translation software is involved. However sad, the reason is quite simple: being so vertical, the sector deeply reflects the state of the reference industry; it is highly fragmented with no company having the critical mass to rule the market. This translates into a superficial acceptance of standards but with different implementations to force customer lock-in.

In the immediate future, it will be possible to use a neural language model trained on the source language to automatically reckon how similar a text is to training data, and hence its suitability for MT. If we invest in enriching the input with profile metadata and linguistic annotation for NMT systems, the whole task may be even simpler.

So, let’s start standardizing, entering and using metadata. For good.

Luigi Muzii has been in the "translation business" since 1982 and has been a business consultant since 2002, in the translation and localization industry through his firm . He focuses on helping customers choose and implement best-suited technologies and redesign their business processes for the greatest effectiveness of translation and localization related work.

This link provides access to his other blog posts.

TechGist05