Recognizing names is so easy for us. But what about machines? And what if you would like to recognize names in different languages? Meltwater asked us to help their system to extract names from the first few lines of news articles (we call them by-lines) and we are currently working on solutions for Brazilian Portuguese, Arabic and Chinese.
The two faces of names
Recognizing names is a Janus-faced problem. Have a look at the following screenshot, taken from LeMonde.fr.
Even if you don’t speak French, you can still recognize the name of its author. Humans know that it is a tradition to mention the author’s name somewhere near the title, and they are also aware that names follow special orthographic rules, i.e. parts of a name usually start with a capital letter and that there is a plethora of common given names in Western-European languages. Things are getting complicated with languages using non-Latin alphabets, like Arabic, Russian. Machines are lacking the ability of this cultural and linguistic awareness, so it is our task to make them clear.
Teaching machines to recognize names
Our team is working with translators who are either native speakers of a target language or they are speakers with native-like fluency. In every case, we are working with real-world data. First we prepare the data by annotations (start and end positions of names were tagged). Then we split our data into two distinct data sets: one for development and one for testing. The development data is our primary source of information and it is used to learn about the nature of names in a given language. What kind of names does a given language use? How many words are in a name? Does a given language use prefixes, suffixes, infixes, etc.? Is there a convention to ascribe authorship, for example “by” in English? The test data is used for the continuous evaluation of the system, i.e. literally to see if we can find names in the by-lines. Employing these methods we can achieve high precision with really low false positive rate.
We are happy to work with multiple languages
We love working on multilingual solutions! Our company is located in the middle of Europe and we have accustomed to the challenges of a multilingual environment. Even our biggest client is a company offering translation memory and online services to professional translators.
Meltwater helps companies make better, more informed decisions based on insights from the outside. More than 23,000 companies use the Meltwater media intelligence platform to stay on top of billions of online conversations, extract relevant insights, and use them to strategically manage their brand and stay ahead of their competition. With 50 offices on six continents, Meltwater is dedicated to personal, global service built on local expertise. Meltwater also operates the Meltwater Entrepreneurial School of Technology (MEST), a nonprofit organization devoted to nurturing future generations of entrepreneurs. For more information, follow Meltwater on Twitter, Facebook, LinkedIn, YouTube, or visit www.meltwater.com.