Dubbed M2M-100, Facebook claims that it is the first multilingual machine translation model (MMT) that can directly translate back and forth between any pair out of a set of 100 languages. In all, FBAI has constructed an enormous data set consisting of 7.5 billion sentences for 100 languages. Using that, the research team trained a universal translation model with more than 15 billion parameters “that captures information from related languages and reflects a more diverse script of languages and morphology,” according to a Facebook blog post Monday.
To do this, Facebook had to collect a whole slew of publicly available data from around the world using a variety of novel techniques. “A lot of this is really building upon work that we've done for many years at research at Facebook, which are like all of the different Lego pieces that we kind of put together to build the system today,” Fan explained.
To start, the team employed CommonCrawl, which maintains an open repository of web crawl data, to collect text examples from around the web. Then they set about identifying the language that text is in using FastText, a text classification system Facebook developed and open sourced a few years back, “It basically looks at some tests and it tries to decide what language it's written in,” Fan said. “So we partition a bunch of texts from the web into all of these different languages and then our goal is to identify sentences that would be translation.”
“Traditionally, people use human translators to create translation data,” she continued. “This is difficult at scale because it's hard, for example, to find someone who speaks English and Tamil, but it's even harder to find someone who speaks French and Tamil together, because non-English translation is still an area that needs improvement.”
To mine that necessary data at scale, Fan’s team relied heavily on the LASER system. “It reads sentences, takes the text and creates a mathematical representation of that text, such that sentences that have the same meaning map to the same thought,” she said. “So if I have one sentence in Chinese and French, and they're saying the same thing, they will kind of overlap -- like a Venn diagram -- the overlapping area is the kind of text that we think are aligned sentences.”
Of course, not all languages have a large amount of written content available on the internet. In those situations, Fan’s team turned to monolingual data, which is just data written in a single language. Using the Chinese to French example, Fan explained “So if my goal is to translate from Chinese to French, but for some reason, I don't get good quality, then I'm going to try and improve this by taking texts monolingual data in French. And what I do is train a reverse of the system: I go from French to Chinese. I take all of my French, for example, from Wikipedia, and I translate it into Chinese.”
Doing so produces a slew of machine generated “synthetic” data, Fan continued. “So I've created this synthetic Chinese based on my back-translated French, then I'm going to add it again to the forward model. So instead of going from Chinese to French, I have Chinese plus my supplemented synthetic Chinese, all going into French. And because this adds a bunch of new examples -- on both the input side and the output side -- the model will be much stronger.”
Whether this will lead to a digital Babel Fish capable of losslessly translating between the world’s 6,200-odd spoken languages remains to be seen. Fan notes that the ultimate success of this project depends on the amount of resources the AI can leverage. For major languages like French, Chinese, German, Spanish, and Hindi, those resources are vast. “People write tons of text on the web in these languages,” Fan noted. “They were really able to help a lot of data, and our models can use this data to get better.”
“I personally identify a lot of areas that we might need improvement in for the very low resource languages,” she continued. “For African languages, we're pretty good at Swahili and Afrikaans, we could use a lot of improvement on languages like Zulu, and these languages have additional research challenges that we need to confront.”
Facebook is releasing the data set, model, training and evaluation setups as open source to the research community to help spur on further advancements. The company also plans to continue developing the system independently and eventually working the technology into its daily operations.
All products recommended by Engadget are selected by our editorial team, independent of our parent company. Some of our stories include affiliate links. If you buy something through one of these links, we may earn an affiliate commission.
all right reserved for yahoo news