AI could make African languages more accessible with machine translation — but people need to make it happen
Machine translation benchmarks were recently set for more than 30 African languages, classified in the Natural Language Processing space as the ‘The Left-Behinds’. The benchmarks are the first advances for some of the 2,000-odd living African languages and present a case for information accessibility through language technology.
If there was a perfect Machine Translation system for African languages it would mean that all the existing knowledge found on the Internet could be translated into someone’s home language.
For example, the number of Xitsonga articles on the global encyclopaedia Wikipedia is tiny. “If we had a perfect [Machine Translation] system… you can take the whole Wikipedia and translate that into someone’s language, then you give them direct access to basically all of knowledge. That’s a little bit amazing,” said Dr Herman Kamper of Stellenbosch University’s Department of Electrical and Electronic Engineering.
Machine Translation is automated translation of one language into another, performed by a computer.
Kamper was one of about 40 scholars and more than 400 participants who have been teaming up since 2019 to solve speech and language problems in Africa.
At the end of 2020, the volunteer community was able to set the first benchmarks for 30-plus African languages in Machine Translation, a first for the African languages in Machine Translation.
Researchers used Natural Language Processing (NLP), a branch of artificial intelligence that helps computers understand, interpret and manipulate human language.
Their research paper, Participatory Research for Low-resourced Machine Translation: A Case Study in African Languages, coming out of the collaborative work went on to win the 2020 Wikimedia Foundation Research Award and some members of the community have since gone on to tackle other NLP tasks.
According to Kamper, the benchmarks are a mere starting point for the languages, since the systems are not yet as good as an English-to-French system, for example, that Google makes use of.
The benchmarks are evaluation sets to test the Machine Translation systems.
If a big tech company like Google wanted to, it could create Machine Translation systems for all the languages, stated Kamper, who focuses particularly on speech recognition.
But, if you don’t have the native speakers on the ground, it is difficult to account for the long-tail languages, Kamper said.
At the moment, what is needed are people who are native speakers of languages — like those that were accounted for in the benchmarks, namely Khoekhoegowab, Igbo, Sepedi and Setswana — and who know that Google won’t easily work on these Machine Translation systems.
Most NLP research fails to have on-the-ground expertise of low-resourced languages.
You need people who will say, “I am going to do it”.
That was the scheme of the group of thinkers who gathered in 2019 to discuss NLP at a Deep Learning Indaba, held in Kenya.
At that 2019 teaching event, it was established that, “we want to do this thing, to build MT systems for all the languages that we possibly can”, Kamper said.
“In that room, there were already a whole bunch of people from all over Africa speaking different languages. That was where it started.”
The community of creators, translators, curators, language technologists and evaluators called their initiative the Masakhane (meaning “we build together”) project.
A year later the group, spearheaded by machine learning engineer Jade Abbott, were able to accomplish some of the first movements in NLP for African languages.
From Nigeria, volunteers are translating their own writings, including personal religious stories and undergraduate theses, into Yoruba and Igbo. This is in an effort to ensure that accessible and representative data of their culture are used to train models.
“But there is still a lot of work to be done,” Kamper pointed out, adding that the community still continues to work and meet on a weekly basis.
“The systems [or benchmarks] are focused on a relatively small domain, meaning that the systems were trained and tested on a specific style of language. They won’t necessarily do well on other types of texts,” said Kamper.
More data would need to be collected to cover more diverse styles or domains for it to work across multiple domains, he said.
While Machine Translation systems for high-resourced languages like English and German work efficiently, the same systems do not work seamlessly for languages that are considered “low-resourced” languages.
There is a big discussion around what defines a low-resourced language and definitions vary, Kamper said. According to him, most African languages are considered “low-resource” because it is either difficult to procure data or there is not enough labelled audio-speech or parallel translation between the different languages, he said.
This means that it is difficult to procure datasets– a sentence in one language alongside its equivalent translated into another language, and then thousands others like these — for the systems.
And, out of the about 7,000 spoken languages in the world, most are further considered endangered languages, with small numbers of speakers, said Kamper.
At the same time, there are some languages, like most South African languages, that are spoken by millions of people, but it remains difficult to get labelled data.
According to their paper, most of the 2,000-odd living languages in Africa are considered “The Left-Behinds” and some “The Rising Stars” in NLP research.
“For me [a language technologist], to build a Machine Translation system for a language that I don’t speak is actually quite hard,” said Kamper, who helped create the systems for Afrikaans and a bit of isiXhosa system.
“The beauty of this project is that we got people, there on the ground, speaking the language,” Kamper said. “Then, the initiative was to quickly upskill [those working on the Machine Translation systems] to build these first systems.”
“We are basically trying to equip people all over Africa to fix the problems in their own communities,” said Kamper.
Kamper pointed out that in the greater scheme of things his contribution was small — just three days devoted to working on the system.
“But the cool thing about it is that 40-something people made a tiny contribution like this, and it turned out to be a big thing,” Kamper said. “If you didn’t have native speakers or languages, and people who sacrificed just a few moments of their time, then that wouldn’t have happened.” DM