January 18, 2018

Statistical Methods in Translation

Machine translation is one of the oldest challenges in computer linguistics. The first attempts at automatic text translation were made as far back as the 1950s. And for a long time after that, all machine translation systems were developed based on rules. Those rules described the way structures in one language were to be transformed into structures of another language. A machine translation system requires quite a few of those rules to work, tens of thousands in fact. And writing them is the linguist’s job. It takes a long time and a lot of effort to develop a system of rules that translates one language into another – months, even years perhaps.

Fortunately, in the 1990s, researchers came up with a different architecture for machine translation systems. Those were the so-called statistical systems of machine translation, and they do not need rules as such. That is, they can work them out on their own by analyzing translation samples. Feed them enough translation samples – the so-called parallel texts, i. e. matched source and target texts – and the system can automatically process them to come up with rules based on a lot of samples like that.

Now, it does take quite a few samples. To work well, a statistical machine translation system requires at least half a million sentences (better still, a couple million) with translations. But if the required data is available, the system will learn on its own, without the help of the linguist. Of course, one may argue that while there is no direct involvement of the linguist while the statistical system is teaching itself, yet, before the system is developed and before it can start learning, there is a load of work to be done to get all those texts translated. So, which approach is the more laborious remains to be seen.

However, there is a crucial difference there. The point is that when linguists write rules for a machine translation system, they are only working for that system and that is all they are doing. In other words, you hire a team of linguists and they work on those rules. The parallel texts however, are never generated specifically for the statistical machine translation system. People translate texts for their own purposes, quite independent of the statistical systems or their development. And then the developers just use those translations that are out there globally.

Please refer tothe full version of the interview – https://postnauka.ru/video/82022