The generalization and reliability of multilingual translations often depend heavily on the amount of parallel data available for each pair of languages. In this post, we focus on generalizing zero shot – a sophisticated configuration that tests translation patterns for which they were not optimized during the training period. To solve the problem, we rephrase multilingual translation as a probabilistic conclusion, (ii) define the notion of zero-shot coherence and show why standard training often leads to inappropriate patterns for zero-shot tasks, and (iii) introduce a consistent agreement-based training method that encourages the model to create equivalent translations of parallel sentences into auxiliary languages. We test our multilingual NMT models on several zero-shot translation repositories (IWSLT17, UN Corpus, Europarl) and show that chord-based learning often allows for an improvement of 2-3 blue shots to zero compared to strong baselines, without losing performance in monitored translation directions. In experiments with UN-Corpus, they reported that the models trained for compliance are comparable to pivots and, in some cases, outperform it, z.B. if the target is Russian, perhaps because it is quite different from the English pivot point. In the data set, the models trained with the proposed objective exceed the Basic models by 2 to 3 blue dots, but are behind the pivot systems. In IWSLT17, the vanilla training method (Johnson et al., 2017) achieves a very high zero-shot yield due to the overlapping data and the presence of numerous pairs of supervised translations, exceeding even pivot. Compliance-trained models give a slight gain over these powerful basic systems.
Multilingual MT systems can be evaluated in terms of zero-shot performance or translation quality along directions for which they have not been optimized (for example. B due to lack of data). Formally, we define the generalization zero shot by consistency. Definition [Expected Zero Shooting Consistency] Monitor the “Ecs” and “Ec0” and “Zero-Shot” tasks. Let`s be a non-negative loss function and “Mc” is a model with maximum monitored losses limited by about Îµ>0: where we introduced and marginalized latent translations into Spanish (`Es` and Russian (`Ru`) (summarizing all sequences in the corresponding languages). Again, note that this goal implies the independence of the “Enâ†’” and “Frâ†’ and En” models. First, consider a purely bilingual environment where we learn to translate from an exit language, Ls, to a target language, Lt., Lt. We can form a translation model by optimizing the probability of conditional log of bilingual data under the model: we call the expression indicated in (7) probability based on agreement.