Professional Documents
Culture Documents
{junczys,t.dwojak}@amu.edu.pl hieu@hoang.co.uk
62.1
61.3
56.0
53.1
50
52.2
51.5
50.1
49.8
46.8
45.3
45.1
44.7
43.3
42.8
40
42.0
39.7
38.8
37.8
36.0
30
60
61.1
59.9
50
52.4
52.4
52.2
51.8
49.8
49.8
40
41.2
41.0
39.7
39.7
38.1
37.4
37.3
36.5
35.5
34.4
30
31.3
30.0
es-ar es-en es-fr es-ru es-zh fr-ar fr-en fr-es fr-ru fr-zh
60
52.9
52.6
50
51.8
50.5
49.6
48.3
44.1
43.4
43.0
40
42.0
39.6
39.4
36.4
35.7
35.4
34.4
34.4
32.6
30
29.6
28.0
ru-ar ru-en ru-es ru-fr ru-zh zh-ar zh-en zh-es zh-fr zh-ru
Figure 1: Comparison between Moses baseline systems and neural models for the full language pair matrix of the
6-way corpus.
Pb-SMT
Hiero
62.7
62.1
60
61.3
NMT 1.2M
60.4
NMT 2.4M
50
51.9
51.5
50.1
49.3
47.6
46.8
46.0
45.6
45.3
45.1
43.3
43.2
42.6
40
42.0
41.7
37.8
en-ar en-es en-fr en-ru en-zh
60
61.7
61.1
60.7
59.9
56.9
56.0
54.7
53.6
53.6
53.2
53.1
53.1
53.0
52.9
52.6
52.4
52.2
50
51.8
47.3
43.0
40
Figure 2: For all language pairs involving English, we experimented also with hierarchal machine translation and more
training iterations for the neural models. NMT 1.2M means 1.2 million iterations with a batch size of 40, training
time ca. 8 days. NMT 2.4 means 2.4 million iterations accordingly.
which increased training time to 16 days in total per 7. Efficient decoding with AmuNMT
neural system.
AmuNMT is a ground-up neural MT toolkit implemen-
Figure 2 summarizes the results. As expected, tation, developed in C++. It currently consist of an effi-
Hiero outperforms PB-SMT by a significant margin cient beam-search inference engine for models trained
for Chinese-English and English-Chinese, but does not with Nematus. We focused mainly on efficiency and
reach half the improvement of the NMT systems. For usability. Features of the AmuNMT decoder include:
other languages pairs we see mixed results with models
trained for 1.2M iterations; for French-English and Multi-GPU, multi-CPU and mixed GPU/CPU
Russian-English, where results for PB-SMT and NMT mode with sentence-wise threads (different
are close, Hiero is the best system. sentences are decoded in different threads);
However, training the NMT system for another Low-latency CPU-based decoding with intra-
eight days helps, with gains between 0.3 and 1.3 sentence multi-threading (one sentence makes
BLEU. We did not observe improvements beyond 2M use of multiple threads during matrix operations);
iterations. For our setting, it seems that stopping train- Compatibility with Nematus [12];
ing after 10 days is a viable heuristic. Training with Ensembling of multiple models;
other corpus sizes, architectures, and hyper-parameters Vocabulary selection in the output layer [15, 16];
may behave differently. Integrated segmentation into subword units [3].
1,000 50.0
Wps
49.0
600 Bleu
48.5
400
48.0
0 5 10 15 20 25 30 35 40 45 50
Beam size
Figure 3: Beam size versus speed and quality for a single English-French model. Speed was evaluated with AmuNMT
on a single GPU. The ragged appearance is due to CuBLAS GEMM kernel switching for different matrix sizes.
864.7
800 800
600 600
455.3
400 358.8 400
269.4 265.3
200 104.2 200 140.9
45.5 47.4
0 0
-1
-1
-1
8
80
PU
-1
-1
-1
-1
-8
PU
PU
PU
K
eT
U
U
T-
CP
CP
CP
CP
CP
G
gl
PU
us
T*
oo
es
us
T*
e
M
at
gl
M
os
G
at
uN
M
em
oo
uN
uN
le
em
M
uN
G
m
g
m
oo
A
m
A
A
G
A
(a) Words per second for Google NMT (b) Moses vs. Nematus vs. AmuNMT: Our CPUs are Intel Xeon E5-2620 2.40GHz,
system reported by [17]. We only give our GPUs are GeForce GTX 1080. CPU-16 means 16 CPU threads, GPU-1 means
this as an example for a production- single GPU. All NMT systems were run with a beam size of 5. Systems marked
ready system. with * use vocabulary selection.
ever, decoding speed is significantly slower. We there- The CPU-bound execution of Nematus reaches
fore choose a beam-size of 5 for our experiments. 47 words per second while the GPU-bound achieved
270 words per second. In similar settings, CPU-bound
7.4. AmuNMT vs. Moses and Nematus AmuNMT is three times faster than Nematus CPU,
but three times slower than Moses. With vocabulary
In Figure 4a we report speed in terms of words per sec- selection we can nearly double the speed of AmuNMT
ond as provided by [17]. Although their models are CPU. The GPU-executed version of AmuNMT is more
more complex than ours, we quote these figures as a than three times faster than Nematus and nearly twice
reference of deployment-ready performance for NMT. as fast as Moses, achieving 865 words per second, with
We ran our experiments on an Intel Xeon E5-2620 vocabulary selection we reach 1,192. Even the speed
2.40GHz server with four NVIDIA GeForce GTX 1080 of the CPU version would already allow to replace a
GPUs. The phrase-based parameters are described in Moses-based SMT system with an AmuNMT-based
Section 3 which is guided by best practices to achieving NMT system in a production environment without
reasonable speed vs. quality trade-off [19]. The neural severely affecting translation throughput.
MT models are as described in the previous section. AmuNMT can parallelize to multiple GPUs pro-
We present the words-per-second ratio for our NMT cessing one sentence per GPU. Translation speed in-
models using AmuNMT and Nematus, executed on the creases to 3,368 words per second with four GPUs, to
CPU and GPU, Figure 4b. For the CPU version we use 4,423 with vocabulary selection. AmuNMT has a start-
16 threads, translating one sentence per thread. We re- up time of less than 10 seconds, while Nematus may
strict the number of OpenBLAS threads to 1 per main need several minutes until the first translation can be
Nematus thread. For the GPU version of Nematus we produced. Nevertheless, the model used in AmuNMT
use 5 processes to maximize GPU saturation. As a base- is an exact implementation of the Nematus model.
line, the phrase-based model reaches 455 words per sec- The size of the NMT model with the chosen param-
ond using 16 threads. eters is approximately 300 MB, which means about 24
8. Conclusions and future work
1,561
Although NMT systems are known to generalize bet-
Latency ms per sentence
1,500
ter than phrase-based systems for out-of-domain data,
it was unclear how they perform in a purely in-domain
1,000 886 setting which is of interest for any organization with
712 significant resources of their own data, such as the UN
610
or other governmental bodies.
500 We evaluated the performance of neural machine
translation on all thirty translation directions of the
41 31 United Nations Parallel Corpus v1.0. We showed
0
that for all translation directions NMT is either on
par with or surpasses phrase-based SMT. For some
language pairs, the gains in BLEU are substantial.
em CPU
PU
uN CPU
PU
PU
CP
G
G
These include all pairs that have Chinese as a source
s
*
us
T*
e
T
M
M
os
at
uN
uN
m
m
A
A
A