Machine Production of Screen Subtitles for Large Scale Production, WSJO, translation

[ Pobierz całość w formacie PDF ]
Machine Translation of TV Subtitles for Large Scale Production
Martin Volk, Rico Sennrich
University of Zurich
Computational Linguistics
CH-8050 Zurich
(volk|sennrich)@cl.uzh.ch
Christian Hardmeier
Fondazione Bruno Kessler
Human Language Technologies
I-38123 Trento
ch@rax.ch
Frida Tidstrom
University of Stockholm
Datorlingvistik
SE-10691 Stockholm
fridatidstrom@hotmail.com
Abstract
of these two systems, we have started working on
other language pairs including English, German and
Swedish. The examples in this paper are taken from
our work on Swedish to Danish. The issues for
Swedish to Norwegian translation are the same to
a large extent.
In this paper we describe the peculiarities of subti-
tles and their implications for MT. We argue that the
text genre “TV subtitles” is well suited for MT, in
particular for Statistical MT (SMT). We first intro-
duce a few other MT projects for subtitles and will
then present our own. We worked with large corpora
of high-quality human translated subtitles as input to
SMT training. Finally we will report on our experi-
ences in the process of building and deploying the
systems at the subtitling company. We will show
some of the needs and expectations of commercial
users that deviate from the research perspective.
This paper describes our work on building
and employing Statistical Machine Transla-
tion systems for TV subtitles in Scandinavia.
We have built translation systems for Danish,
English, Norwegian and Swedish. They are
used in daily subtitle production and trans-
late large volumes. As an example we report
on our evaluation results for three TV genres.
We discuss our lessons learned in the system
development process which shed interesting
light on the practical use of Machine Trans-
lation technology.
1 Introduction
Media traditions distinguish between subtitling and
dubbing countries. Subtitling countries broadcast
TV programs with the spoken word in the original
language and subtitles in the local language. Dub-
bing countries (like Germany, France and Spain)
broadcast with audio in the local language. Scan-
dinavia is a subtitling area and thus large amounts
of TV subtitles are needed in Swedish, Danish and
Norwegian.
Ideally subtitles are created for each language
independently, but for efficiency reasons they are
often translated from one source language to one
or more target languages. To support the efficient
translation we have teamed up with a Scandina-
vian subtitling company to build Machine Transla-
tion (MT) systems. The systems are in practical use
today and used extensively. Because of the estab-
lished language sequence in the company we have
built translation systems from Swedish to Danish
and to Norwegian. After the successful deployment
2 Characteristics of TV Subtitles
When films, series, documentaries etc. are shown
in language environments that differ from the lan-
guage spoken in the video, then some form of trans-
lation is required. Larger markets like Germany
and France typically use dubbing of foreign media
so that it seems that the actors are speaking the lo-
cal language. Smaller countries often use subtitles.
Pedersen (2007) discusses the advantages and draw-
backs of both methods.
In Scandinavian TV, foreign programs are usu-
ally subtitled rather than dubbed. Therefore the de-
mand for Swedish, Danish, Norwegian and Finnish
subtitles is high. These subtitles are meant for the
general public in contrast to subtitles that are spe-
cific for the hearing-impaired which often include
9HQWVLVODY=KHFKHYHG
3URFHHGLQJVRIWKH6HFRQG-RLQW(0&1*/:RUNVKRS´%ULQJLQJ07WRWKH8VHU
5HVHDUFKRQ,QWHJUDWLQJ07LQWKH7UDQVODWLRQ,QGXVWU\µ-(&¬·
SS ² 'HQYHU&21RYHPEHU
-(&´%ULQJLQJ07WRWKH8VHUµ
'HQYHU&2
descriptions of sounds, noises and music (cf. (Mata-
mala and Orero, 2010)). Subtitles also differ with
respect to whether they are produced online (e.g. in
live talkshows or sport reports) or offline (e.g. for
pre-produced series). This paper focuses on general-
public subtitles that are produced offline.
In our machine translation project, we use a par-
allel corpus of Swedish, Danish and Norwegian sub-
titles. The subtitles in this corpus are limited to 37
characters per line and to two lines. Depending on
their length, they are shown on screen between 2 and
8 seconds. Subtitles typically consist of one or two
short sentences with an average number of 10 to-
kens per subtitle in our corpus. Sometimes a sen-
tence spans more than one subtitle. The first sub-
title is then ended with a hyphen and the sentence
is resumed with a hyphen at the beginning of the
next subtitle. This occurs about 36 times for each
1000 subtitles in our corpus. TV subtitles contain a
lot of dialogue. One subtitle often consists of two
lines (each starting with a dash) with the first being
a question and the second being the answer.
Although Swedish and Danish are closely related
languages, translated subtitles might differ in many
respects. Example 1 shows a human-translated
pair of subtitles that are close translation correspon-
dences although the Danish translator has decided to
break the two sentences of the Swedish subtitle into
three sentences.
1
The space limitations on the screen result in spe-
cial linguistic properties. For example, when we
investigated English subtitles we have noticed that
apostrophe-s-contractions (for “is, has, us”) are par-
ticularly frequent in subtitles because of their close-
ness to spoken language. Examples are “He’s watch-
ing me; He’s lost his watch; Let’s go”. In a random
selection of English subtitles we found that 15%
contained apostrophe-s. These contractions need to
be disambiguated, otherwise we end up with transla-
tions like “Oh my gosh, Nicole’s dad is the coolest”
being rendered in German as “Mein Gott, Nicole ist
Papa ist der coolste” where the possessive ‘s’ is er-
roneously translated as a copula verb. We have built
a special PoS tagger for preprocessing the subtitles,
which solves this problem well.
This paper can only give a rough characterization
of subtitles. A more comprehensive description of
the linguistic properties of subtitles can be found in
(de Linde and Kay, 1999) and (Dıaz-Cintas and Re-
mael, 2007). Gottlieb (2001) and Pedersen (2007)
describe the peculiarities of subtitling in Scandi-
navia, Nagel et al. (2009) in other European coun-
tries.
3 Approaches to the Automatic
Translation of Film Subtitles
In this section we describe other projects on the au-
tomatic translation of subtitles.
2
We assume subti-
tles in one language as input and aim at producing
an automatic translation of these subtitles into an-
other language. In this paper we do not deal with the
conversion of the film transcript into subtitles which
requires shortening the original dialogue (cf. (Proko-
pidis et al., 2008)). We distinguish between rule-
based, example-based, and statistical approaches.
(1) SV: Det ar slut, vi hade forfest har. Jatten
drack upp allt.
DA: Den er væk. Vi holdt en forfest. Kæmpen
drak alt.
EN: It is gone. We had a pre-party here. The
giant drank it all.
In contrast, the pair in 2 exemplifies a different
wording chosen by the Danish translator.
3.1 Rule-based MT of Film Subtitles
Popowich et al. (2000) provide a detailed account of
a MT system tailored towards the translation of En-
glish subtitles into Spanish. Their approach is based
on a MT paradigmwhich relies heavily on lexical re-
sources but is otherwise similar to the transfer-based
approach. A unification-based parser analyzes the
(2) SV: Dar ser man vad framgang kan gora med
en ung person.
DA: Der ser man, hvordan succes ødelægger et
ungt menneske.
EN: There you see, what success can do to a
young person / how success destroys a young
person.
1
In all subtitle examples the English translations were added
by the authors.
2
Throughout this paper we focus on TV subtitles, but in this
section we deliberately use the term “film subtitles” in a general
sense covering both TV and movie subtitles.
 1RYHPEHU
WK
09RON56HQQULFK&+DUGPHLHUDQG)7LGVWU|P
input sentence (including proper-name recognition),
followed by lexical transfer which provides the in-
put for the generation process in the target language
(including word selection and correct inflection).
Although Popowich et al. (2000) call their sys-
tem ”a hybrid of both statistical and symbolic ap-
proaches” (p.333), it is a symbolic system by to-
day’s standards. Statistics are only used for effi-
ciency improvements but are not at the core of the
methodology. The paper was published before au-
tomatic evaluation methods were invented. Instead
Popowich et al. (2000) used the classical evaluation
method where native speakers were asked to judge
the grammaticality and fidelity of the system. These
experiments resulted in “70% of the translations ...
ranked as correct or acceptable, with 41% being cor-
rect” which is an impressive result. This project
resulted in a practical real-time translation system
and was meant to be sold by TCC Communications
as “a consumer product that people would have in
their homes, much like a VCR.” But unfortunately
the company went out of business before the prod-
uct reached the market.
3
Melero et al. (2006) combined Translation Mem-
ory technology with Machine Translation for
the language pairs Catalan-Spanish and Spanish-
English but their Translation Memories were not
filled with subtitles but rather with newspaper arti-
cles and UN texts. They don’t give any motivation
for this. Disappointingly they did not train their own
MT system but rather worked only with free-access
web-based MT systems (which we assume are rule-
based systems).
They showed that a combination of Translation
Memory with such web-based MT systems works
better than the web-based MT systems alone. For
English to Spanish translation this resulted in an im-
provement of around 7 points in BLEU (Papineni et
al., 2001) but hardly any improvement at all for En-
glish to Czech.
pared the performance to a system trained on the
same amount of Europarl sentences (which have
more than three times as many tokens!). Training on
the subtitles gave slightly better results when evalu-
ating against subtitles, compared to training on Eu-
roparl and evaluating against subtitles. This is not
surprising, although the authors point out that this
contradicts some earlier findings that have shown
that heterogeneous training material works better.
They do not discuss the quality of the ripped
translations nor the quality of the alignments (which
we found to be a major problem when we did similar
experiments with freely available English-Swedish
subtitles). Their BLEU scores are on the order of
11 to 13 for German to English (and worse for the
opposite direction).
3.3 Statistical MT of Film Subtitles
Descriptions of Statistical MT systems for subti-
tles are practically non-existent probably due to the
lack of freely available training corpora (i.e. collec-
tions of human-translated subtitles). Both Tiede-
mann (2007) and Lavecchia et al. (2007) report on
efforts to build such corpora with aligned subtitles.
Tiedemann (2007) works with a huge collection
of subtitle files that are available on the internet at
www.opensubtitles.org. These subtitles have been
produced by volunteers in a great variety of lan-
guages. However the volunteer effort also results
in subtitles of often dubious quality. Subtitles con-
tain timing, formatting, and linguistic errors. The
hope is that the enormous size of the corpus will
still result in useful applications. The first step then
is to align the files across languages on the subtitle
level. Time codes alone are not sufficient as differ-
ent (amateur) subtitlers have worked with different
time offsets and sometimes even different versions
of the same film. Still, Tiedemann (2007) shows that
an alignment approach based on time overlap com-
bined with cognate recognition is clearly superior to
pure length-based alignment. He has evaluated his
approach on English, German and Dutch. His results
of 82.5% correct alignments for Dutch-English and
78.1% correct alignments for Dutch-German show
how difficult the alignment task is.
Lavecchia et al. (2007) also work with subtitles
obtained from the internet. They work on French-
English subtitles and use a method which they call
3.2 Example-based MT of Film Subtitles
Armstrong et al. (2006) “ripped” German and En-
glish subtitles (40,000 sentences) as training mate-
rial for their Example-based MT system and com-
3
Personal communication with Fred Popowich in August
2010.
 -(&´%ULQJLQJ07WRWKH8VHUµ
'HQYHU&2
Dynamic Time Warping for aligning the files across
the languages. This method requires access to a
bilingual dictionary to compute subtitle correspon-
dences. They compiled a small test corpus consist-
ing of 40 subtitle files, randomly selecting around
1300 subtitles from these files for manual inspec-
tion. Their evaluation focused on precision while
sacrificing recall. They report on 94% correct align-
ments when turning recall down to 66%. They then
go on to use the aligned corpus to extract a bilingual
dictionary and to integrate this dictionary in a Statis-
tical MT system. They claim that this improves the
MT system with 2 points BLEU score (though it is
not clear which corpus they have used for evaluating
the MT system).
This summary indicates that work on the auto-
matic translation of film subtitles with Statistical MT
is limited because of the lack of freely available
high-quality training data. Our own efforts are based
on large proprietary subtitle data and have resulted
in mature MT systems. We will report on them in
the following section.
We have built systems that produce Danish and
Norwegian draft translations to speed up the trans-
lators’ work. This project of automatically translat-
ing subtitles from Swedish to Danish and Norwegian
benefited from three favorable conditions:
1. Subtitles are short textual units with little inter-
nal complexity (as described in section 2).
2. Swedish, Danish and Norwegian are closely
related languages. The grammars are simi-
lar, however orthography differs considerably,
word order differs somewhat and, of course,
one language avoids some constructions that
the other language prefers.
3. We have access to large numbers of Swedish
subtitles and human-translated Danish and
Norwegian subtitles. Their correspondence can
easily be established via the time codes which
leads to an alignment on the subtitle level.
There are other aspects of the task that are less fa-
vorable. Subtitles are not transcriptions, but written
representations of spoken language. As a result the
linguistic structure of subtitles is closer to written
language than the original (English) speech, and the
original spoken content usually has to be condensed
by the Swedish subtitler.
The task of translating subtitles also differs from
most other machine translation applications in that
we are dealing with creative language, and thus we
are closer to literary translation than technical trans-
lation. This is obvious in cases where rhyming song-
lyrics or puns are involved, but also when the subti-
tler applies his linguistic intuitions to achieve a nat-
ural and appropriate wording which blends into the
video without standing out. Finally, the language of
subtitling covers a broad variety of domains from
educational programs on any conceivable topic to
exaggerated modern youth language.
We have decided to build statistical MT (SMT)
systems in order to shorten the development time
(compared to a rule-based system) and in order
to best exploit the existing translations. We have
trained our SMT systems by using standard open
source SMT software. Since Moses was not yet
available at the starting time or our project, we
trained our systems by using GIZA++ (Och and
4 Our MT Systems for TV Subtitles
We have built Machine Translation systems for
translating film subtitles from Swedish to Danish
and to Norwegian in a commercial setting. Some
of this work has been described earlier by Volk and
Harder (2007) and Volk (2008).
Most films are originally in English and receive
Swedish subtitles based on the English video and
audio (sometimes accompanied by an English tran-
script). The creation of the Swedish subtitle is a
manual process done by specially trained subtitlers
following company-specific guidelines. In particu-
lar, the subtitlers set the time codes (beginning and
end time) for each subtitle. They use an in-house
tool which allows them to link the subtitle to spe-
cific frames in the video.
The Danish translator subsequently has access to
the original English video and audio but also to the
Swedish subtitles and the time codes. In most cases
the translator will reuse the time codes and insert the
Danish subtitle. She can, on occasion, change the
time codes if she deems them inappropriate for the
Danish text.
1RYHPEHU
WK
09RON56HQQULFK&+DUGPHLHUDQG)7LGVWU|P
Ney, 2004) for the alignment, Thot (Ortiz-Martınez
et al., 2005) for phrase-based SMT, and Phramer
(www.olteanu.info) as the decoder.
We will first present our setting and the evaluation
results and then discuss the lessons learned from de-
ploying the systems in the subtitling company.
we tokenized the subtitles (e.g. separating punctua-
tion symbols from words), converting all uppercase
words into lower case, and normalizing punctuation
symbols, numbers and hyphenated words.
4.2 Unknown Words
Although we have a large training corpus, there are
still unknown words (not seen in the training data)
in the evaluation data. They comprise proper names
of people or products, rare word forms, compounds,
spelling deviations and foreign words. Proper names
need not concern us in this context since the system
will copy unseen proper names (like all other un-
known words) into the target language output, which
in almost all cases is correct.
Rare word forms and compounds are more seri-
ous problems. Hardly ever do all forms of a Swedish
verb occur in our training corpus (regular verbs have
7 forms). So even if 6 forms of a Swedish verb have
been seen frequently with clear Danish translations,
the 7th will be regarded as an unknown if it is miss-
ing in the training data.
Both Swedish and Danish are compounding lan-
guages which means that compounds are spelled as
orthographic units and that new compounds are dy-
namically created. This results in unseen Swedish
compounds when translating new subtitles, although
often the parts of the compounds were present in
the training data. We therefore generate a transla-
tion suggestion for an unseen Swedish compound by
combining the Danish translations of its parts. For
an unseen word that is longer than 8 characters we
split it into two parts in all possible ways. If the two
parts are in our corpus, we gather the most frequent
Danish translation of each for the generation of the
target language compound. This has resulted in a
measurable improvement in the translation quality.
To keep things simple we disregard splitting com-
pounds into three or more parts. These cases are
extremely rare in subtitles.
Variation in graphical formatting also poses prob-
lems. Consider spell-outs, where spaces, commas,
hyphens or even full stops are used between the let-
ters of a word, like ”I will n o t do it”, ”Seinfeld”
spelled ”S, e, i, n, f, e, l , d” or ”W E L C O M
E T O L A S V E G A S”, or spelling variations
like a-a-alskar or abso-javla-lut which could be ren-
dered in English as lo-o-ove or abso-damned-lutely.
4.1 Our Subtitle Corpus
Our corpus consists of TV subtitles from soap op-
eras (like daily hospital series), detective series,
animation series, comedies, documentaries, feature
films etc. In total we have more than 14,000 sub-
title files (= single TV programmes) in each lan-
guage, corresponding to more than 5 million sub-
titles (equalling more than 50 million words).
When we compiled our corpus we included only
subtitles with matching time codes. If the Swedish
and Danish time codes differed more than a thresh-
old of 15 TV-frames (0.6 seconds) in either start
or end-time, we suspected that they were not good
translation equivalents and excluded them from the
subtitle corpus. In this way we were able to avoid
complicated alignment techniques. Most of the re-
sulting subtitle pairs are high-quality translations
thanks to the controlled workflow in the commercial
setting. Note that we are not aligning sentences. We
work with aligned subtitles which can consist of one
or two or three short sentences. Sometimes a sub-
title holds only the first part of a sentence which is
finished in the following subtitle.
In a first profiling step we investigated the repet-
itiveness of the subtitles. We found that 28% of all
Swedish subtitles in our training corpus occur more
than once. Half of these recurring subtitles have ex-
actly one Danish translation. The other half have
two or more different Danish translations which are
due to context differences combined with the high
context dependency of short utterances and the Dan-
ish translators choosing less compact representa-
tions.
From our subtitle corpus we chose a random se-
lection of files for training the translation model and
the language model. We currently use 4 million sub-
titles for training. From the remaining part of the
corpus, we selected 24 files (approximately 10,000
subtitles) representing the diversity of the corpus
from which a random selection of 1000 subtitles
was taken for our test set. Before the training step
[ Pobierz całość w formacie PDF ]

  • zanotowane.pl
  • doc.pisz.pl
  • pdf.pisz.pl
  • mement.xlx.pl