Multilingual speech metadata extraction


TOSCA-MP targets metadata extraction from speech contained in the audio track of videos in a heterogeneous collection which includes videos of different genres and in different languages. Multilingual speech processing is addressed through spoken language identification, automatic transcription, named entity recognition and and automatic translation.
This task poses several challenges which are faced during the project.

Multilingual Speech Recognition

The TOSCA-MP video collection contains videos in four languages: Dutch, English, German and Italian. During the project, automatic transcription systems were set up for all these languages by using the FBK internal HMM software tool kit.  The acoustic and linguistic data used for system development were gathered from several sources.

The TOSCA-MP system features spoken language identification capabilities, relying on, alternatively, a multilingual speech recognizer or a system based on Gaussian mixture models.

Machine Translation

The TOSCA-MP system features machine translation capabilities for some translation directions, which are relevant for the project. By using Moses, a free software for statistical machine translation, systems were set up for the following translation directions: Dutch-to-English, English-to-Italian, German-to-English and German-to-Italian. In addition, by properly interfacing automatic transcription and machine translation systems the full chain for spoken language translation is supported.

Automatic Transcription of Talk-show TV Programs

To better understand the challenges posed by automatic transcription of TV programs we have analysed the content of several episodes of two popular Italian talk-shows.  A comparison with television broadcast news, a video genre that has been largely investigated in the past, confirms that automatic transcription of broadcast conversations is a more challenging task [Brugnara et. al 2012].

In particular, we have developed an approach for language model adaptation with the aim of ensuring robustness with respect to variations in the linguistic content and style occurring during a talk-show episode. The proposed approach provides to the transcription system the capability to adapt  “quickly” to occurring variations by applying both automatic text data selection and dynamic LM adaptation techniques on a short running window of words [Falavigna et al. 2013].

Speech Transcription through Crowdsourcing

Two methods for crowdsourcing speech transcription have been implemented. These two methods incorporate two different quality control mechanisms (i.e. explicit versus implicit) and are based on two different processes (i.e. parallel versus iterative). The viability of the two crowdsourcing methods for producing significant amount of transcribed speech in several languages was assessed by targeting the four languages in the scope of the project [Sprugnoli et al., 2013].

Creation of Ground Truth

In order to create the ground truth for a subset of videos in the TOSCA-MP collection, trained personnel performed segmentation in speaker turns and accurate orthographic transcription of the speech in the audio track of the videos. In addition, a number of annotations were performed on the audio signal at the linguistic level as well as at the acoustic level. While for Dutch, German and Italian both news broadcasts and talk-show episodes were transcribed, only news broadcasts were transcribed for English. For German, Italian and English orthographic transcriptions produced by trained workers were compared with those generated by non-experts through crowdsourcing.

In addition, manually generated orthographic transcriptions of some news broadcasts (about 20,000 words for each language) were translated by considering the following translation directions: Dutch-to-English, English-to-Italian, German-to-English and German-to-Italian.

© 2017 TOSCA-MP - Task-Oriented Search and Content Annotation for Media Production
The research leading to the presented results has received funding from the European Union's
Seventh Framework Programme (FP7/2007-2013) under grant agreement n° 287532. - Imprint