Topic segmentation in spoken documents consists in breaking an automatic transcription of the document into segments where a unique topic is tackled. Such techniques are crucial in many multimedia applications to break long documents such as broadcast news shows or documentaries into smaller homogeneous contents. For example, in the context of information retrieval, topic segmentation can be used to provide a user solely with the document's portions relevant to the user's query. Topic segmentation can also be used at the core of a news navigation system such as Voxalead news which relies on Irisa's news topic segmentation technology.
Most topic segmentation methods rely on the broad notion of lexical cohesion with the idea that different word distributions are observed for different topics. Typical measures of the lexical cohesion exploits exact repetitions of words or repetition of related words. In the topic segmentation literature, two approaches are commonly found. Global segmentation methods globally search for the best segmentation, usually searching for segments with a strong lexical cohesion [3,4]. Oppositely, local methods locally search for a change in the word distribution, solely considering a limited number of words on the left and right side of a potential topic boundary [1,2].
These two families rely on very distinct, though complementary, philosophies: global methods are designed to find coherent segments, relying on measures of the lexical cohesion, while local methods are designed to find boundaries and rather rely on measures of the lexical dispersion between two segments. But, ideally, a topic boundary should encompass both notions: a good boundary defines two segments which are different enough, each being coherent. However, no criterion to so far has been proposed to cope with this simple statement.
The internship will focus on the design and evaluation of novel topic segmentation criteria that combines the benefits of the local and global approaches, in the context of spoken documents. Contributions will be evaluated on a comprehensive segmentation benchmark consisting of transcribed TV news, reports and documentaries.
The Irisa news topic segmentation software currently implements a variation of the global method in [3] designed to compensate for the peculiarities of automatic transcripts [5] (misrecognized words, no punctuation, etc.). The general idea of the method consists in creating a valued graph of all the possible segmentations, where an edge weight reflects the lexical cohesion of the words corresponding to the edge. Segmentation is obtained by searching for the best path in the segmentation graph. We will investigate how local criteria can be incorporated in this framework. For example, local criteria can be used as weights on the graph's nodes as in [6]. Alternately, one can define a criterion to optimize different from a mere best path search. We will evaluate several propositions along those lines, experimentally comparing their respective impact.
The internship, which will take place in the Texmex team at Irisa, comprises a theoretical part on how to merge local and global indices as well as an experimental part, since the proposed techniques are to be implemented and evaluated on real data (TV broadcasts).