# Analysis and modelling of video sequences

## Scientific foundations: 3d scene modelling based on projective geometry

Key words: 3D reconstruction, computer vision, projective geometry, perspective projection, camera models, fundamental matrix, epipolar constraints.

3D reconstruction is the process of estimating the shape and position of 3D objects from views of these objects. TEMICS deals more specifically with the modelling of large scenes from monocular video sequences. 3D reconstruction using projective geometry is by definition an inverse problem. Some key issues which do not have yet satisfactory solutions are the estimation of camera parameters, especially in the case of a moving camera. Specific problems to be addressed are e.g. the matching of features between images, and the modelling of hidden areas and depth discontinuities.

3D reconstruction uses theory and methods from the areas of computer vision and projective geometry. When the camera Ci is modelled as a perspective projection, the projection equations are:

pi= Pi x , (1)

where x is a 3d point with homogeneous coordinates x = (x y z 1)t in the scene reference frame R0, and where pi = (Xi Yi 1)t are the coordinates of its projection on the image plane Ii. The projection matrix Pi associated to the camera Ci is defined as Pi = K(ri|ti). It is a function of both the intrinsic parameters K of the camera, and of transformations (rotation ri and translation ti) called the extrinsic parameters and characterizing the position of the camera reference frame Ri with respect to the scene reference frame R0. Intrinsic and extrinsic parameters are obtained through calibration or self-calibration procedures. The calibration is the estimation of camera parameters using a calibration pattern (objects providing known 3D points), and images of this calibration pattern. The self-calibration is the estimation of camera parameters using only image data. These data must have previously been matched by identifying and grouping all the image 2D points resulting from projections of the same 3D point. Solving the 3D reconstruction problem is then equivalent to searching for x, given pi, i.e. to solve Eqn. (1) with respect to coordinates x. Like any inverse problem, 3D reconstruction is very sensitive to uncertainty. Its resolution requires a good accuracy for the image measurements, and the choice of adapted numerical optimization techniques.

## Application Domain

The field of video compression has known, during the last decade, a significant evolution leading to the emergence of a large number of international standards (MPEG-4, H.264).

Even though, for most multimedia applications, compression remains a key issue, this is not the only one that has to be taken into account. Emerging applications in the area of interactive audiovisual services show a growing interest for interactivity, content-based capabilities, (e.g. for 3-D scene navigation, for creating intermediate camera viewpoints) for integration of information of different nature, e.g. in augmented and virtual reality applications. These capabilities are not well supported by existing solutions. Interaction and navigation with the video content requires extracting appropriate models, such as regions, objects, 3-D models, mosaics, shots... These features are expected to be beneficial to multimedia applications requiring 3-D virtual scenes, such as video games or virtual visits of museums, virtual and augmented reality.

## New results

### 3D scene modelling from monocular video sequences

Contributed by: Raphaele Balter, Luce Morin.

A video representation scheme based on a set of 3D models has been developed. The approach originality resides in the construction of a set of independent 3D models linked by common view points (key images), instead of a unique 3D model as in classical approaches. The sequence of 3D models can be streamed for remote navigation in the scene. Several aspects have been optimized and enhanced this year.

The approach assumes that the camera undergoes non degenerated motion, i.e., the camera motion enables 3D reconstruction. Hence, the motion resulting from camera panning is assumed to be close to a pure rotation around the vertical axis, and the camera optical center is assumed to be static, leading to a unique viewpoint for all images. Therefore, in this case, 3D information cannot be retrieved. However, the video sequence can be efficiently described with 2D models, e.g., with mosaics. This observation led to the design of a hybrid modelling approach in which 3D models are used in presence of non degenerated motion, and in which 2D mosaics are used for video segments where 3D reconstruction is not possible. The mosaic is obtained from a deformable mesh. The model type (2D or 3D) is chosen according to the magnitude of camera rotation and translation with respect to the apparent motion magnitude. A 3D textured cylinder is generated from the 2D panoramic image and is associated with camera positions. This allows a compatible 3D visualization scheme for both 2D and 3D models (see Fig. figure-rec3). 3D morphing has been added in order to cope with problems of discontinuities between the models in the sequence. The successive 3D models are mapped on a common parametric space ; the 2D parameters are merged into a 2D mesh containing the respective vertices and edges. Morphing is then achieved through vertices position interpolation, avoiding the need for re-meshing.

A technique to encode the 3D models has been designed. The 3D models are independent and produced by elevation from a uniform triangular mesh on each key image. The coding algorithm is based on a wavelet decomposition of the model geometry, associated with a unique topological model. This leads to a scalable representation of the model geometry, preventing visual artifacts inherent to topological re-meshing. A patent has been filed.

Although promising, the approach so far was not enforcing the extracted 3D information to be consistent along the entire sequence. Possible discontinuities between the different models were resulting in annoying artifacts in the reconstructed images.

In 2004, effort has been first dedicated to the design of solutions to handle this problem of discontinuities. The approaches designed rely on techniques of morphing and on so-called evolutive 3D models. The first approach designed rely on a posteriori 3D morphing over regularly meshed 3D models. A joint 2D parameterization of the surfaces of pairs of adjacent models gives a geometric correspondence between the two models. A common connectivity mesh including all vertexes and faces of the two models is then created by 2D mesh fusion. Linear interpolation is then applied on the re-meshed 3D models. This scheme allows a smooth evolution of the geometry (shape) of the two models. However, it does not avoid ruptures in the models connectivity. In addition, the re-meshing done for each 3D model leads to significantly increased decoder complexity.

One can alternatively constrain the 3D models, when being extracted from the video, so that the visible subsets that are common to the two models have the same connectivity (i.e., same vertices and faces). The 3D models are first constructed independently by elevation from a uniform triangular mesh (i.e., from a depth map which has been meshed) on each key image. It is then tracked and updated to account for appearing areas, while preserving the existing connectivity. This common connectivity provides natural correspondences between the models. Hence, 3D morphing can then be performed using classical interpolation techniques. The set of resulting 3D models turns out to have good properties in terms of geometry, texture and connectivity continuity.

A scalable video coding scheme based on the 3D model representation of the scene described above has also been developed. The information on the geometry, connectivity, and texture of the 3D models is encoded and transmitted, as well as the camera position for each frame. Both the texture and the geometry of the models are encoded using a wavelet-based representation. The consistent connectivity of the models enables a consistent wavelet decomposition of the sequence of 3D models. This in turn allows for efficient and progressive coding and decoding of the models geometry.

### Video object segmentation and representation for compression purposes

Contributed by: Marc Chaumont, Dubhe Chavira-Martinez, Nathalie Cammas, Henri Nicolas, Stéphane Pateux.

Object-based video coding approaches are often proposed for compression with advanced functionalities. Object-based video representation and coding allow for semantic interpretation and associated manipulation. Compression efficiency can also be improved by handling the occlusions, by selecting adapted coding techniques for each object, and by optimal allocation of bit rates to the different objects.

Two object-based video coding algorihms making use of TEMICS segmentation tools with some temporal tracking refinements have been developed. The first algorithm relies on a predictive texture-based coding approach. The segmentation extracts a set of objects together with their mean texture. Video is then reconstructed with a two-layer representation. In the first layer, the mean texture information of each object is robustly transmitted. In the second layer, segmentation, motion and texture refinement information is transmitted separately for each frame. Segmentation and motion information allow to warp the texture in order to obtain a coarse approximation of the image. This decomposition allows a progressive and robust transmission: any frame may be lost or dropped, refinement information can also be coded progressively (e.g., by bit-plane coding techniques). Early experiments show promising results at low bit-rate.

The second algorithm is based on an analysis-synthesis approach, allowing to de-correlate shape, motion and texture information, coupled with spatio-temporal wavelet decompositions. Notice that, in classical object-based coders, shape, motion and texture information are usually correlated, resulting with some limitations with respect to scalable representations. Motion is first estimated thanks to active meshes tracked over several frames. Using this information, texture and shape may be extracted and represented independently of the motion information. We then end up with three types of information to encode and transmit. Each information is then decomposed using spatio-temporal wavelet transforms and progressively encoded. The resulting bit-streams are fully scalable, i.e., spatially, temporally, in terms of SNR, and allow object-based scalabilities.

Video coding efficiency depends on the accuracy of the motion fields and on the way these motion fields are being exploited in the motion-compensated temporal transform. In occlusion areas, most motion estimation methods fail, due to spatial and temporal motion discontinuities. A new algorithm based on non-manifold motion has been developed for handling occlusions in motion estimation and representation. In an occlusion area, the mesh is constrained along the frontier between two objects having different displacements, so that different meshes can be constructed on both sides of the frontier. The corresponding triangles are thus overlapping. A non-manifold motion field is thus produced, enabling occluding objects to move independently from each other. The hierarchical mesh representation and estimation ensures the consistency of the motion field. The use of active meshes together with this approach for handling the frontier between objects (called cracklines) allows to improve the temporal prediction (see Fig.5).

### Shadow and motion analysis for video compression

Contributed by: Fabien Catteau, Mireya Garcia-Vasquez, Henri Nicolas.

Motion compensation is a core technique for video compression. It allows to efficiently exploit the temporal redundancy existing between successive images. Moving shadows in a sequence create temporal activity which reduces the motion-compensated temporal prediction efficiency. We have thus focused on the optimization of motion analysis by taking into account the presence of shadows and augmented motion models beyond the classical translational model.

In order to be able to correctly compensate the moving cast shadows, a realistic cast shadow model has been defined. This model takes into account the penumbra effect and the modifications of the ambient light. It has been incorporated in the joint cast shadow and light source position estimation previously developed. The shadow segmentation method has also been improved. It is now based on the minimization of an energy term using a clustering method. If the contours of the object which creates a shadow are known, the projection of the light source position on the image plane is determined.

Once the shadow contours have been determined, they are represented using a set of level lines defined in the luminance ratio space. Breaking nodes are detected on the level lines represented with B-spline functions. This provides smooth texture variations in the reconstructed shadows, allowing a relatively precise shadow prediction. The moving cast shadows are removed from the original images. Two data streams corresponding respectively to the shadow information (contour and texture) and to the sequence without the shadows are then coded separately. The images without the shadows are coded using the scalable video coding scheme based on a 3D subband decomposition developed by TEMICS. The approach can be beneficial for very low bitrate video surveillance systems where the shadows are useless information which can be only roughly coded or even not coded at all, and where only the moving objects represent relevant information.

## Software: 3D Model-based video codec

Contributed by: Raphaele Balter, Luce Morin.

From a video sequence of a static scene viewed by a monocular moving camera, this software allows to automatically construct a representation of a video as a stream of textured 3D models.3D models are extracted using stereovision and dense matching maps estimation techniques. A virtual sequence is reconstructed by projecting the textured 3D models on image planes. This representation enables 3D functionalities such as synthetic objects insertion, lightning modification, stereoscopic visualization or interactive navigation. This codec allows to compress at low and very low bit-rates (16 to 256 kb/s in 25Hz CIF format) with a satisfactory visual quality.

 Webmaster: temics_web (at) irisa.fr Last time modified: 2006-02-20