Efficient, Causal Camera Tracking In Unprepared Environments

This work addresses the problem of tracking the 3D pose of a camera in space, using the images it acquires while moving freely in unmodeled, arbitrary environments. A novel feature-based approach for camera tracking is proposed, intended to facilitate tracking in on-line, time-critical applications such as video see-through augmented reality. In contrast to several existing methods which are designed to operate in a batch, off-line mode, assuming that the whole video sequence to be tracked is available before tracking commences, the proposed method operates on images incrementally. At its core lies a feature-based 3D plane tracking technique, which permits the estimation of the homographies induced by a virtual 3D plane between successive image pairs. Knowledge of these homographies allows the corresponding projection matrices encoding camera motion to be expressed in a common projective frame and, therefore, to be recovered directly, without estimating 3D structure. Projective camera matrices are then upgraded to Euclidean and used for recovering structure, which is in turn employed for refining the projection matrices through local bundle adjustment. The proposed approach is causal, is tolerant to erroneous and missing feature matches, does not require modifications of the environment and has realistic computational requirements.

A detailed description of the approach can be found in ICS/FORTH Technical Report #324, Sep. 2003. A shorter version entitled ''Vision-based Camera Motion Recovery for Augmented Reality'', was published in the 2004 Computer Graphics International Conference (CGI'04). Additionally, a journal version titled ''Efficient, Causal Camera Tracking In Unprepared Environments'' has been accepted for publication in the Computer Vision and Image Understanding Journal and a demo video titled ``Camera Matchmoving in Unprepared, Unknown Environments'' will be included in the CVPR'05 video proceedings.

Reconstructions Gallery

Sample experimental results from the application of the proposed camera tracker on a variety of image sequences are shown below. For each sequence, a VRML file illustrating the recovered motion and structure is provided. Dots correspond to 3D points, red pyramids to camera locations and green polylines to camera trajectories. Running times were measured on an Intel P4@2.5 GHz laptop. Roughly 80% of the reported execution time is spent for detecting and matching image corners.

We recommend using VRMLview to inspect VRML models.

Clicking on the second column images brings up a larger view

INRIA MOVI house

Images of a model house, acquired by a fixed camera as a model house on a turntable made a full revolution around its vertical axis.

Frames used: 59
Average number of matched corners per pair: 197.7
Average number of matched corners per triplet: 127.96
Average running time per frame (ms): 485.05
VRML model.

Digital Air's cooks

"Frozen time" sequence captured with Digital Air's TimeTrack camera.

Frames used: 27
Average number of matched corners per pair: 486.49
Average number of matched corners per triplet: 354.05
Average running time per frame (ms): 946.89
VRML model.

Digital Air's cooks reconstructed from 27 frames

Oxford's basement

Sequence acquired by a camera mounted on a mobile robot as it approached the scene while smoothly turning left.

Frames used: 11
Average number of matched corners per pair: 144.1
Average number of matched corners per triplet: 91.0
Average running time per frame (ms): 308.3
VRML model.

Oxford's basement reconstructed from 11 frames

Leuven's Arenberg castle

Sequence shot with a handheld camera, exhibiting relatively large interframe translational motion and epipoles being located outside the images.

Frames used: 22
Average number of matched corners per pair: 483.43
Average number of matched corners per triplet: 373.65
Average running time per frame (ms): 1173.2
VRML model.

Leuven's Arenberg castle reconstructed from 22 frames

Leuven's Sagalassos site

Sequence shot with a camcorder, frames are characterized by very small interframe motion. Imaged scene contains two dominant planes, relative to which the camera moves laterally.

Frames used: 26
Average number of matched corners per pair: 481.94
Average number of matched corners per triplet: 331.31
Average running time per frame (ms): 987.062
VRML model.

Leuven's Sagalassos site reconstructed from 26 frames

Leuven's Beguinages

Small interframe motion sequence, shot with a camcorder as the operator approached the scene. Forward camera motion results in the angle between the triangulating 3D lines being small, making structure recovery challenging.

Frames used: 11
Average number of matched corners per pair: 382.89
Average number of matched corners per triplet: 240.37
Average running time per frame (ms): 1164.12
VRML model.

Office desk

Sequence shot with a firewire webcam undergoing complex motion, resulting in large changes in the field of view.

Frames used: 46
Average number of matched corners per pair: 330.22
Average number of matched corners per triplet: 211.23
Average running time per frame (ms): 681.20
VRML model.

Office desk reconstructed from 46 frames

Calibration object

Images of a two-face calibration object that were acquired with a consumer digital camera. Corners were determined as the intersections of line segments fitted to the calibration grids.

Frames used: 27
Average number of matched corners per pair: 722
Average number of matched corners per triplet: 722
Average running time per frame (ms): 382.7 (not including corner extraction and matching)
VRML model.

Calibration object reconstructed from 27 frames

Augmented Video

In addition to VRML reconstructions, the tracking results were used to augment the original sequences with artificial 3D objects. To achieve this, the estimated camera trajectories were exported to 3DSMax using MaxScript and then the augmented sequences were generated with the aid of 3DSMax's rendering engine that used the original sequence as a background. The initial alignment of the coordinate systems employed by the camera tracker and 3DSMax was achieved interactively, by manually rotating and translating them until they lined up. The placement of the artificial graphical objects into the scene was guided by the structure information also provided by the camera tracker.

Click here for a ~16Mb video augmenting (among others) the above sequences. Notice that the frame dimensions have been decreased to reduce the video's file size

Contact Address

For any questions, please contact lourakis@ics.forth.gr

Search form