ScienceDirect - Computer Vision and Image Und...

来源：百度文库编辑：神马文学网时间：2024/04/26 12:41:19

User Name: Password:

Remember me on this computerForgotten password?

Home Browse Search My Settings Alerts Help
Quick Search Title, abstract, keywords Author e.g. j s smith

Journal/book title Volume Issue Page

Computer Vision and Image Understanding
Volume 106, Issues 2-3, May-June 2007, Pages 270-287
Special issue on Advances in Vision Algorithms and Systems beyond the Visible Spectrum
SummaryPlusFull Text + LinksPDF (2852 K) View thumbnail images |View full size images

doi:10.1016/j.cviu.2006.10.008

Copyright © 2007 Elsevier Inc. All rights reserved.
Mutual information based registration of multimodal stereo videos for person tracking

Stephen J. Krotosky

,a,

and Mohan M. Trivedia,

aComputer Vision and Robotics Research Laboratory, University of California, San Diego, 9500 Gilman Dr. 0434, La Jolla, CA 92093-0434, USA
Received 15 September 2006; accepted 23 October 2006. Communicated by James Davis and Riad Hammoud. Available online 20 December 2006.
Abstract
Research presented in this paper deals with the systematic examination, development, and evaluation of a novel multimodal registration approach that can perform accurately and robustly for relatively close range surveillance applications. An analysis of multimodal image registration gives insight into the limitations of assumptions made in current approaches and motivates the methodology of the developed algorithm. Using calibrated stereo imagery, we employ maximization of mutual information in sliding correspondence windows that inform a disparity voting algorithm to demonstrate successful registration of objects in color and thermal imagery. Extensive evaluation of scenes with multiple objects at different depths and levels of occlusion shows high rates of successful registration. Ground truth experiments demonstrate the utility of the disparity voting techniques for multimodal registration by yielding qualitative and quantitative results that outperform approaches that do not consider occlusions. A basic framework for multimodal stereo tracking is investigated and promising experimental studies show the viability of using registration disparity estimates as a tracking feature.
Keywords: Thermal infrared sensing; Multisensor fusion; Person tracking; Visual surveillance; Situational awareness
Article Outline 1.Introduction 2.Multimodal registration approaches: comparative analysis of algorithms 2.1.Infinite homographic registration 2.2.Global image registration 2.3.Stereo geometric registration 2.4.Partial image ROI registration
3.An approach to mutual information based multimodal registration 3.1.Multimodal image calibration 3.2.Image acquisition and foreground extraction 3.3.Correspondence matching using maximization of mutual information 3.4.Disparity voting with sliding correspondence windows
4.Experimental validation and analysis 4.1.Algorithmic evaluation 4.2.Accuracy evaluation using ground truth disparity values 4.3.Comparative study of registration algorithms with non-ideal segmentation 4.4.Robustness evaluation
5.Multimodal video analysis for person tracking: basic framework and experimental study 6.Discussion and concluding remarksReferences
1. Introduction
A fundamental issue associated with multisensory vision is that of accurately registering corresponding information and features from the different sensory systems. This issue is exasperated when the sensors are capturing signals derived from totally different physical phenomena, such as color (reflected energy) and thermal signature (emitted energy). Multimodal imagery applications for human analysis span a variety of application domains, including medical[1], in-vehicle safety systems[2] and long-range surveillance[3]. The combination of both types of imagery yields information about the scene that is rich in color, depth, motion and thermal detail. Once registered, such information can then be used to successfully detect, track and analyze movement and activity patterns of persons and objects in the scene.
At the heart of any registration approach is the selection of the most relevant similarity metric, which can accurately match the disparate physical properties manifested in images recorded by multimodal cameras. Mutual Information (MI) provides an attractive metric for situations where there are complex mappings of the pixel intensities of corresponding objects in each modality, due to the disparate physical mechanisms that give rise to the multimodal imagery[4]. Egnal has shown that mutual information is a viable similarity metric for multimodal stereo registration when the mutual information window sizes are large enough to sufficiently populate the joint probability histogram of the mutual information computation[5]. Further investigations into the properties and applicability of mutual information for windowed correspondence measure has been done by Thevenaz and Unser[6]. Challenges lie in obtaining these appropriately sized window regions for computing mutual information in scenes with multiple people and occlusions, where a balanced tradeoff between larger windows for matching evidence and smaller windows for registration detail is needed.
This paper presents the following contributions: we first give a detailed analysis of current methods to multimodal registration with a comparative analysis that motives our approach. We then present our approach for mutual information based multimodal registration. A disparity voting technique that uses the accumulation of disparity values from sliding correspondence windows gives reliable and robust registration and an analysis of several thousand frames demonstrates its success for complex scenes with high levels of occlusion and numbers of objects occupying the imaged space. An accuracy evaluation to ground truth measurements is presented and a comparative study using practical segmentation methods illustrates how the occlusion handling of the disparity voting algorithm is an improvement over previous approaches. Additionally, a basic framework for person tracking in multimodal video is presented and a promising experimental study is given to illustrate the use of the disparities generated from multimodal registration as a feature for tracking. We then discuss current algorithmic issues and potential resolutions for future research.
2. Multimodal registration approaches: comparative analysis of algorithms
In a multimodal, multicamera setup, because each camera can be at a different position in the world and have different intrinsic parameters, objects in the scene can not be assumed to be located at the same position in each image. Due to these camera effects, corresponding objects in each image may have different sizes, shapes, positions, and intensities. In order to combine the information in each image, it is required that the corresponding objects in the scene be aligned, or registered. Sensory measurements can then be fused or features combined in a variety of ways that can fuel algorithms that take advantage of the information provided from multiple and differing image sources[7]. Experiments in our previous work[2] have offered analysis and insight into the commonalities and uniqueness of the multimodal imagery. Multimodal image registration approaches vary based on factors such as camera placement, scene complexity and the desired range and density of registered objects in the scene. In order to better understand the algorithmic details of the various multimodal registration techniques, it is important to outline the underlying geometric framework for registration. Much of the multiple view geometry properties derived in this paper are adapted from Hartley and Zisserman[8].
Given a two camera setup with camera center locations C and C′, a 3D point in space can be defined relative to each of the camera coordinate systems as P = (X, Y, Z)T and P′ = (X′, Y′, Z′)T, respectively. The coordinate system transformation between P and P′ is:
P′=RP+T (1)

where R is the matrix that defines the rotation between the two camera centers and T is the translation vector that represents the distance between them. Additionally, the projection matrices for each camera are defined as K and K′, where the projected points on the image plane are the homogeneous coordinates p = (x, y, 1) and p′ = (x′, y′, 1).
Let π be a plane in the scene parameterized with N, the surface normal of the plane and dπ is the distance from the camera center C. Then a point lies on that plane if NTP = dπ. The homography induced by π is P′ = HPP where:

(2)

Applying the projection matrices K and K′, we have p′ = Hp, where H = K′HPK−1 giving

(3)

This homographic transformation describes the transformation of points only when the points lie on the plane π (e.g. NTP = dπ). When a point does not lie on this plane, then an additional parallax component needs to be added to transformation equation to accommodate the projective depth of other points in the scene relative to the plane π. It has been shown in[8] that the transformation that includes the additional parallax term is:
p′=Hp+δe′ (4)

where e′ is the epipole in C′ and δ is the parallax relative to the plane π. The epipole is the intersecting point between the image plane and the line containing the optical centers of C and C′. The equation in(4) has effectively decomposed the point correlation equation into a term for the induced planar homography (Hp) and the parallax associated with points that do not satisfy the planar homography assumption (δe′). It is within this framework that we will describe the registration techniques used for multimodal imagery.Fig. 1 illustrates the main approaches to multimodal image registration that will be analyzed. Additionally,Table 1 provides a summary of references utilizing these approaches and indicates the assumptions, methods and limitations in each.

Display Full Size version of this image (61K)
Fig. 1. Geometric illustration of the four main approaches to multimodal image registration. (a) Infinite homography. (b) Global. (c) Stereo geometric. (d) Partial image ROI.

Table 1.
Review of approaches to multimodal registration and body analysis Work Modalities Assumptions Calibration Registration Application Algorithm Comments
Visual IR 3D
Trivedi et al.[2] X X X Approximate colocation None None—Comparative evaluation Head detection for airbag deployment Head detection and tracking using background subtraction and elliptical templates. Segmentation using background subtraction in disparity for visual imagery and in hot spot localization for thermal infrared imagery Comparative analysis of head detection algorithms using both stereo and thermal infrared imagery
Davis and Sharma[18],[19] and[3] X X Colocation. Observed scene far from camera. None Infinite homographic Person detection and background modeling Fused contour saliency maps (CSMs) are used to form silhouettes to enhance background modeling Does not deal with occlusion and discriminating people merged into one silhouette. Camera placement can be prohibitive
Conaire et al.[9] X X Colocation. Observed scene far from camera. Majority of scene is background. Hotspots valid for human segmentation None Infinite homographic Person detection and background modeling Hysteresis threshold of initial foregrounds used to form background model update from foreground object velocity, size, edge magnitude, and thermal brightness Hotspot segmentation is a limiting assumption. Does not deal with occlusion and discriminating people merged into one silhouette. Deviation from histogram assumption only valid when majority of scene is background
Irani and Anandan[10] X X Parametric transformation model can globally match entire scene None Global—parametric correlation surface General multimodal registration Directional-derivative-energy features obtained for Gaussian pyramid of input images. Local correlation of features used to iteratively find best global parametric alignment using Newton’s method Experimental images only contain one dominant plane in scene and no foreground objects. Global parametric model not likely to model large parallax effects well
Coiras et al.[11] X X Most edges common across modalities None Global—affine General multimodal registration The global affine transformation that best maximizes the global edge-formed triangle matching is determined from transformations obtained by matching individual formed triangles in the image Global affine model cannot account for large parallax effects. Experiments are not performed for multiple objects in scene at different planes
Han and Bhanu[12] X X Simplified projective transformation for planar scene objects. Background objects unregistered. Human must walk within same plane during sequence. Hotspots valid for human segmentation Uncalibrated Global—projective model for planar objects Person detection Top-of-head and centroid from two frames in sequence used as input to Hierarchical Genetic Algorithm (HGA) that searches for best registration Walking along different planes results in different registration. Multiple people at different depths will not be registered. Unrealistic that humans walk in same line for registration. Need entire sequence before first frame can be registered
Itoh et al.[13] X X X Colocation. Predefined operating region Calib. board using 25 points Global—quadratic model from Calib. points Hand and object detection Features such as skin tone, hot spots, depth in operating region, and motion are fused to localize hands in operating region Registration assumptions only valid for objects within a range of certain depths located inside the limited “workspace”. Information from each modality is heuristically thresholded and not probabilistically generalized
Ye[14] X X Colocation. Registration is displacement and scaling None Global registration and tracking using Hausdorff distance edge matching Person tracking Top points of segmented objects are tracked. Registration is iteratively refined over time using motion information. Registered images are used for face detection by hotspots Global matching not valid when people are at large depth differences. Experiments do not test large movements over sequences where registration parameters would be changing
Ju et al.[15] X X X Colocation. Hi-res. stereo. Only once face in scene, positioned carefully Stereo camera calibration Stereo geometric 3D thermography of face Multiscale stereo depth information. Mapped onto 3D face model Registration evaluation in low-res stereo environment and in real-world conditions, e.g. multiple people, occlusions, lighting, etc.?
Bertozzi et al.[16] X X X Colocation. Stereo pairs for each modality Calibrated stereo rigs Stereo geometric Pedestrian detection Stereo estimates from each unimodal stereo pair combined in disparity space Four camera system cumbersome in terms of setup and maintenance, as well as in terms of image processing and data management
Chen et al.[17] X X Target tracking problem assumed solved. Registration is only a displacement and known scale Scale factor known a priori Partial image ROI Object detection. Concealed weapon detection Maximizing mutual information (MMI) of individual bounding box ROI’s for each object in scene. Simplex method used to search for MMI Assumption of perfect target tracking gives ideal bounding boxes. With a real world tracker, how to handle occlusions, overlaps, and incompleteness?
This paper X X X Stereo configuration. Reasonable foreground extraction. Each object has a single in scene Calibrated multimodal stereo Disparity voting Object detection and person tracking Disparity voting for sliding mutual information correspondence windows yields registration disparities for objects in scene Successful registration through occlusions and scenes with multiple people. Disparity estimates can be used as feature in tracking algorithms

2.1. Infinite homographic registration
In Conaire et al.[9] and Davis and Sharma[3], it is assumed that the thermal infrared and color cameras are nearly colocated and the imaged scene is far from the camera, so that the deviation of pedestrians from the ground plane is negligible compared to the distance between the ground and the cameras. Under these assumptions, an infinite planar homography can be applied to the scene and all objects will be aligned in each image.
The infinite planar homography, H∞, is defined as the homography that occurs when the plane π is at infinity. An illustration of this type of registration geometry is shown inFig. 1(a). Starting from(3), we define

(5)

When the plane is at infinity, the homography between points is only a rotation R between the cameras and the internal projection matrices for each camera, K and K′. Similarly, from(4), Hartley and Zisserman[8] showed that the correspondence equation for image points in an infinite homography is:

(6)

where

is the depth from C and K′t = e′ is the epipole in C′.
Infinite homographic registration techniques are used when the scene distance is very far from the camera. When all observed objects are very far from C, then Z → ∞ and the parallax effects will be negligible. Alternatively, when the cameras are nearly colocated, i.e. t → 0, the parallax term also becomes negligible. In both cases the correspondence equation becomes:
p′=H∞p (7)

The use of an infinite planar homography is a an effective way of registering the scene, but only when the scene that is being registered conforms to the homographic assumptions. This means that the scene must be very far from the camera so that object’s displacement from the ground plane will be negligible compared to the observation distance. While this type of assumption is appropriate for long distance and overhead surveillance scenes, this is not valid in situations where objects and people can be at various depths whose difference is significant relative to their distance from the camera. In these cases, the infinite homography assumption will not align all objects in the scene. In addition, when the assumption of a infinite homography does hold, the lack of an parallax term precludes any estimate of depth that could be used as a differentiator for occluding objects.
2.2. Global image registration
Global approaches to registration can be used when further assumptions about the movement and placement of objects and people in a scene are employed to make the registration fit a specific model. The registration will be accurate when the scene follows the specific model used, but can be grossly inaccurate when the imaged scene does not fit the assumptions of the model.
The usual assumption of these techniques is that all objects lie on the same plane in the scene. Often to enforce this assumption, only foreground objects are considered. Global image registration techniques make the assumption that δ, the measure of difference from the homographic plane in(4), will be small for all objects in the scene. However, in scenes where objects of interest are at different planes, only the objects lying on the plane π that induces the homography will be registered. All other objects that lie on different planes will be misaligned due to the second term δe′ in(4).
If the distance of objects from the plane is small compared to the distance of cameras from the plane, the parallax effects tend to zero and the homography accurately describes the registration of objects in the scene at any depth. Works that have applied this global registration technique operated either on the single plane or approximate colocation assumption to allow for accurate scene registration. An illustration of this type of registration is shown inFig. 1(b).
Irani and Anandan[10] used directional-derivative-energy operators to generate features from a Gaussian Pyramid of the visual and thermal infrared images and used local correlation values for these features to obtain a global alignment for the multimodal image pair. Alignment is done by estimating a parametric surface correspondence that can estimate the registration alignment of the two images. Newton’s method is used to iteratively search for the parametric transformation that maximizes the global alignment.
Coiras et al.[11] matches triangles formed from edge features in visual and thermal infrared images to learn an affine transformation model for static images. The global affine transformation that best maximizes the global edge-formed triangle matching is searched from transformations obtained by matching individual formed triangles in one image to other individual formed triangles in the second image.
Han and Bhanu[12] used the features extracted when a human walked in a scene to learn a projective transformation model to register visual and IR images. It is assumed that the person walking in the scene walks in a straight line during the registration sequence. This enforces that the person is located within a single plane throughout the sequence and ensures that the global projective transformation model assumption holds. Feature points derived from foreground silhouettes in two pair of images in the sequence are used as input into a Hierarchical Genetic Algorithm that searches for the best global transformation.
Itoh et al.[13] used a calibration board to registered colocated color and thermal infrared cameras for use in a system that recognized hand movement for multimedia production. The calibration board points were used to establish a quadratic transformation model between the color and thermal infrared images. Registration is only required for a predefined workspace with a fixed range within the image scene and the calibration board was place to ensure registration in that region.
Similarly, Ye[14] used silhouette tracking and Hausdorff distance edge matching to register visual and thermal infrared images. In this case, it is assumed that the cameras are nearly colocated and that registration can be accomplished with a displacement and scaling. The detected top points of foreground silhouettes are tracked using the motion associations with previously tracked points. The Hausdorff distance measure is used to match edge features in each silhouette and estimate the scale and translation parameters. The registration and tracking are then used and updated to provide simultaneous tracking and iterative registration.
Global image registration methods place some limiting assumptions on the configuration of objects in the scene. Specifically, it is assumed that all registered objects will lie on a single plane in the image and it is impossible to accurately register objects at different observation depths, as the registration transform for each object will depend on the varying perspective effects of the camera. This means that accurate registration can only occur when there is only one observed object in the scene[12], or when all the observed objects are restricted to lie at approximately the same distance from the camera[13] and[14]. The global alignment algorithms proposed by Irani and Anandan[10] and Coiras et al.[11] do not account for situations where there are objects at different depths or planes in the image. Both use the assumption that the colocation of the cameras and the observed distances are such that the parallax effects can be ignored.
The primary limitation to global registration methods is that it is impossible to register objects at different depths. Global methods effectively restrict the successfully registered area to be a single plane in the image. When colocated cameras are used to relax the single plane restriction, parallax effects become negligible, and the problem becomes akin to infinite homographic methods.
2.3. Stereo geometric registration
When a stereo camera setup is used in combination with additional cameras from other modalities, the images from each modality can be combined using the stereo 3D point estimates and the geometric relation between the stereo and multimodal cameras. As demonstrated in Ju et al.[15], stereo cameras can give accurate 3D point coordinates for objects in the image. If the remaining cameras are then calibrated to the reference stereo pair, usually with a calibration board, then the pixels in those images (thermal infrared) can be reprojected onto the reference stereo image. The resulting reprojection will be registered to the stereo reference image.
In this case, for a point p in the reference stereo image, an estimate of its 3D location

is given from the calibrated stereo geometry parameters. Additionally, the calibration of the left reference stereo image and the additional thermal infrared modality give the rotation R and T between camera coordinates. This allows the change of coordinate system to the thermal infrared reference frame,

. The 3D point can then be reprojected onto the infrared image plane.
pTIR=KTIRPTIR (8)

The thermal image point is then put into homogeneous form and the intensity value at this location in the thermal infrared image can then be assigned to the point p in the stereo reference image. Such a registration technique is illustrated inFig. 1(c).
For the case of stereo geometric registration techniques, objects in a scene at very different depths can be registered as long as the stereo disparity information is available for that object. If the stereo algorithm can provide dense and accurate stereo for the objects in the scene, stereo geometric registration it is a good way of quickly and effectively registering the visual and infrared imagery. In the experiments of Ju et al.[15] the observed object (head) was carefully placed into the scene and it was assumed that it was the only object in the scene. Stereo data was captured using high resolution stereo cameras in a fairly stable and well-conditioned scene. The resulting 3D stereo image was dense and accurate in these conditions. However, experiments need to be conducted to see how these environmental conditions can be relaxed. Namely, it is important to examine how stereo geometric registration techniques performs in real world conditions, where using standard resolution cameras in environments of poor lighting, poor textures and occlusions can affect the quality and reliability of the 3D reprojection registration technique.
Multiple stereo camera approaches to stereo geometric have been investigated by Bertozzi et al.[16]. Using four cameras configured into two unimodal stereo pairs that yield two separate disparity estimates, registration can occur in the disparity domain. While this approach yields redundancy and registration success, the use of four cameras can be cumbersome both in physical creation, calibration and management, as well as in data storage and processing.
2.4. Partial image ROI registration
An approach to registering objects at multiple depths is to use partial image region-of-interest registration. The main assumption of this approach is that each individual object in the scene is at a specific plane and that each plane can be individually registered with a separate homography. For each of the i regions-of-interest Ω in the image, if p

Ωi then
p′=Hip+δie′ (9)

Again, it assumed that the parallax effects are negligible within each object, as each is approximated to be a single planar object in the scene. As long as each Ωi satisfies this assumption, the registration technique will be applicable. This is illustrated inFig. 1(d).
Chen et al.[17] proposed that the visual and infrared imagery be registered using a maximization of mutual information technique on bounding boxes that correspond to detected objects in one of the modalities. It is assumed that the corresponding region is at a scale and displacement away. It is also assumed that the scale is fixed and known a priori. The matching bounding box is then searched for in the other modality using a simplex method. This allows bounding boxes that correspond to objects at different depths to be successfully registered.
Chen et al. assume that the bounding boxes representing a single object can always be properly segmented and tracked in one of the modalities. The assumption that bounding boxes will be properly segmented will often not hold, especially in uncontrolled scenes where the issues of lighting, texture and occlusions can produce segmentation results that contain two or more merged objects at different depths. Using bounding boxes that contain multiple objects will not register properly as the required assumption that an ROI contains objects within a single plane will not hold.
3. An approach to mutual information based multimodal registration
Our registration algorithm[20] addresses the registration of objects at different depths in relatively close range surveillance scenes and eliminates the need for perfectly segmented bounding boxes by relying on reasonable initial foreground segmentation and using a disparity voting algorithm to resolve the registration for occluded or malformed segmentation regions. This approach gives robust registration disparity estimation with statistical confidence values for each estimate.Fig. 2 shows a flowchart outlining our algorithmic framework. Individual modules are described in the subsequent sections.

Display Full Size version of this image (45K)
Fig. 2. Flowchart of disparity voting approach to multimodal image registration.

3.1. Multimodal image calibration
A minimum camera solution for registering multimodal imagery in these short range surveillance situations would be to use a single camera from each modality, arranged in a stereo pair. Unlike colocating the cameras, arranging the cameras into a stereo pair allows objects at different depths to be registered. To perform this type of registration, it is desirable to first calibrate the color and thermal infrared cameras. Knowing the intrinsic and extrinsic calibration parameters transforms the epipolar lines to lie along the image scanlines, enabling disparity correspondence matching to be a one-dimensional search. Calibration can be performed using standard techniques, such as those available in the Camera Calibration Toolbox for Matlab[21]. The toolbox assumes input images from each modality where a calibration board is visible in the scene. In typical visual setups, this is simply a matter of placing a checkerboard pattern in front of the camera. However, due to the large differences in visual and thermal imagery, some extra care needs to be taken to ensure the calibration board looks similar in each modality. A solution is to use a standard calibration board and illuminate the scene with high intensity halogen bulbs placed behind the cameras. This effectively warms the checkerboard pattern, making the visually dark checks appear brighter in the thermal imagery. Placing the board under constant illumination reduces the blurring associated with thermal diffusion and keeps the checkerboard edges sharp, allowing for calibration with subpixel accuracy. An example pair of images in the visual and thermal infrared domain and the subsequently calibrated and rectified image pair are shown inFig. 3.

Display Full Size version of this image (63K)
Fig. 3. Multimodal stereo calibration using a heated calibration board to allow for a visible checkerboard pattern in thermal imagery. (a) Color image. (b) Thermal image. (c) Rectified color image. (d) Rectified thermal image.

3.2. Image acquisition and foreground extraction
The acquired and rectified image pairs are denoted as IL, the left color image, and IR, the right thermal image. Due to the high differences in imaging characteristics, it is very difficult find correspondences for the entire scene. Instead, registration is focused on the pixels that correspond to foreground objects of interest. Naturally then, it is desirable to determine which pixels in the frame belong to the foreground. In this step, only a rough estimate of the foreground pixels is necessary and a fair amount of false positives and negatives is acceptable. Any “good” segmentation algorithm could potentially be used with success. The corresponding foreground images are FL and FR, respectively. Additionally, the color image is converted to grayscale for mutual information based matching. Example input images and foreground maps are shown inFig. 4.

Display Full Size version of this image (41K)
Fig. 4. Image acquisition and foreground extraction for color and thermal imagery. (a) Color. (b) Color segmentation. (c) Thermal. (d) Thermal segmentation. (For interpretation of the references to color in this figure legend, the reader is referred to the Web version of this article.)

3.3. Correspondence matching using maximization of mutual information
Once the foreground regions are obtained, the correspondence matching can begin. Matching occurs by fixing a correspondence window along one reference image in the pair and sliding the window along the second image that is the best match. Let h and w be the height and width of the image, respectively. For each column i

0…w, let WL,i be a correspondence window in the left image of height h and width M centered on column i. The width M that produces the best results can be experimentally determined for a given scene. Typically, the value for M is significantly less than the width of an object in the scene. Define a correspondence window WR,i,d in the right image having height h*, the largest spanning foreground distance in the correspondence window, and centered at a column i + d, where d is a disparity offset. For each column i, a correspondence value is found for all d

dmin…dmax.
Given the two correspondence windows WL,i and WR,i,d, we first linearly quantize the image to N levels such that

(10)

where Mh* is the area of the correspondence window. The result in(10) comes from Thevenaz and Unser’s[6] suggestion that this equation is reasonable to determine the number of levels needed to give good results for maximizing the mutual information between image regions.
Now we can compute the quality of the match between the two correspondence windows by measuring the mutual information between them. The mutual information between two image patches is defined as

(11)

where PL,R(l, r) is the joint probability mass function (pmf) and PL(l) and PR(r) are the marginal pmf’s of the left and right image patches, respectively.
The two-dimensional histogram, g, of the correspondence window is utilized to evaluate the pmf’s needed to determine the mutual information. The histogram g is an N by N matrix so that for each point, the quantized intensity levels l and r from the left and right correspondence windows increment g(l, r) by one. Normalizing by the total sum of the histogram gives the probability mass function

(12)

The marginal probabilities can be easily determined by summing PL,R(l, r) over the appropriate dimension.

(13)

(14)

Now that we are able to determine the mutual information for two generic image patches, let’s define the mutual information between two specific image patches as Ii,d where again i is the center of the reference correspondence window and i + d is the center of the second correspondence window. For each column i, we have a mutual information value Ii,d for d

dmin…dmax. The disparity

that best matches the two windows is the one that maximizes the mutual information

(15)

The process of computing the mutual information for a specific correspondence window is illustrated inFig. 5. An example plot of the mutual information values over the range of disparities is also shown. The red box in the color image is a visualization of a potential reference correspondence window. Candidate sliding correspondence windows for the thermal image are visualized in green boxes.

Display Full Size version of this image (29K)
Fig. 5. Mutual information for correspondence windows. (a) Color image. (b) Thermal image. (c) Mutual information. (d) Disparity voting matrix. (For interpretation of the references to color in this figure legend, the reader is referred to the Web version of this article.)

3.4. Disparity voting with sliding correspondence windows
We wish to assign a vote for

, the disparity that maximizes the mutual information, to all foreground pixels in the reference correspondence window. Define a disparity voting matrix DL of size (h, w, dmax − dmin + 1), the range of disparities. Then given a column i, for each image pixel that is in the correspondence window and foreground map, (u, v)

(WL,i ∩ FL), we add to the disparity voting matrix at

.
Since the correspondence windows are M pixels wide, pixels in each column in the image will have M votes for a correspondence matching disparity value. For each pixel (u, v) in the image, DL can be thought of as a distribution of matching disparities from the sliding correspondence windows. Since it is assumed that all the pixels attributed to a single person are at the same distance from the camera, a good match should have a large number of votes for a single disparity value. A poor match would be widely distributed across a number of different disparity values.Fig. 5(d) shows the disparity voting matrix for a sample row in the color image. The x-axis of the image is the columns i of the input image. The y-axis of the image is the range of disparities d = dmin…dmax, which can be experimentally determined based on scene structure and the areas in the scene where activity will occur. Entries in the matrix correspond to the number of votes given to a specific disparity at a specific column in the image. Brighter areas correspond to a higher vote tally.
The complementary process of correspondence window matching is also performed by keeping the right thermal infrared image fixed. The algorithm is identical to the one described above, switching the left and right denotations. The corresponding disparity accumulation matrix is given as DR.
Once the disparity voting matrices have been evaluated for the entire image, the final disparity registration values can be determined. For both the left and right images, we determine the best disparity value and its corresponding confidence measure as

(16)

(17)

For a pixel (u, v) the values of

represent the number of times the best disparity value

was voted for. A higher confidence value indicates that the disparity maximized the mutual information for a large number of correspondence windows and in turn, the disparity value is more likely to be accurate than at a pixel with lower confidence. Values for

and

are similarly determined. The values of

and

are also shifted by their disparities so that they align to the left image:

(18)

(19)

Once the two disparity images are aligned, they can be combined. We have chosen to combine them using an AND operation. This tends to give the most robust results. So for all pixels (u, v) such that

and

(20)

The resulting image D*(u, v), is the disparity image for all the overlapping foreground object pixels in the image. It can be used to register multiple objects in the image, even at very different depths from the camera.Fig. 6 shows the result of registration for the example frame carried throughout the algorithmic derivation.Fig. 6(a) shows the computed disparity image D*, whileFig. 6(b) shows the initial alignment of the color and thermal images andFig. 6(b) shows the alignment after shifting the foreground pixels by the resulting disparity image. The thermal foreground pixels are overlaid (in green) on the color foreground pixels (in purple).

Display Full Size version of this image (20K)
Fig. 6. The resulting disparity image D* from combining the left and right disparity images

and

as defined in(20). (a) Disparity image. (b) Unregistered. (c) Registered. (For interpretation of the references to color in this figure legend, the reader is referred to the Web version of this article.)

The resulting registration inFig. 6 is successful in aligning the foreground areas associated with each of the three people in the scene. Each person in the scene lies at a different distance from the camera and yields a different disparity value that will align its corresponding image components.
4. Experimental validation and analysis
The disparity voting registration algorithm was tested using color and thermal data collected where the cameras were oriented in the same direction with a baseline of 10 cm. The cameras were placed so that the optical axis was approximately parallel to the ground imaging a scene approximately 6 m × 6 m. This placement was used to satisfy the assumption that there would be approximately constant disparity across all pixels associated with a specific person in the frame. Placing the cameras in this sort of position is a reasonable thing to do, and such a position is appropriate for many applications. Video was captured as up to four people moved throughout an indoor environment. For these specific experiments, foreground segmentation in the visual imagery was done using the codebook model proposed by Kim et al.[22]. In the thermal imagery, the foreground is obtained using an intensity threshold under the assumption that the people in the foreground are hotter than the background. This approach provided reasonable segmentation in each image. In cases where segmentation can only be obtained for one modality, the disparities can be computed with only that modality as the reference, at the cost of less robustness. We will show successful registration for examples of varying segmentation quality. The goal was to obtain registration results for various configurations of people including different positions, distances from camera, and levels of occlusion.
Examples of successful registration for additional frames are shown inFig. 7. Columns (a) and (b) show the input color and thermal images, while column (c) illustrates the initial registration of the objects in the scene and column (d) shows the resulting registration overlay after the disparity voting has been performed. These examples show the registration success of the disparity voting algorithm in handling occlusion and properly registering multiple objects at widely disparate depths from the camera.

Display Full Size version of this image (287K)
Fig. 7. Registration results using disparity voting algorithm for example frames. (a) Color. (b) Infrared. (c) Unregistered. (d) Registered. (For interpretation of the references to color in this figure legend, the reader is referred to the Web version of this article.)

4.1. Algorithmic evaluation
We have analyzed the registration results of our disparity voting algorithm for more than 2000 frames of captured video. To evaluate the registration, we define a correct frame as when the color and infrared data corresponding to each foreground object in the scene were visibly aligned. If one or more objects in the scene are not visibly aligned, then the registration is deemed incorrect for the entire frame.Table 2 shows the results of this evaluation. The data is broken down into groups based on the number of objects in the scene.
Table 2.
Registration results for disparity voting algorithm with multiple people in a scene No. objects in frame No. frames correct Total frames % Correct
1 55 55 100.00
2 171 172 99.42
3 1087 1111 97.84
4 690 720 95.83
Total 2003 2058 97.33

This analysis shows that when there was no visible occlusion in the scene, registration was correct 100% of the time. We further break down the analysis to consider only the frames where there are occluding objects in the scene. Under these conditions, the registration success of the disparity voting algorithm is shown inTable 3. The registration results for the occluded frames is still quite high, with most errors occurring during times of near total occlusion.
Table 3.
Registration results for disparity voting algorithm with multiple people in a scene: frames with occlusion No. objects in frame No. frames correct Total frames % Correct
2 51 52 98.08
3 653 677 96.45
4 581 611 95.09
Total 1285 1340 95.90

4.2. Accuracy evaluation using ground truth disparity values
In order to demonstrate the accuracy of our disparity voting algorithm (DV) in handling occlusions, we offer a quantitative comparison to ground truth. It is our contention that the disparity voting algorithm will provide good registration results during occlusions, when initial segmentation gives regions that contained merged objects. Our disparity voting algorithm makes no assumptions about the assignment of pixels to individual objects, only that a reasonable segmentation can be obtained. We demonstrate that the disparity voting registration can successfully register all objects in the scene even through occlusions. We will also show the results for bounding box approaches (BB)[17] for completeness.
We generate the ground truth by manually segmenting the regions that correspond to foreground for each image. We then determine the ground truth disparity by individually matching each manually segmented object in the scene. This ground truth disparity image allows us to directly and quantitatively compare the registration success of the disparity voting algorithm and the bounding box approach. By comparing the registration results to the ground truth disparities, we are able to quantify the success of each algorithm and show that the disparity voting algorithm outperforms the bounding box approach for occluding object regions.
Fig. 8 illustrates the ground truth disparity comparison tests. Column (a) shows the ground truth disparity, column (b) shows the disparity generated using the bounding box (BB) algorithm, and column (c) shows the disparity generated using the disparity voting (DV) algorithm.Fig. 9 plots the absolute difference in disparity values from the ground truth for each corresponding row inFig. 8. The BB results are plotted in dotted red, while the DV results are plotted in solid blue. Notice how the two algorithms perform identically to ground truth in the first row, as there are no occlusion regions. The subsequent examples all have occlusion regions and the DV approach more closely follows ground truth than the BB approach. The BB registration results have multiple objects registered at the same depth though the ground truth shows that they are at separate depths. Our disparity voting algorithm is able to determine the distinct ground truth disparities for different objects and the |Δ Disparity| plots show that the DV algorithm is quantitatively closer to the ground truth, with most registration errors within one pixel of ground truth with larger errors usually occurring only in small portions of the image. On the other hand, when errors occur in the bounding box approach, the resulting disparity offset error is large and occurs for the entire scope of erroneously registered object.

Display Full Size version of this image (33K)
Fig. 8. Comparison of bounding box (BB) approach to the proposed disparity voting algorithm for ground truth segmentation. (a) Ground truth. (b) BB disparity. (c) DV disparity.

Display Full Size version of this image (18K)
Fig. 9. Plots of |ΔD| from ground truth for each example inFig. 8. Bounding box errors for an example row are plotted in dotted red, while errors in disparity voting registration are plotted in solid blue. (For interpretation of the references to color in this figure legend, the reader is referred to the Web version of this article.)

4.3. Comparative study of registration algorithms with non-ideal segmentation
We perform a qualitative evaluation using the real segmentations generated from codebook background subtraction in the color image and intensity thresholding in the thermal image. These common segmentation algorithms only give foreground pixels and make no attempt to discern the structure of objects in the scene.Fig. 10 illustrates several examples that compare the registration results of the disparity voting and bounding box algorithms. Notice how the disparities for the bounding box (BB) algorithm in row (5) are constant for the entire occlusion region even though the objects are clearly at very different disparities. The disparity results for our disparity voting algorithm in row (6) show distinct disparities in the occlusion regions that correspond to the appropriate objects in the scene. Visual inspection of rows (7) and (8) show that the resulting registered alignment from the disparity values is more accurate for the DV approach.

Display Full Size version of this image (128K)
Fig. 10. Comparison of BB algorithm[17] to the proposed disparity voting (DV) algorithm for a variety of occlusion examples using non-ideal segmentation: (1) the color image, (2) the color segmentation, (3) the thermal image, (4) the thermal segmentation, (5) the BB disparity image, (6) the DV disparity image, (7) the BB registration, (8) the DV registration. (For interpretation of the references to color in this figure legend, the reader is referred to the Web version of this article.)

Fig. 11 shows the registration alignment for each algorithm in closer detail for a selection of frames. Notice how the disparity voting approach is able to align each object in the frame, while the bounding box approach has alignment errors due to the fact that the segmentation of the image yielded bounding boxes that contained more than one object. Clearly, disparity voting is able to handle the registration in these occlusion situations and the resulting alignment appears qualitatively better than the bounding box approach.

Display Full Size version of this image (30K)
Fig. 11. Details of registration alignment errors in the bounding box registration approach and corresponding alignment success for the disparity voting (DV) Algorithm for several occlusion examples using non-ideal segmentation. (a) BB registration. (b) DV registration.

4.4. Robustness evaluation
We demonstrate the robustness of our algorithm by applying it to another set of data taken of a different scene with a different set of cameras. For these experiments, we have up to six people move through an approximately 6 m × 6 m environment. The cameras are arranged with a 10 cm baseline and are calibrated and rectified as described in Section3.1. Again, segmentation is performed using the codebook background model for the color imagery and intensity thresholding for the thermal imagery. Correspondence window sizes and threshold values were kept constant from past experiments.
Fig. 12 shows successful registration for example frames containing an increasing number of people in the scene. Column (c) of the figure shows distinct levels of alignment disparity for each person in the scene and column (e) shows the resulting registered alignment. Notice how the disparity voting algorithm is able to properly determine the disparities necessary to align the color and thermal image in situations with multiple people and multiple levels of occlusion.Fig. 13 shows detailed examples of the registration alignment. Note how image features, especially facial region, appear well aligned in the images.

Display Full Size version of this image (131K)
Fig. 12. Examples illustrating the robustness of the disparity voting algorithm in registering multiple people in a scene. Each row contains an increasing number of people. Column (e) illustrates the registration using disparity voting. It is a marked improvement over the initial, unregistered image in column (d). (a) Color. (b) Thermal. (c) Disparity. (d) Unregistered. (e) Registered.

Display Full Size version of this image (46K)
Fig. 13. Detailed examples of successful registration alignment using disparity voting.

5. Multimodal video analysis for person tracking: basic framework and experimental study
We have shown that the disparity voting algorithm for multimodal registration is a robust approach to estimating the alignment disparities in scenes with multiple occluding people. The disparities generated from the registration process yield values that can be used to differentiate the people in the room. It is with this in mind that we investigate the use of multimodal disparity as a feature for tracking people in a scene.
Tracking human motion using computer vision approaches is a well-studied area of research and a good survey by Moeslund and Granum[23] gives lucid insight into the issues, assumptions and limitations of a large variety of tracking approaches. One approach, disparity based tracking, has been investigated for conventional color stereo cameras and has proven quite robust in localizing and maintaining tracks through occlusion, as the tracking is performed in 3D space by transforming the stereo image estimates into a plan-view occupancy map of the imaged space[24]. We wish to explore the feasibility of using such approaches to tracking with the disparities generated from disparity voting registration. An example sequence of frames inFig. 14 illustrates the type of people movements we aim to track. The sequence has multiple people occupying the imaged scene. Over the sequence, there are multiple occlusions of people at different depths. The registration disparities that are used to align the color and thermal images can be used as an feature for tracking people through these occlusions and maneuvers.

Display Full Size version of this image (76K)
Fig. 14. Example input sequence for multiperson tracking experiments. Notice occlusions, scale, appearance and disparity variations. (a) Frame 0. (b) Frame 20. (c) Frame 40. (d) Frame 60. (e) Frame 80. (f) Frame 100. (g) Frame 120. (h) Frame 140.

Fig. 15 shows an algorithmic framework for multimodal person tracking. In tracking approaches, representative features are typically extracted from all available images in the setup[25]. Features are used to associate tracks from frame to frame and the output of the tracker is often used to guide subsequent feature extraction. All of these algorithmic modules are imperative for reliable and robust tracking. For our initial investigations, we will focus on the viability of registration disparity as a tracking feature.

Display Full Size version of this image (35K)
Fig. 15. Algorithmic flowchart for multiperson tracking.

In order to determine the accuracy of the disparity estimates for tracking, we first calibrate the scene. This is done by having a person walk around the testbed area, stopping at preset locations in the scene. At each location we measure the disparity generated from our algorithm and use that as ground truth for analyzing the disparities generated when there are more complex scenes with multiple people and occlusions.Fig. 16(a) is the variable baseline multimodal stereo rig andFig. 16(b) shows the ground truth disparity range for the testbed from the calibration experiments captured with this rig.

Display Full Size version of this image (70K)
Fig. 16. (a) Variable baseline multimodal stereo rig, (b) experimentally determined disparity range for testbed. The disparities were computed by determining the disparities for a single person standing at predetermined points in the imaged scene.

To show the viability of registration disparity as tracking feature in a multimodal stereo context, we compare ground truth positional estimates to those generated from the disparity voting algorithm. Lateral position information for each track was hand segmented by clicking on the center point of the person’s head in each image. This is a reasonable method, as robust head detection algorithms for head detection could be implemented for both color and thermal imagery (skin-tone, hot spots, head template matching). Approaches such as vertical projection or v-disparity could also be used to determine the locations of people in the scene. Ground truth disparity estimates were generated by visually determining the disparity based on the person’s position relative to the ground truth disparity range map as shown inFig. 16. Experimental disparities were generated using the disparity voting algorithm with the disparity of each person determined from disparity values in the head region. A moving average of 150 ms was used to smooth instantaneous disparity estimates.
Fig. 17 shows the track patterns and ground truth for the example sequence inFig. 14. The ground truth is plotted in solid colors for each person in the sequence, while the disparity estimates from the disparity voting algorithm are shown in corresponding colored symbols with dotted lines connecting the estimates.Fig. 17(a) is a representation of the tracks, illustrating a “plan-view”-like representation of the movements and disparity changes of the people in the testbed.Fig. 17(b) shows a time varying version of the same data, with the frame number plotted in the third dimension.

Display Full Size version of this image (41K)
Fig. 17. Tracking results showing close correlation between ground truth (in solid colors) and disparity tracked estimates (in dotted colors). Each color shows the path of each person in the sequence. (a) Track patterns and ground truth for four person tracking experiment. (b) Time varying track patterns and ground truth for four person tracking experiment. (For interpretation of the references to color in this figure legend, the reader is referred to the Web version of this article.)

The plots inFig. 17 show that the disparities generated from the disparity voting registration reasonably follow the ground truth tracks. As the green tracked person moves behind and becomes occluded by the blue tracked person, we see that the disparities generated when he re-emerges from the occlusion are in line with the ground truth disparities and can be used to re-associate the track after the occlusion.
Errors from ground truth are particularly apparent when people are further from the camera. This is because of the non-linearity of the disparity distribution. There are more distinct disparities nearer to the camera. As you move deeper in the scene inFig. 16, the change in disparity for the same change in distance is much less. At these distances, errors of even one disparity shift are very pronounced. Conventional stereo algorithms typically used approaches that give subpixel accuracy, but the current implementation of our disparity voting algorithm only gives pixel level disparity shifts. While this may be acceptable for registration alignment, refinement steps are necessary to make disparity a more robust tracking feature. Approaches that use multiple primitives[26], such as edges, shapes, and silhouettes, etc., could be used to augment the accuracy of the disparity voting algorithm. Additionally, using multiple tracking features could provide additional measurements that can be used to boost the association accuracy.
6. Discussion and concluding remarks
Multimodal imagery applications for human analysis span a variety of application domains, including medical[1], in-vehicle safety systems[2] and long-range surveillance[3]. Often, the registration algorithms these types of systems employ do not operate on data that has multiple objects and multiple depths that are significant relative to their distance from the camera. It is in this realm, including close-range surveillance[20] and pedestrian detection applications[27], that we believe disparity voting registration techniques and corresponding tracking algorithms will prove useful.
In this paper we have provided an analysis of the approaches to multimodal image registration and detailed the assumptions, applicability and limitations of each. We then introduced and analyzed a method for registering multimodal images with occluding objects in the scene. By using the disparity voting approach, an analysis of over 2000 frames yielded a registration success rate of over 97% with a 96% success rate when considering only occlusion examples. Additionally, ground truth accuracy evaluations illustrate how the disparity voting algorithm provides accurate registration for multiple people in scenes with occlusion. Comparative studies show the improvements upon the accuracy and robustness of previous bounding box techniques in both a quantitative and qualitative manner. We have presented a framework for tracking and have shown promising experimental studies that suggest that disparity voting results can be used as a feature that will allow for the differentiation of people in a scene and give accurate tracking associations in complex scenes with multiple people and occlusions.
References
[1] P. Thevenaz, M. Bierlaire, M. Unser, Halton sampling for image registration based on mutual information, Sampling Theory in Signal and Image Processing (in press). Available from: .
[2] M.M. Trivedi, S.Y. Cheng, E.M.C. Childers and S.J. Krotosky, Occupant posture analysis with stereo and thermal infrared video: algorithms and experimental evaluation, IEEE Trans. Veh. Technol. 53 (2004) (6), pp. 1712–1968.
[3] J. Davis, V. Sharma, Fusion-based background-subtraction using contour saliency, in: IEEE CVPR Workshop on Object Tracking and Classification beyond the Visible Spectrum, 2005.
[4] P. Viola and W.M. Wells, Alignment by maximization of mutual information, Int. J. Comput. Vis. 24 (1997) (2), pp. 137–154.Full Text via CrossRef |View Record in Scopus |Cited By in Scopus
[5] G. Egnal, Mutual information as a stereo correspondence measure, Tech. Rep. MS-CIS-00-20, University of Pennsylvania, 2000.
[6] P. Thevenaz and M. Unser, Optimization of mutual information for multiresolution image registration, IEEE Trans. Image Process. 9 (2000) (12), pp. 2083–2089.
[7] G.L. Foresti, C.S. Regazzoni and P.K. Varshney, Multisensor Surveillance Systems: The Fusion Perspective, Springer Press (2003).
[8] R. Hartley and A. Zisserman, Multiple View Geometry in Computer Vision, Cambridge University Press (2002).
[9] C.O. Conaire, E. Cooke, N. O’Connor, N. Murphy, A. Smeaton, Background modeling in infrared and visible spectrum video for people tracking, in: IEEE CVPR Workshop on Object Tracking and Classification beyond the Visible Spectrum, 2005.
[10] M. Irani, P. Anandan, Robust multi-sensor image alignment, in: Computer Vision, 1998. Sixth International Conference on, 1998.
[11] E. Coiras, J. Santamaria and C. Miravet, Segment-based registration technique for visual-infrared images, Opt. Eng. 39 (2000) (1), pp. 282–289.OJPS full text |Full Text via CrossRef |View Record in Scopus |Cited By in Scopus
[12] J. Han, B. Bhanu, Detecting moving humans using color and infrared video, in: IEEE Inter. Conf. on Multisensor Fusion and Integration for Intelligent Systems, 2003.
[13] M. Itoh, M. Ozeki, Y. Nakamura, Y. Ohta, Simple and robust tracking of hands and objects for video-based multimedia production, in: IEEE Conf. on Multisensor Fusion and Integration for Intelligent Systems, 2003.
[14] G. Ye, Image registration and super-resolution mosaicing, (2005).
[15] X. Ju, J.-C. Nebel, J.P. Siebert, 3D thermography imaging standardization technique for inflammation diagnosis, in: Proc. SPIE, Photonics Asia, 2004.
[16] M. Bertozzi, A. Broggi, M. Felias, G. Vezzoni, M.D. Rose, Low-level pedestrian detection by means of visible and far infra-red tetra-vision, in: IEEE Conf. on Intelligent Vehicles, 2006.
[17] H. Chen, P. Varshney, M. Slamani, On registration of regions of interest (ROI) in video sequences, in: IEEE Conf. on Advanced Video and Signal Based Surveillance (AVSS’03), 2003.
[18] J. Davis, V. Sharma, Robust detection of people in thermal imagery, in: IEEE 17th Inter. Conf. on Pattern Recognition, 2004.
[19] J. Davis, V. Sharma, Robust background-subtraction for person detection in thermal imagery, in: Computer Vision and Pattern Recognition Workshop, 2004 Conference on, 2004.
[20] S.J. Krotosky, M.M. Trivedi, Registration of multimodal stereo images using disparity voting from correspondence windows, in: IEEE Conf. on Advanced Video and Signal based Surveillance (AVSS’06), 2006.
[21] J.-Y. Bouguet, Camera calibration toolbox for matlab,.
[22] K. Kim, T. Chalidabhongse, D. Harwood and L. Davis, Real-time foreground-background segmentation using codebook model, Real-Time Imaging 11 (2005) (3), pp. 163–256.
[23] T.B. Moesland and E. Granum, A survey of computer vision-based human motion capture, Comput. Vis. Image Und. 81 (2001) (3), pp. 231–268.
[24] M. Harville, D. Li, Fast, integrated person tracking and activity recognition with plan-view templates from a single stereo camera, in: IEEE Conf. on Computer Vision and Pattern Recognition, 2004.
[25] K. Huang and M.M. Trivedi, Video arrays for real-time tracking of person, head, and face in an intelligent room, Mach. Vis. Appl 14 (2003) (2), pp. 103–111.View Record in Scopus |Cited By in Scopus
[26] S. Marapane and M.M. Trivedi, Multi-primitive hierarchical (MPH) stereo analysis, IEEE Trans. Pattern Anal. Mach. Intell. 16 (1994) (3), pp. 227–240.Full Text via CrossRef |View Record in Scopus |Cited By in Scopus
[27] S.J. Krotosky, M.M. Trivedi, Multimodal stereo image registration for predestrian detection, in: IEEE Conf. on Intelligent Transportation Systems, 2006.

This research is sponsored by the Technical Support Working Group (TSWG) for Combating Terrorism, DHS and the U.C. Discovery Grant.

Corresponding author.
Computer Vision and Image Understanding
Volume 106, Issues 2-3, May-June 2007, Pages 270-287
Special issue on Advances in Vision Algorithms and Systems beyond the Visible Spectrum

Home Browse Search My Settings Alerts Help

About ScienceDirect | Contact Us | Terms & Conditions | Privacy Policy

ScienceDirect - Computer Vision and Image Und... The Computer Vision Industry ScienceDirect - Measurement : An image matchi... Computer Vision ,Educational Resources, Universities ScienceDirect - Nuclear Medicine and Biology ... Image Registration and Mosaicking vision computer Prosecutor's Office Develops and Promotes Computer Game Image Processing using GDI and VC 6.0 - CodeGuru Social Presence Theory and Implications for Interaction and Collaborative Learning in Computer Confe ScienceDirect - Consciousness and Cognition : A cognitive architecture that combines internal simulation with a global workspace 马英九的英文演讲——A Vision for Peace and Prosperity in East Asia MIT OpenCourseWare | Electrical Engineering and Computer Science | 6.830 Database Systems, Fall 2005 | Lecture Notes sciencedirect - journals ScienceDirect - Computers & Education : Engaging college science students and changing academic achievement with technology: A quasi-experimental preliminary investigation computer knowledge ScienceDirect TOP25 Hottest Articles 转载]sciencedirect检索方法 sciencedirect检索方法 CodeGuru: Simulating Multiple Inheritance Und... image | 煎蛋 ScienceDirect - Journal of Economic Dynamics ... ScienceDirect - Journal of Biotechnology : Ef...