You are on page 1of 3

alignment method

We wi 11 consider one particular version of the problem in greater detailwe are asked to identify a three-dimensional object from its projection on the image plane. For convenience, the projection process is modelled as a scaled orthographic projection. We do not know the pose of the objectits position and orientation with respect to the camera. The object is represented by a set of m features or distinguished points //,, /i2,..., /^m in three-dimensional spaceperhaps vertices for a polyhedral object. These are measured in some coordinate system natural for the object. The points are then subjected to an unknown 3-D rotation R, followed by translation by unknown amount t and projection to give rise to image Section 24.6. Object Representation and Recognition 753 feature points p \ ,p2, ...,/? on the image plane. In general, n^m because some model points may be occluded. The feature detector in the image also may miss true features and mark false ones due to noise. We can express this as for 3-D model point /, and corresponding image point/;,. Here FI denotes perspective projection or one of its approximations such as scaled orthographic projection. We can summarize this by the equation /?, = Q\i; where Q is the (unknown) transformation that brings the model points in alignment with the image. Assuming the object is rigid, the transformation Q is the same for all the model points. One can solve for Q given the 3-D coordinates of three model points and their 2-D projections. The intuition is as follows: one can write down equations relating the coordinates of Pi to those of Hi. In these equations, the unknown quantities correspond to the parameters of the rotation matrix R and the translation vector t. If we have sufficiently many equations, we ought

to be able to solve for Q. We will not give any proof here, but merely state the following result
(Huttenlocher and Ullman, 1990): Given three noncollinear points //i, //2, and //j in the model, and their projections on the image plane, p\, p2, and pj under scaled orthographic projection, there exist exactly two transformations from the three-dimensional model coordinate frame to a two-dimensional image coordinate frame.

These transformations are related by a reflection around the image plane and can be computed by
a simple closed-form solution. We will just assume that there exists a function FlND-TRANSFORM, as shown in Figure 24.27. If we could identify the corresponding model features for three features in the image, we could compute Q, the pose of the object. The problem is that we do not know these correspondences. The solution is to operate in a generate-and-test paradigm. We have to guess an initial correspondence of an image triplet with a model triplet and use the function

FIND-TRANSFORM to hypothesize Q. If the guessed correspondence was correct, then Q will be


correct, and when applied to the remaining model points will result in prediction of the image points. If the guessed correspondence was incorrect, then Q will be incorrect, and when applied to the remaining model points would not predict the image points.
function FIND-TRANSFORM^;,/^, 773, /i, V2, /'a) returns a transform Q such that GO'i ) = pi
Q(V>2)=P2
2(/<3)= j03

inputs: p\,p2,pj, image feature points


AM, /*2, /<3, model feature points

Figure 24.27 The definition of the transformation-finding process. We omit the algorithm (Huttenlocher and Ullman, 1990).
754 Chapter 24. Perception

function ALIGN(Image feature points, Model feature points) returns a solution or failure loop do

choose an untried triplet p,,, p,2, p,3 from image if no untried triplets left then return failure while there are still untried model triplets left do choose an untried triplet /*/,, p,j2, ///3 from model
g<-FlND-TRANSFORM(p,-|, ph, p,,,/*,,, ft,, [ij})

if projection according to Q explains image then return (success, Q)


end end

Figure 24.28 An informal description of the alignment algorithm.

This is the basis of the algorithm ALIGN, which seeks to find the pose for a given model

and return failure otherwise (see Figure 24.28). The worst-case time complexity of the algorithm

is proportional to the number of combinations of model triplets and image tripletsthis gives the number of times Q has to be computedtimes the cost of verification. This gives (") () times the cost of verification. The cost of verification is m logn, as we must predict the image position of each of m model points, and find the distance to the nearest image point, a log n operation if

the image points are arranged in an appropriate data structure. Thus, the worst-case complexity of the alignment algorithm is O(m4rr' log n), where m and n are the number of model and image
points, respectively. One can lower the time complexity by a number of ideas. One simple technique is to

hypothesize matches only between pairs of image and model points. Given two image points and the edges at these points, a third virtual point can be constructed by extending the edges
and finding the intersection. This lowers the complexity to O(m3n2 logn). Techniques based on using pose clustering in combination with randomization (Olson, 1994) can be used to bring the complexity down to O(mn3). Results from the application of this algorithm to the stapler image are shown in Figure 24.29.
GEOMETRIC
INVARIANTS

Using projective invariants


Alignment using outline geometry and recognition is considered successful if outline geometry in an image can be explained as a perspective projection of the geometric model of the object. A disadvantage is that this involves trying each model in the model library, resulting in a recognition complexity proportional to the number of models in the library. A solution is provided by using geometric invariants as the shape representation. These shape descriptors are viewpoint invariant, that is, they have the same value measured on the object or measured from a perspective image of the object, and are unaffected by object pose. The simplest example of a projective invariant is the "cross-ratio" of four points on a line, illustrated Section 24.6. Object Representation and Recognition 755 Figure 24.29 (a) Corners found in the stapler image, (b) Hypothesized reconstruction overlaid on the original image. (Courtesy of Clark Olson.)

in Figure 24.30. Under perspective projection, the ratios of distances are not preservedthink of the spacing of sleepers on an image of a receding railway track. The spacing is constant in the world, but decreases with distance from the camera in an image. However, the ratio of ratio of distances on a line is preserved, that is, it is the same measured on the object or in the image. INDEX FUNCTIONS Invariants are significant in vision because they can be used as index functions, so that a value measured in an image directly indexes a model in the library. To take a simple example, suppose there are three models {A,B, C] in the library, each with a corresponding and distinct invariant value {/(A),/(fi),/(C)}. Recognition proceeds as follows: After edge detection and grouping, invariants are measured from image curves. If a value / = I(B) is measured, then there is evidence that object B is present. It is not necessary to consider objects A and C any further. It may be that for a large model base, all invariants are not distinct (i.e., several models may share invariant values). Consequently, when an invariant measured in the image corresponds to a value in the library, a recognition hypothesis is generated. Recognition hypotheses corresponding to the same object are merged if compatible. The hypotheses are verified by back projecting the outline as in the alignment method. An example of object recognition using invariants is given
in Figure 24.31.

Another advantage of invariant shape representation is that models can be acquired directly from images. It is not necessary to make measurements on the actual object, because the shape descriptors have the same value when measured in any image. This simplifies and facilitates
automation of model acquisition. It is particularly useful in applications such as recognition from satellite images. Although the two approaches to object recognition that we have described are useful in practice, it should be noted that we are far away from human competence. The generation of sufficiently rich and descriptive representations from images, segmentation and grouping to identify those features that belong together, and the matching of these to object models are difficult research problems under active investigation. 756 Chapter 24. Perception Figure 24.30 Invariance of the cross-ratio: AD.BC/AB.CD = A'D'.B'C'/A'B'.C'D'. Exercise 24.7 asks you to verify this fact. Figure 24.31 (a) A scene containing a number of objects, two of which also appearin the model
library. These are recognized using invariants based on lines and conies. The image shows 100

fitted lines and 27 fitted conies superimposed in white. Invariants are formed from combinations of lines and conies, and the values index into a model library. In this case, there are 35 models in the library. Note that many lines are caused by texture, and that some of the conies correspond to edge data over only a small section, (b) The two objects from the library are recognized correctly. The lock striker plate is matched with a single invariant and 50.9% edge match, and the spanner
with three invariants and 70.7% edge match. Courtesy of Andrew Zisserman.

You might also like