Image Stitching - Part 1: Feature Detection with SIFT
Image Stitching is a process that involves a sequence of algorithms to produce a scene from multiple images. I’ll be covering each component involved in the process of stitching images in great detail.
Jupyter notebook: sift.ipynb
Code: utils.py
Feature Detection with SIFT
If we want to construct a whole scene given several different images, we must first search for what features one piece shares in common with another piece. These features must also be recognizable from different perspectives. An algorithm that is commonly used for this type of problem is SIFT (Scale Invariant Feature Transform). It applies a sequence of mathematical operations on an image and reveals what points describe the image the most.
Scale-space Construction
When looking at an object in the real world, it is much easier to give more descriptions of the object when it is right in front of your eyes as opposed to being further away. A person standing a few feet away is much easier to describe than a person that is 500 feet away. A scale-space attempts to remove descriptors of an object so that it’s indistinguishable at any size, meaning we can give the same description from different perspectives.
SIFT convolves an image with a 2D Gaussian to produce a scale-space defined as:
\[\begin{align} L(x,y,\sigma)=G(x,y,\sigma)\ast I(x,y) \label{eq:test1} \end{align}\] \[\begin{align} G(x,y,\sigma)=\frac{1}{2\pi\sigma^{2}} \exp \left(-\frac{x^{2} + y^{2}}{2\sigma^2} \right) \end{align}\]where \(L\) is the convolution of image \(I\) and 2D Gaussian \(G\) with standard deviation \(\sigma\) as the scale parameter. Since \(I\) is discrete, Gaussian \(G\) is represented with a kernel to perform the operation in practice.
SIFT applies this multiple times to the same image with progessively higher scale values to create an octave. The scale parameter will increase by \(\sigma_{i}=k\sigma_{i-1}\). Then it resizes the image by half and repeats the same process for the next octave.
Difference of Gaussian
The motivation behind Difference of Gaussian DoG is to serve as a speedy alternative for blob detection while also maintaining scale invariance. Typically, a scale-normalized Laplacian of Gaussian (LoG) is used for this, but with a bit of clever math, DoG can acheive a good approximation of scale-normalized LoG efficiently as the smoothed image \(L\) can be used for scale-space feature description by image subtraction.
Proving DoG \(\approx\) LoG
The Laplacian of Gaussian for image \(I\) is given as:
\[\nabla^{2}L=\nabla^{2}(G\ast I)=\underbrace{\nabla^{2}G}_\textrm{LoG}\ast I\]LoG can be expanded to:
\[\frac{\partial^{2}G}{\partial x^{2}}=\frac{G(x,y,\sigma)}{\sigma^{2}} \left(\frac{x^2}{\sigma^2} - 1\right) \quad \frac{\partial^{2}G}{\partial y^{2}}=\frac{G(x,y,\sigma)}{\sigma^{2}} \left(\frac{y^2}{\sigma^2} - 1\right)\] \[\nabla^{2}G=\frac{\partial^{2}G}{\partial x^{2}} + \frac{\partial^{2}G}{\partial y^{2}} = \frac{G(x,y,\sigma)}{\sigma^{2}}\left(\frac{x^2 + y^2}{\sigma^2} - 2\right)\]Then the derivative of the Gaussian kernel can be expressed as:
\[\begin{align*} \frac{\partial G}{\partial \sigma} &= \sigma \left[ \frac{G(x,y,\sigma)}{\sigma^{2}} \left(\frac{x^2 + y^2}{\sigma^2} - 2\right) \right] = \sigma \nabla^{2}G \end{align*}\]Scale-normalized LoG is obtained by eliminating the \(\sigma^{-2}\) term:
\[\sigma^{2} \left[ \nabla^{2}G = \frac{G(x,y,\sigma)}{\sigma^{2}}\left(\frac{x^2 + y^2}{\sigma^2} - 2\right) \right] \Rightarrow \underbrace{\sigma^{2}\nabla^{2}G}_\textrm{Scale Norm LoG} = G(x,y,\sigma)\left(\frac{x^2 + y^2}{\sigma^2} - 2\right)\]Using this, we can take the finite difference approximation using nearby scales \(\sigma\) and \(k\sigma\) to discover the relationship between scale-normalized LoG and DoG.
\[\begin{align*} \frac{\partial G}{\partial \sigma} =\sigma \nabla^{2}G \approx \frac{G(x,y,k\sigma) - G(x,y,\sigma)}{k\sigma - \sigma} \\ \sigma \nabla^{2}G \approx \frac{G(x,y,k\sigma) - G(x,y,\sigma)}{\sigma(k - 1)} \\ \end{align*} \\\] \[\begin{align} (k-1)\underbrace{\sigma^{2}\nabla^{2}G}_\textrm{Scale Norm LoG} \approx \underbrace{G(x,y,k\sigma) - G(x,y,\sigma)}_\textrm{DoG} \end{align}\]This shows that taking the difference between two Gaussians with scales differing by constant factor \(k\), the \(\sigma^{2}\) term required for scale-normalized LoG is automatically incorperated. Factor (k - 1) remains constant for every \(\sigma\) which means that it will not influence extrema locality which is important for the following steps.
Local Extrema Detection
Now that we have a good representation of regions with rapid changes in the image from DoG, we can locate the extrema and use them as a representation of keypoints in the image. For detecting extrema, points are sampled in the image and then compared to its eight neighbors in the current scale and nine neighbors in the previous scale and next scale. It is selected as a keypoint candidate if the value is greater than or less than all of its neighbors.
Keypoint Localization and Refinement
Naturally, the keypoints extracted from the previous step are crudely localized because of the quantization used in constructing the scale-space. As a result, there can be many unreliable keypoint candidates because their locations are approximations of the extrema. SIFT refines the keypoint candidates by fitting their local neighborhoods with the second order Taylor expansion of the DoG function. Then it interpolates the discrete approximations from the previous step to give a better representation of the extrema.
Interpolation of Nearby Data
This is done by using the second order Taylor expansion for the DoG function with a sample keypoint \(\mathbf z_i = [x_i \ y_i \ \sigma_i]^{T}\) and shifting it by an offset \(\mathbf z = [x \ y \ \sigma]^{T}\) so that the origin is at the sample’s location.
\[D(x,y,\sigma) = D(\mathbf{z_i}+\mathbf z) \approx D(\mathbf{z_i}) + \bigg(\frac{ \partial D}{\partial \mathbf z}\bigg)^T\bigg|_{\mathbf z=\mathbf z_i}\mathbf z + \frac 1 2 \mathbf z^T\frac{ \partial^2 D}{\partial \mathbf z^2}\bigg|_{\mathbf z=\mathbf z_i} \mathbf z\]To locate an extremum \(\mathbf{\hat z}\), take the derivative of the Taylor expansion with respect to \(\mathbf{z}\) and set it to 0:
\[\begin{align*} \frac{\partial}{\partial \mathbf{z}} \bigg[D(\mathbf{z_i}) + \bigg(\frac{ \partial D}{\partial \mathbf z}\bigg)^T\mathbf z + \frac 1 2 \mathbf z^T\frac{ \partial^2 D}{\partial \mathbf z^2} \mathbf z \bigg] = \frac{ \partial D}{\partial \mathbf z} + \frac{ \partial^2 D}{\partial \mathbf z^2} \mathbf z \\ \end{align*}\]Set it equal to 0 and solve for extremum \(\mathbf{\hat z}\):
\[\begin{align} \frac{ \partial D}{ \partial \mathbf z} + \frac{ \partial^2 D}{\partial \mathbf z^2} \mathbf{\hat z} = 0 \quad \Rightarrow \frac{ \partial^2 D}{\partial \mathbf z^2} \mathbf{\hat z} = -\frac{ \partial D}{ \partial \mathbf z} \quad \Rightarrow \mathbf{\hat z} = -\frac{ \partial^2 D}{\partial \mathbf z^2}^{-1}\frac{ \partial D}{ \partial \mathbf z} \end{align}\]Eliminating low contrast points
If \(\mathbf{\hat z}\) is greater than 0.5 or less than -0.5 for any dimension, then that indicates that the extremum lies outside the neighborhood of the sample so the candidate keypoint is discarded. Otherwise, add the offset \(\mathbf{\hat z}\) to the candidate keypoint \(\mathbf{z_i}\) to get the updated extremum \(\mathbf{z_i} + \mathbf{\hat z}\). This process is done iteratively until the offset is valid. If its not valid by the final iteration, then it is discarded. We can substitute \(\mathbf z\) in the Taylor expansion with \(\mathbf{\hat{z}}\) to get the interpolated value at the localized extremum:
\[\begin{align*} D(\mathbf{z_i}+\mathbf{\hat z}) = D(\mathbf{z_i}) + \bigg(\frac{ \partial D}{\partial \mathbf z}\bigg)^T\bigg|_{\mathbf z=\mathbf z_i} \mathbf{\hat z} + \frac 1 2 \mathbf{\hat z}^T\frac{ \partial^2 D}{\partial \mathbf z^2}\bigg|_{\mathbf z=\mathbf z_i} \mathbf{\hat z} \end{align*}\] \[\begin{align} = D(\mathbf{z_i}) + \frac 1 2 \bigg(\frac{ \partial D}{\partial \mathbf z}\bigg)^T\bigg|_{\mathbf z=\mathbf z_i} \mathbf{\hat z} \end{align}\]The value represents the contrast at that point. If |\(D(\mathbf{z_i}+\mathbf{\hat z})\)|\(< 0.03\), it is consider to be low contrast and discarded.
The red region shows some extrema that were discarded due to having low contrast. The green region shows extrema that shifted by the offset from the interpolation.
Eliminating edge responses
As you may have noticed in the DoG images, the contours on my face and hat were easily discernible which means that it had a strong response along these edges and curves. Some of the points along the edges are coarsely located and easily influenced to small amounts of noise resulting in undesirable extrema. This process will also eliminate other erroneous extremas with no structure.
Using the 2D Hessian \(\mathbf H\) of the DoG, information about the local structure around a keypoint can be obtained. The eigenvalues of \(\mathbf H\) can provide the maximal and minimal principle curvatures at a DoG point. Edges will have a large principal curvature orthogonal to the contour and a small one along the contour.
All of the keypoints in the green region should be discarded. In the blue, the points along the ridge should be kept because the orthogonal principal curvature is nearly the same as the one along the contour, indicating that it is not on an edge.
SIFT will discard the keypoints whose eigenvalue ratio is more than threshold \(r\), as this indicates that the point lies on an edge. \(r\) is related to the ratio of the Hessian determinant and its trace.
\[\mathbf{H} = \begin{bmatrix} D_{xx} & D_{xy} \\ D_{xy} & D_{yy} \end{bmatrix} \quad \quad \frac {\text{Tr}(\mathbf{H})^2} {\text{Det}(\mathbf{H})} = \frac {(\lambda_{1} + \lambda_{2})^2}{\lambda_{1}\lambda_{2}} < \frac{(r+1)^2}{r}\]This relationship can also be expressed as:
\[\begin{align} \frac {\text{Tr}(\mathbf{H})^2} {\text{Det}(\mathbf{H})} < \frac{(r+1)^2}{r} \quad \Rightarrow \quad r\text{Tr}(\mathbf{H})^2 < (r+1)^2\text{Det}(\mathbf{H}) \end{align}\]
The green dots are showing the keypoints from the second scale and red dots are from the thrid scale after refinement.
Descriptor Orientation
The descriptors are structured around a set of weighted histograms of the gradient orientation taken at the neighboring regions of the keypoint. The most dominant gradient orientation over a keypoint neighborhood is provided by the strongest gradient angle in the region and is selected as the reference orientation, allowing for rotation invariance for the resulting descriptor. The size of the region depends on the scale and octave that the keypoint was located in.
For each keypoint, the gradient magnitude \(m\) and the gradient angle \(\theta\) are computed:
\[L(x,y,\sigma) = G(x,y,\sigma) \ast I(x,y) \\\] \[\begin{align} m(x,y) = \sqrt{(L(x+1,y)-L(x-1,y))^2 + (L(x,y+1)-L(x,y-1))^2} \\ \end{align}\] \[\begin{align} \theta(x,y) = \arctan\left(\frac{ L(x,y+1)-L(x,y-1)}{L(x+1,y)-L(x-1,y)}\right) \end{align}\]
The left image shows the patch of a sample keypoint from the green region below. The right image shows the gradient magnitudes represented by the color with their respective gradient angles.
SIFT accumulates the local distribution of the gradient angle within a region using a histogram composed of 36 bins for every 10 degrees in a full rotation (360 degrees). The amount added to a bin depends on the product of the gradient magnitude and the contribution weight at that particular point. \(\lambda_{ori}\) is used to adjust the reach of the distribution.
\[\begin{align} \text{Contribution weight at (x,y):} \quad c_{x,y}^{ori} = \exp\left(\frac{-(x^2 + y^2)}{2(\lambda_{ori}\sigma)^2}\right) \\ \end{align}\] \[\begin{align} \text{Bin index at (x,y):} \quad b_{x,y}^{ori} = \frac{\theta_{x,y} n_{bins}}{360} \end{align}\]Smoothing the histogram by weighting nearby bins and averaging them.
After computing the smoothed histogram, the orientation with the greatest magnitude (peak) is selected as its reference orientation. Additionally, if any other orientation contains a magnitude that is greater than 80% of the peak, then a new keypoint gets created with the same location and is assigned that orientation instead.
All together
The green bar shows the orientation that was selected for the feature descriptor and the red bars are the orientations that didn’t pass the threshold.
Descriptor Construction
A descriptor is constructed for the keypoint by partitioning the normalized patch into a set of \(n_{hists} \times n_{hists}\) (where \(n_{hists}=4\)) weighted histograms to form a spatial grid of the gradient orientations from different regions of the normalized patch. Instead of using 36 spatial bins like the previous step, 8 are used for binning the angles (\(n_{ori}=8\)).
Trilinear interpolation is used to spread the weight across neighboring spatial bins and across neighboring angle bins instead of having it concentrated in a single bin. This creates more robust descriptors when trying to match them with images taken from different perspectives.
The accumulated array of histograms after interpolation is encoded into a feature vector \(\mathbf{f}\) of length \(n_{hists} \times n_{hists} \times n_{ori} = 128\). The vector is normalized and then saturated by 20% of the maximum value. This seeks to reduce the impact of non-linear illumination changes. Finally, the vector is renormalized by the product of 512 and quantized to 8 bit integers. This aims to accelerate the computation of distances between feature vectors when matching.
What’s next?
Next, I’ll be explaining how to match the feature keypoints gathered from using SIFT in order to stitch images together.
Additional Notes
The code I provided is a barebones implementation I had done, so it is likely suboptimal and prone to errors. I found that the original paper was insufficient in terms of providing information on how to construct the algorithm.
I did a little digging and found an article that goes into rigorous detail on how to go about implementing the algorithm. I referenced it a lot when constructing the algorithm but left out some details. If I didn’t explain something very well or you would like to understand the algorithm more clearly, I suggest you take a look at it.