Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Computer Vision And Deep Learning: Convolutional Neural Network, Study notes of Artificial Intelligence

The concepts of edge detection and corner detection in image processing. It defines edges, blobs, and corners and explains their properties and detection methods. It also introduces the concept of scale-space and its importance in detecting features at different scales. different operators used for edge detection and corner detection, including Prewitt operator and Harris Corner Detector. It also explains the process of scale selection and keypoint matching. useful for students studying computer vision, image processing, and pattern recognition.

Typology: Study notes

2022/2023

Available from 03/23/2023

CosmicAlgo
CosmicAlgo 🇮🇳

4 documents

1 / 37

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
CVDL
Visual Features &
Representation
Unit - 2
Notes
Edge:
An Edge in an image is a sharp variation of the intensity function. In grayscale images this
applies to the intensity or brightness of pixels. In color images it can also refer to sharp
variations of color. An edge is distinguished from noise by possessing long range structure.
Properties of edges include gradient and orientation.
Edge Detection:
Edge detection is a technique of image processing used to identify points in a digital image with
discontinuities, simply to say, sharp changes in the image brightness. These points where the
image brightness varies sharply are called the edges (or boundaries) of the image.
Blob :
A Blob, in a sense, is anything that is considered a large object or anything bright in a dark
background, in images, we can generalize it as a group of pixel values that forms a somewhat
colony or a large object that is distinguishable from its background. Using image processing, we
can detect such blobs in an image.
Corner :
A corner is a point whose local neighborhood stands in two dominant and different edge
directions. In other words, a corner can be interpreted as the junction of two edges, where an
edge is a sudden change in image brightness. Corners are the important features in the image,
and they are generally termed as interest points which are invariant to translation, rotation,
and illumination.
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25

Partial preview of the text

Download Computer Vision And Deep Learning: Convolutional Neural Network and more Study notes Artificial Intelligence in PDF only on Docsity!

CVDL

Visual Features &

Representation

Unit - 2

Notes

Edge: An Edge in an image is a sharp variation of the intensity function. In grayscale images this applies to the intensity or brightness of pixels. In color images it can also refer to sharp variations of color. An edge is distinguished from noise by possessing long range structure. Properties of edges include gradient and orientation. Edge Detection: Edge detection is a technique of image processing used to identify points in a digital image with discontinuities, simply to say, sharp changes in the image brightness. These points where the image brightness varies sharply are called the edges (or boundaries) of the image. Blob : A Blob, in a sense, is anything that is considered a large object or anything bright in a dark background, in images, we can generalize it as a group of pixel values that forms a somewhat colony or a large object that is distinguishable from its background. Using image processing, we can detect such blobs in an image. Corner : A corner is a point whose local neighborhood stands in two dominant and different edge directions. In other words, a corner can be interpreted as the junction of two edges, where an edge is a sudden change in image brightness. Corners are the important features in the image, and they are generally termed as interest points which are invariant to translation, rotation, and illumination.

Corner Detection: Corner detection is an approach used within computer vision systems to extract certain kinds of features and infer the contents of an image. Corner detection is frequently used in motion detection, image registration, video tracking, image mosaicing, panorama stitching, 3D reconstruction and object recognition. Corner detection overlaps with the topic of interest point detection. Scale-space: Real world objects are meaningful only at a certain scale. You might see a sugar cube perfectly on a table. But if looking at the entire milky way, then it simply does not exist. This multi-scale nature of objects is quite common in nature. And a scale space attempts to replicate this concept on digital images.

Concept of Edge Detection

The concept of edge detection is used to detect the location and presence of edges by making changes in the intensity of an image. Different operations are used in image processing to detect edges. It can detect the variation of grey levels but it quickly gives response when a noise is detected. In image processing, edge detection is a very important task. Edge detection is the main tool in pattern recognition, image segmentation and scene analysis. It is a type of filter which is applied to extract the edge points in an image. Sudden changes in an image occurs when the edge of an image contour across the brightness of the image. In image processing, edges are interpreted as a single class of singularity. In a function, the singularity is characterized as discontinuities in which the gradient approaches are infinity. As we know that the image data is in the discrete form so edges of the image are defined as the local maxima of the gradient. lll Mostly edges exits between objects and objects, primitives and primitives, objects and background. The objects which are reflected back are in discontinuous form. Methods of edge detection study to change a single pixel of an image in gray area. Edge detection is mostly used for the measurement, detection and location changes in an image gray. Edges are the basic feature of an image. In an object, the clearest part is the edges and lines. With the help of edges and lines, an object structure is known. That is why extracting the edges is a very important technique in graphics processing and feature extraction. The basic idea behind edge detection is as follows:

1. To highlight local edge operator use edge enhancement operator.

Following is the gradient magnitude: As it is much faster to compute An approximate magnitude is computed:

  1. Laplacian of Gaussian The Laplacian of Gaussian is a 2-D isotropic measure of an image. In an image, Laplacian is the highlighted region in which rapid intensity changes and it is also used for edge detection. The Laplacian is applied to an image which is been smoothed using a Gaussian smoothing filter to reduce the sensitivity of noise. This operator takes a single grey level image as input and produces a single grey level image as output. Following is the Laplacian L(x,y) of an image which has pixel intensity value I(x, y). In Laplacian, the input image is represented as a set of discrete pixels. So discrete convolution kernel which can approximate second derivatives in the definition is found. 3 commonly used kernels are as following:

This is 3 discrete approximations which are used commonly in Laplacian filter. Following is 2-D Log function with Gaussian standard deviation:

  1. Prewitt operator Prewitt operator is a differentiation operator. Prewitt operator is used for calculating the approximate gradient of the image intensity function. In an image, at each point, the Prewitt operator results in gradient vector or normal vector. In Prewitt operator, an image is convolved in the horizontal and vertical direction with small, separable and integer-valued filter. It is inexpensive in terms of computations. CORNER DETECTOR Harris Corner Detector is a corner detection operator that is commonly used in computer vision algorithms to extract corners and infer features of an image. It was first introduced by Chris Harris and Mike Stephens in 1988 upon the improvement of Moravec’s corner detector. Compared to the previous one, Harris’ corner detector takes the differential of the corner score

Remember that we want the SSD to be large in shifts for all eight directions, or conversely, for the SSD to be small for none of the directions. By solving for the eigenvectors of M, we can obtain the directions for both the largest and smallest increases in SSD. The corresponding eigenvalues give us the actual value amount of these increases. A score, R, is calculated for each window: λ1 and λ2 are the eigenvalues of M. So the values of these eigenvalues decide whether a region is a corner, edge or flat. ● When |R| is small, which happens when λ1 and λ2 are small, the region is flat. ● When R<0, which happens when λ1>>λ2 or vice versa, the region is an edge. ● When R is large, which happens when λ1 and λ2 are large and λ 1 ∼λ2, the region is a corner. SCALE SPACE AND SCALE SELECTION Scale-space theory is a framework for multi-scale signal representation developed by the computer vision, image processing and signal processing communities with complementary

motivations from physics and biological vision. It is a formal theory for handling image structures at different scales, by representing an image as a one-parameter family of smoothed images, the scale-space representation, parametrized by the size of the smoothing kernel used for suppressing fine-scale structures. The parameter t in this family is referred to as the scale parameter, with the interpretation that image structures of spatial size smaller than about 𝑡 have largely been smoothed away in the scale-space level at scale t. The main type of scale space is the linear (Gaussian) scale space, which has wide applicability as well as the attractive property of being possible to derive from a small set of scale-space axioms. The corresponding scale-space framework encompasses a theory for Gaussian derivative operators, which can be used as a basis for expressing a large class of visual operations for computerized systems that process visual information. This framework also allows visual operations to be made scale invariant, which is necessary for dealing with the size variations that may occur in image data, because real-world objects may be of different sizes and in addition the distance between the object and the camera may be unknown and may vary depending on the circumstances. Scale selection: The theory presented so far describes a well-founded framework for representing image structures at multiple scales. In many cases it is, however, also necessary to select locally appropriate scales for further analysis. This need for scale selection originates from two major reasons; (i) real-world objects may have different size, and this size may be unknown to the vision system, and (ii) the distance between the object and the camera can vary, and this

● Scale-space peak selection: Potential location for finding features. ● Keypoint Localization: Accurately locating the feature keypoints. ● Orientation Assignment: Assigning orientation to keypoints. ● Keypoint descriptor: Describing the keypoints as a high dimensional vector. ● Keypoint Matching The scale space of an image is a function L(x,y,σ) that is produced from the convolution of a Gaussian kernel(Blurring) at different scales with the input image. Scale-space is separated into octaves and the number of octaves and scale depends on the size of the original image. So we generate several octaves of the original image. Each octave’s image size is half the previous one. Blurring Within an octave, images are progressively blurred using the Gaussian Blur operator. Mathematically, “blurring” is referred to as the convolution of the Gaussian operator and the image. Gaussian blur has a particular expression or “operator” that is applied to each pixel. What results is the blurred image. G is the Gaussian Blur operator and I is an image. While x,y are the location coordinates and σ is the “scale” parameter. Think of it as the amount of blur. Greater the value, greater the blur. DOG(Difference of Gaussian kernel) Now we use those blurred images to generate another set of images, the Difference of Gaussians (DoG). These DoG images are great for finding out interesting keypoints in the image. The difference of Gaussian is obtained as the difference of Gaussian blurring of an image with two

different σ, let it be σ and kσ. This process is done for different octaves of the image in the Gaussian Pyramid. It is represented in below image: Finding keypoints Up till now, we have generated a scale space and used the scale space to calculate the Difference of Gaussians. Those are then used to calculate Laplacian of Gaussian approximations that are scale invariant. One pixel in an image is compared with its 8 neighbors as well as 9 pixels in the next scale and 9 pixels in previous scales. This way, a total of 26 checks are made. If it is a local extrema, it is a potential keypoint. It basically means that keypoint is best represented in that scale. Keypoint Localization Key0points generated in the previous step produce a lot of keypoints. Some of them lie along an edge, or they don’t have enough contrast. In both cases, they are not as useful as features. So we

A neighborhood is taken around the keypoint location depending on the scale, and the gradient magnitude and direction is calculated in that region. An orientation histogram with 36 bins covering 360 degrees is created. Let's say the gradient direction at a certain point (in the “orientation collection region”) is 18.759 degrees, then it will go into the 10–19-degree bin. And the “amount” that is added to the bin is proportional to the magnitude of the gradient at that point. Once you’ve done this for all pixels around the keypoint, the histogram will have a peak at some point. The highest peak in the histogram is taken and any peak above 80% of it is also considered to calculate the orientation. It creates keypoints with same location and scale, but different directions. It contributes to the stability of matching. Keypoint descriptor At this point, each keypoint has a location, scale, orientation. Next is to compute a descriptor for the local image region about each keypoint that is highly distinctive and invariant as possible to variations such as changes in viewpoint and illumination. To do this, a 16x16 window around the keypoint is taken. It is divided into 16 sub-blocks of 4x size.

For each sub-block, 8 bin orientation histogram is created. So 4 X 4 descriptors over 16 X 16 sample array were used in practice. 4 X 4 X 8 directions give 128 bin values. It is represented as a feature vector to form keypoint descriptor. This feature vector introduces a few complications. We need to get rid of them before finalizing the fingerprint.

  1. Rotation dependence The feature vector uses gradient orientations. Clearly, if you rotate the image, everything changes. All gradient orientations also change. To achieve rotation independence, the keypoint’s rotation is subtracted from each orientation. Thus each gradient orientation is relative to the keypoint’s orientation.
  2. Illumination dependence If we threshold numbers that are big, we can achieve illumination independence. So, any number (of the 128) greater than 0.2 is changed to 0.2. This resultant feature vector is normalized again. And now you have an illumination independent feature vector!

With I_Σ calculated, it only takes four additions to calculate the sum of the intensities over any upright, rectangular area, independent of its size. Hessian matrix-based interest points Surf uses the Hessian matrix because of its good performance in computation time and accuracy. Rather than using a different measure for selecting the location and the scale (Hessian-Laplace detector), surf relies on the determinant of the Hessian matrix for both. Given a pixel, the Hessian of this pixel is something like: For adapt to any scale, we filtered the image by a Gaussian kernel, so given a point X = (x, y), the Hessian matrix H(x, σ) in x at scale σ is defined as: where Lxx(x, σ) is the convolution of the Gaussian second order derivative with the image I in point x, and similarly for Lxy (x, σ) and Lyy (x, σ). Gaussians are optimal for scale-space analysis but in practice, they have to be discretized and cropped. This leads to a loss in repeatability under image rotations around odd multiples of π /4. This weakness holds for Hessian-based detectors in general. Nevertheless, the detectors still perform well, and the slight decrease in performance does not outweigh the advantage of fast convolutions brought by the discretization and cropping. In order to calculate the determinant of the Hessian matrix, first we need to apply convolution with Gaussian kernel, then second-order derivative. After Lowe’s success with LoG approximations(SIFT), SURF pushes the approximation(both convolution and second-order derivative) even further with box filters. These approximate second-order Gaussian derivatives and can be evaluated at a very low computational cost using integral images and independently of size, and this is part of the reason why SURF is fast.

Gaussian partial derivative in xy Gaussian partial derivative in y The 9 × 9 box filters in the above images are approximations for Gaussian second order derivatives with σ = 1.2. We denote these approximations by Dxx, Dyy, and Dxy. Now we can represent the determinant of the Hessian (approximated) as: w=0.9 (Bay’s suggestion) Scale-space representation Scale spaces are usually implemented as image pyramids. The images are repeatedly smoothed with a Gaussian and subsequently sub-sampled in order to achieve a higher level of the pyramid. Due to the use of box filters and integral images, surf does not have to iteratively apply the same filter to the output of a previously filtered layer but instead can apply such filters of any size at exactly the same speed directly on the original image, and even in parallel. Therefore, the scale space is analyzed by up-scaling the filter size(9×9 → 15×15 → 21×21 → 27×27, etc) rather than iteratively reducing the image size. So for each new octave, the filter size increase is doubled simultaneously the sampling intervals for the extraction of the interest points(σ) can be doubled as well which allow the up-scaling of the filter at constant cost. In order to localize interest points in the image and over scales, a nonmaximum suppression in a 3 × 3 × 3 neighborhood is applied.

Descriptor Components Now it’s time to extract the descriptor

  1. The first step consists of constructing a square region centered around the keypoint and oriented along the orientation we already got above. The size of this window is 20s.
  2. Then the region is split up regularly into smaller 4 × 4 square sub-regions. For each sub-region, we compute a few simple features at 5×5 regularly spaced sample points. For reasons of simplicity, we call dx the Haar wavelet response in the horizontal direction and dy the Haar wavelet response in the vertical direction (filter size 2s). To increase the robustness towards geometric deformations and localization errors, the responses dx and dy are first weighted with a Gaussian (σ = 3.3s) centered at the keypoint. Then, the wavelet responses dx and dy are summed up over each subregion and form a first set of entries to the feature vector. In order to bring in information about the polarity of the intensity changes, we also extract the sum of the absolute values of the responses, |dx| and |dy|. Hence, each sub-region has a four-dimensional descriptor vector v for its underlying intensity structure V = (∑ dx, ∑ dy, ∑|dx|, ∑|dy|). This results in a descriptor vector for all 4×4 sub-regions of length 64(In Sift, our descriptor is the 128-D vector, so this is part of the reason that SURF is faster than Sift). HISTOGRAM ORIENTED GRADIENTS (HoG) Histogram of Oriented Gradients, also known as HOG, is a feature descriptor like the Canny Edge Detector, SIFT (Scale Invariant and Feature Transform). It is used in computer vision and image processing for the purpose of object detection. The technique counts occurrences of gradient orientation in the localized portion of an image. This method is quite similar to Edge Orientation Histograms and Scale Invariant aFeature Transformation (SIFT). The HOG descriptor focuses on the structure or the shape of an object. It is better than any edge descriptor as it uses magnitude as well as angle of the gradient to compute the features. For the

regions of the image it generates histograms using the magnitude and orientations of the gradient. LOCAL BINARY PATTERNS (LBP) Local Binary Pattern (LBP) is a simple yet very efficient texture operator which labels the pixels of an image by thresholding the neighborhood of each pixel and considers the result as a binary number. Due to its discriminative power and computational simplicity, LBP texture operator has become a popular approach in various applications. It can be seen as a unifying approach to the traditionally divergent statistical and structural models of texture analysis. Perhaps the most important property of the LBP operator in real-world applications is its robustness to monotonic gray-scale changes caused, for example, by illumination variations. Another important property is its computational simplicity, which makes it possible to analyze images in challenging real-time settings. VISUAL MATCHING - BAG-OF-WORDS In computer vision, the bag-of-words model (BoW model) sometimes called bag-of-visual-words model can be applied to image classification or retrieval, by treating image features as words. In document classification, a bag of words is a sparse vector of occurrence counts of words; that is, a sparse histogram over the vocabulary. In computer vision, a bag of visual words is a vector of occurrence counts of a vocabulary of local image features. Image representation based on the BoW model: To represent an image using the BoW model, an image can be treated as a document. Similarly, "words" in images need to be defined too. To achieve this, it usually includes following three steps: feature detection, feature description, and codebook generation. A definition of the BoW model can be the "histogram representation based on independent features". Content based image indexing and retrieval (CBIR) appears to be the early adopter of this image representation technique. Feature representation: After feature detection, each image is abstracted by several local patches. Feature representation methods deal with how to represent the patches as numerical vectors. These vectors are called feature descriptors. A good descriptor should have the ability to handle intensity, rotation, scale and affine variations to some extent. One of the most famous descriptors is Scale-invariant feature transform (SIFT). SIFT converts each patch to 128-dimensional vector. After this step, each image is a collection of vectors of the same dimension (128 for SIFT), where the order of different vectors is of no importance.