Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Computer Vision and Deep Learning: Visual Features and Representation, Study notes of Artificial Intelligence

An introduction to Convolutional Neural Networks (CNNs), a type of Deep Learning architecture used for image classification and recognition tasks. It explains the different layers of CNNs, including Convolutional layers, Pooling layers, and fully connected layers, and how they work together to extract features and make predictions. The document also covers topics such as backpropagation, activation functions, and training CNNs. Additionally, it discusses the evolution of CNN architectures, including AlexNet and ResNet.

Typology: Study notes

2022/2023

Available from 03/23/2023

CosmicAlgo
CosmicAlgo 🇮🇳

4 documents

1 / 18

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Computer Vision and Deep
Learning
UNIT - III
Convolutional Neural Network
Notes
Introduction: Convolutional Neural Network (CNN)
A Convolutional Neural Network (CNN) is a type of Deep Learning architecture commonly used
for image classification and recognition tasks. It consists of multiple layers, including
Convolutional layers, Pooling layers, and fully connected layers. The Convolutional layer applies
filters to the input image to extract features, the Pooling layer downsamples the image to
reduce computation, and the fully connected layer makes the final prediction. The network
learns the optimal filters through backpropagation and gradient descent.
Artificial Neural Networks are used in various classification tasks like image, audio, words.
Different types of Neural Networks are used for different purposes, for example for predicting
the sequence of words we use Recurrent Neural Networks more precisely an LSTM, similarly for
image classification we use Convolution Neural networks.
Convolution Neural Network
Convolution Neural Networks or covnets are neural networks that share their parameters.
Imagine you have an image. It can be represented as a cuboid having its length, width
(dimension of the image), and height (as images generally have red, green, and blue channels).
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12

Partial preview of the text

Download Computer Vision and Deep Learning: Visual Features and Representation and more Study notes Artificial Intelligence in PDF only on Docsity!

Computer Vision and Deep

Learning

UNIT - III

Convolutional Neural Network

Notes

Introduction: Convolutional Neural Network (CNN) A Convolutional Neural Network (CNN) is a type of Deep Learning architecture commonly used for image classification and recognition tasks. It consists of multiple layers, including Convolutional layers, Pooling layers, and fully connected layers. The Convolutional layer applies filters to the input image to extract features, the Pooling layer downsamples the image to reduce computation, and the fully connected layer makes the final prediction. The network learns the optimal filters through backpropagation and gradient descent. Artificial Neural Networks are used in various classification tasks like image, audio, words. Different types of Neural Networks are used for different purposes, for example for predicting the sequence of words we use Recurrent Neural Networks more precisely an LSTM, similarly for image classification we use Convolution Neural networks. Convolution Neural Network Convolution Neural Networks or covnets are neural networks that share their parameters. Imagine you have an image. It can be represented as a cuboid having its length, width (dimension of the image), and height (as images generally have red, green, and blue channels).

Now imagine taking a small patch of this image and running a small neural network on it, with say, k outputs and represent them vertically. Now slide that neural network across the whole image, as a result, we will get another image with different width, height, and depth. Instead of just R, G, and B channels now we have more channels but lesser width and height. This operation is called Convolution. If the patch size is the same as that of the image it will be a regular neural network. Because of this small patch, we have fewer weights. Now let’s talk about a bit of mathematics that is involved in the whole convolution process. ● Convolution layers consist of a set of learnable filters (a patch in the above image). Every filter has small width and height and the same depth as that of input volume (3 if the input layer is image input). ● For example, if we have to run convolution on an image with dimension 34x34x3. The possible size of filters can be axax3, where ‘a’ can be 3, 5, 7, etc but small as compared to image dimension. ● During forward pass, we slide each filter across the whole input volume step by step where each step is called stride (which can have value 2 or 3 or even 4 for high dimensional images) and compute the dot product between the weights of filters and patch from input volume. ● As we slide our filters we’ll get a 2-D output for each filter and we’ll stack them together and as a result, we’ll get output volume having a depth equal to the number of filters. The network will learn all the filters. ● Layers used to build ConvNets : A covnets is a sequence of layers, and every layer transforms one volume to another through a differentiable function.

● End-to-end training, no need for manual feature extraction. ● Can handle large amounts of data and achieve high accuracy. Disadvantages of Convolutional Neural Networks (CNNs): ● Computationally expensive to train and require a lot of memory. ● Can be prone to overfitting if not enough data or proper regularization is used. ● Requires large amount of labeled data. ● Interpretability is limited, it’s hard to understand what the network has learned. —-------------------------------------------------------------------------------------------------- CNN’s Basic Architecture A CNN architecture consists of two key components:

  • A convolution tool that separates and identifies the distinct features of an image for analysis in a process known as Feature Extraction
  • A fully connected layer that takes the output of the convolution process and predicts the image’s class based on the features retrieved earlier. The CNN is made up of three types of layers: convolutional layers, pooling layers, and fully-connected (FC) layers. Convolution Layers This is the very first layer in the CNN that is responsible for the extraction of the different features from the input images. The convolution mathematical operation is done between the input image and a filter of a specific size MxM in this layer. The Fully Connected

The Fully Connected (FC) layer comprises the weights and biases together with the neurons and is used to connect the neurons between two separate layers. The last several layers of a CNN Architecture are usually positioned before the output layer. Pooling layer The Pooling layer is responsible for the reduction of the size(spatial) of the Convolved Feature. This decrease in the computing power is being required to process the data by a significant reduction in the dimensions. There are two types of pooling 1 average pooling 2 max pooling. A Pooling Layer is usually applied after a Convolutional Layer. This layer’s major goal is to lower the size of the convolved feature map to reduce computational expenses. This is accomplished by reducing the connections between layers and operating independently on each feature map. There are numerous sorts of Pooling operations, depending on the mechanism utilised. The largest element is obtained from the feature map in Max Pooling. The average of the elements in a predefined sized Image segment is calculated using Average Pooling. Sum Pooling calculates the total sum of the components in the predefined section. The Pooling Layer is typically used to connect the Convolutional Layer and the FC Layer.

adjustments for better accuracy every run of the training of the image dataset is being called an “epoch.” The CNN goes through several series of epochs during the process of training, adjusting its weights as per the required small amounts. After each epoch step, the neural network becomes a bit more accurate at classifying and correctly predicting the class of the training images. As the CNN improves, the adjustments being made to the weights become smaller and smaller accordingly. After training the CNN, we use a test dataset to verify its accuracy. The test dataset is a set of labelled images that were not being included in the training process. Each image is being fed to CNN, and the output is compared to the actual class label of the test image. Essentially, the test dataset evaluates the prediction performance of the CNN. If a CNN accuracy is good on its training data but is bad on the test data, it is said as “overfitting.” This happens due to less size of the dataset (training) —--------------------------------------------------------------------------------------------------

Evolution of Convolutional Neural Network Architectures

The architecture of CNNs was inspired by the organizational of the Visual Cortex of the human brain. It is essentially a Deep Learning model that takes images, assigns respective weights for differentiation from one another, and does any given task such as image classification. Compared to hard-coded primitive approaches, these can be trained to learn the required filters on enough training. Over the years, there have been a number of developments in architectures to handle problems in regards to computational efficiency, error rate, and further improvements in the domain.

AlexNet

Before the advent of AlexNet, CNNs had been the most sought-after models for object recognition. With a firm grip over the problem of overfitting, these are strong models quite easy to train, with a strong performance similar to that of standard feedforward neural networks of the same size. Despite such qualities and efficiency of their architecture, these proved to be expensive for large scale to high-resolution images, exactly the problem when ImageNet arrived.

AlexNet Architecture ● Consists of eight layers — Five convolutional layers and three fully connected layers. ● Uses ReLu (Rectified Linear Units) in place of the tanh function, giving a 6x times faster dataset than a CNN using tanh, for an error of 25% for the CIFAR-10 dataset. ● Paved way for multi GPU training by splitting the neurons to be trained across multiple GPUs. This led to faster training times and the training of a bigger model. ● Overlaps outputs of neighboring groups of neurons in opposition to the “pooling” of outputs, giving us a reduction in error by 0.5%. ● The problems with overfitting increased with the use of 60 million parameters. This was taken care of by dropping out neurons with a predetermined probability (say 50%) and data augmentation. ● The model won the 2012 version of the ImageNet competition with an error difference of more than 11% with the runner up. ● Though an amazingly powerful model, the removal of any of the convolutional layers will drastically degrade the model’s performance.

ZFNet

ZFNet was an improved version of AlexNet, proposed by Zeiler et al. (2013). The main reason that ZFNet became widely popular because it was accompanied by a better understanding of how CNNs work internally. Earlier, researchers were never fully sure why ConvNets work for computer vision. But with ZFNet, came a novel visualization technique through a deconvolutional network. Deconvolution can be defined as the reconstruction of any convoluted features into a human-comprehensible visual form. Hence, this helped researchers to know what they were exactly doing.

It was trained on the ImageNet dataset and achieved state-of-the-art results with up to 92.7% accuracy, beating the GoogLeNet and Clarifai. ● It approximately had an overwhelming 138 million parameters to train which was more than at least twice the number of parameters in other models used then. Hence, it took weeks to train. ● It had a very systematic architecture. As we move to deeper layers, the image dimensions halved, while the no. of channels (or the no. of filters used in each layer) doubled. ● A prominent drawback of this model was that it was extremely slow to train and huge in size, making it less practical for real-time deployment.

ResNet | ResNeXt

ResNet was put forward by He et al. in 2015, a model that could employ hundreds to thousands of layers whilst providing compelling performance. The problem with deep Neural Networks was of the vanishing gradient, repeated multiplication as the network goes deeper, thereby resulting in an infinitely small gradient. Residual Block ResNet looks to introduce “shortcut connections” by skipping one or more layers. Here, these perform identity mappings, with outputs added to those of the stacked layers. With 152 layers (deepest back then) used, ResNet won the ILSVRC 2015 classification competition with a top 5 error of 3.57%. With an increasing demand in the research community, different interpretations of the ResNet were developed. The following model treats ResNet as an ensemble of many smaller networks.

A block of ResNeXt with cardinality = 32 Xie at al. proposed this variant of the ResNet(called the ResNeXt); this is similar in looks to the Inception module (both perform split-transform-merge); however, the outputs of different paths are added together, while they are depth concatenated in the latter. Furthermore, every path is the same in terms of topology, the Inception follows varying topologies for different paths (1x1, 3x3, 5x5 convolution). ● Authors introduce cardinality, a hyperparameter that makes the model adaptable to different datasets and increased accuracy on a higher value. ● Divides the input into groups of feature maps to perform novel convolution, and the outputs are then fed into concatenated by the depth and fed into a 1x1 convolution layer.

DenseNet | ConDenseNet

The idea of DenseNet stemmed from the intuition that CNNs could be substantially deeper, accurate and efficient to train if there are to be shorter connections close to the input and those close to the output. In sum, every layer is connected to every other layer in a feed-forward fashion.

Learned group convolutions with 3 groups, condensation factor of C = 3 Learns a sparsing network automatically during the training process, producing a regular connectivity pattern for implementation using group convolutions ● The filters of a later are divided into multiple groups, and unnecessary features are removed for these groups during training ● Groups of incoming features are learned, these are not predefined ● For similar accuracy levels, it uses 1/10th of the computational power needed for traditional DenseNets —------------------------------------------------------------------------------------------------ Extra knowledge about evolution of CNN LeNet- LeNet-5 was the first “famous” CNN architecture, which was developed by LeCun et al. (1998), for recognition of handwritten digits. LeCun and his fellow researchers were working on CNN models for a decade to come up with an efficient architecture. LeNet-5 is greatly responsible for inspiring deep learning researchers to develop the very efficient CNN models which we use these days. LeNet-5 Architecture

● The simple architecture was as follows: INPUT -> CONV -> AVG_POOL -> CONV -> AVG_POOL -> FC -> FC -> OUTPUT ● Used the MNIST database to train. ● It was a very shallow CNN by modern regards and had only about 60,000 parameters to train for an input image of dimensions 32x32x1. ● As we go deeper into the model, the input image dimensions tend to decrease, while the number of channels in a layer tends to increase. Inception Inception Networks were proposed by Szegedy et al. (2014) and brought along a novel concept of multitasking to the CNNs. With an aim to reduce the computation costs in CNNs, this architecture suggested that instead of building extensive deep networks, we can stack multiple convolutions in a single layer. This model also introduced the use of 1x1 filters for dimensionality reduction, in order to generate small-sized layers. In simple words, for an example, an inception network would allow us to do 3x3 CONV, 5x5 CONV, and MAX_POOL simultaneously, pass them through 1x1 convolution before or after these actions (before in CONV and after in POOL) and finally concatenate the corresponding outputs across the 3rd dimension. An inception module Inception networks paved the way for many other CNN architectures based on the same principles, such as GoogLeNet, Inception v3, Inception v4, Xception, etc, with some changes in the architecture. GoogLeNet is discussed below. GoogLeNet GoogLeNet was proposed by Szegedy et al. in 2015 as the initial version of the Inception; this model put forward state of the art image classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14) and secured the first place in the

● For ImageNet classification, they obtain a lower top 1 error than the MobileNet system, with ~x13 speedup over AlexNet with similar accuracy levels ● It is found to continuously outperform MobileNet for platforms with lower computational power. Hence, it paves the way for usage in mobile devices in the future. FractalNet FractalNet is an interesting CNN because it drifts away from the trending ResNets, and build its own deep architecture without any residual blocks. Simply stating, FractalNet can be viewed as an alternative to ResNets for very deep networks. Proposed by Larsson et al. (2017), the figure below shows one fractal unit, stacked up to form a fractal block, which then stacks up to form the FractalNet. Fractal Architecture ● Regularization is done using global (a single fixed path) and local (probability based path) drop-path, in order to prevent overfitting. ● FractalNet outperforms several ResNets on numerous tasks. ● It achieves an accuracy of 92.61% on the ImageNet dataset, just a little higher than ResNet. R-CNNs R-CNNs come up with the proposition that only certain regions of the image contain the required features and these regions must be fed to the CNN models, thus the name region-based CNNs. Their main application is in object detection which has to be done in a lot of real-time systems. We will discuss some of the R-CNNs below -

Fast R-CNN architecture Girshick (2015) improved his own R-CNN to create Fast R-CNN, by not extracting the regions of interest first, but feeding the whole image as input, with the regions of interest being extracted in the network and reshaped using the pooling layer. This drastically reduced the training and test time, because thousands of regions of the same image did not have to be fed into the model. Reason Proposal Network (RPN)/ Faster R-CNN Architecture Ren et al. (2015) proposed a new network in order to reduce the computation time of Fast R-CNN even more. Rather than the selective searching in the network, the image is first passed into a different network called RPN. This network has been trained to detect the region proposals and thus, the output region from this RPN is then fed into a CNN. Faster R-CNN made the R-CNN fast enough to be deployed for real-time applications.