










Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
An introduction to Convolutional Neural Networks (CNNs), a type of Deep Learning architecture used for image classification and recognition tasks. It explains the different layers of CNNs, including Convolutional layers, Pooling layers, and fully connected layers, and how they work together to extract features and make predictions. The document also covers topics such as backpropagation, activation functions, and training CNNs. Additionally, it discusses the evolution of CNN architectures, including AlexNet and ResNet.
Typology: Study notes
1 / 18
This page cannot be seen from the preview
Don't miss anything!
Introduction: Convolutional Neural Network (CNN) A Convolutional Neural Network (CNN) is a type of Deep Learning architecture commonly used for image classification and recognition tasks. It consists of multiple layers, including Convolutional layers, Pooling layers, and fully connected layers. The Convolutional layer applies filters to the input image to extract features, the Pooling layer downsamples the image to reduce computation, and the fully connected layer makes the final prediction. The network learns the optimal filters through backpropagation and gradient descent. Artificial Neural Networks are used in various classification tasks like image, audio, words. Different types of Neural Networks are used for different purposes, for example for predicting the sequence of words we use Recurrent Neural Networks more precisely an LSTM, similarly for image classification we use Convolution Neural networks. Convolution Neural Network Convolution Neural Networks or covnets are neural networks that share their parameters. Imagine you have an image. It can be represented as a cuboid having its length, width (dimension of the image), and height (as images generally have red, green, and blue channels).
Now imagine taking a small patch of this image and running a small neural network on it, with say, k outputs and represent them vertically. Now slide that neural network across the whole image, as a result, we will get another image with different width, height, and depth. Instead of just R, G, and B channels now we have more channels but lesser width and height. This operation is called Convolution. If the patch size is the same as that of the image it will be a regular neural network. Because of this small patch, we have fewer weights. Now let’s talk about a bit of mathematics that is involved in the whole convolution process. ● Convolution layers consist of a set of learnable filters (a patch in the above image). Every filter has small width and height and the same depth as that of input volume (3 if the input layer is image input). ● For example, if we have to run convolution on an image with dimension 34x34x3. The possible size of filters can be axax3, where ‘a’ can be 3, 5, 7, etc but small as compared to image dimension. ● During forward pass, we slide each filter across the whole input volume step by step where each step is called stride (which can have value 2 or 3 or even 4 for high dimensional images) and compute the dot product between the weights of filters and patch from input volume. ● As we slide our filters we’ll get a 2-D output for each filter and we’ll stack them together and as a result, we’ll get output volume having a depth equal to the number of filters. The network will learn all the filters. ● Layers used to build ConvNets : A covnets is a sequence of layers, and every layer transforms one volume to another through a differentiable function.
● End-to-end training, no need for manual feature extraction. ● Can handle large amounts of data and achieve high accuracy. Disadvantages of Convolutional Neural Networks (CNNs): ● Computationally expensive to train and require a lot of memory. ● Can be prone to overfitting if not enough data or proper regularization is used. ● Requires large amount of labeled data. ● Interpretability is limited, it’s hard to understand what the network has learned. —-------------------------------------------------------------------------------------------------- CNN’s Basic Architecture A CNN architecture consists of two key components:
The Fully Connected (FC) layer comprises the weights and biases together with the neurons and is used to connect the neurons between two separate layers. The last several layers of a CNN Architecture are usually positioned before the output layer. Pooling layer The Pooling layer is responsible for the reduction of the size(spatial) of the Convolved Feature. This decrease in the computing power is being required to process the data by a significant reduction in the dimensions. There are two types of pooling 1 average pooling 2 max pooling. A Pooling Layer is usually applied after a Convolutional Layer. This layer’s major goal is to lower the size of the convolved feature map to reduce computational expenses. This is accomplished by reducing the connections between layers and operating independently on each feature map. There are numerous sorts of Pooling operations, depending on the mechanism utilised. The largest element is obtained from the feature map in Max Pooling. The average of the elements in a predefined sized Image segment is calculated using Average Pooling. Sum Pooling calculates the total sum of the components in the predefined section. The Pooling Layer is typically used to connect the Convolutional Layer and the FC Layer.
adjustments for better accuracy every run of the training of the image dataset is being called an “epoch.” The CNN goes through several series of epochs during the process of training, adjusting its weights as per the required small amounts. After each epoch step, the neural network becomes a bit more accurate at classifying and correctly predicting the class of the training images. As the CNN improves, the adjustments being made to the weights become smaller and smaller accordingly. After training the CNN, we use a test dataset to verify its accuracy. The test dataset is a set of labelled images that were not being included in the training process. Each image is being fed to CNN, and the output is compared to the actual class label of the test image. Essentially, the test dataset evaluates the prediction performance of the CNN. If a CNN accuracy is good on its training data but is bad on the test data, it is said as “overfitting.” This happens due to less size of the dataset (training) —--------------------------------------------------------------------------------------------------
The architecture of CNNs was inspired by the organizational of the Visual Cortex of the human brain. It is essentially a Deep Learning model that takes images, assigns respective weights for differentiation from one another, and does any given task such as image classification. Compared to hard-coded primitive approaches, these can be trained to learn the required filters on enough training. Over the years, there have been a number of developments in architectures to handle problems in regards to computational efficiency, error rate, and further improvements in the domain.
Before the advent of AlexNet, CNNs had been the most sought-after models for object recognition. With a firm grip over the problem of overfitting, these are strong models quite easy to train, with a strong performance similar to that of standard feedforward neural networks of the same size. Despite such qualities and efficiency of their architecture, these proved to be expensive for large scale to high-resolution images, exactly the problem when ImageNet arrived.
AlexNet Architecture ● Consists of eight layers — Five convolutional layers and three fully connected layers. ● Uses ReLu (Rectified Linear Units) in place of the tanh function, giving a 6x times faster dataset than a CNN using tanh, for an error of 25% for the CIFAR-10 dataset. ● Paved way for multi GPU training by splitting the neurons to be trained across multiple GPUs. This led to faster training times and the training of a bigger model. ● Overlaps outputs of neighboring groups of neurons in opposition to the “pooling” of outputs, giving us a reduction in error by 0.5%. ● The problems with overfitting increased with the use of 60 million parameters. This was taken care of by dropping out neurons with a predetermined probability (say 50%) and data augmentation. ● The model won the 2012 version of the ImageNet competition with an error difference of more than 11% with the runner up. ● Though an amazingly powerful model, the removal of any of the convolutional layers will drastically degrade the model’s performance.
ZFNet was an improved version of AlexNet, proposed by Zeiler et al. (2013). The main reason that ZFNet became widely popular because it was accompanied by a better understanding of how CNNs work internally. Earlier, researchers were never fully sure why ConvNets work for computer vision. But with ZFNet, came a novel visualization technique through a deconvolutional network. Deconvolution can be defined as the reconstruction of any convoluted features into a human-comprehensible visual form. Hence, this helped researchers to know what they were exactly doing.
It was trained on the ImageNet dataset and achieved state-of-the-art results with up to 92.7% accuracy, beating the GoogLeNet and Clarifai. ● It approximately had an overwhelming 138 million parameters to train which was more than at least twice the number of parameters in other models used then. Hence, it took weeks to train. ● It had a very systematic architecture. As we move to deeper layers, the image dimensions halved, while the no. of channels (or the no. of filters used in each layer) doubled. ● A prominent drawback of this model was that it was extremely slow to train and huge in size, making it less practical for real-time deployment.
ResNet was put forward by He et al. in 2015, a model that could employ hundreds to thousands of layers whilst providing compelling performance. The problem with deep Neural Networks was of the vanishing gradient, repeated multiplication as the network goes deeper, thereby resulting in an infinitely small gradient. Residual Block ResNet looks to introduce “shortcut connections” by skipping one or more layers. Here, these perform identity mappings, with outputs added to those of the stacked layers. With 152 layers (deepest back then) used, ResNet won the ILSVRC 2015 classification competition with a top 5 error of 3.57%. With an increasing demand in the research community, different interpretations of the ResNet were developed. The following model treats ResNet as an ensemble of many smaller networks.
A block of ResNeXt with cardinality = 32 Xie at al. proposed this variant of the ResNet(called the ResNeXt); this is similar in looks to the Inception module (both perform split-transform-merge); however, the outputs of different paths are added together, while they are depth concatenated in the latter. Furthermore, every path is the same in terms of topology, the Inception follows varying topologies for different paths (1x1, 3x3, 5x5 convolution). ● Authors introduce cardinality, a hyperparameter that makes the model adaptable to different datasets and increased accuracy on a higher value. ● Divides the input into groups of feature maps to perform novel convolution, and the outputs are then fed into concatenated by the depth and fed into a 1x1 convolution layer.
The idea of DenseNet stemmed from the intuition that CNNs could be substantially deeper, accurate and efficient to train if there are to be shorter connections close to the input and those close to the output. In sum, every layer is connected to every other layer in a feed-forward fashion.
Learned group convolutions with 3 groups, condensation factor of C = 3 Learns a sparsing network automatically during the training process, producing a regular connectivity pattern for implementation using group convolutions ● The filters of a later are divided into multiple groups, and unnecessary features are removed for these groups during training ● Groups of incoming features are learned, these are not predefined ● For similar accuracy levels, it uses 1/10th of the computational power needed for traditional DenseNets —------------------------------------------------------------------------------------------------ Extra knowledge about evolution of CNN LeNet- LeNet-5 was the first “famous” CNN architecture, which was developed by LeCun et al. (1998), for recognition of handwritten digits. LeCun and his fellow researchers were working on CNN models for a decade to come up with an efficient architecture. LeNet-5 is greatly responsible for inspiring deep learning researchers to develop the very efficient CNN models which we use these days. LeNet-5 Architecture
● The simple architecture was as follows: INPUT -> CONV -> AVG_POOL -> CONV -> AVG_POOL -> FC -> FC -> OUTPUT ● Used the MNIST database to train. ● It was a very shallow CNN by modern regards and had only about 60,000 parameters to train for an input image of dimensions 32x32x1. ● As we go deeper into the model, the input image dimensions tend to decrease, while the number of channels in a layer tends to increase. Inception Inception Networks were proposed by Szegedy et al. (2014) and brought along a novel concept of multitasking to the CNNs. With an aim to reduce the computation costs in CNNs, this architecture suggested that instead of building extensive deep networks, we can stack multiple convolutions in a single layer. This model also introduced the use of 1x1 filters for dimensionality reduction, in order to generate small-sized layers. In simple words, for an example, an inception network would allow us to do 3x3 CONV, 5x5 CONV, and MAX_POOL simultaneously, pass them through 1x1 convolution before or after these actions (before in CONV and after in POOL) and finally concatenate the corresponding outputs across the 3rd dimension. An inception module Inception networks paved the way for many other CNN architectures based on the same principles, such as GoogLeNet, Inception v3, Inception v4, Xception, etc, with some changes in the architecture. GoogLeNet is discussed below. GoogLeNet GoogLeNet was proposed by Szegedy et al. in 2015 as the initial version of the Inception; this model put forward state of the art image classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14) and secured the first place in the
● For ImageNet classification, they obtain a lower top 1 error than the MobileNet system, with ~x13 speedup over AlexNet with similar accuracy levels ● It is found to continuously outperform MobileNet for platforms with lower computational power. Hence, it paves the way for usage in mobile devices in the future. FractalNet FractalNet is an interesting CNN because it drifts away from the trending ResNets, and build its own deep architecture without any residual blocks. Simply stating, FractalNet can be viewed as an alternative to ResNets for very deep networks. Proposed by Larsson et al. (2017), the figure below shows one fractal unit, stacked up to form a fractal block, which then stacks up to form the FractalNet. Fractal Architecture ● Regularization is done using global (a single fixed path) and local (probability based path) drop-path, in order to prevent overfitting. ● FractalNet outperforms several ResNets on numerous tasks. ● It achieves an accuracy of 92.61% on the ImageNet dataset, just a little higher than ResNet. R-CNNs R-CNNs come up with the proposition that only certain regions of the image contain the required features and these regions must be fed to the CNN models, thus the name region-based CNNs. Their main application is in object detection which has to be done in a lot of real-time systems. We will discuss some of the R-CNNs below -
Fast R-CNN architecture Girshick (2015) improved his own R-CNN to create Fast R-CNN, by not extracting the regions of interest first, but feeding the whole image as input, with the regions of interest being extracted in the network and reshaped using the pooling layer. This drastically reduced the training and test time, because thousands of regions of the same image did not have to be fed into the model. Reason Proposal Network (RPN)/ Faster R-CNN Architecture Ren et al. (2015) proposed a new network in order to reduce the computation time of Fast R-CNN even more. Rather than the selective searching in the network, the image is first passed into a different network called RPN. This network has been trained to detect the region proposals and thus, the output region from this RPN is then fed into a CNN. Faster R-CNN made the R-CNN fast enough to be deployed for real-time applications.