Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Final Exam notes for NLP, Study notes of Natural Language Processing (NLP)

the content inside the doc cover the important topics discussed in class

Typology: Study notes

2022/2023

Uploaded on 12/08/2023

romil-jain-3
romil-jain-3 🇺🇸

1 document

1 / 39

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Q) what is deep learning.
Ans:-
Deep learning is a type of machine learning that involves training artificial neural networks to
perform tasks. In simple terms, it's a way to teach computers to learn and make decisions by
simulating how the human brain works.
Here's a breakdown:
1. Neural Networks: Deep learning relies on neural networks, which are structures inspired by
the human brain. These networks consist of interconnected nodes (artificial neurons)
organized into layers. Information flows through these layers, and the network learns to
recognize patterns and relationships in the data.
2. Deep Neural Networks: "Deep" in deep learning refers to the depth of these networks,
indicating that they have multiple layers. More layers allow the network to learn complex
features and hierarchies in the data.
3. Applications: Deep learning is used in various applications, such as image and speech
recognition, language translation, playing games, and even autonomous vehicles. The
ability to automatically learn and adapt to different tasks makes deep learning a powerful
tool in the field of artificial intelligence.
Q) what is artifical neural networks.
Ans) Artificial Neural Networks (ANNs) are computational models inspired by the structure and
functioning of the human brain. They are a fundamental component of deep learning. Let's break
down the key concepts:
1. Neurons: The basic building blocks of artificial neural networks are artificial neurons, often
referred to as nodes or perceptrons. These are designed to simulate the functioning of
biological neurons in the human brain.
2. Layers: ANNs consist of layers of neurons. The three main types of layers are:
Input Layer: Neurons in this layer receive the initial input data.
Hidden Layers: Intermediate layers between the input and output layers where
complex patterns are learned.
Output Layer: Neurons in this layer produce the final output or prediction.
3. Connections (Weights): Neurons are connected to each other with weights, which represent
the strength of the connection. During training, these weights are adjusted to improve the
network's performance.
4. Activation Function: Each neuron applies an activation function to its input, determining
whether it should be "activated" (produce an output). Common activation functions include
the sigmoid, tanh, and ReLU (Rectified Linear Unit).
5. Feedforward and Backpropagation: The process of passing data through the network from
input to output is called feedforward. During training, the network learns by adjusting the
weights based on the error in its predictions. This process is known as backpropagation.
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27

Partial preview of the text

Download Final Exam notes for NLP and more Study notes Natural Language Processing (NLP) in PDF only on Docsity!

Q) what is deep learning. Ans:- Deep learning is a type of machine learning that involves training artificial neural networks to perform tasks. In simple terms, it's a way to teach computers to learn and make decisions by simulating how the human brain works. Here's a breakdown:

  1. Neural Networks: Deep learning relies on neural networks, which are structures inspired by the human brain. These networks consist of interconnected nodes (artificial neurons) organized into layers. Information flows through these layers, and the network learns to recognize patterns and relationships in the data.
  2. Deep Neural Networks: "Deep" in deep learning refers to the depth of these networks, indicating that they have multiple layers. More layers allow the network to learn complex features and hierarchies in the data.
  3. Applications: Deep learning is used in various applications, such as image and speech recognition, language translation, playing games, and even autonomous vehicles. The ability to automatically learn and adapt to different tasks makes deep learning a powerful tool in the field of artificial intelligence. Q) what is artifical neural networks. Ans) Artificial Neural Networks (ANNs) are computational models inspired by the structure and functioning of the human brain. They are a fundamental component of deep learning. Let's break down the key concepts:
  4. Neurons: The basic building blocks of artificial neural networks are artificial neurons, often referred to as nodes or perceptrons. These are designed to simulate the functioning of biological neurons in the human brain.
  5. Layers: ANNs consist of layers of neurons. The three main types of layers are:  Input Layer: Neurons in this layer receive the initial input data.  Hidden Layers: Intermediate layers between the input and output layers where complex patterns are learned.  Output Layer: Neurons in this layer produce the final output or prediction.
  6. Connections (Weights): Neurons are connected to each other with weights, which represent the strength of the connection. During training, these weights are adjusted to improve the network's performance.
  7. Activation Function: Each neuron applies an activation function to its input, determining whether it should be "activated" (produce an output). Common activation functions include the sigmoid, tanh, and ReLU (Rectified Linear Unit).
  8. Feedforward and Backpropagation: The process of passing data through the network from input to output is called feedforward. During training, the network learns by adjusting the weights based on the error in its predictions. This process is known as backpropagation.

Q) Why Deep Learning Models start to outperform than Machine learning. Ans) • A lot of data • Faster machines, GPU • New models/algorithms. Q) Role of Bias:  In a neural network, each neuron (or node) receives input signals, applies weights to these inputs, and produces an output.  Bias is an additional parameter associated with each neuron. It allows the neuron to output non-zero values even when all the inputs are zero.  The bias term essentially allows the neuron to introduce an offset or shift in the output, providing flexibility in modeling more complex relationships in the data. Q) difference between weights and activation function in simple terms with example. Ans: - Weights:Role: Weights are like adjustable knobs on the connections between neurons in a neural network.  Function: They control how much influence one neuron has on another, determining the strength of connections.  Purpose: Adjusting weights during training allows the network to learn and adapt to different patterns in the data.

  1. Activation Function:Role: Activation functions are like switches at the output of each neuron.  Function: They decide whether a neuron should "fire" (activate) based on the weighted sum of its inputs.  Purpose: Introduces non-linearity, allowing the network to capture complex patterns and relationships in the data. In essence, weights adjust the connections between neurons, while activation functions determine whether a neuron should contribute to the network's output. Together, they enable the network to learn and make predictions by adapting to the characteristics of the input data. EXAMPLE :- Scenario: Imagine you have a neural network for predicting whether someone will play tennis based on two features: weather (sunny or cloudy) and temperature (hot or mild). The network has two input neurons (one for each feature) and one output neuron (predicting play or not play).
  2. Weights:  Each connection between an input neuron and the output neuron has a weight. Let's say the weights are adjusted during training to emphasize the importance of weather over temperature. The weights might be:  Weight from the sunny neuron: 0.

 During backpropagation, the weights are adjusted based on the gradient of the loss function with respect to the weights, using optimization algorithms like stochastic gradient descent. Let's use a simple example to illustrate forward pass and backpropagation in a neural network. Consider a single-layer neural network for binary classification: Forward Pass:

  1. Initialization:  We have one input feature (x) and one output neuron (y).  Initial weights (w) and bias (b) are set to small random values.
  2. Forward Pass Steps:Input (x): Let's say our input is 0.5.  Weighted Sum: Multiply the input by the weight and add the bias.  Activation Function: Pass the weighted sum through an activation function (e.g., sigmoid).  Output (y): The result is the predicted output. Mathematically: Weighted Sum=(0.5×w)+b Weighted Sum=(0.5× w )+ b Activation Function Output=sigmoid(Weighted Sum)Activation Function Output=sigmoid(Weighted S um) Output=Activation Function OutputOutput=Activation Function Output Backpropagation:
  3. Compute Loss:  Compare the predicted output with the actual output using a loss function (e.g., mean squared error). Loss=12(Actual Output−Predicted Output)2Loss=21(Actual Output−Predicted Output)
  4. Backpropagate Error:  Calculate the gradient of the loss with respect to the weights and bias.

Gradient of Loss w.r.t. Weight (dw)=−(Actual Output−Predicted Output)×Sigmoid (Weighted′ Sum)×In putGradient of Loss w.r.t. Weight (dw)=−(Actual Output−Predicted Output)×Sigmoid (Weighted′ Sum) ×Input Gradient of Loss w.r.t. Bias (db)=−(Actual Output−Predicted Output)×Sigmoid′ (Weighted Sum)Gradient of Loss w.r.t. Bias (db)=−(Actual Output−Predicted Output)×Sigmoid′ (Weighted Sum)

  1. Update Weights and Bias:  Adjust the weights and bias using the calculated gradients and a learning rate. New Weight=Old Weight−Learning Rate×Gradient of Loss w.r.t. WeightNew Weight=Old Weight−Lear ning Rate×Gradient of Loss w.r.t. Weight New Bias=Old Bias−Learning Rate×Gradient of Loss w.r.t. BiasNew Bias=Old Bias−Learning Rate×Gradi ent of Loss w.r.t. Bias
  2. Repeat:  Repeat the forward pass and backpropagation for multiple iterations (epochs) until the network learns the patterns in the data and minimizes the loss. Q) Learning Through Backpropagation
  • Backpropagation takes the difference between the predicted value and the actual value and uses that error term to adjust each node’s weights.
  • The process works backwards from the final layers to earlier layers, one layer at a time, and computes the contribution that each weight in the given layer had in the loss value.
  • The algorithm that computes the loss value is called a “gradient descent:” this iteratively moves in the direction of greatest improvement in prediction.

Word2Vec transforms words into numerical vectors in a continuous vector space. The key idea behind Word2Vec is to learn distributed representations of words based on their contextual usage in a large corpus of text. How It Works:

  1. Contextual Learning:  Word2Vec learns by predicting the context of words in sentences. It considers the surrounding words to understand the meaning of a target word.
  2. Continuous Vector Space:  Words are represented as high-dimensional vectors in a continuous vector space.  Similar words have similar vectors, meaning they are close to each other in the vector space. Two Architectures:
  3. Continuous Bag-of-Words (CBOW):  Predicts the target word given its context (surrounding words).  Effective for smaller datasets and frequent words.
  4. Skip-gram:  Predicts the context words given a target word.  Better for larger datasets and capturing rare words. Example: Consider the sentence: "The cat sat on the mat."  For CBOW, given the context "The cat on the mat," predict the target word "sat."  For Skip-gram, given the target word "sat," predict the context words "cat," "on," and "the." The learned vectors might look like this (simplified):

Word Vector Representation

cat [0.2, 0.8, -0.5, ...] sat [-0.1, 0.7, 0.4, ...] on [0.3, 0.6, -0.2, ...] the [-0.5, 0.2, 0.9, ...] mat [0.4, 0.3, -0.7, ...] GloVe: (Global Vectors for Word Representation): GloVe is an unsupervised learning algorithm for obtaining vector representations (embeddings) of words based on global statistical information about their co-occurrence in a large corpus of text. Key Concepts:

  1. Co-occurrence Statistics:  GloVe builds on the idea that the meaning of words can be captured by examining how often they co-occur with other words.  It constructs a global co-occurrence matrix that represents word-word relationships.
  2. Word Embeddings:  The goal is to learn vector representations (embeddings) for words that encode their semantic relationships.  Each word is associated with a vector in a continuous vector space.
  3. Objective Function:  GloVe's training objective is to learn word vectors that preserve the ratios of co- occurrence probabilities.  It aims to minimize the difference between the dot product of word vectors and the logarithm of the observed co-occurrence probabilities. Example: Consider the following sentences:
  4. "The cat sat on the mat."
  5. "The dog sat on the rug." The co-occurrence matrix might represent the following (simplified) counts: the cat sat on mat dog rug the 2 1 2 2 1 1 1

 Special considerations are taken to address the problem of Xij being 0 and to handle stop words and rare words appropriately. In simpler terms, GloVe is a method that learns word representations by looking at how often words appear together globally in a large dataset. It aims to capture the nuances of word relationships based on their co-occurrence probabilities. The training process involves addressing issues related to zero entries, stop words, and rare words to create meaningful word embeddings. Difference between Word2Vec and GloVe Comparison:Context:  Word2Vec focuses on local context, making it effective for capturing word similarities within nearby contexts.  GloVe considers global context, enabling it to capture broader semantic relationships.  Data Requirements:  Word2Vec can perform well with smaller datasets.  GloVe often benefits from larger corpora to extract meaningful global co-occurrence statistics.  Usage:  Word2Vec may be preferred for tasks like sentiment analysis, document clustering, and word analogy tasks.  GloVe may excel in capturing global semantic relationships and is suitable for various NLP applications. Word Embedding – fastText: Explanation in Simple Terms:  fastText is a word embedding technique that not only considers individual words but also takes into account subword information.  It breaks words into smaller subword components, allowing it to capture morphological information and handle out-of-vocabulary words better. Example:  Consider the word "playing." In fastText, it can be broken down into subwords like "play," "ing," and "<pl" (a special token representing the beginning of the word).

 By considering subwords, fastText can understand relationships between words even if they share common subword components. Key Points:

  1. Subword Information:  fastText looks beyond entire words and considers smaller subword components.  Example: For "playing," it analyzes "play," "ing," and "<pl."
  2. Morphological Information:  It captures information about word structure, including prefixes and suffixes.  Example: Understanding relationships between "run," "running," and "runner."
  3. Handling Out-of-Vocabulary Words:  Since it works with subwords, fastText can generate representations for new or uncommon words based on their subword components.  Example: Even if it hasn't seen the word "unbelievable," it may have encountered the subwords "un," "believ," and "able." Summary: fastText is like a word detective that doesn't just look at words as a whole but examines their building blocks. By considering subwords, it gains insights into word structure, handles variations of words, and is more adaptable to out-of-vocabulary terms.

Imp Q) What are some of the problems with Word2Vec/GloVe.? Word Sense

Disambiguation? (Disadvantages)

Ans:- Issue in Simple Terms:  Word embeddings like Word2Vec and GloVe struggle with distinguishing between different senses or meanings of a word, a problem known as Word Sense Disambiguation (WSD). Explanation:

  1. Limited Context Understanding:  Word embeddings consider the overall context a word appears in but may not capture different meanings in distinct contexts.  Example: "Bank" can mean a financial institution or the side of a river, and word embeddings might not differentiate well.
  2. Single Vector Representation:  Each word is represented by a single vector, regardless of its various meanings.  Example: The word "bat" could refer to a sports equipment or a flying mammal, and a single vector may not capture both meanings effectively.
  3. Polysemy Challenge:
  1. Multiple Embeddings:  Each word can have multiple embeddings, reflecting its diverse meanings in varied contexts.
  2. Improved Understanding:  ELMo's embeddings enhance language understanding by considering the nuanced meaning of words. Summary: ELMo goes beyond traditional embeddings by considering the diversity of word meanings in different sentence contexts. It provides a more nuanced understanding of language, making it valuable for tasks where context matters, such as natural language understanding and sentiment analysis. BERT (Bidirectional Encoder Representations from Transformers): Explanation in Simple Terms:  BERT is a powerful language model that understands the context and relationships between words in a sentence by considering both the left and right sides of each word. Key Features:
  3. Bidirectional Understanding:  Unlike traditional models that read text in one direction, BERT looks at all words in a sentence simultaneously, capturing the full context.
  4. Transformer Architecture:  BERT uses a transformer architecture, which excels at handling sequential data by attending to different parts of a sequence.
  5. Pre-trained on Large Datasets:  BERT is pre-trained on massive amounts of text data, allowing it to learn the nuances of language comprehensively.
  6. Versatility in Tasks:  BERT's pre-trained knowledge can be fine-tuned for various natural language processing tasks like text classification, sentiment analysis, and question answering. Example:  For the sentence "The cat is on the mat," BERT understands the relationships between words like "cat" and "mat" and how their positions impact meaning. Why BERT Matters:  BERT's bidirectional approach and transformer architecture make it exceptionally effective in understanding context, resulting in state-of-the-art performance on a wide range of language tasks.

Summary: BERT is a language model that excels at understanding context by considering both directions of a sentence. Its versatility and ability to capture intricate language nuances make it a groundbreaking tool in natural language processing. BERT trained on 2 NLP task:-

1. Masked Language Modeling (MLM):Objective: Predicting Missing Words  Process:  Sentences are randomly selected.  Certain words in these sentences are replaced with a special [MASK] token.  The model is trained to predict the original words that were replaced with [MASK].  Additionally, random words are substituted, and the model learns to distinguish between masked and random words.  About 80% of the time, words are replaced with [MASK], 10% with random words, and 10% remain unchanged. Example:  Original Sentence: "This turns out to be the greatest thing that has ever happened to me."  Masked Sentence: "This turns out to be the [MASK] thing that has ever happened to me." Why Masked Language Modeling Matters:  BERT learns contextual representations by understanding the relationships between words, even when some words are masked. This enhances its ability to grasp language nuances. 2. Next Sentence Prediction (NSP):Objective: Predicting if Sentences Follow Each Other  Process:  Pairs of sentences are chosen.  The model is trained to determine if the second sentence logically follows the first.  About 50% of the time, the second sentence follows the first in the original text, and 50% of the time, it is a random sentence. Example:  Pair of Sentences: 1. "I love sunny days." 2. "It makes me feel happy."  BERT learns to predict whether sentence 2 follows sentence 1.

4. Significance:  XLNet combines bidirectional context, permutation language modeling, and autoregressive modeling.  This enhances its ability to understand intricate relationships and dependencies between words. Summary: XLNet stands out by incorporating bidirectional context, avoiding the need for word masking in pre-training. It achieves this through permutation language modeling and autoregressive modeling, enabling a more comprehensive understanding of word relationships. Q) What is CNN Ans:- **CNN in Simple Terms:

  1. Concept:**  Convolutional Neural Network (CNN):  Specialized neural network for image-related tasks. 2. Structure:Convolutional Layers:  Detect patterns like edges, textures, and shapes.  Pooling Layers:  Downsample and reduce spatial dimensions.  Fully Connected Layers:  Make decisions based on patterns detected. **Example: Recognizing Objects in Images
  2. Convolutional Layer:**  Input: Image pixels.  Operation: Detect edges, textures, or patterns.  Example: Identify edges of a cat's ears. 2. Pooling Layer:Input: Output from convolutional layer.  Operation: Downsample and reduce spatial dimensions.  Example: Highlight key features like a cat's face. 3. Fully Connected Layer:

Input: Flattened output from pooling layer.  Operation: Make decisions based on detected features.  Example: Decide whether the image contains a cat.

4. Significance:  CNNs excel in image-related tasks due to their ability to recognize hierarchical patterns.  They're used in image classification, object detection, and facial recognition. Summary: CNNs are neural networks designed for image-related tasks. They consist of convolutional layers, pooling layers, and fully connected layers, allowing them to recognize intricate patterns in images. An example involves detecting features like edges and shapes in an image to make decisions about its content. Key Insights From Mammalian Vision

  • An image is not processed, perceived or understood in one huge lump
  • The vision system considers small chunks of the visual field and extracts key features from each
  • Features are combined at later stages of processing into something recognizable as an object
  • This insight suggests that at the lowest level we can slide a small “receptive window” over input data – convolution – to process small chunks of input Q) Two Keras Model Types Many Types of Layers Supported Partial list:
  • Preprocessing layers (e.g., text)

Popular models based on the Transformer architecture include BERT (Bidirectional Encoder Representations from Transformers), GPT (Generative Pre-trained Transformer), and T5 (Text-To-Text Transfer Transformer), among others. These models have achieved state-of-the-art performance on various NLP tasks. Example:- Consider the sentence: "The cat sat on the mat."

  1. Embedding:  Each word in the sentence is initially represented as an embedding vector. For instance:  "The" might be represented as [0.2, 0.7, 0.1].  "cat" might be represented as [0.5, 0.3, 0.8].  ...
  2. Positional Encoding:  Since transformers do not inherently understand the order of words, positional encodings are added to the embeddings to provide information about the position of each word in the sequence.
  3. Self-Attention:  The transformer processes the embedded sequence through self-attention layers. Each word attends to every other word, and the attention scores determine how much each word contributes to the representation of others.  For example, when processing "cat," the model assigns attention weights to other words like "The," "sat," "on," and "mat" based on their relevance to "cat."
  4. Multi-Head Attention:  Transformers often use multiple attention heads to capture different aspects of relationships within the input sequence. Each attention head focuses on different patterns and dependencies.
  5. Feedforward Layer:  The attended sequence is then passed through a feedforward layer for further non- linear transformations.
  6. Normalization:  Layer normalization is applied to stabilize and normalize the activations within each layer.
  7. Position-wise Feedforward Networks:  Additional position-wise feedforward networks process the outputs, allowing the model to capture complex patterns.
  8. Output:

 The final output sequence represents a contextualized and enriched version of the input sequence, capturing relationships and dependencies. This process is repeated across multiple layers of the transformer, allowing it to learn intricate patterns and dependencies in the data. Transformers have demonstrated exceptional performance in various natural language processing tasks due to their ability to capture long-range dependencies and understand context effectively. Limitations:

  • Only accepting a fixed-size vector as input and produce a fixed-size vector as output (e.g., probabilities of different classes).
  • Use a fixed amount of computational steps (e.g. the number of layers in the model). RNN Recurrent Neural Networks (RNNs) are a family of neural networks introduced to learn sequential data. RNNS enable neural networks to remember the past words within a sentence. Recurrent Neural Networks are networks with loops, allowing information to persist. In the above diagram, a chunk of neural network, A = fW, looks at some input xt and outputs a value ht. A loop allows information to be passed from one step of the network to the next. # of parameters = shape (h)(shape(h)+shape(x))+ shape (h)* What about parameters for Dense layer? Output: y dimension Hidden state dimension: h Bias: y dimension Parameters = shape (y) * shape (h)+ shape (y) Sequence Learning Applications
  • RNNs can be applied to various type of sequential data to learn the temporal patterns.
  • Time-series data (e.g., stock price) à Prediction, regression
  • Raw sensor data (e.g., signal, voice, handwriting) à Labels or text sequences
  • Text à Label (e.g., sentiment) or text sequence (e.g., translation, summary, answer)
  • Image and video à Text description (e.g., captions, scene interpretation)