Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Assignment ELBO GANs, Assignments of Artificial Intelligence

Theoretical assignment - ELBO, GANs

Typology: Assignments

2024/2025

Uploaded on 03/11/2025

chiya-anand
chiya-anand 🇮🇳

1 document

1 / 3

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
EE 6180: Assignment 1
February 15, 2025
Guidelines
1. Deadline: Submit assignment solutions by March 1.
2. Complexity Calculations: Express computational complexity in terms of scalar multiplications, providing
a concise derivation in one or two lines.
3. Conciseness: Avoid unnecessary verbosity. Provide precise and to-the-point explanations.
4. Food for Thought/ Refer: Note these points for your own learning. Also, think what is the practical use of
each question.
5. LLM usage: Exercise caution with LLM-generated solutions as they confidently provide incorrect answers.
Question 1(17 marks)
A single Transformer layer consists of the following components: Multi-Head
Self-Attention (MHSA), Layer Normalization, and Feedforward Neural Networks
(MLP). The dimensions and details of these components are outlined below:
Input: XRH×W×d.
MHSA Weight Matrices: For Nhheads: Wh
Q, W h
K, W h
VRd×dh,where Nh·
dh=d.
MLP Architecture: Input is ˆ
Y, the output of MHSA with residual connection.
First Linear Transformation:
F= ReLU( ˆ
Y W1+b1), W1Rd×dff, b1Rdff .
Second Linear Transformation:
YMLP =F W2+b2, W2Rdff×d, b2Rd.
Here, dff= 8d.
Assume that computational complexity is always measured in terms of scalar multiplications.
1. Complexity Analysis:
a) Derive the computational complexity for MHSA and MLP in the Transformer layer. Exclude the softmax
operation in MHSA for the derivation. (2marks)
b) Find the conditions on d, H, W such that MLP computations exceed those of MHSA. Is this possible in practical
scenarios for image applications? (2marks)
c) Compare the number of parameters in MHSA and MLP layers, and determine which layer contributes more to
the total parameter count in a Transformer layer. (1mark)
2. Sliding Window Attention Approximation: The full attention map ShRH·W×H·Wis approximated by
considering only a patch of size P×Paround each pixel i:
(ˆ
Sh)ij =((Sh)ij,if jPatch(i),
0,otherwise.
Only for this question, you can exclude computing query, key and value complexity.
1
pf3

Partial preview of the text

Download Assignment ELBO GANs and more Assignments Artificial Intelligence in PDF only on Docsity!

EE 6180: Assignment 1

February 15, 2025

Guidelines

  1. Deadline: Submit assignment solutions by March 1.
  2. Complexity Calculations: Express computational complexity in terms of scalar multiplications, providing a concise derivation in one or two lines.
  3. Conciseness: Avoid unnecessary verbosity. Provide precise and to-the-point explanations.
  4. Food for Thought/ Refer: Note these points for your own learning. Also, think what is the practical use of each question.
  5. LLM usage: Exercise caution with LLM-generated solutions as they confidently provide incorrect answers.

Question 1 ( 17 marks)

A single Transformer layer consists of the following components: Multi-Head Self-Attention (MHSA), Layer Normalization, and Feedforward Neural Networks (MLP). The dimensions and details of these components are outlined below:

Input: X ∈ RH×W^ ×d. MHSA Weight Matrices: For Nh heads: W (^) Qh, W (^) Kh , W (^) Vh ∈ Rd×dh^ , where Nh · dh = d. MLP Architecture: Input is Yˆ , the output of MHSA with residual connection.

ˆ First Linear Transformation:

F = ReLU( Y Wˆ 1 + b 1 ), W 1 ∈ Rd×dff^ , b 1 ∈ Rdff^.

ˆ Second Linear Transformation:

YMLP = F W 2 + b 2 , W 2 ∈ Rdff×d, b 2 ∈ Rd.

Here, dff = 8d.

Assume that computational complexity is always measured in terms of scalar multiplications.

  1. Complexity Analysis:

a) Derive the computational complexity for MHSA and MLP in the Transformer layer. Exclude the softmax operation in MHSA for the derivation. ( 2 marks)

b) Find the conditions on d, H, W such that MLP computations exceed those of MHSA. Is this possible in practical scenarios for image applications? ( 2 marks) c) Compare the number of parameters in MHSA and MLP layers, and determine which layer contributes more to the total parameter count in a Transformer layer. ( 1 mark)

  1. Sliding Window Attention Approximation: The full attention map Sh ∈ RH·W^ ×H·W^ is approximated by considering only a patch of size P × P around each pixel i:

( Sˆh)ij =

(Sh)ij , if j ∈ Patch(i), 0 , otherwise.

Only for this question, you can exclude computing query, key and value complexity.

a) Derive the computational complexity of MHSA with the above approximation. ( 2 marks)

b) Calculate the reduction in computations compared to original MHSA. ( 2 marks)

c) Is there a reduction in number of parameters for the layer? If yes, calculate the reduction. ( 1 mark)

  1. Low-Rank Approximation for MLP: Approximate the weight matrices of MLP using a low-rank factorization:

W 1 ≈ A 1 B 1 , W 2 ≈ A 2 B 2 ,

where A 1 ∈ Rd×r^ , B 1 ∈ Rr×dff^ , A 2 ∈ Rdff×r^ , B 2 ∈ Rr×d, r << d.

a) Derive the computational complexity of the approximated MLP. ( 2 marks)

b) Compare the computational savings with the original MLP. ( 2 marks)

c) Is there a reduction in number of parameters for the layer? If yes, calculate the reduction. ( 1 mark)

  1. Modified Transformer Complexity: If MHSA is replaced with sliding window attention and the MLP is replaced with the low-rank approximation, derive the total computational complexity of the modified Transformer layer. Assume computational complexity of a single layer Normalization is H · W · d ( 2 marks)

Refer: LORA for practical applications of low-rank approximations in LLMs.

Question 2 ( 12 marks)

Given two density functions p(x) and q(x), we define the following divergence measures:

ˆ Total Variation Distance: DTV(p∥q) =

Z

|p(x) − q(x)| dx

ˆ Kullback-Leibler (KL) Divergence:

DKL(p∥q) =

Z

p(x) log

p(x) q(x) dx

ˆ Hellinger Distance:

DH(p∥q) =

sZ p p(x) −

p q(x)

dx

Derive and analyze the relationships between these measures:

  1. Determine the range (minimum and maximum values) for DTV(p∥q) and DH(p∥q). ( 2 marks)
  2. Prove the inequality between DTV(p∥q) and DH(p∥q) ( 4 marks):

DTV(p∥q) ≤

2 DH(p∥q).

  1. Prove the inequality between DTV(p∥q) and DKL(p∥q) (1 + 3 marks):

(a) Show: − log(x + 1) ≥ −x for all x ≥ 0.

(b) Using the result in (a), prove: (^) √ 2 DH(p∥q) ≤

p DKL(p∥q), and hence: DTV(p∥q) ≤

p DKL(p∥q).

  1. Identify the limitation of the inequality between DTV(p∥q) and DKL(p∥q) in one sentence. ( 2 marks)

Food for Thought: Why are these inequalities significant? Can you think of a practical application for such inequalities?