Prepare for your exams
Get points
Guidelines and tips

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search Store documents

The best documents sold by students who completed their studies

Search through all study resources

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

University Rankings

Discover the best universities in your country according to Docsity users

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

From our blog

Exams and Study

Go to the blog

DSP Instruction Set: TMS320C6713 DSK Overview and Features - Prof. Zhijie Shi, Study notes of Computer Science

University of Connecticut (UConn) - Avery Point Computer Science

Prof. Zhijie Shi

An overview of the tms320c6713 dsp system, including its architecture, features, and specifications. It covers topics such as computational architectures, memory organization, and functional units. The document also includes a block diagram and data path diagram of the c6713 dsp.

Typology: Study notes

Pre 2010

Uploaded on 09/17/2009

koofers-user-j94 🇺🇸

10 documents

1 / 21

This page cannot be seen from the preview

Don't miss anything!

DSP Instruction Set

Architecture

Case Study: TMS320C6713

What is DSP?

A digital signal processor (DSP) is an integrated circuit designed

for high-speed data manipulations, for example, audio, video and

communications

Partial preview of the text

Download DSP Instruction Set: TMS320C6713 DSK Overview and Features - Prof. Zhijie Shi and more Study notes Computer Science in PDF only on Docsity!

DSP Instruction Set

Architecture

Case Study: TMS320C

What is DSP?

A digital signal processor (DSP) is an integrated circuit designed

for high-speed data manipulations, for example, audio, video and

communications

DSP of Analog Signals

Characteristics of DSP Applications

Computationally demanding (multiply,

multiply-accumulate …)

Stringent real-time requirement

“Streaming” data

High data bandwidth

Predictable program flow

Modern DSPs

Texas Instruments

TMS320C6x DSP family

Freescale

MSC81xx multi-core DSP family

Analog Devices

SHARC, Blackfin

C67x DSP Roadmap

C6713 DSK Overview

225 MHz TMS320C6713 floating point DSP

AIC23 stereo codec

8~92 KHz sample rates

Memory

16 MB dynamic RAM

512 KB nonvolatile FLASH memory

General purpose I/O

4 LEDS

4 DIP switches

USB interface to host PC

C6713 DSK Physical Layout

C6713 Block Diagram

C6713 Data Path

S2S1D

D S1 S

1X 2X

L 1 S

S1 S2 D DL SL SLDLDS1S

M2 S

S2 S1 D S2 S1DDL SL

Registers A0 - A15 Registers B0 - B

SL DL D S2 S

2 Data Paths

8 Functional Units

Orthogonal/Independent

2 Floating Point Multipliers

2 Floating Point Arithmetic

2 Floating Point Auxiliary

Control

Independent

Up to 8 32-bit Instructions

Registers

2 Files

32, 32-bit registers total

Cross paths (1X, 2X)

L-Unit (L1, L2)

Floating-Point, 40-bit Integer ALU

Bit Counting, Normalization

S-Unit (S1, S2)

Floating Point Auxiliary Unit

32-bit ALU/40-bit shifter

Bitfield Operations, Branching

M-Unit (M1, M2)

Multiplier: Integer & Floating-Point

D-Unit (D1, D2)

32-bit add/subtract Addr Calculations

Cross Paths

40-bit Write Paths (8 MSBs)

40-bit Read Paths/Store Paths

Function Unit & Operations

Advanced VLIW (VelociTI

® )

Example 1

A B C D E F G H

Example 2

A B

F G H

Example 3

Fetch Packet

CPU fetches 8 instructions/cycle

Execute Packet

CPU executes 1 to 8 instructions/cycle

Fetch packets can contain multiple

execute packets

Parallelism determined at compile/assembly

time

Examples

1) 8 parallel instructions

2) 8 serial instructions

3) Mixed Serial/Parallel Groups

Reduces

Code size

Number of Program Fetches

Power Consumption

Addressing Modes

Indirect addressing

*R Register R contains the address of memory location

*R++( d ) Post-increamented with modification

*++R( d ) Pre-increamented with modification

*+R( d ) Pre-increamented without modification

Circular addressing

Address is bounded to a range

Controlled by AMR register

AMR Register

Example: AMR = 0004 0001 h

Cross-Path Constrains

There can be at most two instructions per

cycle using cross-paths

Valid

Invalid

Load/Store Constrains

Address register must be on the same side as the .D unit

A load (store) using one register file in parallel with another

load (store) must use a different register file

Valid Invalid

Invalid Valid

Branch Instructions

Branch using a displacement

Unit can be S1 or S

[A0] B .S1 LOOP

ADD .L1 A1, A2, A

LOOP: SUB .D1 A5, A6, A

Branch using a register

Only on S

B .S2 B

Integer Instructions

Arithmetic instructions:

ADD .L1 A3, A7, A7 ; A3+A7->A

Move instructions:

MVKL .S1 X, A4 ; move 16LSBs of X -> A

MVKH .S1 X, A4 ; move 16MSBs of X -> A

MVKLH .S1 X, A4 ; move (X<<16) -> A

Comparison instructions:

CMPEQ, CMPGT, CMPLT

CLR – Clear a Bit Field

Syntax

CLR (.unit) src2, csta, cstb, dst

or

CLR (.unit) src2, src1, dst

EXT – Extract and Sign-Extend a Bit

Field

Syntax

EXT (.unit) src2, csta, cstb, dst

EXT (.unit) src2, src1, dst

Example: Dot Product

float dotp(float a[], float b[])

int i;

float sum;

for (i = 0; i < 200; i++)

sum += a[i] * b[i];

return sum;

MVK .S1 200, A

ZERO .L1 A

LOOP: LDW .D1 *A4++, A

LDW .D1 *A8++, A

NOP 4

MPYSP .M1 A2, A3, A

NOP 3

ADDSP .L1 A6, A7, A

SUB .S1 A1, 1, A

[A1] B .S2 LOOP

NOP 5

Assembly

Double-Word Loading

MVK .S1 100, A

ZERO .L1 A

|| ZERO .L2 B

LOOP: LDDW .D1 *A4++, A3:A

|| LDDW .D2 *B4++, B3:B

SUB .S1 A1, 1, A

NOP 2

[A1] B .S2 LOOP

MPYSP .M1x A2, B2, A

|| MPYSP .M2x A3, B3, B

NOP 3

ADDSP .L1 A6, A7, A

|| ADDSP .L2 B6, B7, B

; branch occurs here

NOP 3

ADDSP .L1 A7, B7, A

NOP 3

Optimization with Software Pipeline

Instructions in the first iteration

Assembly Code for Loop Kernel

The loop kernel can be done in one cycle!

Butterfly Diagram

X(0)

X(1)

X(2)

X(3)

X(4)

X(5)

X(6)

X(7)

y(0)

y(4)

X(2)

y(6)

y(1)

y(5)

y(3)

X(7)

X(0)

X(4)

X(2)

X(6)

X(1)

X(5)

X(3)

X(7)

w0 -

y(0)

y(1)

y(2)

y(3)

y(4)

y(5)

y(6)

y(7)

Out of order

In order

Input Output

Implementation in C

void DSPF_sp_cfftr2_dit(float* x, float* w, short n)

short n2, ie, ia, i, j, k, m;

float rtemp, itemp, c, s;

n2 = n;

ie = 1;

for(k=n; k > 1; k >>= 1)

n2 >>= 1;

ia = 0;

for(j=0; j < ie; j++)

c = w[2*j];

s = w[2*j+1];

for(i=0; i < n2; i++)

m = ia + n2;

rtemp = c * x[2m] + s * x[2m+1];

itemp = c * x[2m+1] - s * x[2m];

x[2m] = x[2ia] - rtemp;

x[2m+1] = x[2ia+1] - itemp;

x[2ia] = x[2ia] + rtemp;

x[2ia+1] = x[2ia+1] + itemp;

ia++;

ia += n2;

ie <<= 1;

Number of stages

Number of sub groups

Number of elements in cell

Stage k

N/

sub-group

Cell

Even elements in sub-group

Odd elements in sub-group

k-

sub-groups in total

Implementation in

Linear Assembly

Code

.global _DSPF_sp_cfftr2_dit_DSPF_sp_cfftr2_dit .cproc A_xptr, B_wptr, A_n

MV A_n, A_n2 ; init n SHR A_n, 1, A_n ; outer loop cntr MV A_n, A_cnt ; inner loop cntr oloop: SHR A_n2, 1, A_n2 ; n2>> LDDW *B_wptr, B_s:B_c ; load s:c ADD B_wptr, 8, B_w ; init w ptr MV A_n2, A_i ; init ia MV A_cnt, A_icntr ; init loop cntr SHL A_n2, 3, A_8n2 ; n2<< ADDAD A_xptr, A_n2, A_x ; init load ptr ADDAD A_xptr, A_n2, A_xs ; init store ptr MV A_xptr, B_x ; init load ptr MV B_x, B_xs ; init store ptr loop: [!A_i] ADD A_x, A_8n2, A_x ; if(!i) A_x+=8n [!A_i] ADD B_x, A_8n2, B_x ; if(!i) B_x+=8n [!A_i] LDDW *B_w++, B_s:B_c ; if(!i) load s:c [!A_i] ADD A_xs, A_8n2, A_xs ; reset store ptr [!A_i] ADD B_xs, A_8n2, B_xs ; reset store ptr

LDDW *A_x++, A_x2mp1:A_x2m ; load x[2m+1]:x[2m]

[!A_i] MV A_n2, A_i ; reset ia [A_i] SUB A_i, 1, A_i ; decr ia

MPYSP A_x2m, B_c, A_p1 ; p1=cx[2m] MPYSP A_x2m, B_s, B_p4 ; p4=sx[2m] MPYSP A_x2mp1, B_s, A_p2 ; p2=sx[2m+1] MPYSP A_x2mp1, B_c, A_p3 ; p3=cx[2m+1]

ADDSP A_p1, A_p2, A_rtemp ; rtemp=p1+p SUBSP A_p3, B_p4, B_itemp ; itemp=p3-p

LDDW *B_x++, B_x2iap1:B_x2ia; load x[2ia+1]:x[2ia]

SUBSP B_x2ia, A_rtemp, A_x2ms; x[2m]=x[2ia]-rtemp ADDSP B_x2ia, A_rtemp, A_x2ias; x[2ia]=x[2ia]+rtemp

SUBSP B_x2iap1,B_itemp, B_x2mp1s; x[2m+1]=x[2ia+1]-itemp ADDSP B_x2iap1,B_itemp, B_x2iap1s;x[2ia+1]=x[2ia+1]+itemp

STW A_x2ms, A_xs++ ; perform all stores STW A_x2ias, B_xs++ STW B_x2mp1s,A_xs++ STW B_x2iap1s,B_xs++

[A_icntr] SUB A_icntr, 1, A_icntr ; decr loop cntr [A_icntr] B loop ; branch inner

SHR A_n, 1, A_n ; half outer loop cntr [A_n] B oloop ; branch outer loop

DSP Instruction Set: TMS320C6713 DSK Overview and Features - Prof. Zhijie Shi, Study notes of Computer Science

Related documents

Partial preview of the text

Download DSP Instruction Set: TMS320C6713 DSK Overview and Features - Prof. Zhijie Shi and more Study notes Computer Science in PDF only on Docsity!

DSP Instruction Set

Architecture

Case Study: TMS320C

What is DSP?

 A digital signal processor (DSP) is an integrated circuit designed

for high-speed data manipulations, for example, audio, video and

communications

 2 Data Paths

 8 Functional Units

 Orthogonal/Independent

 2 Floating Point Multipliers

 2 Floating Point Arithmetic

 2 Floating Point Auxiliary

 Control

 Independent

 Up to 8 32-bit Instructions

 Registers

 2 Files

 32, 32-bit registers total

 Cross paths (1X, 2X)

 L-Unit (L1, L2)

 Floating-Point, 40-bit Integer ALU

 Bit Counting, Normalization

 S-Unit (S1, S2)

 Floating Point Auxiliary Unit

 32-bit ALU/40-bit shifter

 Bitfield Operations, Branching

 M-Unit (M1, M2)

 Multiplier: Integer & Floating-Point

 D-Unit (D1, D2)

 32-bit add/subtract Addr Calculations

 Fetch Packet

 CPU fetches 8 instructions/cycle

 Execute Packet

 CPU executes 1 to 8 instructions/cycle

 Fetch packets can contain multiple

execute packets

 Parallelism determined at compile/assembly

time

 Examples

 1) 8 parallel instructions

 2) 8 serial instructions

 3) Mixed Serial/Parallel Groups

 Reduces

 Code size

 Number of Program Fetches

 Power Consumption

Example: AMR = 0004 0001 h

[A0] B .S1 LOOP

ADD .L1 A1, A2, A

LOOP: SUB .D1 A5, A6, A

B .S2 B

ADD .L1 A3, A7, A7 ; A3+A7->A

MVKL .S1 X, A4 ; move 16LSBs of X -> A

MVKH .S1 X, A4 ; move 16MSBs of X -> A

MVKLH .S1 X, A4 ; move (X<<16) -> A

CMPEQ, CMPGT, CMPLT

CLR (.unit) src2, csta, cstb, dst

or

CLR (.unit) src2, src1, dst

EXT (.unit) src2, csta, cstb, dst

float dotp(float a[], float b[])

int i;

float sum;

for (i = 0; i < 200; i++)

sum += a[i] * b[i];

return sum;

MVK .S1 200, A

ZERO .L1 A

LOOP: LDW .D1 *A4++, A

LDW .D1 *A8++, A

NOP 4

MPYSP .M1 A2, A3, A

NOP 3

ADDSP .L1 A6, A7, A

SUB .S1 A1, 1, A

A digital signal processor (DSP) is an integrated circuit designed

2 Data Paths

8 Functional Units

Orthogonal/Independent

2 Floating Point Multipliers

2 Floating Point Arithmetic

2 Floating Point Auxiliary

Control

Independent

Up to 8 32-bit Instructions

Registers

2 Files

32, 32-bit registers total

Cross paths (1X, 2X)

L-Unit (L1, L2)

Floating-Point, 40-bit Integer ALU

Bit Counting, Normalization

S-Unit (S1, S2)

Floating Point Auxiliary Unit

32-bit ALU/40-bit shifter

Bitfield Operations, Branching

M-Unit (M1, M2)

Multiplier: Integer & Floating-Point

D-Unit (D1, D2)

32-bit add/subtract Addr Calculations

Fetch Packet

CPU fetches 8 instructions/cycle

Execute Packet

CPU executes 1 to 8 instructions/cycle

Fetch packets can contain multiple

Parallelism determined at compile/assembly

Examples

1) 8 parallel instructions

2) 8 serial instructions

3) Mixed Serial/Parallel Groups

Reduces

Code size

Number of Program Fetches

Power Consumption