Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

DSP Instruction Set: TMS320C6713 DSK Overview and Features - Prof. Zhijie Shi, Study notes of Computer Science

An overview of the tms320c6713 dsp system, including its architecture, features, and specifications. It covers topics such as computational architectures, memory organization, and functional units. The document also includes a block diagram and data path diagram of the c6713 dsp.

Typology: Study notes

Pre 2010

Uploaded on 09/17/2009

koofers-user-j94
koofers-user-j94 🇺🇸

10 documents

1 / 21

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
1
DSP Instruction Set
Architecture
Case Study: TMS320C6713
What is DSP?
A digital signal processor (DSP) is an integrated circuit designed
for high-speed data manipulations, for example, audio, video and
communications
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15

Partial preview of the text

Download DSP Instruction Set: TMS320C6713 DSK Overview and Features - Prof. Zhijie Shi and more Study notes Computer Science in PDF only on Docsity!

DSP Instruction Set

Architecture

Case Study: TMS320C

What is DSP?

‡ A digital signal processor (DSP) is an integrated circuit designed

for high-speed data manipulations, for example, audio, video and

communications

DSP of Analog Signals

Characteristics of DSP Applications

‡ Computationally demanding (multiply,

multiply-accumulate …)

‡ Stringent real-time requirement

‡ “Streaming” data

‡ High data bandwidth

‡ Predictable program flow

Modern DSPs

‡ Texas Instruments

„ TMS320C6x DSP family

‡ Freescale

„ MSC81xx multi-core DSP family

‡ Analog Devices

„ SHARC, Blackfin

C67x DSP Roadmap

C6713 DSK Overview

‡ 225 MHz TMS320C6713 floating point DSP

‡ AIC23 stereo codec

„ 8~92 KHz sample rates

‡ Memory

„ 16 MB dynamic RAM

„ 512 KB nonvolatile FLASH memory

‡ General purpose I/O

„ 4 LEDS

„ 4 DIP switches

‡ USB interface to host PC

C6713 DSK Physical Layout

C6713 Block Diagram

C6713 Data Path

D

S2S1D

M

D S1 S

D

D S1 S

1X 2X

L 1 S

S1 S2 D DL SL SLDLDS1S

M2 S

S2 S1 D S2 S1DDL SL

Registers A0 - A15 Registers B0 - B

L

SL DL D S2 S

‡ 2 Data Paths

‡ 8 Functional Units

„ Orthogonal/Independent

„ 2 Floating Point Multipliers

„ 2 Floating Point Arithmetic

„ 2 Floating Point Auxiliary

‡ Control

„ Independent

„ Up to 8 32-bit Instructions

‡ Registers

„ 2 Files

„ 32, 32-bit registers total

‡ Cross paths (1X, 2X)

‡ L-Unit (L1, L2)

„ Floating-Point, 40-bit Integer ALU

„ Bit Counting, Normalization

‡ S-Unit (S1, S2)

„ Floating Point Auxiliary Unit

„ 32-bit ALU/40-bit shifter

„ Bitfield Operations, Branching

‡ M-Unit (M1, M2)

„ Multiplier: Integer & Floating-Point

‡ D-Unit (D1, D2)

„ 32-bit add/subtract Addr Calculations

Cross Paths

40-bit Write Paths (8 MSBs)

40-bit Read Paths/Store Paths

Function Unit & Operations

Advanced VLIW (VelociTI

® )

Example 1

A B C D E F G H

A B C D E F G H

Example 2

A B

C

D

E

F G H

Example 3

‡ Fetch Packet

„ CPU fetches 8 instructions/cycle

‡ Execute Packet

„ CPU executes 1 to 8 instructions/cycle

„ Fetch packets can contain multiple

execute packets

‡ Parallelism determined at compile/assembly

time

‡ Examples

„ 1) 8 parallel instructions

„ 2) 8 serial instructions

„ 3) Mixed Serial/Parallel Groups

‡ Reduces

„ Code size

„ Number of Program Fetches

„ Power Consumption

Addressing Modes

‡ Indirect addressing

„ *R Register R contains the address of memory location

„ *R++( d ) Post-increamented with modification

„ *++R( d ) Pre-increamented with modification

„ *+R( d ) Pre-increamented without modification

‡ Circular addressing

„ Address is bounded to a range

„ Controlled by AMR register

AMR Register

Example: AMR = 0004 0001 h

Cross-Path Constrains

‡ There can be at most two instructions per

cycle using cross-paths

Valid

Invalid

Load/Store Constrains

‡ Address register must be on the same side as the .D unit

‡ A load (store) using one register file in parallel with another

load (store) must use a different register file

Valid Invalid

Invalid Valid

Branch Instructions

‡ Branch using a displacement

„ Unit can be S1 or S

[A0] B .S1 LOOP

ADD .L1 A1, A2, A

LOOP: SUB .D1 A5, A6, A

‡ Branch using a register

„ Only on S

B .S2 B

Integer Instructions

‡ Arithmetic instructions:

ADD .L1 A3, A7, A7 ; A3+A7->A

‡ Move instructions:

MVKL .S1 X, A4 ; move 16LSBs of X -> A

MVKH .S1 X, A4 ; move 16MSBs of X -> A

MVKLH .S1 X, A4 ; move (X<<16) -> A

‡ Comparison instructions:

CMPEQ, CMPGT, CMPLT

CLR – Clear a Bit Field

‡ Syntax

CLR (.unit) src2, csta, cstb, dst

or

CLR (.unit) src2, src1, dst

EXT – Extract and Sign-Extend a Bit

Field

‡ Syntax

EXT (.unit) src2, csta, cstb, dst

or

EXT (.unit) src2, src1, dst

Example: Dot Product

C

float dotp(float a[], float b[])

int i;

float sum;

for (i = 0; i < 200; i++)

sum += a[i] * b[i];

return sum;

MVK .S1 200, A

ZERO .L1 A

LOOP: LDW .D1 *A4++, A

LDW .D1 *A8++, A

NOP 4

MPYSP .M1 A2, A3, A

NOP 3

ADDSP .L1 A6, A7, A

SUB .S1 A1, 1, A

[A1] B .S2 LOOP

NOP 5

Assembly

Double-Word Loading

MVK .S1 100, A

ZERO .L1 A

|| ZERO .L2 B

LOOP: LDDW .D1 *A4++, A3:A

|| LDDW .D2 *B4++, B3:B

SUB .S1 A1, 1, A

NOP 2

[A1] B .S2 LOOP

MPYSP .M1x A2, B2, A

|| MPYSP .M2x A3, B3, B

NOP 3

ADDSP .L1 A6, A7, A

|| ADDSP .L2 B6, B7, B

; branch occurs here

NOP 3

ADDSP .L1 A7, B7, A

NOP 3

Optimization with Software Pipeline

Instructions in the first iteration

Assembly Code for Loop Kernel

The loop kernel can be done in one cycle!

Butterfly Diagram

X(0)

X(1)

X(2)

X(3)

X(4)

X(5)

X(6)

X(7)

w

w

w

w

y(0)

y(4)

X(2)

y(6)

y(1)

y(5)

y(3)

X(7)

w

w

w

w

w

w

w

w

X(0)

X(4)

X(2)

X(6)

X(1)

X(5)

X(3)

X(7)

w0 -

w

w

w

y(0)

y(1)

y(2)

y(3)

y(4)

y(5)

y(6)

y(7)

w

w

w

w

w

w

w

w

Out of order

In order

Input Output

Implementation in C

void DSPF_sp_cfftr2_dit(float* x, float* w, short n)

short n2, ie, ia, i, j, k, m;

float rtemp, itemp, c, s;

n2 = n;

ie = 1;

for(k=n; k > 1; k >>= 1)

n2 >>= 1;

ia = 0;

for(j=0; j < ie; j++)

c = w[2*j];

s = w[2*j+1];

for(i=0; i < n2; i++)

m = ia + n2;

rtemp = c * x[2m] + s * x[2m+1];

itemp = c * x[2m+1] - s * x[2m];

x[2m] = x[2ia] - rtemp;

x[2m+1] = x[2ia+1] - itemp;

x[2ia] = x[2ia] + rtemp;

x[2ia+1] = x[2ia+1] + itemp;

ia++;

ia += n2;

ie <<= 1;

Number of stages

Number of sub groups

Number of elements in cell

Stage k

N/

k

N/

k

N/

k

sub-group

Cell

Even elements in sub-group

Odd elements in sub-group

2

k-

sub-groups in total

Implementation in

Linear Assembly

Code

.global _DSPF_sp_cfftr2_dit_DSPF_sp_cfftr2_dit .cproc A_xptr, B_wptr, A_n

MV A_n, A_n2 ; init n SHR A_n, 1, A_n ; outer loop cntr MV A_n, A_cnt ; inner loop cntr oloop: SHR A_n2, 1, A_n2 ; n2>> LDDW *B_wptr, B_s:B_c ; load s:c ADD B_wptr, 8, B_w ; init w ptr MV A_n2, A_i ; init ia MV A_cnt, A_icntr ; init loop cntr SHL A_n2, 3, A_8n2 ; n2<< ADDAD A_xptr, A_n2, A_x ; init load ptr ADDAD A_xptr, A_n2, A_xs ; init store ptr MV A_xptr, B_x ; init load ptr MV B_x, B_xs ; init store ptr loop: [!A_i] ADD A_x, A_8n2, A_x ; if(!i) A_x+=8n [!A_i] ADD B_x, A_8n2, B_x ; if(!i) B_x+=8n [!A_i] LDDW *B_w++, B_s:B_c ; if(!i) load s:c [!A_i] ADD A_xs, A_8n2, A_xs ; reset store ptr [!A_i] ADD B_xs, A_8n2, B_xs ; reset store ptr

LDDW *A_x++, A_x2mp1:A_x2m ; load x[2m+1]:x[2m]

[!A_i] MV A_n2, A_i ; reset ia [A_i] SUB A_i, 1, A_i ; decr ia

MPYSP A_x2m, B_c, A_p1 ; p1=cx[2m] MPYSP A_x2m, B_s, B_p4 ; p4=sx[2m] MPYSP A_x2mp1, B_s, A_p2 ; p2=sx[2m+1] MPYSP A_x2mp1, B_c, A_p3 ; p3=cx[2m+1]

ADDSP A_p1, A_p2, A_rtemp ; rtemp=p1+p SUBSP A_p3, B_p4, B_itemp ; itemp=p3-p

LDDW *B_x++, B_x2iap1:B_x2ia; load x[2ia+1]:x[2ia]

SUBSP B_x2ia, A_rtemp, A_x2ms; x[2m]=x[2ia]-rtemp ADDSP B_x2ia, A_rtemp, A_x2ias; x[2ia]=x[2ia]+rtemp

SUBSP B_x2iap1,B_itemp, B_x2mp1s; x[2m+1]=x[2ia+1]-itemp ADDSP B_x2iap1,B_itemp, B_x2iap1s;x[2ia+1]=x[2ia+1]+itemp

STW A_x2ms, A_xs++ ; perform all stores STW A_x2ias, B_xs++ STW B_x2mp1s,A_xs++ STW B_x2iap1s,B_xs++

[A_icntr] SUB A_icntr, 1, A_icntr ; decr loop cntr [A_icntr] B loop ; branch inner

SHR A_n, 1, A_n ; half outer loop cntr [A_n] B oloop ; branch outer loop