| The Datasheet Archive - 100 Million Datasheets from 7500 Manufacturers. |
Richard Scales Abstract advancements performance flexibility
Top Searches for this datasheetSoftware Development Techniques TMS320C6201 Richard Scales Abstract advancements performance flexibility modern digital signal processor (DSP) devices clearly demonstrated release TMS320C62xx family DSPs from Texas Instruments. TMS320C62xx high-performance Very Long Instruction Word (VLIW) based Veloci (TITM) architecture. need support such advanced device fueled need development software that equal task when designing high performance systems. order extract optimum performance from TMS320C62xx devices, necessary high level language (HLL) compilers that perform beyond currently expected norm following areas: Code size, allow greater on-chip memory Execution efficiency algorithmic functional optimizations Data throughput Utilization on-chip features functionality Contents Problem. Solution. Conclusion. Figures Figure Figure TMS320C6201 Core TMS320C62xx Instruction Delay Slots Tables Table Table Filter Benchmark Results. Tips Optimizing TMS320C62xx Code. Digital Signal Processing Solutions December 1998 Problem gain maximum benefits from development tools device, necessary programmer become familiar with functionality both hardware software, which involve steep learning curve; however, development tools available TMS320C62xx ensure that learning process smooth possible. been designing systems working with these devices past year learned much about challenges that will face programmers future. These challenges described this document, along with typical solutions suggestions future system implementation. complexity devices, indeed, future devices, trend toward high-level languages (HLL) will continue overwhelming majority future application programs will fundamentally HLL-based, with assembly code used time-critical sections. Solution code that comprises typical applications usually split into major categories: signal-processing code system-control code. system-control code often timecritical signal-processing code performance pure ANSI code usually more than adequate. signal-processing code timing factor, however, often benefits greater degree from closer examination this paper focuses this code particular. Typically, TMS320C62xx code will generated using top-down design technique, follows: High-level ANSI functionality Optimized code, which include intrinsic functions Code sections linear assembly Optimized assembly time-critical sections support devices development tools, Texas Instruments introduced some programming concepts techniques that will many programmers. techniques effectively increase number stages that programmer must pass through quest fully optimizing algorithm. techniques designed structure whole process this will both reduce initial design time reduce possibility errors final code. algorithms presented here have been chosen because they cover some common requirements, including some specifically associated with multi dimensional-array based operations like imaging. examples provide some useful guidelines that applied other applications offer enlightening demonstration required software development techniques. Each algorithm benchmarked 120-MHz TMS320C6201-based PCI/C6200. Software Development Techniques TMS320C6201 algorithms described are: Infinite-Impulse Response filter Vector Two-dimensional convolution Before considering programming microprocessor, necessary fully understand architecture device. TMS320C62xx VLIW device, which viewed central processing core surrounded peripheral devices that support operation core. code optimization purposes core vital component. TMS320C62xx core shown Figure Figure TMS320C6201 Core C6200 Megamodule Program Fetch Instruction Dispatch Instruction Decode Data Path Register File Data Path Register File Control Logic Test Emulation Interrupts Control Registers TMS320C62xx architecture incorporates virtually identical data paths, each which capable performing two16-bit parallel multiply-accumulate operations cycle. Each data path contains four independent functional units, sixteen general purpose 32-bit registers, 32-bit load/store path memory, 32-bit cross path other data path. TMS320C62xx reads 256-bit (eight 32-bit instructions) wide instruction fetch packet; each fetch packet contain between eight execution packets. execute packet simply more instructions that operate parallel. Each instruction within execute packet then passed appropriate functional unit. fetch packet executed eight separate execute packets, instructions, will take eight times long single eight-instruction execute packet. Each register banks incorporates execution units follows: Unit Logical Unit With Shifter Unit Logical Unit Unit Data Unit Unit Multiply Unit I.E. TMS320C62xx Multipliers ALUs Software Development Techniques TMS320C6201 TMS320C62xx features register-based architecture, with load-store structure program code. Each register bank consists registers there cross-paths between register banks allow cross-transfer data. instructions TMS320C62xx conditional; conditions valid registers. Parallel instructions indicated with symbol start command line. TMS320C62xx device uses pipeline parallel instruction execution. instructions, only three (multiply, load, branch) operations experience delay slots, i.e. there delay before result written register file before available subsequent instructions. cases where single operation being performed there other instructions execute during delay slots, multicycle instructions used fill delay slots, while minimizing code size. Figure TMS320C62xx Instruction Delay Slots Most Instructions Integer Multiply Loads Branches Branch Target Delay Delay Slot Delay Slots Delay Slots pipeline effects delay slots experienced three instructions mentioned shown Figure diagram shows majority instructions complete single execute cycle (E1) others require additional delay cycles. branch operation shows that delay number pipeline stages takes branch target reach execute stage. delays reduce ability TMS320C62xx issue single instruction execution packet every clock cycle. TMS320C62xx devices incorporate very rich orthogonal instruction that supported powerful ANSI compiler. Many powerful TMS320C62xx instructions, however, particularly 16-bit parallel operations that operate separate halves 32-bit words, unsupported ANSI standard. has, therefore, incorporated intrinsic functions within compiler enable TMS320C62xx instructions executed with function call overhead. Software Development Techniques TMS320C6201 described earlier, best approach implementing algorithm TMS320C62xx top-down approach, i.e., define algorithm source level verify that correct results generated. Having proved algorithm, then necessary benchmark performance, then optimize, where appropriate. Most operations require repeated processing arrays data, with same mathematical operation performed samples. instructions performing processing repeated with maximum efficiency pipelined loop samples, hence, they referred "piped-loop kernel" algorithm. analysis this article, each algorithm will developed using top-down approach described, piped-loop kernel will presented. piped-loop kernel often preceded prologue initialization followed epilogue clear down; however, kernel central part algorithm that processes majority data here that optimization most critical. first algorithm analyzed Infinite-Impulse Response (IIR) filter, which defined following code: void (const short *coefs, const short *input, short *optr, short *state) short short input[0]; n++) ((coefs[2] state[0] coefs[3] state[1]) 15); ((coefs[0] state[0] coefs[1] state[1]) 15); state[1] state[0] coefs state *optr++ state[0]; point next filter coefs point next filter states Software Development Techniques TMS320C6201 assembly code piped-loop kernel, produced compiler PIPED-LOOP KERNEL B4,15,B4 A3,15,A5 .M2X B6,A5,B6 *+A6(16),A4 *+B7(10),B6 .M1X .M2X .M1X .L1X A0,A5,A0 B6,A3,A3 B5,A4,B5 *+A6(22),A3 *+B7(8),B5 A0,16,16,A0 B5,*+B7(6) B5,A3,A4 *+A6(20),A3 8,A6,A6 A0,*B7++(4) A0,B4,A0 B0,1,B0 B6,B5,B4 A0,16,16,A0 A3,A4,A3 *+A6(18),A5 ;@@@ results show that execute packets contain either four five parallel instructions, hence, TMS320C62xx processing units fully utilized there possibility optimizing performance this code. characters comments specify iteration loop that instruction software pipeline automatically generated tools. example, while instructions executing iteration loop, executing iteration instructions executing iteration j+2. scheduling iteration instructions within piped loop result prologue leading execution piped-loop kernel. Software Development Techniques TMS320C6201 processing 16-bit data, then first level optimization will utilize 32-bit external increase data rates through core performing parallel 16-bit reads single 32-bit word. 16-bit data then processed using TMS320C62xx _mpy _mpyhl operations, which accessed C-level intrinsic functions, shown following code: void (const *coefs, const short *input, short *optr, short *state) short short input[0]; n++) x+((_mpy(coefs[1],state[0]) _mpyhl(coefs[1],state[1])) 15); t+((_mpy(coefs[0],state[0]) _mpyhl(coefs[0],state[1])) 15); state[1] state[0]; state[0] coefs state *optr++ Software Development Techniques TMS320C6201 assembly code piped-loop kernel, produced compiler PIPED-LOOP KERNEL MPYHL MPYHL .M2X .L1X .L1X .M1X B7,B8,B7 A0,A3,A0 B6,B9 A5,*+A4(6) *B5++(8),B8 B7,15,B7 A0,16,16,A0 B0,1,B0 B8,A5,B8 B6,A3,A3 *+B4(14),B6 A0,B7,A6 B8,B9,B7 A3,15,A3 *+B5(4),B7 *+A4(12),A5 4,B4,B4 A0,*A4++(4) A6,16,16,A0 B7,B6,B6 B7,A5,A3 ;@@@ ;@@@ ;@@@ assembly code shows that coefficients loaded time, single 32-bit operations. parallel loads optimize data efficiency require that coefficients contiguous memory, although this usually problem applications. results also show that piped-loop kernel been reduced four instruction fetch packets, which most efficient implementation algorithm that possible using pure code. optimize code further, necessary linear assembly code. Linear assembly similar regular TMS320C62xx assembly code, that TMS320C62xx instructions used write code; however, frees programmer from some timeconsuming aspects pure assembly code programming, hence, shortens development time drastically. linear assembly code, programmer specify some, information required, he/she allow assembly optimizer specify Information such register usage, functional unit more omitted during first-pass approach then more detail added further control resource allocation fully utilize device. Software Development Techniques TMS320C6201 following linear assembly code shows function implemented, also, optional parameters utilized: .def _iir _iir3 .cproc cptr0,sptr0 .reg cptr1, s01, s10, s23, c10, c32, s10_s, s10_t .reg s23_s, mask, sptr1, s10p, LOOP: .trip MPYH s23_s,x,t t,mask,t MPYH s10_s,t,x StateAddr[0] [ctr] [ctr] s10p,16,s1 t,s1,s01 s01,*sptr1 c10,s10,p0 c10,s10,p1 p0,p1,s10_t s10_t,15,s10_s clear upper bits CoefAddr[0] StateAddr[0] CoefAddr[1] StateAddr[1] CA[0] SA[0] CA[1] SA[1] (CA[0] SA[0] CA[1] SA[1]) .D1T1 *cptr0,c32 .D2T2 *cptr1,c10 .D1T2 *sptr0,s10 s10,s10p c32,s10,p2 c32,s10,p3 p2,p3,s23 s23,15,s23_s coefAddr[3] CoefAddr[2] CoefAddr[1] CoefAddr[0] StateAddr[1] StateAddr[0] save StateAddr[1] StateAddr[0] CoefAddr[2] StateAddr[0] CoefAddr[3] StateAddr[1] CA[2] SA[0] CA[3] SA[1] (CA[2] SA[0] CA[3] SA[1]) cptr0,cptr1 sptr0,sptr1 50,ctr setup loop counter StateAddr[1] StateAddr[0] StateAddr[0] store StateAddr[1] .endproc -1,ctr,ctr LOOP outer cntr Branch outer loop Software Development Techniques TMS320C6201 linear assembly passed through linear assembler resultant assembly code piped-loop kernel PIPED-LOOP KERNEL .M1X B3,B7,B0 B0,B8,B8 A4,A5,A4 B2,B1,B8 A0,B1,A4 *B6,B2 *A3,A0 clear upper bits CA[0] SA[0] CA[1] Branch outer loop CA[2] SA[0] CA[3] CoefAddr[1] CoefAddr[2] ;@@@@ CoefAddr[1] ;@@@@ coefAddr[3] SA[1] SA[1] MPYH StateAddr[1] StateAddr[0] CoefAddr[0] CoefAddr[2] B4,B0,B9 B0,B9,B0 B8,0xf,B4 SA[1]) A4,0xf,A5 SA[1]) B2,B1,B0 StateAddr[0] MPYH .M1X A0,B1,A5 StateAddr[1] *A6,B1 StateAddr[0] StateAddr[0] (CA[0] SA[0] CA[1] (CA[2] SA[0] CA[3] CoefAddr[0] CoefAddr[3] ;@@@@ StateAddr[1] B0,*A7 store StateAddr[1] StateAddr[0] B5,0x10,B9 StateAddr[1] StateAddr[0] .L2X B9,A5,B3 0xffffffff,A1,A1 outer cntr B1,B5 save StateAddr[1] StateAddr[0] piped-loop kernel been reduced three instruction fetch packets clear that with eight instructions fetch packet, that there further room optimization. pipelined code also shows that eight TMS320C62xx processing units (.Dx, .Sx, .Mx, .Lx) almost fully utilized. Another benefit assembly optimizer, shown above, that puts original comments scheduled output, easy what going code. Software Development Techniques TMS320C6201 following table shows results different levels optimization filter: Table Filter Benchmark Results Development Technique ANSI with intrinsic functions Linear assembler Number Cycles next algorithm analyzed vector addition operation, which defined following code: short Add(short *x1, short *x2, short short count) short (i=0; count; i++) y[i] x1[i] x2[i]; clear from source code that this operation requires three external memory accesses sample (two reads write). assembly code produced this operation will able execute single instruction because TMS320C62xx units loading storing data. assembly code piped-loop kernel produced compiler thus: L31: PIPED LOOP KERNEL .L1X B4,A0,A5 *A3++,A0 A5,*A4++ B0,1,B0 *B5++,B4 Software Development Techniques TMS320C6201 This function executes addition operation every instruction cycles, which suggests that there might room improvement better utilizing currently unbalanced addition resources. routine rewritten using intrinsic functions perform parallel additions follows: short Add(short *x1, short *x2, short short *x1a, short *x2a, short *ya, short count) short (i=0; count; i++) y[i] _add2(x1[i], x2[i]); ya[i] _add2(x1a[i], x2a[i]); code produced compiler now: L13: PIPED LOOP KERNEL ADD2 ADD2 .S2X .S1X A3,B5,B6 *A4++,A5 *B7++,B5 B5,A5,A3 B6,*B4++ *A0++,A3 A3,*A6++ B0,1,B0 *B8++,B5 above code calculates data samples parallel. assembly code above shows that version balanced addition resources executes four addition operations every three cycles, which performance improvement 167%. This implementation uses 32-bit load store operations, preference bits; this halves memory bandwidth requirement 100% optimized with respect load store operations. TMS320C62xx ideally suited image-processing applications because 16-bit operations give optimum performance processing headroom pixel word width. Software Development Techniques TMS320C6201 convolution operation defined following code: short Conv3x3(short row0[], short row1[], short row2[], short y[]) short (i=0; width-2; i++) y[i] row0[i]*kernel[0][0] row0[i+1]*kernel[0][1] row0[i+2]*kernel[0][2] row1[i]*kernel[1][0] row1[i+1]*kernel[1][1] row1[i+2]*kernel[1][2] row2[i]*kernel[2][0] row2[i+1]*kernel[2][1] row2[i+2]*kernel[2][2]; This implementation requires cycles process each pixel; however, rewritten follows: short Conv3x3(short row0[], short row1[], short row2[], short y[]) short (x=0; i++) acc1 _mpy (row0[i], a00) _mpyh (row0[i], a00) _mpy (row0[i+1], a02); acc2 _mpyhl (row0[i], a00) _mpylh (row0[i+1], a00) _mpyhl (row0[i+1], a02); acc1 (row1[i+1], acc2 (row1[i+1], acc1 (row2[i+1], acc2 (row2[i+1], _mpy (row1[i], a10) _mpyh (row1[i], a10) _mpy a12); _mpyhl (row1[i], a10) _mpylh (row1[i+1], a10) _mpyhl a12); _mpy (row2[i], a20) _mpyh (row2[i], a20) _mpy a22); _mpyhl (row2[i], a20) _mpylh (row2[i+1], a20) _mpyhl a22); *y++ acc1; *y++ acc2; row0++; row1++; row2++; This implementation algorithm requires instruction cycles calculate results pixels, which allows performance increase, again parallel data loads stores improve bandwidth requirements. this application, TMS320C62xx core 100% optimized since both multipliers used every cycle. Software Development Techniques TMS320C6201 clear from tests performed that compiler optimizer provides good first pass toward optimum code development, improving performance code, very successful eliminating redundant variables repeated memory accesses. optimizer also enables gains pipelining loops. many instances, eight parallel instruction slots used. order achieve this high performance, compiler must have certain knowledge loop, such memory dependencies minimum loop-trip count. These both documented TMS320C62xx programmer's guide. When compiler cannot ensure that trip count large enough pipeline loop maximum performance, pipelined nonpipelined version same loop generated. compiler provides statement (_nassert) assembly optimizer includes directive (.trip) indicating minimum number iterations loop this purpose. Performance gains often made unrolling loops, which optimize data-load bandwidth balance resources. Loop-unrolling helps many smaller code loops, however, lead significant increase code size when loop contains large number instructions. loop must contain conditional breaks function calls, although these inlined. applications where TMS320C62xx execution units fully utilized, often possible parallel separate data-flow streams. This separates data dependencies allows greater performance. Table Tips Optimizing TMS320C62xx Code internal memory pointers necessarily beat arrays intrinsics where possible 32-bit loads stores, possible unrolling loops short) Separates data dependencies Balances resources Experiment! Software Development Techniques TMS320C6201 Conclusion This article shown Texas Instruments TMS320C6201 supported generation development tools. compilers assemblers fully utilize on-chip features functionality device; good examples include packing instructions into fetch packets enable greater on-chip memory also algorithmic functional optimizations optimize performance data throughput. Once programmer leaves then linear assembly shown first example excellent language developing highly optimized routines; therefore, time savings over using pure assembly code enormous. There cases where resorting pure assembler essential; however, often advantageous linear assembly starting point, then, modify output save valuable coding time. final benefit using linear assembly that provides virtually 100% code-portability future family devices beyond, which will ensure that optimized code developed will become redundant years time. clear that future generations devices will require engineers embrace programming techniques disciplines. techniques described this application report cover several trade-offs between computational memory requirements. next generation compilers DSPs will make life great deal easier addition such features intrinsic functions. choice optimization techniques used development particular project will depend entirely final system requirements. References Oppenheim, Schafer, Discrete Time Signal Processing, Prentice-Hall, 1989. Loughborough Sound Images Inc., PCI/C6200 User's Guide, Loughborough Sound Images Inc., 1997. Loughborough Sound Images Inc., PCI/C6200 Technical Reference Manual, Loughborough Sound Images Inc., 1997. Texas Instruments Inc., TMS320C62xx Instruction Set, Texas Instruments Inc., 1997. Texas Instruments Inc., TMS320C6x Optimizing Compiler User's Guide, Texas Instruments Inc., 1997. Texas Instruments Inc., TMS320C6x Assembly Language Tools User's Guide, Texas Instruments Inc., 1997. Texas Instruments Inc., TMS320C62xx Programmer's Guide, Texas Instruments Inc., 1997. Software Development Techniques TMS320C6201 INTERNET www.ti.com Register with TI&ME build custom information pages receive product updates automatically email. Semiconductor Home Page http://www.ti.com/sc Distributors PRODUCT INFORMATION CENTERS Europe, Middle East, Africa Phone Deutsch +49-(0) 8161 3311 English +44-(0) 1604 3399 Francais +33-(0) 1-30 Italiano +33-(0) 1-30 +33-(0) 1-30-70 Email epic@ti.com Japan Phone International Domestic International Domestic Email Asia Phone International Domestic Australia TMS320 Hotline email Americas Phone Email +81-3-3457-0972 +0120-81-0026 +81-3-3457-1259 +0120-81-0036 pic-japan@ti.com (281) 274-2320 (281) 274-2324 (281) 274-2323 dsph@ti.com +1(972) 644-5580 +1(972) 480-7800 sc-infomaster@ti.com +886-2-3786800 1-800-881-011 Asia (continued) Number China Number Hong Kong Number India Number Indonesia Number Korea Malaysia Number Zealand Number Philippines Number Singapore Number Taiwan Thailand Number -800-800-1450 10811 -800-800-1450 800-96-1111 -800-800-1450 000-117 -800-800-1450 001-801-10 -800-800-1450 080-551-2804 1-800-800-011 -800-800-1450 +000-911 -800-800-1450 105-11 -800-800-1450 800-0111-111 -800-800-1450 080-006800 0019-991-1111 -800-800-1450 IMPORTANT NOTICE Texas Instruments (TI) reserves right make changes products discontinue semiconductor product service without notice, advises customers obtain latest version relevant information verify, before placing orders, that information being relied current complete. warrants performance semiconductor products related software specifications applicable time sale accordance with TI's standard warranty. Testing other quality control techniques utilized extent deems necessary support this warranty. Specific testing parameters each device necessarily performed, except those mandated government requirements. Certain application using semiconductor products involve potential risks death, personal injury, severe property environmental damage ("Critical Applications"). SEMICONDUCTOR PRODUCTS DESIGNED, INTENDED, AUTHORIZED, WARRANTED SUITABLE LIFE-SUPPORT APPLICATIONS, DEVICES SYSTEMS OTHER CRITICAL APPLICATIONS. Inclusion products such applications understood fully risk customer. products such applications requires written approval appropriate officer. Questions concerning potential risk applications should directed through local sales office. order minimize risks associated with customer's applications, adequate design operating safeguards should provided customer minimize inherent procedural hazards. assumes liability applications assistance, customer product design, software performance, infringement patents services described herein. does warrant represent that license, either express implied, granted under patent right, copyright, mask work right, other intellectual property right covering relating combination, machine, process which such semiconductor products services might used. Copyright 1998, Texas Instruments Incorporated trademark Texas Instruments Incorporated. Other brands names property their respective owners. Software Development Techniques TMS320C6201 Other recent searchesTSL257T - TSL257T TSL257T Datasheet TPS51117 - TPS51117 TPS51117 Datasheet NOV25PMD6 - NOV25PMD6 NOV25PMD6 Datasheet MAX5527 - MAX5527 MAX5527 Datasheet MAX5527 - MAX5527 MAX5527 Datasheet MAX5528 - MAX5528 MAX5528 Datasheet MAX5529 - MAX5529 MAX5529 Datasheet KID65551S - KID65551S KID65551S Datasheet FTM7921ER - FTM7921ER FTM7921ER Datasheet APT30DQ100BCT - APT30DQ100BCT APT30DQ100BCT Datasheet APT30DQ100BCTG - APT30DQ100BCTG APT30DQ100BCTG Datasheet 1N5400 - 1N5400 1N5400 Datasheet 1N5408 - 1N5408 1N5408 Datasheet
Privacy Policy | Disclaimer |