| The Datasheet Archive - 100 Million Datasheets from 7500 Manufacturers. |
Mathew George, (Joe) Mohsen Khayami Digital Signal Processing Solution
Top Searches for this datasheetUsing TMS320C6x Non-Traditional Applications Mathew George, (Joe) Mohsen Khayami Digital Signal Processing Solutions Abstract Texas Instruments (TITM) TMS320C6x digital signal processor (DSP) architecture, with RISC-like instruction set, flexible parallelism, conditional execution, used nontypical applications from microcontroller-type FPGA/ASIC/data flow-type tasks. This paper uses code examples explore ways efficiently handle manipulation, address manipulation, dataflow configurations. addition, this document includes example table lookup benchmark system architecture discussion data input/output. Contents Introduction. CPU/Instruction Features With Code Examples. Manipulation Address Manipulation Decision Execution (Conditionally Execute Advantages Over Test/Branch) Application Example. Table Lookup Example Description Table Lookup Example Code. System Discussion-C6x DMAs Data (Eliminate Components) Conclusion. Appendix Table Lookup Code. ipp.c iploop.sa ipp.cmd ipploop.asm (tool generated) ipptab.asm Digital Signal Processing Solutions 1999 Figures Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Clear/Set/Toggle Example. Clear/Set/Toggle Code. Byte-Swap Example. Byte-Swap Code. Table Parsing Example Table Parsing Code Link List Example. Link List Code. Decision Execution Concepts Decision Execution Example (Comparator) Decision Execution Code (Bit Test/Branch). Decision Execution Code (Conditional Execute). Decision Execution Code (Conditional Execute Parallel). Decision Execution Code (Conditional Execute Parallel-II) Decision Execution Code (Conditional Execute-Software Pipelined) Table Lookup Example Description Table Lookup Example Code Initialization. Primary Lookup Table Loop Linear Assembly. Primary Lookup Table Loop Pure Assembly C6x) Architecture Architecture (Size) Architecture (Speed) C6202 Architecture With Second Introduction TMS320C6x traditional DSP, even though handles traditional applications, such filtering, FFTs, vocoders, that other DSPs also includes variety additional features that make attractive non-standard applications. These applications include, limited Microcontroller-style manipulation "bit banging" often called) instructions some cases performed even better than with microcontrollers single 5-ns cycle) Byte addressibility Address manipulation (with improved results over C3x/C4x) Dynamic operations (performed well those C3x/C4x) Efficient "conditionally execute" method classic test/branch seen "controller"-type housekeeping code Ability replace FPGA/ASIC with "dataflow" style design innovative tools develop these operations easily Elegant four-channel (direct memory access) data movement Using TMS320C6x Non-Traditional Applications This application report examines various aspects their implementation code hardware. Specific architectural features described accompanied code examples. document includes application example discusses tools assist process optimization. elegant hardware architecture also presented data movement processing. information presented this document should encourage appreciation power DSP. CPU/Instruction Features With Code Examples examine various non-traditional features architecture through instruction that does well traditional DSPs often not. graphic example concept and/or application followed code segment provided each. Note that code examples authentic (assembled simulator) assembly code, most UNOPTIMIZED meant purely descriptive, academic purposes. features classified into three descriptive groups: manipulation, address manipulation, decision execution. these features important enabling optimally execute some these non-traditional functions. Manipulation This section examines types manipulation done both microcontroller-type ASIC/FPGA-type applications. microcontroller-type applications, registers often manipulated control peripherals perform housekeeping functions. ASIC/FPGAtype applications, fast data streams often manipulated. Note that manipulation usually done data addresses (for more information, section, Address Manipulation). Clear/Set/Toggle Figure shows value register being set, cleared, toggled, then placed another register. This value might have been loaded from register part data stream. Figure Clear/Set/Toggle Example Manipulation Clear/Set/Toggle with Single Cycle Instruction. 1234h 0000h Clear 1234h 0ffffh Toggle 5555h 0aaaah Using TMS320C6x Non-Traditional Applications code that corresponds each these operations shown Figure Figure Clear/Set/Toggle Code Manipulation Set/Clear/Toggle with single cycle instruction*. .text Typical banging bitbang: MVKH .data bbdata: .word bbdata, bbdata, *A15, Initialize pointer with MVK/MVKH** Load value bbdata=12345555h 12345555h CLEAR upper byte upper halfword 00345555h lower byte upper halfword 00ff5555h TOGGLE 0ff00aaaah 012345555h *See TMS320C62xx Instruction Reference Guide 3-42, 117. **See TMS320C62xx Instruction Reference Guide 3-77 3-80. three operations shown bold Figure Each instruction executed single cycle. example, address (pointer) bbdata, where bbdata located .data section, loaded into register using MVK/MVKH instructions. value bbdata loaded into register using instruction with *A15 acting pointer register. pipeline considerations. NOTE: Remember that most following code examples UNOPTIMIZED. SET/CLR accomplished specifying from which which needs cleared from bits value specified here constant (and fits well opcode because 5-bit constants well opcode). With XOR, entire bits toggled because constant value "-1" signextended before operation done. Please note that above instructions also executed using mask (and hence real-time dynamically, needed) register. These operations standard microcontrollers well supported other TMS320 DSPs. C2xx requires accumulator C54x requires accumulators, making them unavailable other operations. (parallel logic unit) that directly manipulates data, thus off-loading accumulator offering advantage over other fixed-point processors. Even does have SET/CLR. TMS320 family, only parallel processor (PP) offers improved performance over these types operations. Using TMS320C6x Non-Traditional Applications Byte-Swapping Byte swapping classic operation going back Intel Motorola models bigendian little-endian. Usually very difficult software, shifters made problem much easier other TMS320 DSPs. byte-swapping operation shown Figure Figure Byte-Swap Example Manipulation Byte swap done with "Extract" Instruction EXT/EXTU. Byte Swap 1234h 3412h This operation must accomplished pure software without shifter. Multiplication (which slow most probably required along with some masking addition. case C6x, instruction makes even easier letting programmer actually pick contiguous bits wants manipulate. This powerful feature implemented interesting with shifts cycle. (See Figure code.) Using TMS320C6x Non-Traditional Applications Figure Byte-Swap Code Manipulation .text Byteswap using EXTU (extract unsigned) instruction byteswap:MVK bbdata, Initialize pointer MVKH bbdata, *A15, Load value bbdata=12345555h 12345555h EXTU Extract "34" 12345555h, 00000034h EXTU Extract "12" 00000034h, 00000012h Shift**/align make "3400" 00003400h, 00000012h Swap adding 00003400h, 00003412h *A15 Store "3412" Byte Swapping (EXTU* does shifts cycle). Dynamic EXTU (with registers) shown Table Lookup *See TMS320C62xx Instruction Reference Guide 3-55. **See TMS320C62xx Instruction Reference Guide 3-94. After loading pointer value last example codes, EXTU (the means unsigned) pulls appropriate bits specified this case constants. first bolded EXTU pulls "34" saves register while second bolded EXTU pulls "12" saves register. "34" then left-shifted make "3400" added "12". other processors without instruction, values must masked off. Using instruction, first number denotes many bits left throw out. slight wrench system fact that second number denotes many bits right throw PLUS many bits left already thrown out. That's right. must specify bits left twice. This because operation accomplished with shifts cycle. After shift left throw bits left, must shift right same distance return original position start shifting right, want right-justified answer destination register. some operations, such optimized byte-swap, might want answer right justified.) This instruction graphically explained TMS320C62xx Instruction Reference Guide, 3-55. Again, note that value dynamic register (this shown Application Example section). Again, remember that this example UNOPTIMIZED. (This operation accomplished cycles instead cycles parallelizing EXTU with them right justifying "34" first cycle then adding second cycle.) data only item that might need manipulated. Addresses also often need some manipulation, discussed following section. Using TMS320C6x Non-Traditional Applications Address Manipulation Manipulation section, said that manipulation often performed data. Theoretically, perform manipulation addresses also, treating data. This section describes treats addresses. Most processors provide specific, separate address registers allow "pointer"-type address manipulation. contrast, C6x, "general-purpose" registers used address/pointer registers. Table Parsing Table parsing/lookup important allow base-pointer register setup, from which offsets applied jump through table. C3x/C4x does this well, only C54x came close fixed-point processors that constant immediate) modify usually only good stacks. Figure shows contrived example table lookup summing some Pythagorean triples. base register with address "8000h" offset indicated value "[]". Figure Table Parsing Example Address Manipulation Table Parsing (base address offset) single cycle (and using byte addressability). (le) 8000h 8001h 8002h 8000h 8003h 8004h 8005h 8006h 8007h 8008h Although this example complex, shows capability byte addressing that DSPs other than (and dynamic memory only) support. corresponding code itself shown Figure Using TMS320C6x Non-Traditional Applications Figure Table Parsing Code Address Manipulation Table Parsing (base address offset)* single cycle. .text Table look (base address offset) Pythagorean Triples c^2=a^2 calculation tablel: table, Initialize pointer MVKH table, *A15[0], *A15[1], *A15[2] Load Load 00000009h, 00000010h Calculate 000019h Store table: .word table: .data .byte .byte (3^2) (4^2) (5^2) (6^2) (8^2) (10^2) Dynamic parsing/addressing available with registers. *See TMS320C62xx Instruction Reference Guide 3-20. previous examples pointer, bbdata loaded into register using MVK/MVKH instructions. value bbdata loaded into register using instruction show byte addressibility), with *A15 acting pointer register. pipeline considerations. Remember that code examples UNOPTIMIZED. This contrived example reads squares table bytes adds them together. resulting value then written over initialized "zero", again byte. index each element denoted "[]" instruction. Remember that pure load/store architecture, only instructions perform address accesses. Thus, "*"with only these instructions. This method also works well manipulating registers dedicated peripheral (such McBSP DMA). main peripheral control register often comes first memory map, which used base. Other secondary peripheral registers used offsets. this example, offsets constant (immediate) offsets that derided beginning this section. following section shows only dynamic example also other features. Using TMS320C6x Non-Traditional Applications Link Lists Dynamically calculating pointer addresses often constitute programming practice often used extensively real-time processor code. This example first calculates initial address linked list (dynamically). then shows pointer access accomplished using same register feature mentioned Address Manipulation section.). finally, example shows subtle feature link list circular with instruction C6x. Figure shows example. Figure Link List Example Address Manipulation Pointer/address calculation (dynamic) including link lists (example becomes circular after initial calculation) (le) xptr (80008000h) yptr (le) zptr (80008200h) xptr Initial Pointer Calculation firstlnk (80000000h) (le) yptr (80008100h) zptr value xptr initially dynamically calculated (and forced 80008000h) then link list points next location circular fashion. Note that each "ptr" could arbitrary place memory that just points next "ptr" arbitrary place memory. Figure shows corresponding code: Using TMS320C6x Non-Traditional Applications Figure Link List Code Address Manipulation Link lists using fact that registers used both calculation/general purpose pointer/address functions. .text Circular three element link list load llcirc: firstlnk, MVKH firstlnk, 08000h, MVKH circ: Initialize firstlnk pointer Hand calc xptr offset Clear upper bits ;A1=8000h, A15=80000000h firstlnk (bad programming practice) xptr=80008000h Load next link A15, *A15, yptr, zptr, xptr, yptr. circ Repeat infinitely .data firstlnk .word firstlnk .sect "ptrs" xptr .word yptr yptr .word zptr zptr .word xptr 80000000h 80008000h 80008100h 80008200h pointer "firstlnk" initialized data memory 80000000h push example) loaded with MVK/MVKH. Then address that pointer "firstlink" points 8000h added hard-coded address hard-coded linker command file) xptr. (There technically should separate "ptrs" section with .sect directive EACH pointer accurate addresses shown comments .sect directive above). Then overwrites present pointer with next circular endless-loop fashion. This single instruction pointer update/overwrite possible because registers (A0-A15 B0-B15) BOTH calculation general-purpose) address auxiliary) registers. other TMS320 this. Thus, cycles wasted moving value from general-purpose register/accumulator address/auxiliary register. address data manipulation only features that does well. execution code, especially decision execution, should next examined. Using TMS320C6x Non-Traditional Applications Decision Execution (Conditionally Execute Advantages Over Test/Branch) Much non-traditional code involves "controller", housekeeping-type functions that often involve decision trees with testing branches. disadvantage this many DSPs (and other microprocessors) branching overhead caused deep pipelines. Previous TMS320 DSPs needed cycles overhead branch (sometimes overhead reduced with delayed branch instruction). overhead cycles traditional sense, delay slots used. (Microcontrollers often have shorter pipelines much slower cycle times, overall execution speed much worse than with DSP.) Often delay slots help when there tight data dependencies; that when next decision based very operations following results last decision. Such configuration inherently inefficient. option optimally execute decisions with tight data dependencies uses feature which every instruction conditionally executed. This option presents linear, non-branching method achieving these decision trees. concept shown Figure Figure Decision Execution Concepts Conditionally Execute Parallel) Advantages over Test/Branch Conditionally Execute test Branch test Branch Conditionally Execute Conditionally Execute test Branch Branch Instead classic "bit test branch" that flushes pipeline each decision, shown left side diagram, either execute instruction based condition. This method avoids branching overhead. Using TMS320C6x Non-Traditional Applications Other microprocessors, often RISCs, this methodology. Some, such Intel IA64, execute both legs branch ahead time until determined which will used, which point other voided. course, this method expensive hardware. method software-based less expensive hardware. Comparator Example real-world example illustrate concept saturation input signal seen Figure using comparator function C6x. Figure Decision Execution Example (Comparator) Comparator Example (unsigned) Analog positive rail 16-bit Inputs Digital positive rail FFFF 8000 Digital negative rail Digital positive rail FFFF 8000 negative rail Compare Saturate Analog positive rail Int/Hex 16-bit negative rail Outputs Int/Hex negative rail this example system, analog signal converted digital, resulting 16-bit unsigned value. voltage (that 8000h), will saturated maximum positive unsigned value 0ffffh C6x. voltage (that 8000h) will saturated minimum negative unsigned value 0ffffh C6x. digital signal then converted analog. This example nothing fancy does allow compare styles decision execution. more classical "bit test branch" shown Figure implemented assembly with conditional branch instruction (called BCND other TMS320 DSPs). Using TMS320C6x Non-Traditional Applications Figure Decision Execution Code (Bit Test/Branch) Test/Branch Every instruction conditional Instead typical "bit test branch" with much pipeline overhead (using registers constants): OLD: CMPGT *A15, 8000h, 0000h, A4=0ffffh Load value Test greater than (8000h) a000h then 00000001h 2000h then 00000000h branch (pos sat) [A1] possat LOOP LOOP negsat: possat: not, fall thru clear (neg sat) 00000000h (pos sat) 0000ffffh Note that some values have been pre-loaded into that register operation used bits (For brevity have omitted every MVK/MVKH seen previous examples). data LDWed (into register tested (with register 8000h. result written register branch conditioned [A1] (other TMS320 DSPs have specific "branch conditional" instruction). value 8000h, code branches possat: positively saturates value. value 8000h, falls through branch negsat: negatively saturates value. Often design code condition that statistically happen more often will fall through, although this applicable, comparing sine waves. count cycles execute loop once 1(LDW)+ 4(NOP) 1(AND) 6(B) 1(AND/OR) cycles. improve this number? "conditionally execute" method more conducive shown Figure Using TMS320C6x Non-Traditional Applications Figure Decision Execution Code (Conditional Execute) Conditional Execute operations "Conditional Execute" method saving pipeline overhead: NEW: *A15, 8000h, 0000h, A4=0ffffh Load value negsa: [!A1] possa: [A1] Mask 0/!0 check a000h then 00000001h 2000h then 00000000h !=1, clear (neg sat) 00000000h (pos sat) 0000ffffh Tight loop LOOP: LOOP Note that this code example also, some values have been pre-loaded into that register operation used bits. (Again, brevity have omitted every MVK/MVKH seen previous examples). Again data LDWed this time tested doing operation with value 8000h that will result either register. (The reason using "and" along with other optimization methods discussed section, Optimization Methods/RationalesASIC/FPGA. CMPGT would have been just valid). Then used conditional test negative positive saturation, identical Figure conditions mutually exclusive; thus, executed while other becomes NOP. branches needed (The tight loop just meant give example). This code equivalent that seen Figure think about benchmarking number cycles execute code. count cycles execute loop once 1(LDW) 4(NOP) 1(AND) 2(AND/OR) cycles. Again, improve this number? Optimizing Code (Parallelism Unit Utilization) Examining code, that positive negative saturation instructions have data dependencies between them (for more information data dependencies, TMS320C6000 Programmer's Guide, literature number SPRU198). Thus nothing prevents from executing them same time. start optimizing code adding "||" code perform negative positive saturation same cycle. Again note that conditions mutually exclusive; thus, executed while other becomes parallel, shown Figure Using TMS320C6x Non-Traditional Applications Figure Decision Execution Code (Conditional Execute Parallel) Conditional Execute Parallel) also start parallelizing code, start more functional units (except multiplier?) take seven cycles: NEW: negsapossa: [!A1] [A1] LOOP: *A15, LOOP LOOP Load value Mask 0/!0 check !=1, clear (neg sat) (pos sat) Tight loop Note that mutually exclusive conditionals (like [!A] [A1]) always have conditional acting NOP. second issue concerned about unit resources. Because unit cannot used twice same cycle, must also unit, shown Figure count cycles execute loop once 1(LDW) 4(NOP) 1(AND) 1(AND/OR) cycles. interesting note that cost branching that unit becomes NOP. Thus, could almost that instead losing cycles from Figure units cycles potential units), "lose" only unit instead forty-eight. consider parallelize using more units. This accomplished bringing values time, keeping them separate sides, executing parallel. Each units load value into respectively first cycle wait appropriate NOPs. Each units test each values write result into registers, respectively, sixth cycle. Then conditionally positively saturate values using units, even trick conditionally negatively saturate values using units (multiply value "0") seventh cycle. code Figure Using TMS320C6x Non-Traditional Applications Figure Decision Execution Code (Conditional Execute Parallel-II) Conditional Execute Parallel) Better yet, bring values saturated, parallelize algorithm, execute seven cycles (but doubling throughput cycles/val), even multiplier clear) shown below: nosw: *A15, *B15, 0Fh, 0Fh, 00h, 00h, Load value Load value [A1] [B1] ||[!A1] ||[!B1] Mask 0/!0 check Mask 0/!0 check LSN=Fh (pos sat) LSN=Fh (pos sat) !=1, clear LSN=0 (neg sat) !=1, clear LSN=0 (neg sat) Note that this 4-bit "nibble" saturation. Technically, units have latency negatively saturated values would ready until eighth cycle. Nevertheless, counting "||" combinations, number cycles comes values, thus averaging cycles value. Note that example simplified doing nibbles could stick with constants. Using registers possible, resource conflicts will start appear Figure spread accesses among registers. Finally, software pipelining, kernel shown Figure possible (for more information, TMS320C6000 Programmer's Guide, literature number SPRU198). Using TMS320C6x Non-Traditional Applications Figure Decision Execution Code (Conditional Execute-Software Pipelined) Conditional Execute (with Pipeline) Best yet, bring values saturated, heavily software pipelined, execute single cycle, even multiplier clear) shown below: PIPED LOOP PROLOG PIPED LOOP KERNEL *A15, Load value *B15, Load value Mask 0/!0 check Mask 0/!0 check [A1] 0Fh, LSB=Fh (pos sat) [B1] 0Fh, LSB=Fh (pos sat) ||[!A1] 00h, !=1, clear LSB=0 (neg sat) ||[!B1] 00h, !=1, clear LSB=0 (neg sat) PIPED LOOP EPILOG With prolog epilog, this code would running 1600 MIP's, except that unit left looping! After some prolog initialize pipeline, above kernel uses eight units execute samples cycle. Then some epilog code often needed gracefully exit from kernel. This method allows values loaded, compared, saturated single cycle, assuming, course, appropriate prolog epilog code. This eliminates "NOP following "LDW" seen previous code examples. Thus, cycles theoretical maximum values could processed, with prolog/epilog overhead probably more like 55-60 cycles. Thus, effective benchmark cycle values cycles value. unit Figure available looping. Thus, there ways repeat this instruction, example, times. method dual-cycle loop that will cause take 105-110 cycles (for more information, TMS320C6000 Programmer's Guide, literature number SPRU198). second method unroll loop. other words, repeat/copy times, have available code space. Thus, "loop" benchmark remains within 55-60 cycles with classic code size speed tradeoff. Optimization Methods/RationalesASIC/FPGA section, Optimizing Code (Parallelism Unit Utilization), Figure shows 8000h test performed using CMPGT instruction. Figure through Figure equivalent test could done using instruction. Such method chosen allow flexibility later unit allocation instructions because CMPGT only available units. Because available units units, using this equivalent test makes later flexibility allocation units possible. Using TMS320C6x Non-Traditional Applications compiler uses similar trick when tests value being equal something. i==5) could tested subtracting from variable testing 0!/0. Because this operation available units (.L, .D's), gives greater flexibility unit allocation. feature "conditionally executes" allowing (such mutually exclusive conditions) when operation performed feature parallelism architecture offers interesting observation. This architecture allows operation sequential execution mode with very little branches many conditionals, similar dataflow seen FPGA ASIC. These functions could lockstep very fast speeds much easier program/route than when implemented FPGA/ASIC. Decision Execution Cycle Summary Thus, summarize cycle savings Table (please bear with relative levels optimization that were presented academically concepts across): Table Execution Decision Cycle Summary Coding Style test branch (Figure Conditionally execute (Figure Conditionally execute with parallel saturate (Figure Dual value conditionally execute with parallel saturate (Figure Software pipelined dual value conditionally execute with parallel saturate (Figure Cycles ~0.5 that have seen specifics heart architecture assembly, advanced tools help make using this architecture easier. Application Example section, CPU/Instruction Features With Code Examples, examined various specific features architecture, albeit written assembly. Often programmer, especially starting out, does want involved intricacies certain CPU's assembly language. Thus, they write ANSI produce portable, general code. There various code optimization levels between ANSI pure assembly (intrinsics, callable assembly, etc.) that will fully explored with benchmarks future application report with code benchmarks. this section, write something unique called "linear assembly" through code-generation tool called "assembly optimizer" (for more information, TMS320C6000 Optimizing Compiler User's Guide, literature number SPR187). presented example just first pass non-traditional application. Using TMS320C6x Non-Traditional Applications Table Lookup Example Description examine certain networking lookup algorithm implemented implemented TNETX15VE address lookup engine. course, additional optimizations possible hand assembly, assembly optimizer tool accomplish some functions have mentioned. algorithm explained Figure full code listed Appendix Figure Table Lookup Example Description Table Lookup Example using EXTU (Algorithm) Code Summary (assume setup already): Input value lookup. Traverse through table 6-bit chunks. Read pointer value/linklist next lookup. iteration loop (32/6~=6). Written linear assembly (using optimizer). Example Steps Load 0851C928h into register. Base=table= 80000000h. Extract using EXTU instruction first bits offset. offset base 80000000h 80000002h. Load value 80000002h 01h. base table (value<<6) 80000000h (40h). Internal Memory 0x80000000 80000002h 0x80000040 80000045h 0x80000080 80000087h 0x800000C0 800000C9h 0x80000100 8000010ah 0x80000140 80000140h Extract using EXTU instruction next bits =5h, offset. value 0851C928h: offset base 80000040h 80000045h values Load value 80000045h 02h. base table (value<<6) 80000000h (80h). Repeat from EXTU more times. binary code summary gives overview what code does, while example steps through contrived actual data value used. actual data value displayed lower right-hand hex, binary, 6-bit values coded hex. table, hard-coded internal memory, displayed upper right side graphic. specific initialized values (along with their addresses) used this contrived example displayed boxes scale. This example shows EXTU instruction. assumed that lookup table built, code benchmarks apply processing 32-bit value. algorithm ended being six-iteration loop. Loops obviously good DSPs. More iterations would helpful would require buffering much more data system level bytes packet). other words, bytes buffer space iterations loop needed. Thus, thousand iterations, would need (1000/6) 333K bytes, which prohibitive some systems. Using TMS320C6x Non-Traditional Applications Table Lookup Example Code initialization code written initialization code should actual lookup function could ANSI with intrinsics, linear assembly, pure assembly. Figure shows both main code beginning called linear assembly function named "iploop" (The code Figure actually resides separate files. code ".c" file. linear assembly ".sa" file that stands "serial assembly".) Figure Table Lookup Example Code Initialization main() Init pointer data *llptr; data 0x0851C928; //Assign 0x80000000 (reserved linker programming practice) call llptr (int 0x80000000; ipploop (llptr, data); main _ipploop:.cprocllptr, data .regcount, cstal, cstbr, cstfinal .reg base, offset count; init cnount mvk0, cstal; init shift cstbr; mvk0, base init base called linear assembly function from calling function resembles function with passable parameters return value. half shows code that hard-codes pointer internal memory location 0x80000000 (and allocates memory using linker) with pointer. Then function called, function, with passed parameter pointer data value. When using .cproc, called linear assembly function understands passed parameters from calling function linear assembly function. bottom half shows linear assembly function file parameters received used function along with some initializations. Figure shows iploop() function written linear assembly that appears same file shown Figure (for more information, TMS320C6000 Optimizing Compiler User's Guide, literature number SPR187). Using TMS320C6x Non-Traditional Applications Figure Primary Lookup Table Loop Linear Assembly loop: build cstal, cstfinal cstal, cstfinal, cstfinal; annoying cstbr, cstfinal, cstfinal llptr, base, base base with llptr extu data, cstfinal, offset offset base, offset, base base *base, offset next offset update base offset, base ;increment cstal cstbr cstal, cstal cstbr, cstbr [count] [count] count, count loop offset->base .return count Linear assembly allows mnemonics with symbolic (including passed parameter) values. written "you think without optimization software pipelining. Figure shows meat code. EXTU instruction meat loop. extracts bits from data value offset looks base next table location. EXTU used dynamically cstal cstbr variables specify which bits extract. They pasted together into register cstfinal beginning loop updated toward end. loop counter operation needed seen last lines code before return. Note that return value merely confirms that loop executed. after code through assembly optimizer, pure assembly automatically generated. Figure shows kernel optimized assembly that would reside ".asm" file. Note that epilog prolog have been omitted brevity that little time spent optimizing this looking data dependencies. Using TMS320C6x Non-Traditional Applications Figure Primary Lookup Table Loop Pure Assembly PIPED LOOP KERNEL EXTU A0,A5,A5 A4,A6,A6 base with llptr offset B0,0x1,B0 A5,A6,A5 base *A5,B4 next offset 0x6,A7,A7 A7,0x5,A6 A3,0x6,A3 A7,A6,A6 annoying .S1X B4,0x6,A5 offset->base A3,A6,A6 Thus, Figure shows assembly optimizer generated assembly code softwarepipelined kernel. have think about software pipelining optimization because done you. code clearly shows number cycles required. count cycles data dependencies follow sets parallel bars. System Discussion-C6x DMAs Data (Eliminate Components) mentioned earlier, certain architectural features that make powerful operation "dataflow" applications. addition, provides efficient configuration bringing data on-chip taking data off-chip without much overhead. Also, internal memory allows elimination expensive external device I/Os, such FIFOs. networking data mover typical example shown Figure Using TMS320C6x Non-Traditional Applications Figure C6x) Architecture C6x) Architecture Router FIFO FPGA FIFO FIFO FIFO Mbit/s/32=25 Quad Physical Layer (Phy) 10/100Mbit Mbit/s Physical Layer (Phy) 10/100Mbit Mbit/s Physical Layer (Phy) 10/100Mbit Mbit/s Physical Layer (Phy) 10/100Mbit Mbit/s Let's eliminate FIFO's FPGA! this networking example, physical layer (PHY) akin speech codec typical system. media access controller MAC) receives digital data from Ethernet wire would) sent router (which could imagine like host, much faster) through FIFOs FPGA. Everything running fast with many parts bi-directional manner. maximum size Ethernet packet 1538 bytes. Figure shows substituted FIFOs Figure Using TMS320C6x Non-Traditional Applications Figure Architecture (Size) DMA's Dataflow (memory size) EMIF Router Program Data MCSP MCSP TMS320C62xx Size Ethernet Packet: 1538 bytes->(2K*4MAC*2 bi-dir)=16K Quad chip 16K*2=32K (for efficient ping pong) Physical Layer (Phy) Physical Layer (Phy) Physical Layer (Phy) Physical Layer (Phy) internal memory able replace FIFOs. Size-wise there easily enough internal memory entire maximum Ethernet packet size 1538 bytes into each direction (discussed Figure total C6201B silicon. Because internal memory well partitioned C6201B silicon, doubling buffer size with ping-pong approach would cause less CPU/DMA conflicts. Figure shows DMA/EMIF replaces FPGA addresses speeds bandwidths necessary system operation. Using TMS320C6x Non-Traditional Applications Figure Architecture (Speed) DMA's Dataflow (speed) EMIF Router Data Program TMS320C62xx Function: does data moving work. MCSP What does that conveniently path Protocol conversion, VOIP switch, repeater, encryption, compression, echo cancellation. MCSP Speed: Quad Physical Layer (Phy) Physical Layer (Phy) [(100Mbit/ s)*4 MAC's*2dir]/32bits Unidirectional*2= Bidirectional turnaround? Physical Layer (Phy) Physical Layer (Phy) four channels give elegant solution each direction each "ports" that hooking Speed-wise, might have trouble keeping presently uncharacterized "bus turnaround" issues bi-directional manner. enhance discussion, modify have second have C6202, some enhancements system architecture made, shown Figure Figure C6202 Architecture With Second Second parallel EMIF) would speed unidirectional systems eliminating "bus turnaround" overhead. Each handles direction. Parallel Data C6202 Process Parallel Data Second parallel EMIF) would simplify bidirectional systems interface logic providing second "port" parallel access. Router Data In/Out C6202 Process Quad Data In/Out Using TMS320C6x Non-Traditional Applications second parallel adds some major advantages system interfacing only reducing bandwidths also simpler uni-directional system, there turnaround overhead because side writing side reading. more complex bi-directional system, second provides second "port" parallel access simpler decode (see Figure 23). latter looks like router described this section. Conclusion TMS320C6x CPU/architecture variety features attractive non-typical functions, especially dataflow/"virtual FPGA"-type architecture. preferable write much this code linear assembly because compiler does comprehend these features. four-channel provides attractive architecture such dataflow applications (the second parallel appropriate uni- bi-directional applications). Using TMS320C6x Non-Traditional Applications Appendix Table Lookup Code following code used example described section, Application Example. following command lines were used invoke tools: cl6x ipp.c cl6x ipploop.sa asm6x ipptab.asm lnk6x ipp.cmd ipp.c #include #include #include #include extern void ipploop(); main() *llptr; data 0x0851C928; Could hack making this 0x80000000 simulator this works llptr (int 0x80000000; ipploop (llptr, data); main iploop.sa Texas Instruments, Inc. Linear Assembly perform Packet Parsing Executive Author: David Alter, PhD. Using TMS320C6x Non-Traditional Applications Author: George Date: 02/02/98 Description: Requirements: Parameters: Return: finds doesn't llptr Table parse Parse header chunks .def _ipploop _ipploop: .cproc llptr, data .reg .reg count, cstal, cstbr, cstfinal base, offset mvk0, count init cnount cstal init shift cstbr mvk0, base init base build loop: shlcstal, cstfinal addcstal, cstfinal, cstfinal; annoying addcstbr, cstfinal, cstfinal addllptr, base, base base with llptr Using TMS320C6x Non-Traditional Applications extu data, cstfinal, offset; offset addbase, offset, base base ldb*base, offset next offset update base shloffset, base offset->base ;increment cstal cstbr addcstal, cstal subcstbr, cstbr [count] [count] subcount, count loop .return count .endproc ipp.cmd lnk.cmd v1.00 Texas Instruments Incorporated Copyright 1996-1997 -heap 0x2000 -stack 0x0800 Link Command file test code ipp.out ipp.map ipp.obj ipploop.obj ipptab.obj Using TMS320C6x Non-Traditional Applications c:\dsp\c6x\c6xc\lib\rts6201.lib MEMORY VECS: 00000000h 00400h reset interrupt vectors PMEM: 00000400h 0FC00h intended initialization LTABLE0: 80000000h 0003Fh table LTABLE1: 80000040h 0003Fh table LTABLE2: 80000080h LTABLE3: 800000C0h LTABLE4: 80000100h LTABLE5: 80000140h 0003Fh table 0003Fh table 0003Fh table 0003Fh table BMEM: 80008000h 08000h /*.bss, .system, .stack, cinit SECTIONS vectors .text lnktable0 lnktable1 lnktable2 lnktable3 lnktable4 lnktable5 .tables .data .stack .bss .sysmem .cinit .const .cio .far VECS PMEM LTABLE0 LTABLE1 LTABLE2 LTABLE3 LTABLE4 LTABLE5 BMEM BMEM BMEM BMEM BMEM BMEM BMEM BMEM BMEM Using TMS320C6x Non-Traditional Applications ipploop.asm (tool generated) TMS320C6x ANSI Codegen 1.10 Date/Time created: 14:12:41 1998 Version GLOBAL FILE PARAMETERS Architecture Endian Memory Model TMS320C6200 Little Small Redundant Loops Enabled Pipelining Debug Info Enabled Debug .set .set .set .file "ipploop.sa" Texas Instruments, Inc. Linear Assembly perform Packet Parsing Executive Author: David Alter, PhD. Author: George Date: 02/02/98 Using TMS320C6x Non-Traditional Applications Description: Requirements: Parameters: Return: finds doesn't llptr Table parse Parse header chunks .def _ipploop .sect ".text" .align .sym _ipploop,_ipploop,36,2,0 .func FUNCTION NAME: _ipploop Regs Modified Regs Used A0,A1,A3,A4,A5,A6,A7,B0,B4,B5 _ipploop: _ipploop: .cproc llptr, data .reg .reg .sym .sym count, cstal, cstbr, cstfinal base, offset llptr,0,4,4,32 data,4,4,4,32 .line Using TMS320C6x Non-Traditional Applications .L1X B4,A4 A4,A0 .sym .sym .sym .sym .sym .sym count,16,4,4,32 cstal,7,4,4,32 cstbr,3,4,4,32 cstfinal,6,4,4,32 base,5,4,4,32 offset,6,4,4,32 .line .line .line 0x1a,A3 0x0,A7 init shift 0x6,B0 init cnount .line CMPGTU BRANCH OCCURS loop: .line A7,0x5,A6 .L1X 0x0,A5 B0,1,A1 init base .line A7,A6,A6 annoying .line A3,A6,A6 .line A0,A5,A5 base with llptr .line EXTU A4,A6,A6 offset .line A5,A6,A5 base .line *A5,A6 next offset Using TMS320C6x Non-Traditional Applications .line A6,0x6,A5 offset->base .line 0x6,A7,A7 .line A3,0x6,A3 .line B0,0x1,B0 .line BRANCH OCCURS BRANCH OCCURS CSR,B5 -2,B5,B4 loop B4,CSR B0,1,B0 PIPED LOOP PROLOG A7,0x5,A6 A7,A6,A6 A3,A6,A6 annoying PIPED LOOP KERNEL EXTU A0,A5,A5 A4,A6,A6 base with llptr offset B0,0x1,B0 Using TMS320C6x Non-Traditional Applications A5,A6,A5 base *A5,B4 next offset 0x6,A7,A7 A7,0x5,A6 A3,0x6,A3 A7,A6,A6 annoying .S1X B4,0x6,A5 A3,A6,A6 offset->base PIPED LOOP EPILOG EXTU A0,A5,A5 A4,A6,A6 base with llptr offset A5,A6,A5 *A5,B4 base next offset 0x6,A7,A7 .S1X A3,0x6,A3 B4,0x6,A5 offset->base B5,CSR .line BRANCH OCCURS L10: Using TMS320C6x Non-Traditional Applications .line .L1X B0,A4 BRANCH OCCURS .endfunc 64,000000000h,0 .endproc ipptab.asm .global ippacket .sect data Table USAGE This table Packet Revision Data: 04/22/97 TEXAS INSTRUMENTS, INC. ippacket: values value .word .sect "lnktable0" t000: .byte .byte Using TMS320C6x Non-Traditional Applications .byte 01h; Packet .byte .byte .byte .byte .byte .byte .byte .byte .byte .byte .byte .byte .byte .byte .sect "lnktable1" t001: .byte .byte .byte .byte .byte .byte 02h; Packet .byte .byte .byte .byte .byte .byte .byte .byte .byte .byte .sect "lnktable2" t002: .byte .byte Using TMS320C6x Non-Traditional Applications .byte .byte .byte .byte .byte .byte 03h; Packet .byte .byte .byte .byte .byte .byte .byte .byte .sect "lnktable3" t003: .byte .byte .byte .byte .byte .byte .byte .byte .byte .byte 04h; Packet .byte .byte .byte .byte .byte .byte .sect "lnktable4" t004: .byte .byte .byte Using TMS320C6x Non-Traditional Applications .byte .byte .byte .byte .byte .byte .byte .byte 05h; Packet .byte .byte .byte .byte .byte .sect "lnktable5" t005: .byte 00h; Packet .byte .byte .byte .byte .byte .byte .byte .byte .byte .byte .byte .byte .byte .byte .byte .end Using TMS320C6x Non-Traditional Applications Contact Numbers INTERNET Semiconductor Home Page www.ti.com/sc Distributors www.ti.com/sc/docs/distmenu.htm PRODUCT INFORMATION CENTERS Americas Phone +1(972) 644-5580 +1(972) 480-7800 Email sc-infomaster@ti.com Europe, Middle East, Africa Phone Deutsch +49-(0) 8161 3311 English +44-(0) 1604 3399 +34-(0) Francais +33-(0) 1-30 Italiano +33-(0) 1-30 +44-(0) 1604 Email epic@ti.com Japan Phone International +81-3-3344-5311 Domestic 0120-81-0026 International +81-3-3344-5317 Domestic 0120-81-0036 Email pic-japan@ti.com Asia Phone International +886-2-23786800 Domestic Australia 1-800-881-011 Number -800-800-1450 China 10810 Number -800-800-1450 Hong Kong 800-96-1111 Number -800-800-1450 India 000-117 Number -800-800-1450 Indonesia 001-801-10 Number -800-800-1450 Korea 080-551-2804 Malaysia 1-800-800-011 Number -800-800-1450 Zealand 000-911 Number -800-800-1450 Philippines 105-11 Number -800-800-1450 Singapore 800-0111-111 Number -800-800-1450 Taiwan 080-006800 Thailand 0019-991-1111 Number -800-800-1450 886-2-2378-6808 Email tiasia@ti.com trademark Texas Instruments Incorporated. Other brands names property their respective owners. Using TMS320C6x Non-Traditional Applications IMPORTANT NOTICE Texas Instruments subsidiaries (TI) reserve right make changes their products discontinue product service without notice, advise customers obtain latest version relevant information verify, before placing orders, that information being relied current complete. products sold subject terms conditions sale supplied time order acknowledgement, including those pertaining warranty, patent infringement, limitation liability. warrants performance semiconductor products specifications applicable time sale accordance with TI's standard warranty. Testing other quality control techniques utilized extent deems necessary support this warranty. Specific testing parameters each device necessarily performed, except those mandated government requirements. CERTAIN APPLICATIONS USING SEMICONDUCTOR PRODUCTS INVOLVE POTENTIAL RISKS DEATH, PERSONAL INJURY, SEVERE PROPERTY ENVIRONMENTAL DAMAGE ("CRITICAL APPLICATIONS"). SEMICONDUCTOR PRODUCTS DESIGNED, AUTHORIZED, WARRANTED SUITABLE LIFE-SUPPORT DEVICES SYSTEMS OTHER CRITICAL APPLICATIONS. INCLUSION PRODUCTS SUCH APPLICATIONS UNDERSTOOD FULLY CUSTOMER'S RISK. order minimize risks associated with customer's applications, adequate design operating safeguards must provided customer minimize inherent procedural hazards. assumes liability applications assistance customer product design. does warrant represent that license, either express implied, granted under patent right, copyright, mask work right, other intellectual property right covering relating combination, machine, process which such semiconductor products services might used. TI's publication information regarding third party's products services does constitute TI's approval, warranty, endorsement thereof. Copyright 1999 Texas Instruments Incorporated Using TMS320C6x Non-Traditional Applications Other recent searchesWP7083SED - WP7083SED WP7083SED Datasheet GI338 - GI338 GI338 Datasheet CMOZ43V - CMOZ43V CMOZ43V Datasheet BHC4103SS - BHC4103SS BHC4103SS Datasheet 73S8009R - 73S8009R 73S8009R Datasheet 501800to2000MHz - 501800to2000MHz 501800to2000MHz Datasheet 1728800000 - 1728800000 1728800000 Datasheet
Privacy Policy | Disclaimer |