NEW DATABASE - 350 MILLION DATASHEETS FROM 8500 MANUFACTURERS
SPRA027A TMS320C40 256KB 512KB TMS320 TMS320C30 002FF800H 0100020H DMA04 DMA02 - Datasheet Archive
With TMS320C4x DSPs Application Report Rose Marie Piedra Digital Signal Processing - Semiconductor Group SPRA027A February 1994
Parallel 2-D FFT Implementation With TMS320C4x DSPs Application Report Rose Marie Piedra Digital Signal Processing - Semiconductor Group SPRA027A SPRA027A February 1994 Printed on Recycled Paper IMPORTANT NOTICE Texas Instruments (TI) reserves the right to make changes to its products or to discontinue any semiconductor product or service without notice, and advises its customers to obtain the latest version of relevant information to verify, before placing orders, that the information being relied on is current. TI warrants performance of its semiconductor products and related software to the specifications applicable at the time of sale in accordance with TI's standard warranty. Testing and other quality control techniques are utilized to the extent TI deems necessary to support this warranty. Specific testing of all parameters of each device is not necessarily performed, except those mandated by government requirements. Certain applications using semiconductor products may involve potential risks of death, personal injury, or severe property or environmental damage ("Critical Applications"). TI SEMICONDUCTOR PRODUCTS ARE NOT DESIGNED, INTENDED, AUTHORIZED, OR WARRANTED TO BE SUITABLE FOR USE IN LIFE-SUPPORT APPLICATIONS, DEVICES OR SYSTEMS OR OTHER CRITICAL APPLICATIONS. Inclusion of TI products in such applications is understood to be fully at the risk of the customer. Use of TI products in such applications requires the written approval of an appropriate TI officer. Questions concerning potential risk applications should be directed to TI through a local SC sales office. In order to minimize risks associated with the customer's applications, adequate design and operating safeguards should be provided by the customer to minimize inherent or procedural hazards. TI assumes no liability for applications assistance, customer product design, software performance, or infringement of patents or services described herein. Nor does TI warrant or represent that any license, either express or implied, is granted under any patent right, copyright, mask work right, or other intellectual property right of TI covering or relating to any combination, machine, or process in which such semiconductor products or services might be or are used. Copyright © 1996, Texas Instruments Incorporated Introduction Fourier transform techniques are of fundamental importance in digital signal processing (DSP) applications. Among the most commonly used algorithms in image processing is the Fast Fourier Transform (FFT). FFT is used for computation of the Discrete Fourier Transform (DFT). FFT computations can be used to solve image correlations and convolutions. Two-dimensional convolutions and correlations are used for feature extraction in image processing. For example, applications on fluid dynamics (2-D turbulences) can lead to calculation of velocity vectors and gradients. One important advantage of using frequency domain tools over direct methods is faster execution. The FFT algorithm significantly reduces the computation time of the DFT. This application note compares serial and parallel implementations of 2-D complex FFTs with the TMS320C40 TMS320C40 processor. Special attention is given to parallel implementation of 2-D FFTs. The increasing demands for speed and performance in some real-time DSP applications make sequential systems inadequate. Parallel systems provide higher throughput rates. The algorithms were implemented on the Parallel Processing Development System (PPDS), a system with four TMS320C40s and with both shared- and distributed-memory support. This report is structured as follows: 2-D FFT Algorithm Gives a brief review of the FFT algorithm and its extension to the 2-D case. Describes applications of FFTs in the calculation of correlation and convolution algorithms. Parallel 2-D FFT Focuses on parallel implementations of 2-D FFTs. Shared- and distributed-memory implementations are considered, as well as the TMS320C40 TMS320C40's suitability for each. TMS320C40 TMS320C40 Implementation Presents the results of shared- and distributed-memory implementations of parallel 2-D FFT realized on the PPDS. Gives analyses of speed-up and efficiency. Conclusion States conclusions. Appendixes Lists the code for performing serial and parallel 2-D FFTs in C and 'C40 assembly language code. The 2-D FFT Algorithm The Discrete Fourier Transform (DFT) of an n-point discrete signal x(i) is defined by: X(k) where 0 v k v n * 1 and W + n + p e *j 2 n . n1 + x(i)W n ik i 0 Direct DFT computation requires O (n2) arithmetic operations. A faster method of computing the DFT is the Fast Fourier Transform (FFT) algorithm. If FFT is used to solve an n-point DFT, (log 2 n) steps are required, with n/2 butterfly operations per step. The FFT algorithm therefore requires 1 [ approximately n log 2 n times faster than direct O (n log 2 n) arithmetic operations (which is n 2 log 2 n DFT computation). See [3], [6], and [4] for a more detailed analysis of the 1-D FFT case. Two-dimensional DFT can be defined in a manner similar to the 1-D case [10]. The 2-D DFT is given by: where 0 v k , k v n * 1 and W 1 2 * 1 n* 1 + + + e* n X (k 1,k 2) i1 n 0 i2 +0 p j2n x ( i 1,i 2 ) W n ( i1k1 ) i2k2 ) . A standard approach to computing the 2-D FFT of a matrix A is to perform a 1-D FFT on the rows of A, giving an intermediate matrix A', then performing a 1-D FFT on the columns of A'[10]. This is the approach followed in this application report. Timing Analysis A 2-D FFT of a complex matrix of size (n n) requires the execution of a 1-D FFT on n rows, followed by a 1-D FFT on n columns. The number of arithmetic operations required will therefore be as follows: Time + n * O (n log 2 n) ) n * O (n log (FFT on n rows) n) 2 [ O (n 2 log 2 n) (FFT on n columns) Application of FFT on Correlation/Convolution Algorithms Relationships between image and transform domains can be described by convolution and correlation. Convolution is used for linear interpolation or filtering. Correlation plays an important role in feature extraction in image processing. These image operations are computationally intensive; Fourier transforms can be used to enhance speed. The correlation of two sequences x(i) and y(i) of length n is defined as[10]: + w(i) n k *1 +0 x (k) y (i ) k) For a 1-D correlation, the common direct approach (in time domain) based on shift-and-multiply operations requires O (n 2) arithmetic operations. Based on the convolution property of the Fourier transform [3], an efficient way to compute correlation is by using FFT and inverse FFT (IFFT), as illustrated below: 1. 2. 3. Compute FFT{x(i)} and FFT{y(i)}. Multiply FFT{x(i)} by the complex conjugate of FFT{y(i)}. Compute the IFFT of this result. Similarly, a convolution operation reduces to a simple multiplication in the Fourier domain. FFT correlation/convolution becomes computationally faster than spatial convolution for large images. n2 n , which is significant when dealing with very large Speed-up is approximately log 2 n n log 2 n images. + Parallel 2-D FFT A 2-D FFT is an intrinsically parallel operation; a 1-D FFT is applied separately to each row and column of a matrix. 2 The Parallel Algorithm w Let n = qp, where n is the order of the squared matrix A, p is the number of processors, and q 1 is an integer. The basic idea is to allocate a unique working set of rows/columns to each processor. The algorithm consists of three basic steps: Step 1. FFT on rows: Processor i executes 1-D sequential FFT on rows qi, qi+1, . , qi+q-1, with i = 0,1,.,p-1. Because each processor executes a 1-D FFT on q different rows, this step requires 2 q * O (n log 2 n) O (n log 2 n) arithmetic operations. p [ Step 2. Matrix Transposition: Because column elements are not stored in contiguous memory location, row/column transposition of matrix A is necessary prior to executing FFT on columns. Matrix transposition n (n 1) requires (n 1) exchange steps, where n is the order of the matrix (n 2) 1 2 2 [7]. The computation delay involved in this operation is therefore O (n 2) for serial execution or O n p for parallel execution. * ) * ) AAA ) + * Step 3. FFT on columns: Same as in Step 1, but by column. Speed-Up Analysis The speed-up factor can be calculated as follows: n 2 log 2 n Speed-up O(p) n2 p log 2 n [ [ The parallelism in the 2D-FFT is suitable for implementation on distributed-memory or shared-memory multiprocessors. Let's consider those cases. Shared-Memory Implementation Matrix A is stored in global memory, so each processor has easy access to all the rows/columns. Even when all processors share the same physical data memory, each processor points to a different row/column working set. Shared-memory systems require careful consideration of the memory-contention problem. Matrix transposition (Step 2) is simpler, but row/column access can create a major bottleneck. Shared-memory implementation requires at least 2 * n * n = 2 n2 words of shared memory. If this amount of RAM is unavailable in the system, consider either intermediate downloading of files or distributed-memory implementation as an alternative. Figure 1 illustrates the shared-memory 2-D FFT implementation for p = 4 and n = 8. Distributed-Memory Implementation Matrix A is partitioned into p regions. Each region contains q rows and is assigned to each processor's local memory. Processors communicate via message passing. · · Steps 1 and 3 of the parallel 2-D FFT algorithm are executed in the local memory of each processor. No interprocessor communication is necessary. Step 2, matrix transposition, is more complex because matrix A is distributed between processors by row. You must perform message passing of row segments before you execute matrix transposition. This procedure can be described as follows: 3 Total-exchange step: Processor i sends to processor j columns qj, qj + 1, . , qj + q 1 of each of the rows allocated to it, where 0 j < p and i j. In such DMA-supported devices as the TMS320C40 TMS320C40, this step can be executed simultaneously with Step 1, after computation of each row FFT. Better DMA-CPU parallelism can thus be achieved. This is the approach followed in the parallel 2-D FFT (distributed-memory implementation) presented in this application report. Transposition of submatrices: After the total exchange step, each processor contains all the column elements needed to perform a row/column transposition. Transposition is executed on p squared submatrices of size (q q). Submatrix Gk contains elements (kq + i, j), where (0 i, j q) and (0 k p). The computation delay involved in this operation is 2 p O (q 2) O n . p v t + v t Figure 2 illustrates Step 2 of the distributed-memory implementation, with p = 4 and n = 8. · · · 4 Memory requirements: Each processor requires at least 2n2 / p words of local memory to store the ( n/p ) rows allocated to it. Required topology: This implementation requires a fully connected multiprocessor configuration. Other configurations with rerouting capabilities are also feasible. Refer to [9] for information on total exchange techniques for configurations for cube, mesh, and linear arrays. Output result: Matrix results are stored by column, with column elements stored in successive memory locations in the local memory of each processor. Processor i contains columns qi, qi + 1, . , qi + q 1, where i = 0,1,.,p 1. Figure 1. 2-D FFT Shared-Memory Implementation FFT on Rows: Matrix A (Shared Memory) Processor 0 Processor 2 01 02 03 04 05 06 07 11 12 13 14 15 16 17 20 21 22 23 24 25 26 27 30 Processor 1 00 10 31 32 33 34 35 36 37 41 42 43 44 45 46 47 51 52 53 54 55 56 57 60 61 62 63 64 65 66 67 70 Processor 3 40 50 71 72 73 74 75 76 77 FFT on Columns: Matrix A (Shared Memory) Processor 0 Processor 1 Processor 2 Processor 3 00 02 03 04 05 06 07 10 11 12 13 14 15 16 17 20 21 22 23 24 25 26 27 30 31 32 33 34 35 36 37 40 41 42 43 44 45 46 47 50 51 52 53 54 55 56 57 60 61 62 63 64 65 66 67 70 NOTE: 01 71 72 73 74 75 76 77 Matrix element A[i][j] Matrix size Number of processors = ij = n = 8 = p =4 5 Figure 2. 2-D FFT Distributed-Memory Implementation (Step 2: Transposition of Submatrices) Matrix A After Processors Have Completed FFTs on All the Rows 00 10 01 11 02 12 03 13 04 14 05 15 06 16 07 17 20 30 21 31 22 32 23 33 24 34 25 35 26 36 27 37 40 50 41 51 42 52 43 53 44 54 45 55 46 56 47 57 60 70 61 71 62 72 63 73 64 74 65 75 66 76 67 77 } } } } Processor 0 (rows 0,1) Processor 1 (rows 2,3) Processor 2 (rows 4,5) Processor 3 (rows 6,7) Matrix A During Total Exchange Step 00 10 01 11 02 12 03 13 04 14 05 15 06 16 07 17 20 30 21 31 22 32 23 33 24 34 25 35 26 36 27 37 40 50 41 51 42 52 43 53 44 54 45 55 46 56 47 57 60 70 61 71 62 72 63 73 64 74 65 75 66 76 67 77 Exchange of (q q) submatrices q q Matrix A After Total Exchange Step and During Submatrix Transposition 00 10 01 11 20 30 21 31 40 50 41 51 60 70 61 71 02 12 03 13 22 32 23 33 42 52 43 53 62 72 63 73 04 14 05 15 24 34 25 35 44 54 45 55 64 74 65 75 06 16 07 17 26 36 27 37 46 56 47 57 66 76 67 77 } } } } Processor 0 Processor 1 Processor 2 Processor 3 Matrix A After Transpositions of Submatrices (ready for FFT on columns) 00 01 20 21 30 31 40 41 50 51 60 61 70 71 02 03 12 13 22 23 32 33 42 43 52 53 62 63 72 73 04 05 14 15 24 25 34 35 44 45 54 55 64 65 74 75 06 07 6 10 11 16 17 26 27 36 37 46 47 56 57 66 67 76 77 } } } } Processor 0 (columns 0,1) Processor 1 (columns 2,3) Processor 2 (columns 4,5) Processor 3 (columns 6,7) TMS320C40 TMS320C40 Implementation The TMS320C40 TMS320C40 The TMS320C40 TMS320C40 is the world's first parallel-processing DSP. In addition to a powerful CPU that can execute up to 11 operations per cycle with a 40- or 50-ns cycle time, the TMS320C40 TMS320C40 contains six communication ports and a multichannel DMA [8]. The on-chip communication ports allow direct (glueless) processor-to-processor communication, and the DMA unit provides concurrent I/O by running parallel to the CPU. Special interlocked instructions also provide support for shared-memory arbitration. These features make the TMS320C40 TMS320C40 suitable for both distributed- and shared-memory computing systems [2]. The 2-D FFT algorithm was implemented on the TMS320C40 TMS320C40 Parallel Processing Development System (PPDS). The PPDS includes four TMS320C40s, fully interconnected via the on-chip communication ports. Each 'C40 has 256KB 256KB of local memory SRAM, and all share a 512KB 512KB global memory [5]. General features of the programs: · · · · · · · · All the programs have been written to be independent of the FFT size and the number of processors in the system. Further optimization is possible for a fixed number of processors. Real and imaginary parts of complex numbers are stored in successive memory locations. Both C and assembly language versions of the programs are in the appendices. The programs can be downloaded from the TMS320 TMS320 bulletin board at (713) 274-2323. Set your modem to 8 data bits,1 stop bit, no parity. For the C programs, there are core functions, such as the 1-D FFT and CPU complex moves, and routines to set DMA register values in assembly code to enhanced optimization. For the assembly programs, most of the functions are in-lined to avoid the delay associated with calling routines. But, in order to keep the program flexible, the 1D-FFT has been retained as a subroutine. The new 'C40 LAJ and BUD R11 instructions permit routine calls with just one delay cycle. To make the programs more generic, the (5/4)-cycle sine/cosine table [4] and the input matrix are provided in separate files. The radix-2 1-D FFT routine presented in Appendix D is used as the core for the 2-D FFT implementation. But you can use any FFT routine that complies with the calling conventions. Thus, as faster 1-D FFT algorithms are developed, they can be used to implement faster 2-D FFT algorithms. The 'C40 timer 0 and the timer routines in the parallel runtime support (PRTS) library, available with the 'C40 C compiler, are used for benchmark measures. The real benchmark timing is equal to the timer counter value multiplied by 2 * ('C40 cycle time). For the parallel programs, the total execution time of the parallel algorithm can be defined as T = max (Ti ), where Ti is the execution time taken by processor i. Note that in the programs, Ti is the time between labels t2 and t0. The load-imbalancing case was not considered. Refer to [2] for an example of this case. The compiler/assembler tools were run under OS/2 to avoid memory-limitation problems with the optimizer. Serial Implementation Serial implementations of the 2-D FFT provide accurate speed-up measures for the parallel programs. Appendix NO TAG illustrates single- and double-buffered 2-D FFT serial implementations in C and 'C40 assembly code. 7 Observations: · · Although the 1-D FFT can be executed directly in off-chip RAM, the preferred method is to transfer the row/column to on-chip RAM first. This fully exploits the dual-access single-cycle characteristic of the 'C40 on-chip RAM for some parallel instructions. This transfer delay can be minimized with double-buffering techniques. Double-buffering technique: · · While the CPU is computing FFT on row/column i, the DMA processor transfers the vector result of row/column (i-1) bit-reversed and the next row/column (i+1) from external RAM to on-chip RAM to have it ready for the next FFT computation. This technique is called double buffering. · CPU operations are confined to computing 1-D FFT (1 row at a time). The DMA processor performs the data transfer between on-chip RAM and external RAM, providing the CPU with a new set of data. This requires a continuous CPU-DMA synchronization. The 2K 32-bit-word on-chip RAM constantly holds 2 buffers. Each buffer must be located in a different on-chip RAM block. Because each on-chip RAM block has an independent bus path, CPU/DMA access conflict is minimized. You can compute up to 1K-point real FFT or 512-point complex FFT in on-chip memory. If the double-buffering technique is not used, the system can compute up to 2K-point real FFT or 1K-point complex FFT . For DMA bit-reversed complex transfers, you can use autoinitialization to transfer the real part of the FFT vector result first and the imaginary part later. You must set the "read bit-rev" bit (control register bit 12) to 1 and the source address index to the FFT size (n). Given the 'C40 architecture, no extra delay occurs with CPU/DMA bit-reversed addressing. For DMA column transfers, autoinitialization is also used to transfer the real and imaginary parts of each complex vector. Given the offset addressing capabilities of the 'C40, the transposition step requires no extra cycles when moving columns from off-chip to on-chip RAM. Shared-Memory Parallel Implementation Appendix B contains single- and double-buffered versions of the 2-D FFT (shared-memory version) in C and 'C40 assembly code. Observations: · · · 8 A node ID (my_node) is allocated by software to each processor. In this way, each processor automatically selects its associated row/column working set. Each row/column is initially transferred to on-chip memory to minimize memory access conflict among the processors. Using the DMA for double-buffering minimizes not only this access delay but also the effect of a nonzero wait-state global memory similar to that of the PPDS. Interprocessor synchronization is required before you execute FFT on columns. Synchronization is implemented via a counter flag in global memory. Every processor increments the counter by 1 after completing the execution on the rows allocated to it. In this way, the processors begin executing FFT on columns only after the counter equals p (number of processors). · · · The transposition step that is necessary prior to executing FFT on columns is implemented simultaneously with the transfer of columns to on-chip memory, with no delay penalty. The global memory of the PPDS can contain a complex matrix with a maximum of 256 256 elements. Because of the need for an extra location for the synchronization counter, the program has been tested with a maximum of 128 128 elements. For benchmarking of shared-memory programs, a global start of all the processors is absolutely necessary; otherwise, the real-memory-access conflict resolution will not occur. To facilitate this process, a C-callable routine (syncount.asm) is provided in Appendix D for debugging systems lacking global start capability capability. Rotating priority for shared memory access should be selected by setting the PPDS LCSR register to 0x40. Distributed-Memory Parallel Implementation Two distinct single-buffered 2-D FFT implementations were used: · · Use of DMA only for interprocessor communication (See dis1.c in Appendix C). Use of DMA for interprocessor communication and matrix transposition (See dis2.c and dis2.asm in Appendix C). Observations: · The six-channel DMA coprocessor is used for interprocessor communication during the total-exchange step as follows: · · · As soon as 1-D FFT is completed on a row, the DMA coprocessor will be in charge of transferring (n/p) complex points of this result already stored in local memory, from one processor to the other in total exchange fashion. Each processor will transmit a total of ( n / p ) ( p1 ) O ( n ) complex numbers per row. In the PPDS, a memory-to-memory interprocessor transfer operation of an integer number requires seven clock cycles-four to transmit the 4-byte word, two to write it to/from memory, and one to set up the communication channel. The communication delay will therefore be approximately 7 * 2 * n = 14n clock cycles. [ Careful consideration of the communication delay involved is necessary to achieve true CPU-DMA parallelism. If the total exchange step requires more time than the 1-D FFT computation, the application will slow down. DMA channels are set in split mode [8] with source and destination synchronization using the ICRDY and OCRDY port signals, respectively. In this way, DMA will be interrupted when there is new data to read in the input FIFO. A value will be written to the output FIFO if the output FIFO is not full. Transferring is done in a linear fashion (not bit-reversed). Because the interprocessor communication occurs in an exchange fashion, no extra memory is needed for temporary buffers. Source address and destination addresses are set to the same address values. Destination node IDs help to determine the location of the data to be exchanged. Data will never overlap, because the communication port FIFOs act as data buffers, as long as the communicating processors start executing the exchange.asm routine at approximately the same time. This can be achieved by using a common system reset or the parallel debugger manager (PDM), which is part of the 'C4x emulator. For systems without common reset, use the exch2.asm routine instead of exchange.asm. The exch2.asm routine can be downloaded from the TMS320 TMS320 bulletin board at (713) 274-2323. Set your modem to 8 data bits,1 stop bit, no parity. 9 · The destination node is selected in such a way that each pair of processors are synchronized to talk to each other at approximately the same moment, thus facilitating communication scheduling and avoiding a system lock that could occur if a processor sent data with no processor ready to receive it. You select the destination node by using a XOR operation: (my_node) XOR (i), 0 < i {} > {} > /* LINK USING C CONVENTIONS /* GET RUNTIME SUPPORT = = = = = 0x00 0x002ff800 0x002ffc00 0x40000000 0x80000000 RAM1 RAM1 RAM1 RAM1 RAM1 LM len len len len len = = = = = 0x1000 0x0400 0x0400 0x10000 0x20000 */ */ /* LOCAL MEMORY /* GLOBAL MEMORY */ */ /* /* /* /* /* /* */ */ */ */ */ */ CODE INITIALIZATION TABLES SYSTEM STACK GLOBAL & STATIC VARS Sine tables Input matrix INPUT.ASM * * * INPUT.ASM : input matrix 4 x 4 for serial/shared program * * .global .sect _MATRIX "INPUT" _MATRIX .float .float .float .float 130.0,90.0 66.0,230.0 205.0,136.0 15.0,187.0 ;[0][0] : ;[0][1] : ;[0][2] : ;[0][3] : output output output output 2264.0 56.0 50.0 182.0 , , , , 2288.0 532.0 378.0 22.0 .float .float .float .float 150.0,164.0 222.0,44.0 95.0,243.0 80.0,60.0 ;[1][0] : ;[1][1] : ;[1][2] : ;[1][3] : output output output output 401.0 353.0 423.0 167.0 , , , , 227.0 1.0 373.0 229.0 .float .float .float .float 97.0,36.0 215.0,191.0 209.0,239.0 161.0,22.0 ;[2][0] : ;[2][1] : ;[2][2] : ;[2][3] : output output output output 68.0 106.0 418.0 616.0 , , , , 26.0 176.0 636.0 266.0 .float .float .float .float .end 117.0,238.0 203.0,44.0 104.0,187.0 195.0,177.0 ;[3][0] : ;[3][1] : ;[3][2] : ;[3][3] : output output output output 131.0 175.0 225.0 159.0 , , , , 83.0 319.0 133.0 79.0 23 SINTAB.ASM * * * SINTAB.ASM : Table with twiddle factors for a 4point CFFT * and data input. File to be linked with the * source code for a 4point, radix2 FFT. * * .global .global .global N M SINE N M .set .set 4 2 .data SINE COSINE .float .float .float .float .float .end 24 0.000000 1.000000 0.000000 1.000000 0.000000 A.2. SER.ASM: Single-Buffered Implementation ('C40 Assembly Program) SER.ASM * * * SER.ASM : TMS320C40 TMS320C40 complex 2DFFT serial program * (Singlebuffered version) * * Routines used: * cfft.asm (complex FFT) * * Requirements: Matrix size = N > 0 * * To run: * * asm30 v40 gs ser.asm * asm30 v40 gs sintab.asm * asm30 v40 gs input.asm * lnk30 ser.cmd * * .global .global .global .global N _MATRIX CFFT C2DFFT ; ; ; ; FFT size Matrix address Complex 1DFFT subroutine Entry point for execution _STACK .usect "STACK",10h ; Stack definition FFTSIZE MATR STACK BLOCK0 TIMER .text .word .word .word .word .word N _MATRIX _STACK 002FF800H 002FF800H 0100020H 0100020H ; Stack address ; Onchip buffer (RAM block 0) ; Timer 0 address LDP LDI FFTSIZE @STACK,SP ; Data page pointer initialization ; Stack pointer initialization LDI STIK LDI STI @TIMER,AR2 1,*+AR2(8) 961,R0 R0,*AR2 ; Optional: benchmarking (time_start) OR LDI SUBI 1800h,ST @FFTSIZE,AR3 1,AR3,AR5 ; Enabling cache ; AR3 = N = FFT size ; AR5 = row counter LDI LDI @MATR,AR7 @BLOCK0,AR6 ; AR7 = matrix pointer ; AR6 = onchip buffer pointer C2DFFT t0: * * FFT ON ROWS * LOOPR 25 * * Move row X * * to onchip memory * * SUBI3 LDI RPTBD LDI LDI LDF || LOOP1 || 2,AR3,RC AR7,AR0 LOOP1 AR6,AR1 2,IR0 *+AR0(1),R0 ; RC = N2 ; Source address LDF STF LDF STF *AR0+(IR0),R1 R0, *+AR1(1) *+AR0(1),R0 R1,*AR1+(IR0) ; ; ; ; CFFT *AR0,R1 ; Call 1DFFT (complex) ; Load X(N1) Re R0,*+AR1(1) R1,*AR1 ; Store X(N1) Im ; Store X(N1) Re ; Destination address ; R0 = X(I) Im X(I) Re & points to X(I+1) Store X(I) Im R0 = X(I+1) Im Store X(I) Re * * FFT on row X * * || LAJ LDF NOP STF STF * * Move row X (bitreversed) from * * onchip memory to external memory * * SUBI3 LDI LDI RPTBD LDI LDI LDF || LOOP2 || || LDF STF LDF STF LDF STF DBUD STF LSH3 ADDI 2,AR3,RC AR6,AR0 AR7,AR1 LOOP2 AR3,IR0 2,IR1 *+AR0(1),R0 ; Source address ; Destination Address ; Source offset for bitreverse = N ; Destination offset *AR0+(IR0)B,R1 R0,*+AR1(1) *+AR0(1),R0 R1,*AR1+(IR1) *AR0+(IR0)B,R1 R0,*+AR1(1) AR5,LOOPR R1,*AR1+(IR1) 1,AR3,R0 R0,AR7 * * FFT ON COLUMNS * t1: 26 SUBI LDI 1,AR3,AR5 @MATR,AR7 ; AR5 = column counter ; AR7 = Matrix pointer LOOPC * * Move column X (X=AR7) * * to onchip memory * * SUBI3 LDI LDI RPTBD LSH3 LDI LDF || LOOP3 || 2,AR3,RC AR7,AR0 AR6,AR1 LOOP3 1,AR3,IR1 2,IR0 *+AR0(1),R0 ; RC=N2 ; Source address ; Destination address LDF STF LDF STF *AR0+(IR1),R1 R0, *+AR1(1) *+AR0(1),R0 R1,*AR1+(IR0) ; ; ; ; ; Source offset = 2*N ; Destination offset ; R0= X(I) Im X(I) Re & points to X(I+1) Store X(I) Im R0=X(I+1) Im Store X(I) Re * * FFT on column X * * || LAJ LDF NOP STF STF CFFT *AR0,R1 ; Load X(N1) Re R0,*+AR1(1) R1,*AR1 ; Store X(N1) Im ; Store X(N1) Re * * Move column X (bitreversed) from * * onchip memory to external memory * * SUBI3 LDI LDI RPTBD LDI LSH LDF || t2 LDF STF LDF STF *AR0+(IR0)B,R1 R0,*+AR1(1) *+AR0(1),R0 R1,*AR1+(IR1) DBUD LDF STF STF ADDI AR5,LOOPC *AR0+(IR0)B,R1 R0,*+AR1(1) R1,*AR1+(IR1) 2,AR7 LDI LDI || LOOP4 || 2,AR3,RC AR6,AR0 AR7,AR1 LOOP4 AR3,IR0 1,AR3,IR1 *+AR0(1),R0 @TIMER,AR2 *+AR2(4),R0 B .end ; RC=N2 ; Source address ; Destination address t2 ; Source offset = IR0 = N (bitreverse) ; Destination offset (columns) = IR1 = 2N ; Optional: benchmarking ; tcomp = R0 27 SER.CMD input.obj ser.obj sintab.obj lmylib.lib m ser.map o ser.out MEMORY { ROM: RAM0: RAM1: LM: GM: o o o o o = = = = = .text .data STACK INPUT :{} :{} :{} :{} 0x00000000 0x002ff800 0x002ffc00 0x40000000 0x80000000 l l l l l = = = = = 0x1000 0x400 0x400 0x10000 0x20000 } SECTIONS { } 28 > > > > RAM1 RAM1 RAM1 LM /* SINE TABLE /* STACK */ /* INPUT MATRIX */ */ A.3. SERB.C: Double-Buffered Implementation (C Program) SERB.C /* SERB.C : Serial 2dimensional complex FFT (Doublebuffered version) To run: cl30 v40 g s mr o2 serb.c asm30 v40 s sintab.asm asm30 v40 s input.asm lnk30 serbc.cmd Requirement: SIZE 4 w */ #define SIZE 4 /* FFT size */ #define LOGSIZE 2 /* log(FFT size) */ #define #define #define #define #define BLOCK0 0x002ff800 /* onchip buffer 1 BLOCK1 0x002ffc00 /* onchip buffer 2 DMA0 0x001000a0 /* DMA0 address SWAP(x,y) temp = x; x = y; y = temp; WAIT_DMA(x) while (0x00c00000 & *x) != 0x00800000); extern void cfftc(), cmove(), cmoveb(), set_dma(); extern float MATRIX[SIZE][SIZE*2]; /* Input complex matrix /* /* /* /* */ */ */ Ccallable complex FFT CPU complex move CPU bitreversed move Setup DMA registers */ */ */ */ */ /* DMA control register values */ int int int int int ctrl0= 0x00c41004; /* no autoinit.,dmaint,bit_rev ctrl1= 0x00c01008; /* autoinit.,no dmaint,bitrev ctrl2= 0x00c00008; /* autoinit, no dmaint ctrl3= 0x00c40004; /* no autoinit.,dmain dma01[7], dma02[7], dma03[7], dma04[7]; */ */ */ */ float *CPUbuffer =(float *)BLOCK0, *DMAbuffer =(float *)BLOCK1, *MM[SIZE], *temp; */ */ /* block for CPU FFT operations /* block for DMA operations volatile int *dma0 = (int *)DMA0; int size2 = (SIZE*2),i,j; int tcomp; /*/ main() { asm(" or 1800h,st"); for (i=0;i > > > LM LM RAM1 RAM1 RAM1 RAM1 /* /* /* /* /* /* INPUT MATRIX CODE INITIALIZATION TABLES SYSTEM STACK GLOBAL & STATIC VARS SINE TABLES */ */ */ */ */ */ } 33 A.4. SERB.ASM: Double-Buffered Implementation ('C40 Assembly Program) SERB.ASM * * * SERB.ASM : TMS320C40 TMS320C40 complex 2DFFT serial program * (Doublebuffered version) * * Routines used: cfft.asm (complex FFT) * * Requirements: matrix size = N >= 4 * * To run: * asm30 v40 s g serb.asm * asm30 v40 s g sintab.asm * asm30 v40 s g input.asm * lnk30 serb.cmd * * .global .global .global .global _STACK N _MATRIX CFFT C2DFFT ; ; ; ; FFT SIZE MATRIX ADDRESS 1DFFT SUBROUTINE ENTRY POINT FOR EXECUTION .usect "STACK", 10h ; Stack definition * DMA AUTOINITIALIZATION VALUES DMA04 DMA04 .sect .space .word .space .word .space .word .space "DMA_AUTOINT' ' 6 DMA02 DMA02 6 DMA03 DMA03 6 DMA04 DMA04 6 FFTSIZE MATR BLOCK0 BLOCK1 STACK_A DMA0 CTRL0 CTRL1 CTRL2 CTRL3 P04 P03 P02 P01 MASK TIMER .text .word .word .word .word .word .word .word .word .word .word .word .word .word .word .word .word N _MATRIX 002FF800H 002FF800H 002FFC00H 002FFC00H _STACK 001000A0H 001000A0H 00C41004H 00C41004H 00C01008H 00C01008H 00C00008H 00C00008H 00C40004H 00C40004H DMA04 DMA04 DMA03 DMA03 DMA02 DMA02 DMA01 DMA01 02000000H 02000000H 0100020H 0100020H ; ; ; ; ; ; ; ; ; ; ; ; ; ; LDP LDI FFTSIZE @STACK_A,SP ; LOAD DATA PAGE POINTER ; INITIALIZE THE STACK POINTER DMA01 DMA01 DMA02 DMA02 DMA03 DMA03 C2DFFT 34 ; DMA autoinitialization values RAM BLOCK 0 RAM BLOCK 1 STACK ADDRESS ADDRESS OF DMA0 NO AUTOINITIALIZATION, DMA AUTOINITIALIZATION, NO DMA AUTOINITIALIZATION, NO DMA NO AUTOINITIALIZATION, DMA POINTER TO REGISTER VALUES POINTER TO REGISTER VALUES POINTER TO REGISTER VALUES POINTER TO REGISTER VALUES 1 IN DMAINT0 TIMER 0 address INT., BITREV INT., BITREV INT. INT. (DMA04 DMA04) (DMA03 DMA03) (DMA02 DMA02) (DMA01 DMA01) t0: LDI STIK LDI STI OR LDI LDI LDI LDI @TIMER,AR2 1,*+AR2(8) 961,R0 R0,*AR2 1800h,ST @FFTSIZE,AR3 @MATR,AR7 @BLOCK1,R7 @BLOCK0,AR6 ; OPTIONAL: BENCHMARKING (TIME_START) ; ; ; ; ; ENABLE CACHE AR3=N POINTER TO MATRIX POINTER TO DMA BUFFER POINTER TO FFT BUFFER * * CPU MOVES ROW 0 * * SUBI3 2,AR3,RC ; RC=N2 LDI AR7,AR0 ; SOURCE RPTBD LOOP1 LDI AR6,AR1 ; DESTINATION LDI 2,IR1 LDF *+AR0(1),R0 ; R0= X0(I) IM * LOOP || LOOP1 || LDF STF LDF STF *AR0+(IR1),R1 R0, *+AR1(1) *+AR0(1),R0 R1,*AR1+(IR1) * STORE LAST VALUE LDI @P02,AR2 LDF *AR0+(IR1) || STF R0,*+AR1(1) STF R1,*AR1 ; ; ; ; X0(I) RE & POINTS TO X0(I+1) STORE X0(I) IM R0=X0(I+1) IM STORE X0(I) RE ; POINTS DMA REGISTER ,R1 ; LOAD X0(N1) RE ; STORE X0(N1) IM ; STORE X0(N1) RE * * * SET PARAMETERS FOR DMA02 DMA02, DMA03 DMA03 THAT WILL ALWAYS BE FIXED * * * LDI LDI LDI STI STI STI STI STI STI LSH3 STIK STIK LDI STI LDI STIK SUBI STI STI STIK @P03,AR1 @P01,AR4 @CTRL1,R0 R0,*AR2 AR3,*+AR2(2) R0,*AR4 AR3,*+AR4(2) AR3,*+AR2(3) AR3,*+AR4(3) 1,AR3,R0 2H,*+AR2(5) 2H,*+AR4(5) @CTRL3,R2 R2,*AR1 @DMA0,AR2 1H,*+AR1(2) 3,AR3,AR5 R0,*+AR1(3) AR0,*+AR2(1) 1H,*+AR1(5) ; POINTS DMA REGISTER ; SOURCE INDEX ; ; ; ; ; ; SOURCE INDEX COUNTER COUNTER R0=2*N DESTINATION INDEX DESTINATION INDEX ; ; ; ; ; ; POINTS DMA REGISTER SOURCE INDEX AR5=N3 : (N2) DMA TRANSFERS COUNTER SOURCE DESTINATION INDEX 35 * * CPU : FFT ON ROW 0 * * DMA : BEGINS TO TRANSFER ROW1 * * STIK 1H,*+AR2(2) ; SOURCE INDEX STI R0,*+AR2(3) ; COUNTER=2N LAJ CFFT ; FFT ON ROW 0 STI R7,*+AR2(4) ; DESTINATION STIK 1H,*+AR2(5) ; DESTINATION INDEX STI R2,*AR2 ; CONTROL3 * * FFT ON ROWS * * LDI @P02,AR1 LDI @P01,AR0 LSH3 1,AR3,R0 LOOP2 TSTB @MASK,IIF BZAT LOOP2 * DMA02 DMA02: BITREVERSED TRANSFER OF LAST RESULT (Im) ADDI 1H,AR6,R1 STI R1,*+AR1(1) ; SOURCE ADDI 1H,AR7,R1 STI R1,*+AR1(4) ; DST * DMA01 DMA01: BITREVERSED TRANSFER OF LAST RESULT (Re) STI AR6,*+AR0(1) ; SOURCE STI AR7,*+AR0(4) ; DST * DMA03 DMA03: TRANSFER LDI ADDI ADDI STI AND STI LDI LDI LDI STIK STI NEXT ROW @P03,AR1 R0,AR7 AR7,R0 R0,*+AR1(1) 0H,IIF AR6,*+AR1(4) @DMA0,AR1 R7,R2 AR6,R7 0,*+AR1(3) AR0,*+AR1(6) * FFT ON CURRENT ROW LAJ CFFT LDI R2,AR6 LDI @CTRL2,R0 STI R0,*AR1 DBUD LDI LDI LSH3 36 AR5,LOOP2 @P02,AR1 @P01,AR0 1,AR3,R0 ; DMA0 ; SOURCE: NEXT ROW ; CLEAR FLAG ; DESTINATION: ; EXCHANGE BUFFER POINTERS ; R7: POINTER FOR NEXT DMA ; TEMPORAL FIX ; AR6: POINTER FOR NEXT FFT ; START (DMA) ; DMA0 * * DMA: TRANSFER BACK RESULT (ROW N2). BITREVERSED * * TRANSFER FIRST COLUMN (EXCEPT LAST LOCATION) * * * * CPU: FFT ON LAST ROW (ROW N1) * * LDI @P01,AR2 ; DMA0 LDI @P02,AR1 ; DMA02 DMA02 LDI @P03,AR0 ; AR0 POINTS TO DMA03 DMA03 LDI @P04,AR4 ; AR1 POINTS TO DMA04 DMA04 B2 TSTB @MASK,IIF BZAT B2 * DMA02 DMA02: BITREVERSED TRANSFER OF LAST RESULT (Im) ADDI 1H,AR6,R0 STI R0,*+AR1(1) ; SOURCE ADDI 1H,AR7,R0 STI R0,*+AR1(4) ; DST * DMA01 DMA01: BITREVERSED TRANSFER OF LAST RESULT (ROW N2: Re) STI AR6,*+AR2(1) ; SOURCE STI AR6,*+AR0(4) ; DESTINATION: BLOCK0(RE) STI AR7,*+AR2(4) ; DST STIK 2H,*+AR0(5) ; DESTINATION INDEX=2 (RE) STIK 2H,*+AR4(5) ; DESTINATION INDEX=2 (IM) * DMA03 DMA03: TRANSFER COLUMN 0 (Re) EXCEPT LAST LOCATION * DMA04 DMA04: TRANSFER COLUMN 0 (Im) EXCEPT LAST LOCATION LDI @CTRL2,R1 STI R1,*AR0 LDI @CTRL3,R0 STI R0,*AR4 LDI @MATR,R0 ; R0:ADDRESS OF FIRST COLUMN STI R0,*+AR0(1) ; SOURCE: (RE) ADDI STI ADDI STI LSH3 STI SUBI STI AND STI ADDI STI 1,R0 ; POINTS TO IMAGINARY PART R0,*+AR4(1) ; SOURCE: (IM) 1,AR6,R1 ; R1,*+AR4(4) ; DESTINATION: BLOCK0(IM) 1,AR3,R1 R1,*+AR0(2) ; SOURCE INDEX=2*N 1,AR3,R0 ; R0=N1 R1,*+AR4(2) ; 0H,IIF ; CLEAR FLAG R0,*+AR0(3 ; COUNTER=N1 R1,AR7 ; R1=2N R0,*+AR4(3) LDI LDI LDI STIK STI @DMA0,AR0 R7,R2 AR6,R7 0,*+AR0(3) AR2,*+AR0(6) * FFT ON LAST ROW LAJ CFFT LDI R2,AR6 LDI @CTRL2,R0 STI R0,*AR0 ; GIVE THE START ; EXCHANGE BUFFER POINTERS ; R7: POINTER FOR NEXT DMA ; AR6: POINTER FOR NEXT FFT ; START (DMA) 37 * * DMA: TRANSFER BACK RESULT (LAST ROW) * * TRANSFER SECOND COLUMN (COLUMN 1) * * CPU: FFT ON FIRST COLUMN * * LDI LDI LDI LDI @P02,AR1 @P01,AR0 @P03,AR4 @P04,AR5 ; ; ; ; DMA02 DMA02 P01 AR0 POINTS TO DMA03 DMA03 AR1 POINTS TO DMA04 DMA04 B3 TSTB @MASK,IIF BZAT B3 * CPU MOVES LAST VALUE (1ST COLUMN)FROM AR6: BLOCK1 TO R7: LSH3 1,AR3,R2 ADDI R2,R7,AR2 SUBI 2,AR2 ; AR2= BLOCK0+SIZE22 ADDI 1H,AR6,R2 LDF *AR6,R0 ; RE || LDF *+AR6(1),R1 ; IM STI R2,*+AR1(1) ; SOURCE STF R0,*AR2 || STF R1,*+AR2(1) * DMA02 DMA02: BITREVERSED TRANSFER OF ADDI 1H,AR7,R0 STI R0,*+AR1(4) * DMA01 DMA01: BITREVERSED TRANSFER OF STI AR6,*+AR0(1) STI AR7,*+AR0(4) * DMA03 DMA03: TRANSFER COLUMN 1 (Re) * DMA04 DMA04: TRANSFER COLUMN 1 (Im) LDI @MATR,AR7 ADDI 2,AR7,R0 STI R0,*+AR4(1) ADDI 1,R0 STI R0,*+AR5(1) AND 0H,IIF STI AR6,*+AR4(4) ADDI 1,AR6,R1 STI R1,*+AR5(4) LDI R7,R2 STI AR3,*+AR4(3) LDI @DMA0,AR1 LDI AR6,R7 STI AR3,*+AR5(3) STIK 0,*+AR1(3) STI AR0,*+AR1(6) LAST RESULT (Im) ; DST LAST RESULT (Re) ; SOURCE ; DST ; ; ; ; ; ; R0: POINTS TO COLUMN 1 SOURCE: (RE) POINTS TO IMAGINARY PART SOURCE: (IM) CLEAR FLAG DESTINATION: BLOCK1(RE) ; DESTINATION: BLOCK0(IM) ; COUNTER=N ; GIVE THE START ; R7: BLOCK1 * FFT ON FIRST COLUMN LAJ LDI LDI STI 38 CFFT R2,AR6 @CTRL2,R0 R0,*AR1 ; AR6: POINTER FOR NEXT FFT ; START (DMA) * * FFT ON COLUMNS * * t1: B4 LDI LDI SUBI LSH3 STI STI ADDI @P02,AR1 @P01,AR0 3,AR3,AR5 1,AR3,R1 R1,*+AR0(5) R1,*+AR1(5) 1H,AR6,R0 TSTB BZAT @MASK,IIF B4 ; P01 ; AR5=N3: (N2) DMA TRANSFERS ; DST INDEX ; DST INDEX * DMA02 DMA02: BITREVERSED TRANSFER OF LAST RESULT (Im) STI R0,*+AR1(1) ; SOURCE ADDI 1H,AR7,R0 STI R0,*+AR1(4) ; DST * DMA01 DMA01: BITREVERSED TRANSFER OF LAST RESULT (Re) STI AR6,*+AR0(1) ; SOURCE STI AR7,*+AR0(4) ; DST * DMA03 DMA03: TRANSFER NEXT COLUMN (Re) * DMA04 DMA04: TRANSFER NEXT COLUMN (Im) LDI LDI ADDI ADDI STI AND STI ADDI STI ADDI STI LDI LDI LDI STIK STI @P03,AR4 ; AR0 POINTS TO DMA03 DMA03 @P04,AR2 ; AR1 POINTS TO DMA04 DMA04 2,AR7 ; R0: POINTS TO NEXT COLUMN 2,AR7,R0 R0,*+AR4(1) ; SOURCE: (RE) 0H,IIF ; CLEAR FLAG AR6,*+AR4(4) ; DESTINATION: BLOCK1(RE) 1,R0 ; POINTS TO IMAGINARY PART R0,*+AR2(1) ; SOURCE: (IM) 1,AR6,R1 R1,*+AR2(4) ; DESTINATION: BLOCK0(IM) @DMA0,AR1 ; GIVE THE START R7,R2 AR6,R7 ; R7: BLOCK1 0,*+AR1(3) AR0,*+AR1(6) * FFT ON CURRENT COLUMN LAJ CFFT LDI R2,AR6 LDI @CTRL2,R0 STI R0,*AR1 DBUD AR5,B4 LDI @P02,AR1 LDI @P01,AR0 ADDI 1H,AR6,R0 ; AR6: POINTER FOR NEXT FFT ; START (DMA) * * DMA: TRANSFER LAST FFT RESULT * CPU: FFT ON LAST COLUMN * B5 LDI TSTB BZAT @P02,AR1 @MASK,IIF B5 ; DMA0 39 * DMA02/DMA01 DMA02/DMA01: BITREVERSED TRANSFER ADDI 1H,AR6,R0 STI R0,*+AR1(1) ADDI 1H,AR7,R0 STI R0,*+AR1(4) LDI @P01,AR0 LDI @CTRL0,R0 STI R0,*AR1 STI AR6,*+AR0(1) STI AR7,*+AR0(4) LDI @DMA0,AR1 ADDI 2,AR7 AND 0,IIF STIK 0,*+AR1(3) STI AR0,*+AR1(6) LDI @CTRL2,R0 STI R0,*AR1 OF LAST RESULT ; SOURCE ; DST ; P01 ; SOURCE ; DST ; GIVE THE START ; START (DMA) * FFT ON CURRENT ROW LAJ CFFT LDI R7,R2 LDI AR6,R7 LDI R2,AR6 SUBI3 2,AR3,RC ; RC=N2 LDI AR6,AR0 ; SOURCE LDI AR7,AR1 ; DESTINATION RPTBD B6 LDI AR3,IR0 LSH3 1,AR3,IR1 LDF *+AR0(1),R0 ; R0= X(I) IM * LOOP *AR0+(IR0)B,R1 R0, *+AR1(1) *+AR0(1),R0 R1,*AR1+(IR1) ; ; ; ; X(I) RE & POINTS TO X(I+1) STORE X(I) IM R0=X(I+1) IM STORE X(I) RE * STORE LAST VALUE B7 TSTB BZ LDF || STF STF LDI LDI @MASK,IIF B7 *AR0+(IR0)B,R1 R0,*+AR1(1) R1,*AR1 @TIMER,AR2 *+AR2(4),R0 ; ; ; ; ; LOAD X(N1) RE STORE X(N1) IM STORE X(N1) RE OPTIONAL: BENCHMARKING (TIME_READ) TCOMP = R0 t2 t2 || B6 || 40 LDF STF LDF STF B .end SERB.CMD input.obj serb.obj sintab.obj lmylib.lib m serb.map o serb.out MEMORY { ROM: BUF0: RAM0: BUF1: RAM1: LM: GM: o o o o o o o = = = = = = = 0x00000000 0x002ff800 0x002ffa00 0x002ffc00 0x002ffe00 0x40000000 0x80000000 l l l l l l l = = = = = = = 0x1000 0x200 0x200 0x200 0x200 0x10000 0x20000 } SECTIONS { INPUT .text .data STACK DMA_AUTOINI :{} :{} :{} :{} :{} > > > > > LM LM RAM1 RAM1 RAM1 /* SINE TABLE */ /* AUTOINIT. VALUES */ } 41 Appendix B: Parallel 2-D FFT (Shared-Memory Version) B.1. SH.C: Single-Buffered Implementation (C Program) SH.C /* SH.C : Parallel 2dimensional complex FFT (sharedmemory singlebuffered version) Routines used: cfftc.asm cmove.asm cmoveb.asm syncount.asm (Ccallable complexfft) (CPU complex move) (CPU bitreversed complex move) (synchronization routine via counter in-shared memory) To run: cl30 v40 g as mr o2 sh.c asm30 v40 s input.asm asm30 v40 s synch.asm asm30 v40 s sintab.asm lnk30 shc.cmd Note: Before running, initialize my_node variable with the corresponding value, using the 'C40 emulator or an assembly file. */ #define SIZE 16 /* FFT size (n) */ #define LOGSIZE 4 /* log (FFT size) */ #define P 2 /* number of processors */ #define Q SIZE/P /* rows/col. per processor */ #define BLOCK0 0x002ff800 /* onchip RAM buffer */ extern extern extern void cfftc(), cmove(), cmoveb(), syncount(); int colsynch; /* column/row synchronization */ float MATRIX[SIZE][SIZE*2]; /* Input matrix */ float *block0 = (float *)BLOCK0, *MM[SIZE]; int my_node , *colsynch_p =& colsynch, size2 = 2*SIZE, q2 = 2*Q, i,l1,l2; tcomp; int /* nodeid */ /* row/column synchronization */ /*/ main() { asm(" OR 1800h,st"); /* cache enable */ for (i=0;i RAM1 RAM1 RAM1 RAM1 RAM1 GM GM /* INITIALIZATION TABLES /* SYSTEM STACK */ */ /* Sine table /* Input matrix /* Synchronization */ */ */ } SYNCH.ASM * * * SYNCH.ASM : File containing sharedmemory location for * interprocessor synchronization * * .global _colsynch .sect "SYNCH" _colsynch .int 0 .end 44 */ */ */ */ B.2. SH.ASM: Single-Buffered Implementation ('C40 Assembly Program) SH.ASM * * * SH.ASM : TMS320C40 TMS320C40 complex 2DFFT serial program * (Singlebuffered version) * * Routines used: cfft.asm (radix2 complex FFT) * * Requirements: Number of processors = P > 0 * rows/columns per processor = Q > 0 * * To run: * asm30 v40 g s sh.asm * asm30 v40 g s spinput.asm * asm30 v40 g s input.asm * asm30 v40 g s ssintab.asm * asm30 v40 g s synch.asm * asm30 v40 g s 0.asm * asm30 v40 g s 1.asm * lnk30 sh.cmd 0.obj o a0.out (program for processor 0) * lnk30 sh.cmd 1.obj o a1.out (program for processor 1) * * _STACK FFTSIZE PROC NROWS MYID MATR SYNCH STACK BLOCK0 TIMER .global .global .global .global .global .global .global .global .global .usect .text .word .word .word .word .word .word .word .word .word N P Q MYNODE _MATRIX _colsynch CFFT C2DFFT _syncount "STACK",10h ; FFT size ; Number of processors ; Rows per processor N P Q MYNODE _MATRIX _colsynch _STACK 002FF800H 002FF800H 0100020H 0100020H ; Stack address ; Onchip buffer (RAM block 0) ; Timer 0 address (Benchmarking) LDP LDI LDI LDI FFTSIZE @STACK,SP @SYNCH,AR2 @PROC,R2 ; ; ; ; ; ; ; ; Matrix address Synchronization counter Complex 1DFFT subroutine Entry point for execution ; Stack definition C2DFFT Data page pointer initialization Stack pointer initialization Optional: Common start (benchmarking) wait until counter = P 45 t0: CALL LDI STIK LDI STI OR LDI LDI LDI MPYI MPYI LSH LDI ADDI SUBI LDI _syncount @TIMER,AR2 1,*+AR2(8) 961,R0 R0,*AR2 1800h,ST @FFTSIZE,AR3 @MYID,R0 @NROWS,AR5 AR5,R0 AR3,R0 1,R0 @MATR,AR7 R0,AR7 1,AR5 @BLOCK0,AR6 ; Optional: benchmarking (time_start) ; Enabling cache ; AR3 = N = FFT size ; ; ; ; AR5 = row counter = Q Q*MY_NODE R0 = N*Q*MYNODE R0 = 2*N*Q*MYNODE ; AR7 = matrix pointer = &MATRIX[Q*MYNODE][0] ; AR5 = Q1 ; AR6 = onchip buffer pointer * * FFT ON ROWS * LOOPR * * Move row X (X = AR7) * * to onchip memory * * || LOOP1 || SUBI3 LDI RPTBD LDI LDI LDF LDF STF LDF STF 2,AR3,RC AR7,AR0 LOOP1 AR6,AR1 2,IR0 *+AR0(1),R0 *AR0+(IR0),R1 R0, *+AR1(1) *+AR0(1),R0 R1,*AR1+(IR0) ; RC = N2 ; Source address CFFT *AR0,R1 ; Call 1DFFT (complex) ; Load X(N1) Re R0,*+AR1(1) R1,*AR1 ; Store X(N1) Im ; Store X(N1) Re ; ; ; ; ; ; ; Destination address Destination offset R0 = X(I) Im X(I) Re & points to X(I+1) Store X(I) Im R0 = X(I+1) Im Store X(I) Re * * FFT on row X * * || LAJ LDF NOP STF STF * * Move row X (bitreversed) from * * onchip memory to external memory * * SUBI3 LDI LDI RPTBD LDI LDI LDF 46 2,AR3,RC AR6,AR0 AR7,AR1 LOOP2 AR3,IR0 2,IR1 *+AR0(1),R0 ; Source address ; Destination Address ; Source offset for bitreverse = N ; Destination offset || LOOP2 || || LDF STF LDF STF LDF STF DBUD STF LSH3 ADDI *AR0+(IR0)B,R1 R0,*+AR1(1) *+AR0(1),R0 R1,*AR1+(IR1) *AR0+(IR0)B,R1 R0,*+AR1(1) AR5,LOOPR R1,*AR1+(IR1) 1,AR3,R0 R0,AR7 * * FFT ON COLUMNS * LDI LDI MPYI LSH ; AR5 = column counter = Q ; Q*MY_NODE ; 2*Q*MY_NODE (Complex numbers) LDI ADDI @MATR,AR7 R0,AR7 ; AR7 = &MATRIX[0][2*Q*MYNODE] SUBI 1,AR5 ; AR5 = Q1 LDI LDI LSH t1: @MYID,R0 @NROWS,AR5 AR5,R0 1,R0 @SYNCH,AR2 @PROC,R2 1,R2 ; Row/column synchronization CALL _syncount ; Optional: not needed if no common start ; is required LOOPC * * Move column X (X=AR7) * * to onchip memory * * SUBI3 LDI LDI RPTBD LSH3 LDI LDF || LOOP3 || 2,AR3,RC AR7,AR0 AR6,AR1 LOOP3 1,AR3,IR1 2,IR0 *+AR0(1),R0 ; RC=N2 ; Source address ; Destination address LDF STF LDF STF *AR0+(IR1),R1 R0, *+AR1(1) *+AR0(1),R0 R1,*AR1+(IR0) ; ; ; ; CFFT *AR0,R1 ; Load X(N1) Re R0,*+AR1(1) R1,*AR1 ; Store X(N1) Im ; Store X(N1) Re ; Source offset = 2*N ; Destination offset ; R0= X(I) Im X(I) Re & points to X(I+1) Store X(I) Im R0=X(I+1) Im Store X(I) Re * * FFT on column X * * || LAJ LDF NOP STF STF 47 * * Move column X (bitreversed) from * * onchip memory to external memory * * || LOOP4 || || t2 SUBI3 LDI LDI RPTBD LDI LSH LDF LDF STF LDF STF DBUD LDF STF STF ADDI LDI LDI 2,AR3,RC AR6,AR0 AR7,AR1 LOOP4 AR3,IR0 1,AR3,IR1 *+AR0(1),R0 *AR0+(IR0)B,R1 R0,*+AR1(1) *+AR0(1),R0 R1,*AR1+(IR1) AR5,LOOPC *AR0+(IR0)B,R1 R0,*+AR1(1) R1,*AR1+(IR1) 2,AR7 @TIMER,AR2 *+AR2(4),R0 B .end ; RC=N2 ; Source address ; Destination address t2 ; Source offset = IR0 = N (bitreverse) ; Destination offset (columns) = IR1 = 2N ; Optional: benchmarking (time_read) ; tcomp = R0 SH.CMD input.obj sh.obj spinput.obj ssintab.obj synch.obj m sh.map lmylib.lib osh.out MEMORY { ROM: RAM0: RAM1: LM: GM: org org org org org = = = = = 0x00 0x002ff800 0x002ffc00 0x40000000 0x80000000 len len len len len = = = = = 0x1000 0x0400 0x0400 0x10000 0x20000 /* /* /* /* Onchip RAM block 0 Onchip RAM block 1 LOCAL MEMORY GLOBAL MEMORY } /* SPECIFY THE SECTIONS ALLOCATION INTO MEMORY */ SECTIONS { .text: .data: INPUT: STACK: SYNCH: } 48 {} {} {} {} {} >RAM1 >RAM1 >GM >RAM1 >GM /* /* /* /* /* CODE Sine tables Input matrix SYSTEM STACK synchronization */ */ */ */ */ */ */ */ */ SPINPUT.ASM * * * SPINPUT.ASM : input file for sharedmemory program (Data on parallel system) * * .global .global .global .global .set .set .set .set N M P Q N M P Q 16 4 2 N/P ; ; ; ; FFT size LOG2 FFT Number of processors Rows per processor .end 0.ASM * * 0.ASM : File containing nodeid for processor 0 * .global MYNODE MYNODE .set 0 1.ASM * * 1.ASM : File containing nodeid for processor 1 * .global MYNODE MYNODE .set 1 49 B.3. SHB.C: Double-Buffered Implementation (C Program) SHB.C /* SHB.C : Parallel 2dimensional complex FFT (sharedmemory doublebuffered version) Routines used: cfftc.asm cmove.asm cmoveb.asm set_dma.asm syncount.asm (Ccallable radix2 complexfft) (CPU complex move) (CPU bitreversed complex move) (Routine to set DMA register values) (synchronization routine) To run: cl30 v40 g as mr o2 shb.c asm30 v40 s input.asm asm30 v40 s synch.asm asm30 v40 s sintab.asm lnk30 shbc.cmd Requirement: Q 4 w Note: Before running initialize the my_node variable to the corresponding value using the 'C40 emulator or an assembly file. */ #define SIZE 16 /* FFT size */ #define LOGSIZE 4 /* log (FFT size) */ #define P 2 /* number of processors */ #define Q SIZE/P /* row/col. per processor */ #define BLOCK0 0x002ff800 /* onchip buffer 0 */ #define BLOCK1 0x002ffc00 /* onchip buffer 1 */ #define DMA0 0x001000a0 /* DMA0 address */ #define SWAP(x,y) temp = x; x = y; y = temp; #define WAIT_DMA(x) while (0x00c00000 & *x) != 0x00800000); extern extern extern int float volatile int int int 50 void cfftc(), set_dma(), cmove(), cmoveb(), syncount(); int colsynch; /* counter in GM */ float MATRIX[SIZE][SIZE*2]; /* input matrix */ ctrl0= 0x00c41004, /* no autoinit.,dmaint,bit_rev */ ctrl1= 0x00c01008, /* autoinit.,no dmaint,bitrev */ ctrl2= 0x00c00008, /* autoinit., no dmaint */ ctrl3= 0x00c40004; /* no autoinit.,dmaint */ *CPUbuffer =(float *)BLOCK0, /* For CPU FFT operations */ *DMAbuffer =(float *)BLOCK1, /* For DMA operations */ *temp, *MM[SIZE]; int *dma0 = (int *)DMA0; dma01[7], dma02[7], dma03[7], dma04[7]; /* DMA autoinit.values */ my_node, base, *colsynch_p = &colsynch, size2 = (SIZE*2), q = Q, q2 = 2*Q, ii,i,j; tcomp; /*/ main() { asm(" OR 1800h,ST"); for (i=0;i > > > > LM RAM1 RAM1 RAM1 RAM1 GM GM /* initialization tables /* system stack */ */ /* Sine table /* Input matrix /* Synchronization counter */ */ */ B.4. SHB.ASM: Double-Buffered Implementation ('C40 Assembly Program) SHB.ASM * * * SHB.ASM : TMS320C40 TMS320C40 complex 2DFFT sharedmemory program * (Doublebuffered version) * * Routines used: cfft.asm (radix2 complex FFT) * * Requirements: Number of processors = P > 0 * Rows/columns per processor = Q >= 4 * * * To run: * asm30 v40 g s shb.asm * asm30 v40 g s spinput.asm * asm30 v40 g s input.asm * asm30 v40 g s sintab.asm * asm30 v40 g s synch.asm * asm30 v40 g s 0.asm * asm30 v40 g s 1.asm * lnk30 shb.cmd 0.obj o a0.out (program for processor 0) * lnk30 shb.cmd 1.obj o a1.out (program for processor 1) * * .global .global .global .global .global .global .global .global .global _STACK N P Q MYNODE _MATRIX _colsynch CFFT C2DFFT _syncount ; ; ; ; ; ; ; ; ; FFT SIZE NUMBER OF PROCESSORS ROWS/COLUMNS PER PROCESSOR PROCESSOR ID MATRIX ADDRESS SYNCHRONIZATION COUNTER COMPLEX 1DFFT SUBROUTINE ENTRY POINT FOR EXECUTION SYNCHRONIZATION ROUTINE .usect "STACK", 10h ; STACK DEFINITION * DMA AUTOINITIALIZATION VALUES DMA01 DMA01 DMA02 DMA02 DMA03 DMA03 DMA04 DMA04 FFTSIZE PROC NROWS MYID MATR SYNCH STACK BLOCK0 BLOCK1 .sect .space .word .space .word .space .word .space .word .word .word .word .word .word .word .word .word "DMA_AUTOINI" 6 DMA02 DMA02 6 DMA03 DMA03 6 DMA04 DMA04 6 .text N P Q MYNODE _MATRIX _colsynch _STACK 002FF800H 002FF800H 002FFC00H 002FFC00H ; DMA AUTOINITIALIZATION VALUES ; STACK ADDRESS ; RAM BLOCK 0 ; RAM BLOCK 1 55 DMA0 CTRL0 CTRL1 CTRL2 CTRL3 P04 P03 P02 P01 MASK TIMER .word .word .word .word .word .word .word .word .word .word .word 001000A0H 001000A0H 00C41004H 00C41004H 00C01008H 00C01008H 00C00008H 00C00008H 00C40004H 00C40004H DMA04 DMA04 DMA03 DMA03 DMA02 DMA02 DMA01 DMA01 02000000H 02000000H 0100020H 0100020H ; ; ; ; ; ; ; ; ; ; ; ADDRESS OF DMA0 NO AUTOINITIALIZATION, DMA INT., BITREV AUTOINITIALIZATION, NO DMA INT., BITREV AUTOINITIALIZATION, NO DMA INT. NO AUTOINITIALIZATION, DMA INT. POINTER TO REGISTER VALUES (DMA04 DMA04) POINTER TO REGISTER VALUES (DMA03 DMA03) POINTER TO REGISTER VALUES (DMA02 DMA02) POINTER TO REGISTER VALUES (DMA01 DMA01) 1 IN DMAINT0 TIMER 0 ADDRESS (BENCHMARKING) C2DFFT LDP LDI LDI LDI FFTSIZE @STACK,SP @SYNCH,AR2 @PROC,R2 ; ; ; ; LOAD DATA PAGE POINTER INITIALIZE THE STACK POINTER OPTIONAL: COMMON START (BENCHMARKING) WAIT UNTIL COUNTER = P t0: CALL LDI STIK LDI STI OR LDI LDI LDI MPYI MPYI LSH LDI ADDI LDI LDI _syncount @TIMER,AR2 1,*+AR2(8) 961,R0 R0,*AR2 1800h,ST @FFTSIZE,AR3 @MYID,R0 @NROWS,AR5 AR5,R0 AR3,R0 1,R0 @MATR,AR7 R0,AR7 @BLOCK1,R7 @BLOCK0,AR6 ; OPTIONAL: BENCHMARKING (TIME_START) ; ENABLE CACHE ; AR3 = N = FFT SIZE ; ; ; ; AR5 = Q = ROW COUNTER R0 = Q*MYNODE R0 = N*Q*MYNODE R0 = 2*N*Q*MYNODE ; MATRIX POINTER = &MATRIX[2*Q*MYNODE][0] ; POINTER TO DMA BUFFER ; POINTER TO FFT BUFFER * * CPU MOVES ROW 0 * * SUBI3 LDI RPTBD LDI LDI LDF 2,AR3,RC AR7,AR0 LOOP1 AR6,AR1 2,IR1 *+AR0(1),R0 ; RC=N2 ; SOURCE LDF STF LDF STF *AR0+(IR1),R1 R0, *+AR1(1) *+AR0(1),R0 R1,*AR1+(IR1) ; ; ; ; X0(I) RE & POINTS TO X0(I+1) STORE X0(I) IM R0=X0(I+1) IM STORE X0(I) RE @P02,AR2 *AR0+(IR1),R1 R0,*+AR1(1) R1,*AR1 ; ; ; ; POINTS DMA REGISTER LOAD X0(N1) RE STORE X0(N1) IM STORE X0(N1) RE ; DESTINATION ; R0= X0(I) IM * LOOP || LOOP1 || * STORE LAST VALUE || 56 LDI LDF STF STF * * * SET PARAMETERS FOR DMA AUTOINITIALIZATION VALUES THAT ARE FIXED * * LDI @P03,AR1 ; POINTS DMA REGISTER LDI @P01,AR4 LDI @CTRL1,R0 STI R0,*AR2 STI AR3,*+AR2(2) ; SOURCE INDEX STI R0,*AR4 STI AR3,*+AR4(2) ; SOURCE INDEX STI AR3,*+AR2(3) ; COUNTER STI AR3,*+AR4(3) ; COUNTER LSH3 1,AR3,R0 ; R0=2*N STIK 2H,*+AR2(5) ; DESTINATION INDEX STIK 2H,*+AR4(5) ; DESTINATION INDEX LDI @CTRL3,R2 STI R2,*AR1 LDI @DMA0,AR2 ; POINTS DMA REGISTER STIK 1H,*+AR1(2) ; SOURCE INDEX SUBI 3,AR5 ; AR5=Q3 : (Q2) DMA TRANSFERS STI R0,*+AR1(3) ; COUNTER STI AR0,*+AR2(1) ; SOURCE STIK 1H,*+AR1(5) ; DESTINATION INDEX * * CPU : FFT ON ROW 0 * * DMA : BEGINS TO TRANSFER ROW1 * * STIK STI LAJ STI STIK STI 1H,*+AR2(2) R0,*+AR2(3) CFFT R7,*+AR2(4) 1H,*+AR2(5) R2,*AR2 ; ; ; ; ; SOURCE INDEX COUNTER=2N FFT ON ROW 0 DESTINATION DESTINATION INDEX * * FFT ON ROWS * * LOOP2 LDI LDI LSH3 TSTB BZAT @P02,AR1 @P01,AR0 1,AR3,R0 @MASK,IIF LOOP2 * DMA02 DMA02: BITREVERSED TRANSFER OF LAST RESULT (Im) ADDI 1H,AR6,R1 STI R1,*+AR1(1) ; SOURCE ADDI 1H,AR7,R1 STI R1,*+AR1(4) ; DST * DMA01 DMA01: BITREVERSED TRANSFER OF LAST RESULT (Re) STI AR6,*+AR0(1) ; SOURCE STI AR7,*+AR0(4) ; DST 57 * DMA03 DMA03: TRANSFER LDI ADDI ADDI STI AND STI LDI LDI LDI STIK STI NEXT ROW @P03,AR1 R0,AR7 AR7,R0 R0,*+AR1(1) 0H,IIF AR6,*+AR1(4) @DMA0,AR1 R7,R2 AR6,R7 0,*+AR1(3) AR0,*+AR1(6) * FFT ON CURRENT ROW LAJ CFFT LDI R2,AR6 LDI @CTRL2,R0 STI R0,*AR1 DBUD AR5,LOOP2 LDI @P02,AR1 LDI @P01,AR0 LSH3 1,AR3,R0 ; DMA0 ; SOURCE: NEXT ROW ; CLEAR FLAG ; DESTINATION: ; EXCHANGE BUFFER POINTERS ; R7: POINTER FOR NEXT DMA ; TEMPORAL FIX ; AR6: POINTER FOR NEXT FFT ; START (DMA) ; DMA0 * * DMA: TRANSFERS BACK RESULT (ROW N2). BITREVERSED * * CPU: FFT ON LAST ROW (ROW N1) * TRANSFERS BACK FFT OF LAST ROW (ROW N1) * B2 LDI LDI TSTB BZAT @P01,AR2 @P02,AR1 @MASK,IIF B2 ; DMA01 DMA01 ; DMA02 DMA02 * DMA02 DMA02: BITREVERSED TRANSFER OF LAST RESULT (Im) ADDI 1H,AR6,R0 STI R0,*+AR1(1) ; SOURCE ADDI 1H,AR7,R0 STI R0,*+AR1(4) ; DST LDI @CTRL0,R0 STI R0,*AR1 ; CONTROL REGISTER * DMA01 DMA01: BITREVERSED TRANSFER OF STI AR6,*+AR2(1) STI AR7,*+AR2(4) AND 0H,IIF LSH 1,AR3,R0 ADDI R0,AR7 LDI @DMA0,AR0 LDI R7,R2 LDI AR6,R7 STIK 0,*+AR0(3) STI AR2,*+AR0(6) * FFT ON LAST ROW LAJ CFFT LDI R2,AR6 LDI @CTRL2,R0 STI R0,*AR0 58 LAST RESULT (ROW N2: Re) ; SOURCE ; DST ; CLEAR FLAG ; ; ; ; ; ; AR7 POINTS TO MATRIX (LAST ROW) GIVE THE START EXCHANGE BUFFER POINTERS R7: POINTER FOR NEXT DMA COUNTER = 0 LINK POINTER = DMA01 DMA01 ; AR6: POINTER FOR NEXT FFT ; START (DMA) * CPU TRANSFERS FFT OF LAST ROW TO OFFCHIP RAM SUBI3 LDI LDI RPTBD LDI LDI LDF 2,AR3,RC AR6,AR0 AR7,AR1 B66 AR3,IR0 2,IR1 *+AR0(1),R0 ; RC=N2 ; SOURCE ; DESTINATION LDF STF LDF STF *AR0+(IR0)B,R1 R0, *+AR1(1) *+AR0(1),R0 R1,*AR1+(IR1) ; ; ; ; ; R0= X(I) IM * LOOP || B66 || * STORE LAST VALUE LDF *AR0+(IR0)B,R1 || STF R0,*+AR1(1) STF R1,*AR1 X(I) RE & POINTS TO X(I+1) STORE X(I) IM R0=X(I+1) IM STORE X(I) RE ; LOAD X(N1) RE ; STORE X(N1) IM ; STORE X(N1) RE * * FFT ON COLUMNS * * WAIT t1: TSTB BZAT LDI LDI NOP LSH @MASK,IIF WAIT @SYNCH,AR2 @PROC,R2 AND CALL 0,IIF _syncount 1,R2 ; ROW/COLUMN SYNCHRONIZATION ; OPTIONAL: NOT NEEDED IF COMMON START ; IS NOT REQUIRED * * SET AUTOINITIALIZATION VALUES * || || LDI LDI LDI LDI LDI LSH STI STI STI STI STI LDI LDI STI STI STI STI STIK STIK STI @P01,AR0 @P02,AR1 @P03,AR2 @P04,AR4 @CTRL1,R4 1,AR3,R0 R4,*AR1 R0,*+AR0(IR0) R0,*+AR1(IR0) R0,*+AR2(2) R0,*+AR4(2) @CTRL2,R0 @CTRL3,R1 R0,*AR2 R1,*AR4 AR3,*+AR2(3) AR3,*+AR4(3) 2,*+AR2(IR0) 2,*+AR4(IR0) AR4,*+AR2(6) ; ; : ; ; DMA01 DMA01 DMA02 DMA02 DMA03 DMA03 DMA04 DMA04 LDI 5,IR0 SET DMA02 DMA02 AND DMA01 DMA01 NEW VALUES ; SET DMA03 DMA03 AND DMA04 DMA04 VALUES ; (SRC INDEX) ; (CTRL) ; (COUNTER) ; (DST INDEX) ; (LINK POINTER) 59 * * DMA : TRANSFER COLUMN 1 TO ONCHIP RAM (BLOCK1) * LDI @BLOCK0,AR6 LDI @BLOCK1,R7 LDI @MYID,R0 LDI @NROWS,AR5 ; AR5 = Q = COLUMN COUNTER MPYI AR5,R0 ; R0 = Q*MYNODE LSH 1,R0 ; R0 = 2*Q*MYNODE LDI ADDI LDI LDI ADDI ADDI STI STI STI ADDI STI 2,AR7,R0 1,R0,R1 R0,*+AR2(1) R1,*+AR4(1) R7,*+AR2(4) 1,R7,R0 R0,*+AR4(4) LDI STI STIK LDI STI || @MATR,AR7 R0,AR7 @BLOCK1,R7 @BLOCK0,AR6 @DMA0,AR0 AR2,*+AR0(6) 0,*+AR0(3) @CTRL2,R4 R4,*AR0 ; MATRIX POINTER = &MATRIX[2*Q*MYNODE][0] ; POINTER TO DMA BUFFER ; POINTER TO FFT BUFFER ; SRC ; SRC ; DST ADDRESS (Re PART) ADDRESS (Im PART) ADDRESS (DMA_BUFFER) ; DST ADDRESS (DMA_BUFFER+1) ; DMA START * * CPU : TRANSFER COLUMN 0 TO ONCHIP RAM (BLOCK0) * FFT ON COLUMN 0 * SUBI3 2,AR3,RC ; RC=N2 LDI AR7,AR0 ; SOURCE ADDRESS LDI AR6,AR1 ; DESTINATION ADDRESS RPTBD LOOP3 LSH3 1,AR3,IR1 ; SOURCE OFFSET = 2*N LDI 2,IR0 ; DESTINATION OFFSET LDF *+AR0(1),R0 ; R0= X(I) IM || LOOP3 || LDF STF LDF STF *AR0+(IR1),R1 R0, *+AR1(1) *+AR0(1),R0 R1,*AR1+(IR0) ; ; ; ; X(I) RE & POINTS TO X(I+1) STORE X(I) IM R0=X(I+1) IM STORE X(I) RE CFFT *AR0,R1 ; LOAD X(N1) RE || LAJ LDF NOP STF STF R0,*+AR1(1) R1,*AR1 ; STORE X(N1) IM ; STORE X(N1) RE * * DMA: MOVES FFT COLUMN (I) (Re PART) TO OFFCHIP RAM * MOVES FFT COLUMN (I) (Im PART) TO OFFCHIP RAM * MOVES COLUMN (I+2) (Re PART) TO