NEW DATABASE - 350 MILLION DATASHEETS FROM 8500 MANUFACTURERS
AN2197 SC140/SC1400 SC140 G729A CH370 - Datasheet Archive
Application Note AN2197 Rev. 1, 1/2005 Implementing the Levinson-Durbin Algorithm on the StarCoreTM SC140/SC1400 Cores By
Freescale Semiconductor Application Note AN2197 AN2197 Rev. 1, 1/2005 Implementing the Levinson-Durbin Algorithm on the StarCoreTM SC140/SC1400 SC140/SC1400 Cores By Corneliu Margina and Bogdan Costinescu This application note describes a set of techniques for implementing and optimizing the Levinson-Durbin algorithm on DSPs based on the StarCoreTM SC140/SC1400 SC140/SC1400 cores. This algorithm is used in linear prediction coding (LPC) to compute the linear prediction filter coefficients in speech encoders. The Levinson-Durbin algorithm uses the autocorrelation method to estimate the linear prediction parameters for a segment of speech. Linear prediction coding, also known as linear prediction analysis (LPA), is used to represent the shape of the spectrum of a segment of speech with relatively few parameters. This coding eliminates redundancy in the short-term correlation of adjacent samples, thereby providing more efficient coding. © Freescale Semiconductor, Inc., 2001, 2005. All rights reserved. CONTENTS 1 1.1 1.2 1.3 1.4 2 3 3.1 3.2 4 4.1 4.2 4.3 5 6 Vocoder Recommendations .2 ITU-T Recommendation G.723-1 .2 ITU-T Recommendation G.729 .2 ITU-T Recommendation G.729A .3 ETSI GSM EFR .3 Background Theory .4 Levinson-Durbin Implementation on the SC140 SC140 Core .5 C Implementation .6 Assembly Implementation . 8 Implementation Differences .12 ITU-T G.729 .12 ITU-T G.723-1 .12 ETSI GSM EFR .12 Conclusions .12 References .13 Vocoder Recommendations 1 Vocoder Recommendations Low bit rate speech coders used in digital communications systems use audio signal compression to eliminate redundancy, thus reducing bandwidth. ITU-T has proposed several algorithms for speech signal coding at a low bit rate. This section surveys the vocoders for which the Levinson-Durbin algorithm was implemented and optimized. These include ITU-T Recommendations G.723-1, G.729, and G.729A, as well as the ETSI GSM EFR vocoder. 1.1 ITU-T Recommendation G.723-1 The G.723-1 system encodes the audio signal in frames using linear predictive analysis-by-synthesis coding. The coder uses a limited amount of complexity to represent speech at high quality. The G.723-1 coder has two associated bit rates-5.3 and 6.3 kbits per second. Although both bit rates provide good audio reproduction quality, the higher bit rate produces higher quality than the lower bit rate. It is possible to switch between the two bit rates at any 30-ms frame boundary. In this system, the analog voice input signal is passed through a telephone bandwidth filter (Recommendation G.712 [5]), sampled at 8000 Hz, and then converted at 16-bit linear PCM to generate the encoder input. The output of the decoder is converted back to an analog signal in the same way. Linear prediction analysis-by-synthesis is used to minimize the weighted error signal. The encoder operates on frames of 240 samples each that are equivalent to 30 ms at an 8000 Hz sample rate. A high pass filter is used to remove the DC component. Each frame is then divided into four sub-frames of 60 samples each. A 10th order LPC filter is computed for every sub-frame using the unprocessed input signal. The LPC filter for the last sub-frame is quantized using a predictive split vector quantizer. The LPC unquantized coefficients are then used to construct the perceptually weighted speech signal. Using pitch estimation, a shaping noise filter is constructed and combined with the LPC synthesis filter and the formant perceptual weighting filter to construct an impulse response. The pitch period is computed as a small differential value using pitch estimation and the impulse response to compute a closed loop pitch predictor. Both pitch period and the differential value are transmitted to the decoder. The decoder constructs the LPC synthesis filter using the quantized LPC indices. It also decodes the adaptive codebook and fixed codebook excitations and inputs them to the LPC synthesis filter. The excitation signal for the higher rate coder is Multipulse Maximum Likelihood Quantization (MP-MLQ). For the lower rate coder the excitation signal is Algebraic-Code-Excited Linear-Prediction (ACELP). The vocoder operates on 30 ms frames with 7.5 ms look-ahead for linear prediction analysis. The overall algorithmic delay is 37.5 ms. 1.2 ITU-T Recommendation G.729 ITU-T Recommendation G.729 proposes an algorithm for the speech signal coding at 8 kbit/s using ConjugateStructure Algebraic-Code-Excited Linear-Prediction (CS-ACELP). The analog voice input signal is passed through a telephone bandwidth filter, sampled at 8000 Hz, and converted at 16-bit linear PCM to generate the encoder input. The output of the decoder is converted back to an analog signal in the same way. The coder operates on 10 ms speech frames equivalent to 80 samples at a sampling rate of 8000 samples per second, with 5 ms look-ahead for linear prediction analysis. The overall algorithmic delay is 15 ms. Implementing the Levinson-Durbin Algorithm on the StarCoreTM SC140/SC1400 SC140/SC1400 Cores, Rev. 1 2 Freescale Semiconductor Vocoder Recommendations The encoder analyzes the speech signal to extract the CELP model parameters, which include linear prediction filter coefficients and adaptive and fixed-codebook indices and gains. These parameters are transmitted to the decoder. The decoder uses the CELP parameters to retrieve the excitation and synthesis filter parameters corresponding to a 10 ms frame. The speech is reconstructed by passing the excitation signal through a short-term synthesis filter based on a 10th order linear prediction filter. During the LP analysis, the LP coefficients are converted to line spectrum pairs (LSP) and then quantized using predictive two-stage vector quantization (VQ) with 18 bits. The fixed and adaptive codebook parameters are computed for every 5 ms (40-sample) sub-frame. The long-term synthesis filter uses the adaptive-codebook approach. A postfilter enhances the reconstructed speech. 1.3 ITU-T Recommendation G.729A The ITU-T Recommendation G.729A proposes a reduced complexity version of the 8 kbit/s CS-ACELP speech codec (ITU-T Recommendation G.729). This version is primarily for use in simultaneous voice and data applications, although it is not limited to these applications. The coder operates on 10ms frames with a 5 ms lookahead for linear prediction analysis. The changes applied to G.729A to reduce the codec algorithm complexity include the following: · · Open-loop pitch analysis is simplified by using decimation while computing the correlations of the weighted speech. · The adaptive codebook search is simplified by maximizing the correlation between the past excitation and the backward filtered target signal. · The fixed algebraic codebook search is simplified by using an iterative depth-first tree search approach. · 1.4 The perceptual weighting filter is based on the quantized LP filter coefficients. This reduces the number of filtering operations required to compute the impulse response, compute the target signal and update the filter states. The harmonic postfilter in the decoder is simplified by using only integer delays. ETSI GSM EFR The Global System for Mobile Communication Enhanced Full Rate (GSM EFR) encodes speech at 12.2 kbit/s using algebraic code excited linear prediction (ACELP). The algorithm divides the 20 ms frame into four 5 ms subframes. A 10th order linear prediction analysis, including an asymmetric 30 ms window, is performed twice per frame. The first window emphasizes the second sub-frame, and the second window emphasizes the fourth subframe. The LP parameters are converted to line spectral pairs. The algorithm incorporates an initial open-loop search twice per frame, and a closed-loop search is repeated for each sub-frame. The pulse positions are optimized by a non-exhaustive analysis-by-synthesis search to minimize the perceptually weighted error. At the decoder, the synthesized speech is filtered with an adaptive postfilter. Decoded speech quality is high. Implementing the Levinson-Durbin Algorithm on the StarCoreTM SC140/SC1400 SC140/SC1400 Cores, Rev. 1 Freescale Semiconductor 3 Background Theory 2 Background Theory Linear prediction analysis characterizes the shape of the spectrum of a short segment of speech with a small number of parameters for efficient coding. Linear prediction, also referred to as linear prediction coding (LPC), predicts a time-domain speech sample based on a linearly-weighted combination of previous samples. LP analysis removes the redundancy in the short-term correlation of adjacent samples. LPC determines the coefficients of a FIR filter that predicts the next value in a sequence from current and the previous inputs. This type of filter is also known as a one-step forward linear predictor. LP analysis is based on the all-pole filter described in Equation 1: 11 H ( z ) = - = -p A(z) k 1 ak z k Equation 1 1 where { k ( 1 k p ) } are the predictor coefficients and p is the order of the filter. a Transforming Equation 1 to the time-domain, as shown in Equation 2, predicts a speech sample based on a sum of weighted past samples. p s ( n ) = ak s(n k) Equation 2 k=1 where s ( n ) is the predicted value based on the previous values of the speech signal s ( n ) . LP analysis requires estimating the LP parameters for a segment of speech. The idea is to find a k s so that Equation 2 provides the closest approximation to the speech samples. This means that s ( n ) is closest to s ( n ) for all values of n in the segment. The spectral shape of s ( n ) is assumed to be stationary across the frame, or a short segment of speech. The error, e , between the predicted value and the actual value is e ( n ) = s ( n ) s ( n Equation 3 The summed squared error, E , over a finite window of length N is E = e 2 (n) Equation 4 n where 0 n N + p 1 . The minimum value of E occurs when the derivative is zero with respect to each of the parameters a k . By setting the partial derivatives of E , a set of p equations are obtained. The matrix form of these equations is Implementing the Levinson-Durbin Algorithm on the StarCoreTM SC140/SC1400 SC140/SC1400 Cores, Rev. 1 4 Freescale Semiconductor Background Theory a1 r(0) r(1 ) r(p 1) r(1 a2 r(1) r(0 ) r(p 2) × = r(2 . . . . . r(p 1 ) r(p 2) r(0) r(p ap Equation 5 where r ( i ) is the autocorrelation of lag i computed as N1i r(i) = s(m ) s(m + i ) Equation 6 m=0 and N is the length of the speech segment s ( n ) . The Levinson-Durbin algorithm solves the nth order system of linear equations R a = b Equation 7 for the particular case where R is a Hermitian, positive-definite, Toeplitz matrix and b is identical to the first column of R shifted by one element. The autocorrelation coefficients r ( k ) are used to compute the LP filter coefficients a i , i = 1. , by solving the p set of equations p ai r( i k ) = r(k) Equation 8 i=1 where k = 1. . p This set of equations is solved using the Levinson-Durbin recursion, Equation 9 through Equation 13. E(0) = r(0) Equation 9 i1 r(i) aj i1 r(i j) j=1 k i = -E(i 1) ai aj (i) = aj (i) i1 = ki ki ai j 2 Equation 10 Equation 11 (i 1 E ( i ) = ( 1 ki ) E ( i 1 Equation 12 Equation 13 where 1 j i 1 and 1 i p . The parameters k i are known as the reflection parameters. If the condition k i 1 where 1 i p is satisfied, the roots of the polynomial predictor all lie within the unit circle in the z-plane, and the all-pole filter is stable. Implementing the Levinson-Durbin Algorithm on the StarCoreTM SC140/SC1400 SC140/SC1400 Cores, Rev. 1 Freescale Semiconductor 5 Levinson-Durbin Implementation on the SC140 SC140 Core 3 Levinson-Durbin Implementation on the SC140 SC140 Core This section presents the implementation of Levinson-Durbin algorithm for StarCore processors following the ITU-T G.729A Recommendation. The optimization techniques to increase execution speed are also described. Two implementations were made, one using C language and other using assembly language. The assembly implementation is based on the C implementation, but special techniques are applied to increase the execution speed based on StarCore processor architecture. The source code for the C implementation is listed in Appendix A, and the code for the assembly implementation is listed in Appendix B. In G.729A, linear prediction analysis is performed once per speech frame using the autocorrelation method with a 30 ms asymmetric window. The autocorrelation coefficients of windowed speech are computed and converted to LP coefficients using the Levinson algorithm every 80 samples (10ms). The LP coefficients are then transformed to line spectrum pairs for quantization and interpolation. The interpolated quantized and unquantized filters are converted back to the LP filter coefficients to construct the synthesis and weighting filters for each subframe. Both the C and assembly implementations resolve Equation 7 on page 5 using the Levinson-Durbin recursion by implementing the algorithm described in Equation 9 through Equation 13. The algorithm is summarized in the pseudo-code shown in Code Listing 1. Code Listing 1. Pseudo-Code of Levinson-Durbin Recursion /* input data */ R[i] - autocorrelation coefficients A[i] - filter coefficients K - reflection coefficients Alpha - prediction gain /* initialization */ A[0] = 1 K = -R[1]/R[0] A[1] = K Alpha = R[0]* (1-K^2) (1) For i = 2 To M S = SUM(R[j]*A[i-j];j=1,i-1) + R[i] (2) K = -S/Alpha An[j] = A[j] + K*A[i-j] /*for j=1 to i-1 where An[i] = new A[i]*/ (3) An[i] = K Alpha = Alpha * (1-K^2) End If the filter proves to be unstable when the algorithm executes, the filter coefficients from the previous execution are stored as the new values and the execution terminates. Because the algorithm contains several data dependencies, implementation on multiple execution units is difficult. However, the optimization techniques described in the following sections demonstrate how to take the most advantage of the StarCore architecture. 3.1 C Implementation This section presents the optimization techniques applied to the original ITU-T G.729A Recommendation code of the Levinson-Durbin algorithm using the Metrowerks® CodeWarrior® for StarCore, Release 1.0. The C-optimized implementation is listed in Appendix A. The original C implementation of the Levinson-Durbin algorithm from the ITU-T G.729A Recommendation follows the pseudo-code in Code Listing 1. Operands used in several functions are represented in double precision format (DPF). For details on these functions and the way they are implemented, see Appendix C. Implementing the Levinson-Durbin Algorithm on the StarCoreTM SC140/SC1400 SC140/SC1400 Cores, Rev. 1 6 Freescale Semiconductor Levinson-Durbin Implementation on the SC140 SC140 Core The 32-bit DPF format is designed for 16-bit processors that do not support 32-bit operations. Thus, although StarCore processors are 16-bit processors that support 32-bit operations, the 32-bit operations must be implemented in DPF format to maintain bit-exactness with the original ITU-T implementation. The two 16-bit portions, originally processed in two separate DALU registers, are combined into a single 32-bit value using only one DALU register. Thus, functions that originally received two pointers to two 16-bit arrays can now operate with one pointer to a 32-bit array. This optimization step also reduces the required number of memory moves. The least significant bit of partial or final computation results must be reset to maintain bit-exactness with G.729A. The DPF representations of the algorithm input parameters (for example., the R[ ] vector of autocorrelation coefficients and the A[ ] vector of LPC coefficients) are changed. Therefore, the L_Comp() function, which composes a DPF 32-bit integer from two 16-bit integers, and L_Extract(), which extracts two 16-bit integers from a DPF 32-bit integer, are no longer used. The code for the intrinsic function div_s() is inlined in the Div_32() function to eliminate the extra cycles required to enter and exit the subroutine and implement a hardware loop. The compiler does not inline the code for div_s(). The inlined div_s() intrinsic consists of a loop that executes div_iter() intrinsic function sixteen times. Code Listing 2 shows the source code for Div_32() with an inlined div_s() function. Because the compiler does not inline the code for intrinsic function div_s(), the code for div_s() was inlined in the Div_32() function to eliminate the extra cycles required to enter and exit the subroutine and to allow the loop that contains div_s() to be transformed in a hardware loop. The inlined div_s() intrinsic consists of a loop that executes the div_iter() intrinsic function sixteen times. Code Listing 2 shows the source code for Div_32() with an inlined div_s() function. Code Listing 2. Div_32() with Inlined div_s() static Word32 Div_32(Word32 L_num, Word32 denom) { #pragma inline Word16 approx; Word32 L_32; #ifdef _SCC_METROWERKS_ { int i; clearSRbit(1); L_32 = 0x3fff0000; for ( i = 0; i < 16; i+ ){ L_32 = div_iter(L_32, extract_h(denom); } approx = extract_l(L_32); } #else approx = div_s(Word16)0x3fff, extract_h(denom); #endif L_32 = Mpy_32_16(denom, approx); L_32 = L_sub(Word32)0x7ffffffeL, L_32); L_32 = Mpy_32_16(L_32, approx); /* L_num * (1/L_denom) */ L_32 = Mpy_32(L_num, L_32); return L_shl(L_32, 2); /* result in Q30 */ /* result in Q30 */ /* 1/L_denom in Q29 */ /* result in Q29 /* From Q29 to Q31 */ */ } Implementing the Levinson-Durbin Algorithm on the StarCoreTM SC140/SC1400 SC140/SC1400 Cores, Rev. 1 Freescale Semiconductor 7 Levinson-Durbin Implementation on the SC140 SC140 Core The original implementation uses a vector to store new values of the reflection coefficients. However, these values are no longer used in the G.729A vocoder, so the storing phase of the reflection coefficients is no longer done. As a result, the input vector parameter for reflection coefficients and the phase of storing first two values of reflection coefficients in case of unstable filter were eliminated from implementation, thus reducing the occupied size and the number of execution cycles. Referring to the pseudo-code in Code Listing 1, another optimization involves loop merging the summation part of (2) with the equation in (3), thus making better use of the registers (refer to the section of code marked `E1' in Appendix A). In addition, the first calculation of Alpha (the alp variable) is removed from the initialization phase of the algorithm and inserted into the outer loop. The use of DPF representation in the first inner loop, which copies the filter coefficients values used to recompute the backwards coefficients, results in half the number of cycles of the original implementation. When L_shl() or L_shr() is used and the shift offset is variable, the compiler is forced to insert a function call that checks the sign and magnitude of the offset, providing the proper offset handling and saturation. However, when alp and the reflection coefficient K are calculated, the obtained offset is part of a normalization process, so no overflow can occur. Thus, in these cases, the 32-bit fractional shift left intrinsic L_shl can be replaced with the 40-bit intrinsic X_shl(), which does not perform checking and is thus translated into a single SC140 SC140 instruction, asll(). 3.2 Assembly Implementation This section presents the optimization techniques used in the assembly implementation of the Levinson-Durbin algorithm to reduce code size and the number execution cycles. The assembly-optimized implementation is listed in Appendix B. The outer loop of the algorithm (FIRST_LOOP) contains two inner loops (LOOP_1 and LOOP_2) that are not executed in the first pass through the outer loop. This resulted in several jump and branch instructions in the assembly code generated from the C implementation. The assembly implementation of Levinson-Durbin algorithm optimizes the execution of loops in the optimized C implementation by reducing the number of jump and branch instructions, using registers more efficiently, minimizing the number of values stored on the stack (only the filter coefficients are stored), and reordering certain DPF operations. Using the assembly implementations of the DPF operations Mpy_32(), Mpy_32_16(), and Div_32() (Appendix C) enables them to be mixed with other independent operations, reducing the number of execution cycles. 3.2.1 Dedicated Registers The first optimization was to store values that are used throughout the algorithm in dedicated registers at initialization so that they do not have to be restored after various operations that use them. They are used during DPF operations and for storing the Alpha (alp) and Alpha Exponent (alp_exp) variables. 3.2.2 Software Pipelining A software pipelining technique reduces the number of execution sets in loops that use the Mpy_32() function. These loops include the outer loop (FIRST_LOOP), the second inner loop (LOOP_2), and the loop that stores the filter coefficients (LFINAL). Refer to the areas marked E1, E2 and E3 in Appendix B. In this technique, · the first Mpy_32() operation is performed before the loop · the succeeding Mpy_32() operations are performed at the end of the loop in parallel with storing the results of the previous Mpy_32() operation Implementing the Levinson-Durbin Algorithm on the StarCoreTM SC140/SC1400 SC140/SC1400 Cores, Rev. 1 8 Freescale Semiconductor Levinson-Durbin Implementation on the SC140 SC140 Core This technique is illustrated in Code Listing 3. Code Listing 3. Software Pipelining move.l (r0)+,d0move.l (r1)+,d1 mpyus d0,d1,d2 mpysu d0,d1,d3 FALIGN LOOPSTART3 Label dmacss d0,d1,d2asrw d3,d3 and d11,d2 and d11,d3 add d2,d3,d4 move.l (r0)+,d0move.l (r1)+,d1 [ mpyus d0,d1,d2mpysu d0,d1,d3 moves.f d4,(r2)+ ] LOOPEND3 In addition, the pipelining technique provides execution sets that can be used to fulfill the delay required after a break or skipls instruction. 3.2.2.1 The break Instruction When the filter is determined to be unstable, the outer loop must be terminated and the previous filter coefficients maintained. This is done with a break instruction that jumps to a loop at the end of the program that copies the filter coefficients. (The loop counter is cleared as well.) The break instruction must be placed at least three execution sets before the last execution set in the loop. In the original code, there is no gap between break and the end of the loop; in the optimized code, the required additional execution sets are provided by the software pipeline technique, as illustrated in Code Example 4. Code Listing 4. Using the break Instruction With Software Pipelining FALIGN LOOPSTART2 FIRST_LOOP . [ mpysu d0,d1,d3mpyus d0,d1,d14 cmpgt d4,d5 ] [ dmacss d0,d1,d3asrw d14,d14 inc d8 ] [ ift break L_LOOP ] [ and d11,d3 and d11,d14 tfr d9,d0 ] add d3,d14,d3 sub d3,d12,d1 LOOPEND2 ;Mpy_32(K,K) ;test unstable filter ;increment loop counter ;unstable filter ;alp=d9=d0 ;t0=d3=Mpy_32(K,K) ;L_sub(0x7ffffffe,t0) Implementing the Levinson-Durbin Algorithm on the StarCoreTM SC140/SC1400 SC140/SC1400 Cores, Rev. 1 Freescale Semiconductor 9 Levinson-Durbin Implementation on the SC140 SC140 Core 3.2.2.2 The skipls Instruction The second inner loop (LOOP_2) cannot be performed during the first pass through the outer loop because the first computed reflection coefficient would be altered and the final results would be incorrect. The skipls instruction is used to bypass the second inner loop. This instruction tests the value of the loop counter and jumps past the loop if the loop counter is less than or equal to zero. The skipls instruction requires four execution cycles. The additional execution sets are provided by the software pipelining technique (refer to the area marked E6 in Appendix B). Although the first inner loop (LOOP_1) is executed the first time through the outer loop, it does not affect the final results because the values are rewritten in the next pass through the loop. The reason for executing LOOP_1 despite the fact that it is not needed is to avoid using the skipls instruction, which would have caused two cycles penalty per outer loop iteration. In two instances, a loop is reduced to a single cycle by reading the initial value for the loop before the loop is entered. This technique, illustrated in Code Example 5, is applied to LOOP_1 (refer to the section of code marked E5 Appendix B) and THIRD_LOOP, which is performed near the end of the routine if the filter is unstable. Code Listing 5. Single-Cycle Loop move.l (r0)+,d0 LOOPSTART3 LOOP_1 move.l d0,(r1)+move.l (r0)+,d0 LOOPEND3 3.2.3 Relocating a Calculation The calculation of the (i1)th order filter coefficient was moved before the first inner loop because its value depends only on the reflection coefficient (K), which has already been computed at the beginning of the outer loop (refer area marked E4 in Appendix B). The necessary instructions were inserted into the alp and alp_exp execution sets to reduce the of number execution sets in the code. 3.2.4 Combining Execution Sets The div_s intrinsic executes immediately after the end of the second inner loop. (Refer to the area marked E6 page 18 in Appendix B and the Div_32() function on page 21 in Appendix C.) This instruction cannot be placed earlier in the program because the bmclr instruction changes the SR register, and the modified SR required by the div instruction is available only after two additional cycles. Therefore, the execution sets of the Mpy_32() and Mpy_32_16() operations can be combined with the execution sets that compute the new reflection coefficient K. In addition, the first div operation executes before the loop for div_s, thus reducing the number of iterations of the loop. The result is more condensed code and fewer execution cycles. The code is shown in Code Listing 6. Code Listing 6. Mixing Mpy_32() and Mpy_32_16() Execution Sets bmclr #old_A[M] = A[M] = L_shl_nosat_r(L_shr_nosat(K,4),1); } } Implementing the Levinson-Durbin Algorithm on the StarCoreTM SC140/SC1400 SC140/SC1400 Cores, Rev. 1 Freescale Semiconductor 15 References Appendix B Assembly Implementation ;*; ;* COPYRIGHT 2000 - 2005 Freescale Semiconductor, Inc.*; ;*; ;* *; ;* FILE NAME: levinson.asm *; ;* TARGET PROCESSOR: StarCore SC140 SC140 *; ;*; INCLUDE 'constants.inc' SET Lvsn_AnSize MP1