| The Datasheet Archive - 100 Million Datasheets from 7500 Manufacturers. |
Code Optimization Guide Publication 22007 Revision Date
Top Searches for this datasheetAthlon Processor Code Optimization Guide Publication 22007 Revision Date February 2002 2001, 2002 Advanced Micro Devices, Inc. rights reserved. contents this document provided connection with Advanced Micro Devices, Inc. ("AMD") products. makes representations warranties with respect accuracy completeness contents this publication reserves right make changes specifications product descriptions time without notice. license, whether express, implied, arising estoppel otherwise, intellectual property rights granted this publication. Except forth AMD's Standard Terms Conditions Sale, assumes liability whatsoever, disclaims express implied warranty, relating products including, limited implied warranty merchantability, fitness particular purpose, infringement intellectual property right. AMD's products designed, intended, authorized warranted components systems intended surgical implant into body, other applications intended support sustain life, other application which failure AMD's product could create situation where personal injury, death, severe property environmental damage occur. reserves right discontinue make changes products time without notice. Trademarks AMD, Arrow logo, Athlon, combinations thereof, 3DNow!, AMD-751, Super7 trademarks, AMD-K6 AMD-K6-2 registered trademarks Advanced Micro Devices, Inc. Microsoft, Windows, Windows registered trademarks Microsoft Corporation. trademark Pentium registered trademark Intel Corporation. Other product names used this publication identification purposes only trademarks their respective companies. 22007K February 2002 AthlonProcessor Code Optimization Guide Contents List Figures .xiii List Tables Revision History xvii Chapter Introduction About This Document AthlonProcessor Family. Athlon Processor Microarchitecture Summary Chapter Optimizations Optimization Star Group Optimizations-Essential Optimizations Memory-Size Alignment Issues 3DNow!Prefetching Instructions. Select DirectPath Over VectorPath Instructions. Group Optimizations-Secondary Optimizations. Load-Execute Instruction Usage Take Advantage Write Combining Optimizing Main Memory Performance Large Arrays 3DNow! Instructions Recognize 3DNow! Professional Instructions. Avoid Branches Dependent Random Data Avoid Placing Code Data Same 64-Byte Cache Line. Table Contents AthlonProcessor Code Optimization Guide 22007K February 2002 Chapter Source-Level Optimizations. Ensure Floating-Point Variables Expressions Type Float 32-Bit Data Types Integer Code Consider Sign Integer Operands Array-Style Instead Pointer-Style Code Completely Unroll Small Loops. Avoid Unnecessary Store-to-Load Dependencies Always Match Size Stores Loads Consider Expression Order Compound Branch Conditions Switch Statement Usage. Prototypes Functions Const Type Qualifier Generic Loop Hoisting Declare Local Functions Static Dynamic Memory Allocation Consideration Introduce Explicit Parallelism into Code Explicitly Extract Common Subexpressions Language Structure Component Considerations Sort Local Variables According Base Type Size Accelerating Floating-Point Divides Square Roots Fast Floating-Point-to-Integer Conversion Speeding Branches Based Comparisons Between Floats. Avoid Unnecessary Integer Division. Copy Frequently Dereferenced Pointer Arguments Local Variables Block Prefetch Optimizations. Table Contents 22007K February 2002 AthlonProcessor Code Optimization Guide Chapter Instruction Decoding Optimizations. Overview Select DirectPath Over VectorPath Instructions. Load-Execute Instruction Usage Load-Execute Integer Instructions Load-Execute Floating-Point Instructions with Floating-Point Operands Avoid Load-Execute Floating-Point Instructions with Integer Operands Read-Modify-Write Instructions Where Appropriate Align Branch Targets Program Spots 32-Bit Rather than 16-Bit Instruction. Short Instruction Encodings Avoid Partial-Register Reads Writes. LEAVE Instruction Function Epilogue Code Replace Certain SHLD Instructions with Alternative Code. 8-Bit Sign-Extended Immediates 8-Bit Sign-Extended Displacements. Code Padding Using Neutral Code Fillers Recommendations AMD-K6® Family Athlon Processor Blended Code. Table Contents AthlonProcessor Code Optimization Guide 22007K February 2002 Chapter Cache Memory Optimizations Memory Size Alignment Issues Avoid Memory-Size Mismatches Align Data Where Possible Optimizing Main Memory Performance Large Arrays Memory Copy Optimization Array Addition Summary PREFETCH 3DNow!Instruction Determining Prefetch Distance Take Advantage Write Combining Avoid Placing Code Data Same 64-Byte Cache Line. Multiprocessor Considerations Store-to-Load Forwarding Restrictions. Store-to-Load Forwarding Pitfalls-True Dependencies Summary Store-to-Load Forwarding Pitfalls Avoid Stack Alignment Considerations Align TBYTE Variables Quadword Aligned Addresses Language Structure Component Considerations Sort Variables According Base Type Size Table Contents 22007K February 2002 AthlonProcessor Code Optimization Guide Chapter Branch Optimizations Avoid Branches Dependent Random Data Athlon Processor Specific Code. Blended AMD-K6 Athlon Processor Code Always Pair CALL RETURN Recursive Functions Replace Branches with Computation 3DNow! Code Muxing Constructs Sample Code Translated into 3DNow! Code Avoid Loop Instruction Avoid Control Transfer Instructions Chapter Scheduling Optimizations .105 Schedule Instructions According their Latency Unrolling Loops. Complete Loop Unrolling Partial Loop Unrolling Function Inlining Overview Always Inline Functions Called from Site Always Inline Functions with Fewer than Machine Instructions Avoid Address Generation Interlocks. MOVZX MOVSX Minimize Pointer Arithmetic Loops Push Memory Data Carefully. Table Contents AthlonProcessor Code Optimization Guide 22007K February 2002 Chapter Integer Optimizations. Replace Divides with Multiplies Multiplication Reciprocal (Division) Utility Unsigned Division Multiplication Constant Signed Division Multiplication Constant Consider Alternative Code When Multiplying Constant MMXInstructions Integer-Only Work Repeated String Instruction Usage. Latency Repeated String Instructions Guidelines Repeated String Instructions Instruction Clear Integer Registers Efficient 64-Bit Integer Arithmetic Efficient Implementation Population Count Function Efficient Binary-to-ASCII Decimal Conversion Derivation Multiplier Used Integer Division Constants Derivation Algorithm, Multiplier, Shift Factor Unsigned Integer Division. Derivation Algorithm, Multiplier, Shift Factor Signed Integer Division. viii Table Contents 22007K February 2002 AthlonProcessor Code Optimization Guide Chapter Floating-Point Optimizations Ensure Data Aligned Multiplies Rather than Divides FFREEP Macro Register from Stack Floating-Point Compare Instructions FXCH Instruction Rather than FST/FLD Pairs Avoid Using Extended-Precision Data Minimize Floating-Point-to-Integer Conversions Check Argument Range Trigonometric Instructions Efficiently Take Advantage FSINCOS Instruction Chapter 3DNow!and MMXOptimizations 3DNow! Instructions FEMMS Instruction 3DNow! Instructions Fast Division Optimized 14-Bit Precision Divide Optimized Full 24-Bit Precision Divide. Pipelined Pair 24-Bit Precision Divides Newton-Raphson Reciprocal 3DNow! Instructions Fast Square Root Reciprocal Square Root Optimized 15-Bit Precision Square Root Optimized 24-Bit Precision Square Root Newton-Raphson Reciprocal Square Root PMADDWD Instruction Perform 32-Bit Multiplies Parallel PMULHUW Compute Upper Half Unsigned Products. 3DNow! Intra-Operand Swapping Fast Conversion Signed Words Floating-Point Table Contents AthlonProcessor Code Optimization Guide 22007K February 2002 Width Memory Access Differs Between PUNPCKL* PUNPCKH* PXOR Negate 3DNow! Data PCMP Instead 3DNow! PFCMP. Instructions Block Copies Block Fills Efficient 64-Bit Population Count Using Instructions PXOR Clear Bits Register PCMPEQD Bits Register PAND Find Floating-Point Absolute Value 3DNow! Code Integer Absolute Value Computation Using Instructions Optimized Matrix Multiplication. Efficient 3D-Clipping Code Computation Using 3DNow! Instructions Efficiently Determining Similarity Between RGBA Pixels 3DNow! PAVGUSB MPEG-2 Motion Compensation Efficient Implementation floor() Using 3DNow! Instructions Stream Packed Unsigned Bytes Complex Number Arithmetic. Chapter General Optimization Guidelines .201 Short Forms Dependencies Register Operands Stack Allocation Table Contents 22007K February 2002 AthlonProcessor Code Optimization Guide Appendix AthlonProcessor Microarchitecture Introduction Athlon Processor Microarchitecture Superscalar Processor. Instruction Cache Predecode Branch Prediction Early Decoding Instruction Control Unit Data Cache. Integer Scheduler Integer Execution Unit. Floating-Point Scheduler Floating-Point Execution Unit Load-Store Unit (LSU) Cache. Write Combining Athlon System Appendix Pipeline Execution Unit Resources Overview Fetch Decode Pipeline Stages Integer Pipeline Stages Floating-Point Pipeline Stages Execution Unit Resources Terminology. Integer Pipeline Operations. Floating-Point Pipeline Operations. Load/Store Pipeline Operations Code Sample Analysis Table Contents AthlonProcessor Code Optimization Guide 22007K February 2002 Appendix Implementation Write Combining Introduction Write-Combining Definitions Abbreviations What Write Combining? Programming Details Write-Combining Operations Sending Write-Buffer Data System Appendix Performance-Monitoring Counters Overview Performance Counter Usage PerfEvtSel[3:0] MSRs (MSR Addresses C001_0000h-C001_0003h) PerfCtr[3:0] MSRs (MSR Addresses C001_0004h-C001_0007h) Appendix Programming MTRR PAT. Introduction Memory Type Range Register (MTRR) Mechanism Page Attribute Table (PAT). Appendix Instruction Dispatch Execution Resources/Timing Index Table Contents 22007K February 2002 AthlonProcessor Code Optimization Guide List Figures Figure Figure Figure Figure Figure Figure Figure Figure Figure AthlonProcessor Block Diagram Integer Execution Pipeline Floating-Point Unit Block Diagram Load/Store Unit Fetch/Scan/Align/Decode Pipeline Hardware Fetch/Scan/Align/Decode Pipeline Stages Integer Execution Pipeline Integer Pipeline Stages Floating-Point Unit Block Diagram Figure Floating-Point Pipeline Stages Figure PerfEvtSel[3:0] Registers Figure MTRR Mapping Physical Memory Figure MTRR Capability Register Format Figure MTRR Default Type Register Format Figure Page Attribute Table (MSR 277h) Figure MTRRphysBasen Register Format Figure MTRRphysMaskn Register Format List Figures xiii AthlonProcessor Code Optimization Guide 22007K February 2002 List Figures 22007K February 2002 AthlonProcessor Code Optimization Guide List Tables Table Table Table Table Table Table Table Table Table Table Table Table Table Table Table Table Table Table Table Table Table Table Table Table Table Latency Repeated String Instructions Integer Pipeline Operation Types. Integer Decode Types. Floating-Point Pipeline Operation Types Floating-Point Decode Types. Load/Store Unit Stages Sample 1-Integer Register Operations Sample 2-Integer Register Memory Load Operations Write Combining Completion Events. AthlonSystem Command Generation Rules. Performance-Monitoring Counters Memory Type Encodings Standard MTRR Types Properties PATi 3-Bit Encodings Effective Memory Type Based MTRRs Final Output Memory Types MTRR Fixed Range Register Format MTRR-Related Model-Specific Register (MSR) Map. Integer Instructions MMXInstructions MMXExtensions. Floating-Point Instructions 3DNow!Instructions 3DNow!Extensions Instructions Introduced with 3DNow!Professional. List Tables AthlonProcessor Code Optimization Guide 22007K February 2002 List Tables 22007K February 2002 AthlonProcessor Code Optimization Guide Revision History Date Feb. 2002 Description Corrected code sequences labeled "Consider Alternative Code When Multiplying Constant" page 120. Removed outdated references Appendix Removed blank pages Chapter "3DNow!and MMXOptimizations." Replaced memcpy example arrays "AMD AthlonProcessor-Specific Code" page 178. Revised "PerfCtr[3:0] MSRs (MSR Addresses C001_0004h-C001_0007h)" page 240. Added Table "Instructions Introduced with 3DNow!Professional," page 301. Updated wording regarding cache Cache" page 213. Added block copy/prefetch "Optimizing Main Memory Performance Large Arrays" page Removed Appendix Corrected Example under "Muxing Constructs" page Added Appendix "Performance-Monitoring Counters." Added more details optimizations Chapter "Top Optimizations." Further clarified information "Use Array-Style Instead Pointer-Style Code" page Added optimization, "Always Match Size Stores Loads" page Added optimization, "Fast Floating-Point-to-Integer Conversion" page Added optimization, "Speeding Branches Based Comparisons Between Floats" page Added optimization, "Use Read-Modify-Write Instructions Where Appropriate" page Further clarified information "Align Branch Targets Program Spots" page Added optimization, "Use 32-Bit Rather than 16-Bit Instruction" page Added optimization, "Use LEAVE Instruction Function Epilogue Code" page Added more examples "Memory Size Alignment Issues" page April 2000 Further clarified information "Use PREFETCH 3DNow!Instruction" page Further clarified information "Store-to-Load Forwarding Restrictions" page Changed epilogue code Example "Stack Alignment Considerations" page Added Example "Avoid Branches Dependent Random Data" page Fixed comments examples "Unsigned Division Multiplication Constant" page 116. Revised code "Algorithm: Divisors <231, page "Algorithm: Divisors <231" page 118. Added more examples "Efficient 64-Bit Integer Arithmetic" page 125. Fixed typo integer example added version "Efficient Implementation Population Count Function" page 136. Added optimization, "Efficient Binary-to-ASCII Decimal Conversion" page 139. July 2001 Sept. 2000 June 2000 Revision History xvii AthlonProcessor Code Optimization Guide 22007K February 2002 Date Description Updated code "Derivation Multiplier Used Integer Division Constants" page Software Development (SDK). Further clarified information "Use FFREEP Macro Register from Stack" page 152. Corrected Example "Minimize Floating-Point-to-Integer Conversions" page 154. Added optimization, "Use PMULHUW Compute Upper Half Unsigned Products" page 167. Added "Width Memory Access Differs Between PUNPCKL* PUNPCKH*" page 171. Rewrote "Use MMXInstructions Block Copies Block Fills" page 174. April 2000 cont. Added optimization, "Integer Absolute Value Computation Using MMXInstructions" page 186. Added optimization, "Efficient 64-Bit Population Count Using MMXInstructions" page 184. Added optimization, "Efficiently Determining Similarity Between RGBA Pixels" page 192. Added optimization, "Efficient Implementation floor() Using 3DNow!Instructions" page 197. Corrected instruction mnemonics AAM, AAD, BOUND, FDIVP, FMULP, FDUBP, DIV, IDIV, IMUL, MUL, TEST "Instruction Dispatch Execution Resources/Timing" page "DirectPath versus VectorPath Instructions" page 301. Added "About This Document" page Further clarified information "Consider Sign Integer Operands" page Added optimization, "Use Array-Style Instead Pointer-Style Code" page Added optimization, "Accelerating Floating-Point Divides Square Roots" page Clarified examples "Copy Frequently Dereferenced Pointer Arguments Local Variables" page Further clarified information "Select DirectPath Over VectorPath Instructions" page Nov. 1999 Further clarified information "Align Branch Targets Program Spots" page Further clarified instruction filler "Code Padding Using Neutral Code Fillers" page Further clarified information "Use PREFETCH 3DNow!Instruction" page Modified examples "Unsigned Division Multiplication Constant" page 116. Added optimization, "Efficient Implementation Population Count Function" page 136. Further clarified information "Use FFREEP Macro Register from Stack" page 152. Further clarified information "Minimize Floating-Point-to-Integer Conversions" page 154. xviii Revision History 22007K February 2002 AthlonProcessor Code Optimization Guide Date Description Added optimization, "Check Argument Range Trigonometric Instructions Efficiently" page 157. Added optimization, "Take Advantage FSINCOS Instruction" page 159. Further clarified information "Use 3DNow!Instructions Fast Division" page 162. Further clarified information "Use FEMMS Instruction" page 162. Nov. 1999 Further clarified information "Use 3DNow!Instructions Fast Square Root Reciprocal cont. Square Root" page 165. Clarified "3DNow!and MMXIntra-Operand Swapping" page 169. Corrected PCMPGT information "Use MMXPCMP Instead 3DNow!PFCMP" page 173. Added optimization, "Use MMXInstructions Block Copies Block Fills" page 174. Modified rule "Use MMXPXOR Clear Bits Register" page 185. Modified rule "Use MMXPCMPEQD Bits Register" page 186. Added optimization, "Optimized Matrix Multiplication" page 187. Added optimization, "Efficient 3D-Clipping Code Computation Using 3DNow!Instructions" page 190. Added optimization, "Complex Number Arithmetic" page 199. Added Appendix "Programming MTRR PAT." Rearranged appendixes. Added index. Oct. 1999 Revision History AthlonProcessor Code Optimization Guide 22007K February 2002 Revision History 22007K February 2002 AthlonProcessor Code Optimization Guide Introduction Athlonprocessor newest microprocessor family microprocessors. advances Athlon processor take superscalar operation out-oforder execution level. Athlon processor been designed efficiently execute code written previousgeneration processors. However, enable fastest code execution with Athlon processor, programmers should write software that includes specific code optimization techniques. About This Document This document contains information assist programmers creating optimized code Athlon processor. addition compiler assembler designers, this document been targeted assembly-language programmers writing execution-sensitive code sequences. This document assumes that reader possesses in-depth knowledge instruction set, architecture (registers programming modes), PC-AT platform. This guide been written specifically Athlon processor, includes considerations previousChapter Introduction AthlonProcessor Code Optimization Guide 22007K February 2002 generation processors describes those optimizations applicable Athlon processor. This guide covers following topics: Section Chapter Introduction Topic Description Outlines material covered this document. Summarizes Athlonmicroarchitecture. Provides convenient descriptions most important optimizations programmer should take into consideration. Describes optimizations that C/C++ programmers implement. Describes methods that will make most efficient three sophisticated instruction decoders Athlon processor. Describes optimizations that make efficient large caches high-bandwidth buses Athlon processor. Describes optimizations that improve branch prediction minimize branch penalties. Describes optimizations that improve code scheduling efficient execution resource utilization. Describes optimizations that improve integer arithmetic make efficient integer execution units Athlon processor. Describes optimizations that make maximum superscalar pipelined floatingpoint unit (FPU) Athlon processor. Chapter Optimizations Chapter Source-Level Optimizations Chapter Instruction Decoding Optimizations Chapter Cache Memory Optimizations Chapter Branch Optimizations Chapter Scheduling Optimizations Chapter Integer Optimizations Chapter Floating-Point Optimizations Introduction Chapter 22007K February 2002 AthlonProcessor Code Optimization Guide Section Chapter Topic 3DNow!and MMXOptimizations Description Describes code optimization guidelines 3DNow!, MMX, Enhanced 3DNow!/MMX. Lists generic optimization techniques applicable processors. Describes detail microarchitecture Athlon processor. Describes detail execution unit relation instruction pipeline. Describes algorithm used Athlon processor write-combine. Describes usage performance counters available Athlon processor. Describes steps needed program Memory Type Range Registers Page Attribute Table. Lists instruction execution resource usage latency. Chapter General Optimization Guidelines AthlonProcessor Microarchitecture Pipeline Execution Unit Resources Overview Implementation Write Combining Appendix Appendix Appendix Appendix Performance-Monitoring Counters Appendix Programming MTRR Instruction Dispatch Execution Resources/Timing Appendix AthlonProcessor Family Athlon processor family uses state-of-the-art decoupled decode/execution design techniques deliver nextg compatibility. This next-generation processor family advances code execution using flexible instruction predecoding, wide balanced decoders, aggressive out-of-order execution, parallel integer execution pipelines, parallel floating-point execution pipelines, deep pipelined execution higher delivered operating frequency, dedicated cache memory, high-performance double-rate 64-bit local bus. binary-compatible processor, Athlon processor implements industry-standard instruction Chapter Introduction AthlonProcessor Code Optimization Guide 22007K February 2002 decoding executing instructions using proprietary microarchitecture. This microarchitecture allows delivery maximum performance when running x86-based software. AthlonProcessor Microarchitecture Summary Athlon processor brings superscalar performance high operating frequencies computer systems running industry-standard software. brief summary nextgeneration design features implemented Athlon processor follows: High-speed double-rate local-bus interface Large, split 128-Kbyte level-one (L1) cache External level-two (L2) cache Models On-die cache Models Dedicated level-two (L2) cache Instruction predecode branch detection during cacheline fills Decoupled decode/execution core Three-way instruction decoding Dynamic scheduling speculative execution Three-way integer execution Three-way address generation Three-way floating-point execution 3DNow!technology MMXsingle-instruction multiple-data (SIMD) instruction extensions Super data forwarding Deep out-of-order integer floating-point execution Register renaming Dynamic branch prediction Athlon processor communicates through nextgeneration high-speed local that beyond current Socket Super7bus standard. local transfer data twice rate operating frequency using Introduction Chapter 22007K February 2002 AthlonProcessor Code Optimization Guide information). reduce on-chip cache-miss penalties avoid subsequent data-load instruction-fetch stalls, Athlon processor dedicated high-speed cache. large 128-Kbyte on-chip cache cache allow Athlon execution core achieve sustain maximum performance. decoupled decode/execution processor, Athlon processor makes proprietary microarchitecture, which defines heart Athlon processor. With inclusion these features, Athlon processor capable decoding, issuing, executing, retiring multiple instructions cycle, resulting superior scalable performance. Athlon processor includes both industry-standard SIMD integer instructions 3DNow! SIMD floating-point instructions that were first introduced AMD-K6 processor. design 3DNow! technology based suggestions from leading graphics vendors independent software vendors (ISVs). Using SIMD format, Athlon processor generate four 32-bit, singleprecision floating-point results clock cycle. 3DNow! execution units allow high-performance floating-point vector operations, which replace instructions enhance performance graphics other floating-point-intensive applications. Because 3DNow! architecture uses same registers instructions, switching between 3DNow! penalty. Athlon processor designers took another innovative step carefully integrating traditional floating-point, MMX, 3DNow! execution units into operational engine. With introduction Athlon processor, technology virtually eliminated. Athlon processor combined with 3DNow! technology brings better multimedia experience mainstream users while maintaining backward compatibility with existing software. Chapter Introduction AthlonProcessor Code Optimization Guide 22007K February 2002 Although Athlon processor extract code parallelism on-the-fly from off-the-shelf, commercially available software, specific code optimization Athlon processor result even higher delivered performance. This document describes proprietary microarchitecture Athlon processor makes recommendations optimizing execution software processor. coding techniques achieving peak performance Athlon processor include, limited those AMD-K6®, AMD-K6-2, Pentium®, Pentium Pro, Pentium processors. However, many these optimizations necessary Athlon processor achieve maximum performance. more flexible pipeline control aggressive out-of-order execution, Athlon processor sensitive instruction selection code scheduling. This flexibility distinct advantages Athlon processor. Athlon processor uses latest processor microarchitecture design techniques provide highest performance today's computer. short, Athlon processor offers true next-generation performance with binary software compatibility. Introduction Chapter 22007K February 2002 AthlonProcessor Code Optimization Guide Optimizations This chapter contains descriptions best optimizations improving performance Athlonprocessor. Subsequent chapters contain more detailed descriptions these other optimizations. optimizations this chapter divided into groups listed order importance. Group I-Essential Optimizations Group contains essential optimizations. Users should follow these critical guidelines closely. optimizations Group follows: Memory Size Alignment Issues-Avoid memory size mismatches-Align data where possible PREFETCH 3DNow!Instruction Select DirectPath Over VectorPath Instructions Group II-Secondary Optimizations significantly improve performance Athlon processor. optimizations Group follows: Load-Execute Instruction Usage-Use Load-Execute instructions-Avoid load-execute floating-point instructions with integer operands Take Advantage Write Combining Optimization Array Operations With Block Prefetching 3DNow! Instructions Recognize 3DNow! Professional Instructions Avoid Branches Dependent Random Data Avoid Placing Code Data Same 64-Byte Cache Line Optimizations Chapter AthlonProcessor Code Optimization Guide 22007K February 2002 Optimization Star optimizations described this chapter flagged with star. addition, star appears beside more detailed descriptions found subsequent chapters. Group Optimizations-Essential Optimizations Memory-Size Alignment Issues Avoid Memory-Size Mismatches Avoid memory-size mismatches when different instructions operate same data. When instruction stores another instruction reloads same data, keep their operands aligned keep loads/stores each operand same size. following code examples result store-to-loadforwarding (STLF) stall: Example (Avoid): DWORD [FOO], DWORD [FOO+4], QWORD [FOO] Avoid large-to-small mismatches, shown following code: Example (Avoid): QWORD [FOO] EAX, DWORD [FOO] EDX, DWORD [FOO+4] Optimizations Chapter 22007K February 2002 AthlonProcessor Code Optimization Guide Align Data Where Possible Avoid misaligned data references. data whose size power considered aligned naturally aligned. example: Word accesses aligned they access address divisible two. Doubleword accesses aligned they access address divisible four. Quadword accesses aligned they access address divisible eight. TBYTE accesses aligned they access address divisible eight. misaligned store load operation suffers minimum onecycle penalty Athlon processor load/store pipeline. addition, using misaligned loads stores increases likelihood encountering store-to-load forwarding pitfall. more detailed discussion store-to-load forwarding issues, "Store-to-Load Forwarding Restrictions" page 3DNow!Prefetching Instructions code that take advantage prefetching, 3DNow! PREFETCH PREFETCHW instructions increase effective bandwidth Athlon processor, thereby significantly improving performance. prefetch instructions essentially integer instructions used anywhere, type code (for example, integer, x87, 3DNow!, MMX). following formula determine prefetch distance: Prefetch Distance (DS/C) Round nearest cache line. data stride loop iteration. number cycles loop iteration when hitting cache. "Use PREFETCH 3DNow!Instruction" page more details. Chapter Optimizations AthlonProcessor Code Optimization Guide 22007K February 2002 Select DirectPath Over VectorPath Instructions instructions. DirectPath instructions optimized decode execute efficiently minimizing number operations instruction, which includes `register register memory' well `register register register' forms instructions. three DirectPath instructions decoded cycle. VectorPath instructions block decoding DirectPath instructions. Athlon processor implements majority instructions used compiler DirectPath instructions. consideration usage DirectPath versus VectorPath instructions. Appendix "Instruction Dispatch Execution Resources/Timing," tables DirectPath VectorPath instructions. Group Optimizations-Secondary Optimizations Load-Execute Instruction Usage Load-Execute Instructions Most load-execute integer instructions DirectPath decodable decoded rate three cycle. Splitting load-execute integer instruction into separate instructions-a load instruction "reg, reg" instruction- reduces decoding bandwidth increases register pressure, which results lower performance. split-instruction form avoid scheduler stalls longer executing instructions explicitly schedule load execute operations. Optimizations Chapter 22007K February 2002 AthlonProcessor Code Optimization Guide Load-Execute Floating-Point Instructions with Floating-Point Operands When operating single-precision double-precision floating-point data, wherever possible floating-point loadexecute instructions increase code density. Note: This optimization applies only floating-point instructions with floating-point operands integer operands, described next section. This coding style helps ways. First, denser code allows more work held instruction cache. Second, denser code generates fewer internal MacroOPs, allowing scheduler hold more work, which increases chances extracting parallelism from code. Example (Avoid): FMUL QWORD [TEST1] QWORD [TEST2] ST(1) Example (Preferred): FMUL QWORD [TEST1] QWORD [TEST2] Avoid Load-Execute Floating-Point Instructions with Integer Operands load-execute floating-point instructions with integer operands: FIADD, FISUB, FISUBR, FIMUL, FIDIV, FIDIVR, instructions have integer operands, while integer instructions cannot have floating-point operands. separate FILD arithmetic instructions floatingpoint computations involving integer-memory operands. This optimization potential increase decode bandwidth density scheduler. floating-point loadexecute instructions with integer operands VectorPath generate cycle, while discrete equivalent enables third DirectPath instruction decoded same cycle. some situations, this optimization also reduce execution time FILD scheduled several instructions ahead arithmetic instruction order cover FILD latency. Chapter Optimizations AthlonProcessor Code Optimization Guide 22007K February 2002 Example (Avoid): QWORD [foo] FIMUL DWORD [bar] FIADD DWORD [baz] Example (Preferred): FILD FILD FMULP FADDP DWORD [bar] DWORD [baz] QWORD [foo] ST(2), ST(1),ST Take Advantage Write Combining This guideline applies only operating-system, device-driver, rove performance, Athlon processor aggressively combines multiple memory-write cycles data size that address locations within 64-byte cache line aligned write buffer. Appendix "Implementation Write Combining," more details. Optimizing Main Memory Performance Large Arrays Reading Large Arrays Streams process large array (200 Kbytes more), other large sequential data sets that already cache, block prefetch achieve maximum performance. block prefetch technique involves processing data blocks. data each block preloaded into cache reading just address cache line, causing each cache line filled with data from main memory. Filling cache lines this manner, with single read operation line, allows memory system burst data highest achievable read bandwidth. Once input data cache, processing then proceed maximum instruction execution rate, because memory read accesses will slow down processor. Optimizations Chapter 22007K February 2002 AthlonProcessor Code Optimization Guide Writing Large Arrays Memory data needs written back memory during processing, similar technique used accelerate write phase. processing loop writes data temporary in-cache buffer, avoid memory-access cycles allow processor execute maximum instruction rate. Once complete data block been processed, results copied from incache buffer main memory, using loop that employs very fast streaming store instruction, MOVNTQ. "Optimizing Main Memory Performance Large Arrays" page detailed optimization examples, where block-prefetch method used simply copying memory, also adding floating-point arrays through floating-point unit. Also complete optimized memcpy routine "Use MMXInstructions Block Copies Block Fills" page 174. This example employs Block Prefetch large size memory blocks. 3DNow!Instructions When single precision required, perform floating-point computations using 3DNow! instructions instead instructions. SIMD nature 3DNow! instructions achieves twice number FLOPs that achieved through instructions. 3DNow! instructions also provide flat register file instead stack-based approach instructions. Table page list 3DNow! instructions. information about instruction usage, 3DNow!Technology Manual, order 21928. Chapter Optimizations AthlonProcessor Code Optimization Guide 22007K February 2002 Recognize 3DNow! Professional Instructions Athlonprocessors that include 3DNow! Professional instructions indicate presence Streaming SIMD Extensions (SSE) through standard CPUID feature Table page list additional instructions introduced with 3DNow! Professional technology. Where optimizations already exist, planned future development, feature-detection code using CPUID should checked ensure correct vendor independent recognition processors. full description feature detection processors, please refer Processor Recognition Application Note, order 20734. Avoid Branches Dependent Random Data Avoid conditional branches depending random data, these difficult predict. example, piece code receives random stream characters through branches character before collating sequence. Datadependent branches acting upon basically random data cause branch-prediction logic mispredict branch about time. possible, design branch-free alternative code sequences, which result shorter average execution time. This technique especially important branch body small. "Avoid Branches Dependent Random Data" page more details. Optimizations Chapter 22007K February 2002 AthlonProcessor Code Optimization Guide Avoid Placing Code Data Same 64-Byte Cache Line Sharing code data same 64-byte cache line cause caches thrash (unnecessary castout code/data) order maintain coherency between separate instruction data caches. Athlon processor cache-line size bytes, which twice size previous processors. Avoid placing code data together within this larger cache line, especially data becomes modified. example, consider that memory indirect instruction have data jump table residing same 64-byte cache line instruction. This mixing code data same cache line would result lower performance. Although rare, place critical code border between 32-byte aligned code segments data segments. Code start data segment should executed seldomly possible simply padded with garbage. general, avoid following: Self-modifying code Storing data code segments Chapter Optimizations AthlonProcessor Code Optimization Guide 22007K February 2002 Optimizations Chapter 22007K February 2002 AthlonProcessor Code Optimization Guide Source-Level Optimizations This chapter details programming practices optimizing code Athlonprocessor. Guidelines listed order importance. Ensure Floating-Point Variables Expressions Type Float compilers that generate 3DNow!instructions, make sure that floating-point variables expressions type float. special attention floating-point constants. These require suffix (for example: 3.14f) type float, otherwise they default type double. avoid automatic promotion float arguments double, always function prototypes functions that accept float arguments. 32-Bit Data Types Integer Code implementations vary, typically following data types included-int, signed, signed int, unsigned, unsigned int, long, signed long, long int, signed long int, unsigned long, unsigned long int. Chapter Source-Level Optimizations AthlonProcessor Code Optimization Guide 22007K February 2002 Consider Sign Integer Operands many cases, data stored integer variables determines whether signed unsigned integer type appropriate. example, record weight person pounds, negative numbers required, unsigned type appropriate. However, recording temperatures degrees Celsius require both positive negative numbers, signed type needed. Where there choice using either signed unsigned type, take into consideration that certain operations faster with unsigned types while others faster signed types. Integer-to-floating-point conversion using integers larger than bits faster with signed types, architecture provides instructions converting signed integers floatingpoint, instructions converting unsigned integers. typical case, 32-bit integer converted compiler assembly follows: Example (Avoid): double unsigned ====> FILD FSTP [temp+4], EAX, [temp], QWORD [temp] QWORD previous code slow only because number instructions, also because size mismatch prevents store-toload forwarding FILD instruction. Instead, following code: Example (Preferred): double ====> FILD DWORD FSTP QWORD Source-Level Optimizations Chapter 22007K February 2002 AthlonProcessor Code Optimization Guide Computing quotients remainders integer division constants faster when performed unsigned types. following typical case compiler output 32-bit integer divided four: Example (Avoid): ====> EAX, EDX, EAX, EAX, Example (Preferred): unsigned ====> summary: unsigned types for: Division remainders Loop counters Array indexing signed types for: Integer-to-float conversion Chapter Source-Level Optimizations AthlonProcessor Code Optimization Guide 22007K February 2002 Array-Style Instead Pointer-Style Code pointers makes work difficult optimizers compilers. Without detailed aggressive pointer analysis, compiler assume that writes through pointer write place memory. This includes storage allocated other variables, creating issue aliasing, i.e., same block memory accessible more than way. help compiler optimizer analysis, avoid pointers where possible. example where this trivially possible access data organized arrays. allows either array operator pointers access array. Using array-style code makes task optimizer easier reducing possible aliasing. example, x[0] x[2] cannot possibly refer same recommended array style, significant performance advantages achieved with most compilers. Example (Avoid): typedef struct float x,y,z,w; VERTEX; typedef struct float m[4][4]; MATRIX; void XForm (float *res, const float const float numverts) float const VERTEX* (VERTEX *)v; numverts; i++) vv->x *m++; vv->y *m++; vv->z *m++; vv->w *m++; write transformed *res++ vv->x *m++; vv->y *m++; vv->z *m++; Source-Level Optimizations Chapter 22007K February 2002 AthlonProcessor Code Optimization Guide vv->w *m++; *res++ vv->x vv->y vv->z vv->w *res++ vv->x vv->y vv->z vv->w *res++ ++vv; write transformed *m++; *m++; *m++; *m++; write transformed *m++; *m++; *m++; *m++; write transformed next input vertex reset start transform matrix Example (Preferred): typedef struct float x,y,z,w; VERTEX; typedef struct float m[4][4]; MATRIX; void XForm (float *res, const float const float numverts) const VERTEX* (VERTEX *)v; const MATRIX* (MATRIX *)m; VERTEX* (VERTEX *)res; numverts; i++) rr->x vv->x*mm->m[0][0] vv->y*mm->m[0][1] vv->z*mm->m[0][2] vv->w*mm->m[0][3]; rr->y vv->x*mm->m[1][0] vv->y*mm->m[1][1] vv->z*mm->m[1][2] vv->w*mm->m[1][3]; rr->z vv->x*mm->m[2][0] vv->y*mm->m[2][1] vv->z*mm->m[2][2] vv->w*mm->m[2][3]; rr->w vv->x*mm->m[3][0] vv->y*mm->m[3][1] vv->z*mm->m[3][2] vv->w*mm->m[3][3]; Chapter Source-Level Optimizations AthlonProcessor Code Optimization Guide 22007K February 2002 Reality Check Note that source code transformations interact with compiler's code generator that difficult control generated machine code from source level. even possible that source code transformations improving performance compiler optimizations "fight" each other. Depending compiler specific source code, therefore possible that pointer style code will compiled into machine code that faster than that generated from equivalent array style code. advisable check performance after source code transformation whether performance really improved. Completely Unroll Small Loops Take advantage large 64-Kbyte instruction cache Athlon processor completely unroll small loops. Unrolling loops beneficial performance, especially loop body small, which makes loop overhead significant. Many compilers aggressive unrolling loops. loops that have small fixed loop count small loop body, completely unroll loops source level. Example (Avoid): 3D-transform: multiply vector transform matrix (i=0; i<4; i++) r[i] (j=0; j<4; j++) r[i] M[j][i]*V[j]; Example (Preferred): 3D-transform: multiply vector r[0] M[0][0]*V[0] M[1][0]*V[1] M[3][0]*V[3]; r[1] M[0][1]*V[0] M[1][1]*V[1] M[3][1]*V[3]; r[2] M[0][2]*V[0] M[1][2]*V[1] M[3][2]*V[3]; r[3] M[0][3]*V[0] M[1][3]*V[1] M[3][3]*v[3]; transform matrix M[2][0]*V[2] M[2][1]*V[2] M[2][2]*V[2] M[2][3]*V[2] Source-Level Optimizations Chapter 22007K February 2002 AthlonProcessor Code Optimization Guide Avoid Unnecessary Store-to-Load Dependencies store-to-load dependency exists when data stored memory, only read back shortly thereafter. "Store-toLoad Forwarding Restrictions" page more details. Athlon processor contains hardware accelerate such store-to-load dependencies, allowing load obtain store data before been written memory. However, still faster avoid such dependencies altogether keep data internal register. Avoiding store-to-load dependencies especially important they part long dependency chains, occur recurrence computation. dependency occurs while operating arrays, many compilers unable optimize code that avoids store-to-load dependency. some instances language definition prohibit compiler from using code transformations that would remove storeto-load dependency. therefore recommended that programmer remove dependency manually, e.g., introducing temporary variable that kept register. This result significant performance increase. following example this. Example (Avoid): double x[VECLEN], y[VECLEN], z[VECLEN]; unsigned VECLEN; k++) x[k] x[k-1] y[k]; VECLEN; k++) x[k] z[k] (y[k] x[k-1]); Example (Preferred): double x[VECLEN], y[VECLEN], z[VECLEN]; unsigned double x[0]; VECLEN; k++) y[k]; x[k] Chapter Source-Level Optimizations AthlonProcessor Code Optimization Guide 22007K February 2002 x[0]; VECLEN; k++) z[k] (y[k] x[k] Always Match Size Stores Loads Athlon processor contains load/store buffer (LS) speed forwarding store data dependent loads. However, this store-to-load forwarding (STLF) inside occurs general only when addresses sizes store dependent load match, when both memory accesses aligned (see section "Store-to-Load Forwarding Restrictions" page details). impossible control load store activity source level avoid cases that violate restrictions placed store-to-load-forwarding. some instances possible spot such cases source code. Size mismatches easily occur when different sized data items joined union. Address mismatches could result pointer manipulation. following examples show situation involving union differently sized data items. examples show user defined unsigned 16.16 fixed point type, operations defined this type. Function fixed_add() adds fixed point numbers, function fixed_int() extracts integer portion fixed point number. Example (Avoid) shows inappropriate implementation fixed_int(), which when used result fixed_add() causes misalignment, address mismatch, size mismatch between memory operands, such that STLF takes place. Example (Preferred) shows properly implement fixed_int() order allow store-to-load-forwarding Example (Avoid): typedef union unsigned whole; struct unsigned short frac; lower bits fraction unsigned short intg; upper bits integer parts; FIXED_U_16_16; Source-Level Optimizations Chapter 22007K February 2002 AthlonProcessor Code Optimization Guide _inline FIXED_U_16_16 fixed_add (FIXED_U_16_16 FIXED_U_16_16 FIXED_U_16_16 z.whole x.whole y.whole; return (z); _inline unsigned fixed_int (FIXED_U_16_16 return ((unsigned int)(x.parts.intg)); FIXED_U_16_16 unsigned label1: fixed_add fixed_int (y); label2: object code generated source code between $label1 $label2 typically follows these following variants: ;variant EDX, DWORD EAX, DWORD EAX, DWORD [y], EAX, DWORD [y+2] misaligned/address mismatch, forwarding EAX, 0FFFFh DWORD [q], ;variant EDX, DWORD EAX, DWORD EAX, DWORD [y], MOVZX EAX, WORD [y+2] DWORD [q], size address mismatch, forwarding Chapter Source-Level Optimizations AthlonProcessor Code Optimization Guide 22007K February 2002 Example (Preferred): typedef union unsigned whole; struct unsigned short frac; lower bits fraction unsigned short intg; upper bits integer parts; FIXED_U_16_16; _inline FIXED_U_16_16 fixed_add (FIXED_U_16_16 FIXED_U_16_16 FIXED_U_16_16 z.whole x.whole y.whole; return (z); _inline unsigned fixed_int (FIXED_U_16_16 return (x.whole 16); FIXED_U_16_16 unsigned label1: fixed_add fixed_int (y); label2: object code generated source code between $label1 $label2 typically looks follows: EDX, DWORD EAX, DWORD EAX, DWORD [y], EAX, DWORD match, aligned, size/address forwarding EAX, DWORD [q], Source-Level Optimizations Chapter 22007K February 2002 AthlonProcessor Code Optimization Guide Consider Expression Order Compound Branch Conditions Branch conditions prog rams often compound conditions consisting multiple boolean expressions joined boolean operators guarantees short-circuit evaluation these operators. This means that case first operand evaluate TRUE terminates evaluation, i.e., following operands evaluated all. Similarly first operand evaluate FALSE terminates evaluation. Because this short-circuit evaluation, always possible swap operands This especially case when evaluation operands causes side effect. However, most cases exchange operands possible. When used control conditional branches, expressions involving translated into series conditional branches. ordering conditional branches function ordering expressions compound condition, have significant impact performance. impossible give easy, closed-form formula order conditions. Overall performance function variety following factors: Probability branch mispredict each branches generated Additional latency incurred branch mispredict Cost evaluating conditions controlling each branches generated Amount parallelism that extracted evaluating branch conditions Data stream consumed application (mostly dependence mispredict probabilities nature incoming data data dependent branches) therefore recommended experiment with ordering expressions compound branch conditions most active areas program called spots) where most execution time spent. Such spots found through profiling. Feed "typical" data stream program while doing experiments. Chapter Source-Level Optimizations AthlonProcessor Code Optimization Guide 22007K February 2002 Switch Statement Usage Optimize Switch Statements Switch statements translated using variety algorithms. most common these jump tables comparison chains/trees. recommended sort cases switch statement according probability occurrences, with most probable first. This improves performance when switch translated comparison chain. further recommended make case labels small, contiguous integer values, this allows switch translated jump table. Most compilers allow switch statement translated jump table case labels small contiguous integer values. Example (Avoid): days_in_month, short_months, normal_months, long_months; switch (days_in_month) case case short_months++; break; case normal_months++; break; case long_months++; break; default: printf ("month fewer than more than days\n"); Example (Preferred): days_in_month, short_months, normal_months, long_months; switch (days_in_month) case long_months++; break; case normal_months++; break; case case short_months++; break; default: printf ("month fewer than more than days\n"); Source-Level Optimizations Chapter 22007K February 2002 AthlonProcessor Code Optimization Guide Prototypes Functions general, prototypes functions. Prototypes convey additional information compiler that might enable more aggressive optimizations. Const Type Qualifier "const" type qualifier much possible. This optimization makes code more robust enable higher performance code generated additional information available compiler. example, standard allows compilers allocate storage objects that declared "const" their address never taken. Chapter Source-Level Optimizations AthlonProcessor Code Optimization Guide 22007K February 2002 Generic Loop Hoisting improve performance inner loops, beneficial reduce redundant constant calculations (i.e., loop invariant calculations). However, this idea extended invariant control structures. first case that constant if() statement for() loop. Example for( CONSTANT0 DoWork0( else DoWork1( does affect CONSTANT0 does affect CONSTANT0 Transform above loop into: CONSTANT0 for( DoWork0( else for( DoWork1( This makes inner loops tighter avoiding repetitious evaluation known if() control structure. Although branch would easily predicted, extra instructions decode limitations imposed branching saved, which usually well worth Generalization Multiple Constant Control Code generalize this further multiple constant control code, some more work have done create proper outer loop. Enumeration constant cases reduces this simple switch statement. Source-Level Optimizations Chapter 22007K February 2002 AthlonProcessor Code Optimization Guide Example for(i CONSTANT0 DoWork0( else DoWork1( CONSTANT1 DoWork2( else DoWork3( //does affect CONSTANT0 CONSTANT1 //does affect CONSTANT0 CONSTANT1 //does affect CONSTANT0 CONSTANT1 //does affect CONSTANT0 CONSTANT1 Transform above loop using switch statement into: #define combine( (((c1) (c2)) switch( combine( CONSTANT0!=0, CONSTANT1!=0 case combine( for( DoWork0( DoWork2( break; case combine( for( DoWork1( DoWork2( break; case combine( for( DoWork0( DoWork3( break; case combine( for( DoWork1( DoWork3( break; default: break; Chapter Source-Level Optimizations AthlonProcessor Code Optimization Guide 22007K February 2002 trick here that there some up-front work involved generating combinations switch constant total amount code doubled. However, also clear that inner loops "if()-free". ideal cases where "DoWork*()" functions inlined, successive functions will have greater overlap leading greater parallelism than would possible presence intervening if() statements. same idea applied constant switch() statements, combinations switch() statements if() statements inside for() loops. method combining input constants gets more complicated worth performance benefit. However, number inner loops also substantially increase. number inner loops prohibitively high, then only most common cases need dealt with directly, remaining cases fall back code "default:" clause switch() statement. This typically comes when programmer considering runtime generated code. While runtime generated code lead similar levels performance improvement, much harder maintain, developer must their optimizations their code generation without help available compiler. Declare Local Functions Static Functions that used outside file where they defined should always declared static, which forces internal linkage. Otherwise, such functions default external linkage, compilers-for example, aggressive inlining. Source-Level Optimizations Chapter 22007K February 2002 AthlonProcessor Code Optimization Guide Dynamic Memory Allocation Consideration Dynamic memory allocation (`malloc' language) should always return pointer that suitably aligned largest base type (quadword alignment). Where this aligned pointer cannot guaranteed, technique shown following code make pointer quadword aligned, needed. This code assumes pointer cast long. Example double* double* (double (double *)((((long)(p))+7L) (-8L)); Then `np' instead access data. still needed order deallocate storage. Introduce Explicit Parallelism into Code Where possible, break long dependency chains into several independent dependency chains that then executed parallel, exploiting pipeline execution units. This especially important floating-point code, whether mapped 3DNow! instructions because longer latency floating-point operations. Since most languages, including ANSI guarantee that floating-point expressions reordered, compilers cannot usually perform such optimizations unless they offer switch allow ANSI noncompliant reordering floating-point expressions according algebraic rules. Note that reordered code that algebraically identical computational results lack associativity floating considerations applying these optimizations (consult book numerical analysis). some cases, these optimizations lead unexpected results. Fortunately, vast majority cases, final result differs only least significant bits. Chapter Source-Level Optimizations AthlonProcessor Code Optimization Guide 22007K February 2002 Example (Avoid): double a[100],sum; 0.0f; (i=0; i<100; i++) a[i]; Example (Preferred): double a[100],sum1,sum2,sum3,sum4,sum; sum1 0.0; sum2 0.0; sum3 0.0; sum4 0.0; (i=0; i<100; i+4) sum1 a[i]; sum2 a[i+1]; sum3 a[i+2]; sum4 a[i+3]; (sum4+sum3)+(sum1+sum2); Notice that four-way unrolling chosen exploit four-stage fully pipelined floating-point adder. Each stage floating-point adder occupied every clock cycle, ensuring maximal sustained utilization. Source-Level Optimizations Chapter 22007K February 2002 AthlonProcessor Code Optimization Guide Explicitly Extract Common Subexpressions certain situations, compilers unable extract common subexpressions from floating-point expressions guarantee against reordering such expressions ANSI standard. Specifically, compiler cannot rearrange computation according algebraic equivalencies before extracting common subexpressions. such cases, subexpression. Note that rearranging expression result associativity floating-point operations, results usually differ only least significant bits. Example (Avoid): double a,b,c,d,e,f; b*c/d; b/d*a; Example (Preferred): double a,b,c,d,e,f,t; b/d; c*t; a*t; Example (Avoid): double a,b,c,e,f; a/c; b/c; Example (Preferred): double a,b,c,e,f,t; 1/c; b*t; Chapter Source-Level Optimizations AthlonProcessor Code Optimization Guide 22007K February 2002 Language Structure Component Considerations Many compilers have options that allow padding structures make their multiples words, doublewords, quadwords, order achieve better alignment structures. addition, improve alignment structure members, some compilers might allocate structure elements order that differs from order which they declared. However, some compilers might offer these features, their implementation might work properly situations. Therefore, achieve best alignment structures structure members while minimizing amount padding regardless compiler optimizations, following methods suggested. Sort Base Type Size Multiple Largest Base Type Size Sort structure members according their base type size, declaring members with larger base type size ahead members with smaller base type size. structure multiple largest base type size member. this fashion, first member structure naturally aligned, other members naturally aligned well. padding structure multiple largest based type size allows, example, arrays structures perfectly aligned. following example demonstrates reordering structure member declarations: Example Original ordering (Avoid): struct char long double baz; a[5]; Example ordering with padding (Preferred): struct double long char char baz; a[5]; pad[7]; Language Structure Component Considerations" page different perspective. Source-Level Optimizations Chapter 22007K February 2002 AthlonProcessor Code Optimization Guide Sort Local Variables According Base Type Size When compiler allocates local variables same order which they declared source code, helpful declare local variables such manner that variables with larger base type size declared ahead variables with smaller base type size. Then, first variable allocated contiguously order they declared naturally aligned without padding. Some compilers allocate variables order they declared. these cases, compiler should automatically allocate variables such manner make them naturally aligned with minimum amount padding. addition, some compilers guarantee that stack aligned suitably largest base type (that they guarantee quadword alignment), that quadword operands might misaligned, even this technique used compiler does allocate variables order they declared. following example demonstrates reordering local variable declarations: Example Original ordering (Avoid): short long double char float foo, bar; z[3]; baz; Example Improved ordering (Preferred): double double long float short z[3]; foo, bar; baz; "Sort Variables According Base Type Size" page more information from different perspective. Chapter Source-Level Optimizations AthlonProcessor Code Optimization Guide 22007K February 2002 Accelerating Floating-Point Divides Square Roots Divides square roots have much longer latency than other floating-point operations, even though Athlon processor provides significant acceleration these operations. some codes, these operations occur often seriously impact performance. these cases, recommended port code 3DNow! inline assembly compiler that generate 3DNow! code. code spots that single-precision arithmetic only (i.e., computation involves data type float) some reason cannot ported 3DNow! code, following technique used improve performance. precision-control field part control word. precision-control setting determines what precision results rounded affects basic arithmetic operations, including divides square roots. Athlon AMD-K6® family processors implement divide square root such fashion only compute number bits necessary currently selected precision. This means that setting precision control single precision (versus Win32 default double precision) lowers latency those operations. Microsoft Visual environment provides functions manipulate control word thus precision control. Note that these functions very fast, insert changes precision control where creates little overhead, such outside computation-intensive loop. Otherwise overhead created function calls outweighs benefit from reducing latencies divide square root operations. following example shows precision control single precision later restore original settings Microsoft Visual environment. Source-Level Optimizations Chapter 22007K February 2002 AthlonProcessor Code Optimization Guide Example prototype _controlfp() function #include <float.h> unsigned orig_cw; current control word save orig_cw _controlfp (0,0); precision control control word single precision. This reduces latency divide square root operations. _controlfp (_PC_24, MCW_PC); restore original control word _controlfp (orig_cw, 0xfffff); Chapter Source-Level Optimizations AthlonProcessor Code Optimization Guide 22007K February 2002 Fast Floating-Point-to-Integer Conversion Floating-point-to-integer conversion programs typically very slow operation. semantics demand that conversion truncation. floating-point operand type float, compiler supports 3DNow! code generation, 3DNow! PF2ID instruction, which performs truncating conversion, utilized compiler accomplish rapid floating-point integer conversion. double-precision operands, usual accomplish truncating conversion involves following algorithm: Save current rounding mode (this usually round nearest even). rounding mode truncation. Load floating-point source operand store integer result. Restore original rounding mode. This algorithm typically implemented through runtime library function called ftol(). While Athlon processor special hardware optimizations speed changing rounding modes therefore ftol(), calls ftol() still tend slow. Source-Level Optimizations Chapter 22007K February 2002 AthlonProcessor Code Optimization Guide situations where very fast floating-point-to-integer conversion required, conversion code "Fast" example below helpful. Note that this code uses current rounding mode instead truncation when performing conversion. Therefore result differ from ftol() result. replacement code adds "magic number" 252+251 source operand, then stores double precision result memory retrieves lower doubleword stored result. Adding magic number shifts original argument right inside double precision mantissa, placing binary point immediately right least significant mantissa bit. Extracting lower doubleword then delivers integral portion original argument. Note: This conversion code causes 64-bit store feed into 32-bit load. load from lower bits 64-bit store, case size mismatch between store depending load specifically supported store-to-loadforwarding hardware Athlon processor. Example (Slow): double Example (Fast): #define DOUBLE2INT(i,d) {double ((d)+6755399441055744.0); i=*((int *)(&t));} double DOUBLE2INT(i,x); Chapter Source-Level Optimizations AthlonProcessor Code Optimization Guide 22007K February 2002 Speeding Branches Based Comparisons Between Floats Branches based floating-point comparisons often slow. Athlon processor supports FCOMI, FUCOMI, implementation fast branches based comparisons between operands type double type float. However, many compilers support generating these instructions. Likewise, floating-point comparisons between operands type float accomplished quickly using 3DNow! PFCMP instruction compiler supports 3DNow! code generation. With many compilers, only they implement branches based floating-point comparisons FCOM FCOMP instructions compare floating-point operands, followed "FSTSW order transfer condition code flags into EAX. This allows branch based contents that register. Although Athlon processor acceleration hardware speed FSTSW instruction, this process still fairly slow. Branches Dependent Integer Comparisions Fast alternative branches based comparisons between operands type float store operand(s) into memory location then perform integer comparison with that memory location. Branches dependent integer comparisons very fast. should noted that replacement code uses load dependent immediately prior store. store doubleword aligned, store-to-load-forwarding takes place branch still slow. Also, there activity load-store queue forwarding store data somewhat delayed, thus negating some advantages using replacement code. recommended experiment with replacement code test whether actually provides performance increase code hand. replacement code works well comparisons against zero, including correct behavior when encountering negative zero allowed IEEE-754. also works well comparing positive constants. that case user must first determine integer representation that floating-point constant. This accomplished with following code snippet: Source-Level Optimizations Chapter 22007K February 2002 AthlonProcessor Code Optimization Guide float scanf ("%g", &x); printf ("%08X\n", (*((int *)(&x)))); replacement code IEEE-754 compliant classes floating-point operands except NaNs. However, NaNs occur properly working software. Examples: #define FLOAT2INTCAST(f) (*((int *)(&f))) #define FLOAT2UINTCAST(f) (*((unsigned *)(&f))) comparisons 0.0f) 0.0f) 0.0f) 0.0f) comparisons 3.0f) 3.0f) 3.0f) 3.0f) against against zero (FLOAT2UINTCAST(f) 0x80000000U) (FLOAT2INCAST(f) (FLOAT2INTCAST(f) (FLOAT2UINTCAST(f) 0x80000000U) positive constant (FLOAT2INTCAST(f) (FLOAT2INTCAST(f) (FLOAT2INTCAST(f) (FLOAT2INTCAST(f) 0x40400000) 0x40400000) 0x40400000) 0x40400000) comparisons among floats float (FLOAT2UINTCAST(t) 0x80000000U) float (FLOAT2INTCAST(t) float (FLOAT2INTCAST(t) float (FLOAT2UINTCAST(f) 0x80000000U) Chapter Source-Level Optimizations AthlonProcessor Code Optimization Guide 22007K February 2002 Avoid Unnecessary Integer Division Integer division slowest integer arithmetic operations should avoided wherever possible. possibility reducing number integer divisions multiple divisions, which division replaced with multiplication shown following examples. This replacement possible only overflow occurs during computation product. This determined considering possible ranges divisors. Example (Avoid): i,j,k,m; Example (Preferred): i,j,k,l; Copy Frequently Dereferenced Pointer Arguments Local Variables Avoid frequently dereferencing pointer arguments inside function. Since compiler knowledge whether aliasing exists between pointers, such dereferencing cannot optimized away compiler. This prevents data from being kept registers significantly increases memory traffic. Note that many compilers have "assume aliasing" optimization switch. This allows compiler assume that different pointers always have disjoint contents does require copying pointer arguments local variables. Otherwise, copy data pointed pointer arguments local variables start function necessary copy them back function. Source-Level Optimizations Chapter 22007K February 2002 AthlonProcessor Code Optimization Guide Example (Avoid): //assumes pointers different q!=r void isqrt unsigned long unsigned long unsigned long while *q)) Example (Preferred): //assumes pointers different q!=r void isqrt unsigned long unsigned long unsigned long unsigned long while qq)) Chapter Source-Level Optimizations AthlonProcessor Code Optimization Guide 22007K February 2002 Block Prefetch Optimizations Block prefetching applied code without using assembly level instructions. This example adds values arrays double precision floating point values, produce single double precision floating point total. optimization technique applied code that processes large arrays from system memory. This ordinary loop that does job. Bandwidth approximated code execution Athlon4 DDR: Example: Standard code (bandwidth: ~750 MB/sec (int MEM_SIZE; bytes double double summo *a_ptr++ *b_ptr++; reads from memory Using block prefetch significantly improve memory read bandwidth. same function optimized using block prefetch read arrays into cache maximum bandwidth follows. block prefetch implemented source code, procedure BLOCK_PREFETCH_4K. reads Kbytes data block. This version gets about 1125 Mbytes/sec Athlon4 processor DDR, performance gain over Standard Code Example. Source-Level Optimizations Chapter 22007K February 2002 AthlonProcessor Code Optimization Guide Example: code Using Block Prefetching (bandwidth: ~1125 Mbytes/sec static const CACHEBLOCK 0x1000; p_fetch; prefetch chunk size bytes) this "anchor" variable helps fool optimizer static const void inline BLOCK_PREFETCH_4K (void* addr) int* (int*) addr; cast pointer speed p_fetch a[0] a[16] a[32] a[64] a[80] a[96] a[128] a[144] a[160] a[192] a[208] a[224] 256; a[48] a[112] a[176] a[240]; Grab every 64th address, each cache line once. point second stretch addresses a[48] a[112] a[176] a[240]; p_fetch a[0] a[16] a[32] a[64] a[80] a[96] a[128] a[144] a[160] a[192] a[208] a[224] 256; point third stretch addresses a[48] a[112] a[176] a[240]; p_fetch a[0] a[16] a[32] a[64] a[80] a[96] a[128] a[144] a[160] a[192] a[208] a[224] 256; point fourth stretch addresses a[48] a[112] a[176] a[240]; p_fetch a[0] a[16] a[32] a[64] a[80] a[96] a[128] a[144] a[160] a[192] a[208] a[224] (int MEM_SIZE; CACHEBLOCK) BLOCK_PREFETCH_4K(a_ptr); BLOCK_PREFETCH_4K(b_ptr); process blocks next bytes into cache next bytes into cache (int CACHEBLOCK; double summo *a_ptr++ *b_ptr++; reads from cache! Chapter Source-Level Optimizations AthlonProcessor Code Optimization Guide 22007K February 2002 Caution: Since prefetch code does really anything, from compiler point view, there danger that might optimized from code that generated. block prefetch function BLOCK_PREFETCH_4K uses trick prevent that from happening. memory values read INTs, added together (which very fast INTs), then assigned global variable p_fetch. This assignment should "fool" optimizer into leaving prefetch code intact. However, aware that general, compiler might remove block prefetch code. more thorough discussion block prefetch, "Optimizing Main Memory Performance Large Arrays" page optimized memory-copy code section "Use MMXInstructions Block Copies Block Fills" page 174. Source-Level Optimizations Chapter 22007K February 2002 AthlonProcessor Code Optimization Guide Instruction Decoding Optimizations This chapter describes ways maximize number instructions decoded instruction decoders Athlonprocessor. Guidelines listed order importance. Overview Athlon processor instruction fetcher reads 16-byte aligned code windows from instruction cache. instruction bytes then merged into 24-byte instruction queue. each cycle, in-order front-end engine selects decode three instructions from instruction-byte queue. instructions (x86, x87, 3DNow!TM, MMXinstructions) classified into types decodes-DirectPath VectorPath (see "DirectPath Decoder" "VectorPath Decoder" under "Early Decoding" page more information). DirectPath instructions common instructions that decoded directly hardware. VectorPath instructions more complex instructions that require sequence multiple operations issued from on-chip ROM. three DirectPath instructions selected decode cycle. Only VectorPath instruction selected Chapter Instruction Decoding Optimizations AthlonProcessor Code Optimization Guide 22007K February 2002 decode cycle. DirectPath instructions VectorPath instructions cannot simultaneously decoded. Select DirectPath Over VectorPath Instructions instructions. DirectPath instructions optimized decode execute efficiently minimizing number operations instruction, which includes `register register memory' well `register register register' forms instructions. three DirectPath instructions decoded cycle. VectorPath instructions block decoding DirectPath instructions. Athlon processor implements majority instructions used compiler DirectPath instructions. However, assembly writers must still take into consideration usage DirectPath versus VectorPath instructions. Appendix "Instruction Dispatch Execution Resources/Timing," tables DirectPath VectorPath instructions. Load-Execute Instruction Usage Load-Execute Integer Instructions Most load-execute integer instructions DirectPath decodable decoded rate three cycle. Splitting load-execute integer instruction into separate instructions-a load instruction "reg, reg" instruction- reduces decoding bandwidth increases register pressure, which results lower performance. split-instruction form avoid scheduler stalls longer executing instructions explicitly schedule load execute operations. Instruction Decoding Optimizations Chapter 22007K February 2002 AthlonProcessor Code Optimization Guide Load-Execute Floating-Point Instructions with Floating-Point Operands When operating single-precision double-precision floating-point data, floating-point load-execute instructions wherever possible increase code density. Note: This optimization applies only floating-point instructions with floating-point operands integer operands, described next section. This coding style helps ways. First, denser code allows more work held instruction cache. Second, denser code generates fewer internal MacroOPs and, therefore, scheduler holds more work increasing chances extracting parallelism from code. Example (Avoid): FMUL QWORD [TEST1] QWORD [TEST2] ST(1) Example (Preferred): FMUL QWORD [TEST1] QWORD [TEST2] Avoid Load-Execute Floating-Point Instructions with Integer Operands load-execute floating-point instructions with integer operands: FIADD, FISUB, FISUBR, FIMUL, FIDIV, FIDIVR, instructions have intege operands while integer instructions cannot have floating-point operands. Floating-point computations involving integer-memory operands should separate FILD arithmetic instructions. This optimization potential increase decode bandwidth density scheduler. floatingpoint load-execute instructions with integer operands VectorPath generate cycle, while discrete equivalent enables third DirectPath instruction decoded same cycle. some situations this optimizations also reduce execution time FILD scheduled several Chapter Instruction Decoding Optimizations AthlonProcessor Code Optimization Guide 22007K February 2002 instructions ahead arithmetic instruction order cover FILD latency. Example (Avoid): FIMUL FIADD QWORD [foo] DWORD [bar] DWORD [baz] Example (Preferred): FILD FILD FMULP FADDP DWORD [bar] DWORD [baz] QWORD [foo] ST(2), ST(1),ST Read-Modify-Write Instructions Where Appropriate Athlon processor handles read-modify-write (RMW) instructions such "ADD [mem], reg32" very efficiently. vast majority instructions DirectPath instructions. instructions provide performance benefit over equivalent combination load, load-execute store instructions. comparison load/loadexecute/store combination, equivalent instruction promotes code density (better I-cache utilization), preserves decode bandwidth, saves execution resources occupies only reservation station requires only address computation. also reduce register pressure, demonstrated Example page instructions indicated operation performed data that memory, result that operation reused soon. limited number integer registers processor, often case that data needs kept memory instead registers. Additionally, case that data, once operated upon, reused soon. example would accumulator inside loop unknown trip count, where accumulator result reused inside loop. Note that loops with known trip count, accumulator manipulation frequently hoisted loop. Instruction Decoding Optimizations Chapter 22007K February 2002 AthlonProcessor Code Optimization Guide Example code): code accu, increment; while (condition) accu read increment written here accu increment; Example (Avoid): EAX, [increment] EAX, [accu] [accu], Example (Preferred): EAX, [increment] [accu], Example code): code iterationcount; iteration_count while (condition) iteration count read here iteration_count++; Example (Avoid): EAX, [iteration_count] [iteration_count], Example (Preferred): [iteration_count] Chapter Instruction Decoding Optimizations AthlonProcessor Code Optimization Guide 22007K February 2002 Align Branch Targets Program Spots program spots determined either profiling loop nesting analysis), place branch targets near beginning 16-byte aligned code windows. This guideline improves performance inside hotspots maximizing number instruction fills into instruction-byte queue preserves Icache space branch-intensive code outside such hotspots. 32-Bit Rather than 16-Bit Instruction 32-bit Load Effective Address (LEA) instruction implemented DirectPath operation with execute latency only cycles. 16-bit instruction, however, VectorPath instruction, which lowers decode bandwidth longer execution latency. Short Instruction Encodings Assemblers compilers should generate shortest instruction encodings possible optimize I-cache increase average decode rate. Wherever possible, instructions with shorter lengths. Using shorter instructions increases number instructions that into instruction-byte queue. example, 8-bit displacements opposed 32-bit displacements. addition, singlebyte format simple integer instructions whenever possible, opposed 2-byte opcode ModR/M format. Example (Avoid): EAX, 12345678h ;uses 2-byte opcode form (with ModR/M) EBX, ;uses 32-bit immediate $label1 ;uses 2-byte opcode, 32-bit immediate Instruction Decoding Optimizations Chapter 22007K February 2002 AthlonProcessor Code Optimization Guide Example (Preferred): EAX, 12345678h EBX, $label1 ;uses single byte opcode form ;uses 8-bit sign extended immediate ;uses 1-byte opcode, 8-bit immediate Avoid Partial-Register Reads Writes order handle partial-register writes, Athlon processor execution core implements data-merging scheme. execution unit, instruction writing partial register merges modified portion with current state remainder register. Therefore, dependency hardware potentially force false dependency most recent instruction that writes part register. Example (Avoid): ;inst ;inst false dependency inst ;inst merges with current register value forwarded inst addition, instruction that read dependency part given architectural register read dependency most recent instruction that modifies part same architectural register. Example (Avoid): ;inst ;inst false dependency completion inst ;inst false dependency completion inst ;inst depends completion inst Chapter Instruction Decoding Optimizations AthlonProcessor Code Optimization Guide 22007K February 2002 LEAVE Instruction Function Epilogue Code classical approach referencing function arguments local variables inside function so-called frame pointer. code, register customarily used frame pointer. function prologue code, frame pointer follows: PUSH EBP, ESP, nnnnnnnn ;save frame pointer ;new frame pointer ;allocate local variables Function arguments stack accessed positive offsets relative EBP, local variables accessible negative offsets relative EBP. function epilogue code, following work performed: ESP, ;deallocate local variables ;restore frame pointer functionality these instructions identical that LEAVE instruction. LEAVE instruction single-byte instruction thus saves bytes code space over MOV/POP epilogue sequence. Replacing MOV/POP sequence with LEAVE also preserves decode bandwidth. Therefore, LEAVE instruction function epilogue code both specific Athlon processor optimized blended code (code that performs well both AMD-K6 Athlon processors). functions that allocate local variables, prologue epilogue code simplified following: PUSH ;restore frame pointer EBP, ;save frame pointer ;new frame pointer This optimal cases where frame pointer desired. highest performance code, frame pointer all. Function arguments local variables should accessed directly through ESP, thus freeing general purpose register reducing register pressure. Instruction Decoding Optimizations Chapter 22007K February 2002 AthlonProcessor Code Optimization Guide Replace Certain SHLD Instructions with Alternative Code Certain instances SHLD instruction replaced alternative code sequences using LEA. alternative code lower latency requires less execution resources. ADD, ADC, (32-bit version) DirectPath instructions, while SHLD VectorPath instruction. replacement code optimizes decode bandwidth potentially enables decoding third DirectPath instruction. replacement code increase register pressure since destroys contents REG2, whereas REG2 preserved SHLD. situations where register pressure high, replacement sequences therefore indicated. Example (Avoid): SHLD REG1, REG2, Example (Preferred): REG2, REG2 REG1, REG1 Example (Avoid): SHLD REG1, REG2, Example (Preferred): REG2, REG1, [REG1*4 REG2] Example (Avoid): SHLD REG1, REG2, Example (Preferred): REG2, REG1, [REG1*8 REG2] 8-Bit Sign-Extended Immediates Using 8-bit sign-extended immediates improves code density with negative effects Athlon processor. example, encode FB". Chapter Instruction Decoding Optimizations AthlonProcessor Code Optimization Guide 22007K February 2002 8-Bit Sign-Extended Displacements 8-bit sign-extended displacements conditional branches. Using short, 8-bit sign-extended displacements conditional branches improves code density with negative effects Athlon processor. Code Padding Using Neutral Code Fillers Occasionally need arises insert neutral code fillers into code stream, e.g., code alignment purposes space branches. Since this filler code executed, should take execution resources possible, diminish decode density, modify processor state other than advancing EIP. byte padding easily achieved using instructions (XCHG EAX, EAX; opcode 0x90). architecture, there several multi-byte instructions available that change processor state other than EIP: REG, XCHG REG, CMOVcc REG, REG, REG, REG, SHRD REG, REG, SHLD REG, REG, REG, [REG] REG, [REG+00] REG, [REG*1+00] REG, [REG+00000000] REG, [REG*1+00000000] these instructions equally suitable purposes code padding. example, SHLD/SHRD microcoded, which reduces decode bandwidth takes execution resources. Instruction Decoding Optimizations Chapter 22007K February 2002 AthlonProcessor Code Optimization Guide Recommendations AMD-K6® Family AthlonProcessor Blended Code instructions instructions sequences presented below recommended code padding both AMD-K6 family processors Athlon processor. Each instructions instruction sequences below utilizes register. avoid performance degradation, select register used padding that does lengthen existing dependency chains, i.e., select register that used instructions vicinity neutral code filler. Certain instructions registers implicitly. example, PUSH, POP, CALL, make implicit register. 5-byte filler sequence below consists instructions. flag changes across code padding acceptable, following instructions used singleinstruction 5-byte code fillers: TEST EAX, 0FFFF0000h EAX, 0FFFF0000h recommended neutral code fillers code optimized Athlon processor that also have well other processors. Note some padding lengths, versions using missing lack fully generalized addressing modes. NOP2_EAX NOP2_EBX NOP2_ECX NOP2_EDX NOP2_ESI NOP2_EDI NOP2_ESP NOP2_EBP NOP3_EAX NOP3_EBX NOP3_ECX NOP3_EDX NOP3_ESI NOP3_EDI NOP3_ESP NOP3_EBP TEXTEQU TEXTEQU TEXTEQU TEXTEQU TEXTEQU TEXTEQU TEXTEQU TEXTEQU TEXTEQU TEXTEQU TEXTEQU TEXTEQU TEXTEQU TEXTEQU TEXTEQU TEXTEQU 08Bh,0C0h> 08Bh,0DBh> 08Bh,0C9h> 08Bh,0D2h> 08Bh,0F6h> 08Bh,0FFh> 08Bh,0E4h> 08Bh,0EDh> ;MOV ;MOV ;MOV ;MOV ;MOV ;MOV ;MOV ;MOV EAX, EBX, ECX, EDX, ESI, EDI, ESP, EBP, ;LEA ;LEA ;LEA ;LEA ;LEA ;LEA ;LEA ;LEA EAX, EBX, ECX, EDX, ESI, EDI, ESP, EBP, [EAX] [EBX] [ECX] [EDX] [ESI] [EDI] [ESP] [EBP] 08Dh,004h,020h> 08Dh,01Ch,023h> 08Dh,00Ch,021h> 08Dh,014h,022h> 08Dh,024h,024h> 08Dh,034h,026h> 08Dh,03Ch,027h> 08Dh,06Dh,000h> NOP4_EAX TEXTEQU 08Dh,044h,020h,000h> ;LEA EAX, [EAX+00] Chapter Instruction Decoding Optimizations AthlonProcessor Code Optimization Guide 22007K February 2002 NOP4_EBX NOP4_ECX NOP4_EDX NOP4_ESI NOP4_EDI NOP4_ESP TEXTEQU TEXTEQU TEXTEQU TEXTEQU TEXTEQU TEXTEQU 08Dh,05Ch,023h,000h> 08Dh,04Ch,021h,000h> 08Dh,054h,022h,000h> 08Dh,064h,024h,000h> 08Dh,074h,026h,000h> 08Dh,07Ch,027h,000h> ;LEA ;LEA ;LEA ;LEA ;LEA ;LEA EBX, ECX, EDX, ESI, EDI, ESP, [EBX+00] [ECX+00] [EDX+00] [ESI+00] [EDI+00] [ESP+00] ;LEA EAX, [EAX+00];NOP NOP5_EAX TEXTEQU 08Dh,044h,020h,000h,090h> ;LEA EBX, [EBX+00];NOP NOP5_EBX TEXTEQU 08Dh,05Ch,023h,000h,090h> ;LEA ECX, [ECX+00];NOP NOP5_ECX TEXTEQU 08Dh,04Ch,021h,000h,090h> ;LEA EDX, [EDX+00];NOP NOP5_EDX TEXTEQU 08Dh,054h,022h,000h,090h> ;LEA ESI, [ESI+00];NOP NOP5_ESI TEXTEQU 08Dh,064h,024h,000h,090h> ;LEA EDI, [EDI+00];NOP NOP5_EDI TEXTEQU 08Dh,074h,026h,000h,090h> ;LEA ESP, [ESP+00];NOP NOP5_ESP TEXTEQU 08Dh,07Ch,027h,000h,090h> ;LEA EAX, [EAX+00000000] NOP6_EAX TEXTEQU 08Dh,080h,0,0,0,0> ;LEA EBX, [EBX+00000000] NOP6_EBX TEXTEQU 08Dh,09Bh,0,0,0,0> ;LEA ECX, [ECX+00000000] NOP6_ECX TEXTEQU 08Dh,089h,0,0,0,0> ;LEA EDX, [EDX+00000000] NOP6_EDX TEXTEQU 08Dh,092h,0,0,0,0> ;LEA ESI, [ESI+00000000] NOP6_ESI TEXTEQU 08Dh,0B6h,0,0,0,0> ;LEA EDI, [EDI+00000000] NOP6_EDI TEXTEQU 08Dh,0BFh,0,0,0,0> ;LEA EBP, [EBP+00000000] NOP6_EBP TEXTEQU 08Dh,0ADh,0,0,0,0> ;LEA EAX, [EAX*1+00000000] NOP7_EAX TEXTEQU 08Dh,004h,005h,0,0,0,0> Instruction Decoding Optimizations Chapter 22007K February 2002 AthlonProcessor Code Optimization Guide ;LEA EBX, [EBX*1+00000000] NOP7_EBX TEXTEQU 08Dh,01Ch,01Dh,0,0,0,0> ;LEA ECX, [ECX*1+00000000] NOP7_ECX TEXTEQU 08Dh,00Ch,00Dh,0,0,0,0> ;LEA EDX, [EDX*1+00000000] NOP7_EDX TEXTEQU 08Dh,014h,015h,0,0,0,0> ;LEA ESI, [ESI*1+00000000] NOP7_ESI TEXTEQU 08Dh,034h,035h,0,0,0,0> ;LEA EDI, [EDI*1+00000000] NOP7_EDI TEXTEQU 08Dh,03Ch,03Dh,0,0,0,0> ;LEA EBP, [EBP*1+00000000] NOP7_EBP TEXTEQU 08Dh,02Ch,02Dh,0,0,0,0> ;LEA EAX, [EAX*1+00000000] ;NOP NOP8_EAX TEXTEQU 08Dh,004h,005h,0,0,0,0,90h> ;LEA EBX, [EBX*1+00000000] ;NOP NOP8_EBX TEXTEQU 08Dh,01Ch,01Dh,0,0,0,0,90h> ;LEA ECX, [ECX*1+00000000] ;NOP NOP8_ECX TEXTEQU 08Dh,00Ch,00Dh,0,0,0,0,90h> ;LEA EDX, [EDX*1+00000000] ;NOP NOP8_EDX TEXTEQU 08Dh,014h,015h,0,0,0,0,90h> ;LEA ESI, [ESI*1+00000000] ;NOP NOP8_ESI TEXTEQU 08Dh,034h,035h,0,0,0,0,90h> ;LEA EDI, [EDI*1+00000000] ;NOP NOP8_EDI TEXTEQU 08Dh,03Ch,03Dh,0,0,0,0,90h> ;LEA EBP, [EBP*1+00000000] ;NOP NOP8_EBP TEXTEQU 08Dh,02Ch,02Dh,0,0,0,0,90h> ;JMP NOP9 TEXTEQU Chapter Instruction Decoding Optimizations AthlonProcessor Code Optimization Guide 22007K February 2002 Instruction Decoding Optimizations Chapter 22007K February 2002 AthlonProcessor Code Optimization Guide Cache Memory Optimizations This chapter describes code optimization techniques that take advantage large caches high-bandwidth buses Athlonprocessor. Guidelines listed order importance. Memory Size Alignment Issues Avoid Memory-Size Mismatches Avoid memory-size mismatches when different instructions operate same data. When instruction stores another instruction reloads same data, keep their operands aligned keep loads/stores each operand same size. following code examples result store-to-loadforwarding (STLF) stall: Example (avoid): DWORD [FOO], DWORD [FOO+4], QWORD [FOO] Chapter Cache Memory Optimizations AthlonProcessor Code Optimization Guide 22007K February 2002 Example (avoid): MOVQ [FOO], [FOO+4], MM0, [FOO] Example (preferred): MOVD PUNPCKLDQ [FOO], [FOO+4], MM0, [FOO] MM0, [FOO+4] Example (preferred stores close load): MOVD MM0, [FOO+4], PUNPCKLDQ MM0, [FOO+4] Avoid large-to-small mismatches, shown following code examples: Example (avoid): QWORD [FOO] EAX, DWORD [FOO] EDX, DWORD [FOO+4] Example (avoid): MOVQ [foo], EAX, [foo] EDX, [foo+4] Example (preferred): MOVD PSWAPD MOVD PSWAPD [foo], MM0, [foo+4], MM0, EAX, [foo] EDX, [foo+4] Cache Memory Optimizations Chapter 22007K February 2002 AthlonProcessor Code Optimization Guide Example (preferred contents longer needed): MOVD PUNPCKHDQ MOVD [foo], MM0, [foo+4], EAX, [foo] EDX, [foo+4] Example (preferred stores loads close together, option MOVD PSWAPD MOVD PSWAPD EAX, MM0, EDX, MM0, Example (preferred stores loads close together, option MOVD EAX, PUNPCKHDQ MM0, MOVD EDX, Align Data Where Possible general, avoid misaligned data references. data whose size power considered aligned naturally aligned. example: Word accesses aligned they access address divisible two. Doubleword accesses aligned they access address divisible four. Quadword accesses aligned they access address divisible eight. TBYTE accesses aligned they access address divisible eight. misaligned store load operation suffers minimum onecycle penalty Athlon processor load/store pipeline. addition, using misaligned loads stores increases likelihood encountering store-to-load forwarding pitfall. more detailed discussion store-to-load forwarding issues, "Store-to-Load Forwarding Restrictions" page Chapter Cache Memory Optimizations AthlonProcessor Code Optimization Guide 22007K February 2002 Optimizing Main Memory Performance Large Arrays This section outlines process taking advantage main memory bandwidth using block prefetch three-phase processing. Block prefetch technique reading blocks data from main memory very high data rates. Three-phase processing programming style that divides data into blocks, which processed sequence. Specifically, three-phase processing employs block prefetch read input data each block, operates each block entirely within cache, writes results memory with high efficiency. prefetch techniques applicable applications that access large, localized data objects system memory, sequential near-sequential manner. best advantage realized with data transfers more than Kbytes. basis techniques take best advantage processor's cache memory. code examples this section explore most basic useful memory function: copying data from area memory another. This foundation used explore main optimization ideas, then these ideas applied optimizing bandwidth-limited function that uses process linear data arrays. performance metrics were measured code samples running Athlon4 processor with DDR2100 memory. data sizes chosen several megabytes, i.e. much larger than cache. Exact performance numbers different other platforms, basic techniques widely applicable. Cache Memory Optimizations Chapter 22007K February 2002 AthlonProcessor Code Optimization Guide Memory Copy Optimization Memory Copy: Step simplest copy memory MOVSB instruction used Baseline example. Example Code: Baseline (bandwidth: ~570 Mbytes/sec) esi, [src] edi, [dst] ecx, [len] ecx, movsb source array destination array number QWORDS bytes) convert byte count Memory Copy: Step Starting from this baseline, several optimizations implemented improve performance. next example increases data size from byte copy doubleword copy using MOVSD instruction. Example Code: Doubleword Copy (bandwidth: ~700 Mbytes/sec esi, edi, ecx, ecx, movsd [src] [dst] [len] improvement: 23%) source array destination array number QWORDS bytes) convert DWORD count Chapter Cache Memory Optimizations AthlonProcessor Code Optimization Guide 22007K February 2002 Memory Copy: Step rove doubleword copy. MOVS instructions often efficient explicit loop which uses simple "RISC" instructions. simple explicit instructions executed parallel sometimes even out-of-order, within CPU. explicit loop example uses loop perform copy using instructions. Example Code: Explicit Loop (bandwidth: ~720 Mbytes/sec esi, edi, ecx, ecx, [src] [dst] [len] improvement: source array destination array number QWORDS bytes) convert DWORD count copyloop: eax, dword [esi] dword [edi], esi, edi, copyloop Memory Copy: Step explicit loop faster than MOVSD. that have explicit loop, further optimization implemented unrolling loop. This reduces overhead incrementing pointers counter, reduces branching. unrolled loop example uses [Register Offset] form addressing, which runs just fast simple [Register] address, uses unroll factor four. Example Code: Unrolled Loop Unroll Factor Four (bandwidth: ~700 Mbytes/sec esi, edi, ecx, ecx, [src] [dst] [len] improvement: -3%) source array destination array number QWORDS bytes) convert 16-byte size count (assumes integer) copyloop: eax, dword [esi] dword [edi], ebx, dword [esi+4] dword [edi+4], eax, dword [esi+8] Cache Memory Optimizations Chapter 22007K February 2002 AthlonProcessor Code Optimization Guide dword [edi+8], ebx, dword [esi+12] dword [edi+12], esi, edi, copyloop Memory Copy: Step performance drops when loop unrolled, optimization implemented: grouping read operations together write operations together. general, good idea read data blocks, write blocks, rather than alternating frequently. Example Code: Read Write Grouping (bandwidth: ~750 Mbytes/sec esi, [src] edi, [dst] ecx, [len] ecx, improvement: source array destination array number QWORDS bytes) convert 16-byte size count copyloop: eax, dword [esi] ebx, dword [esi+4] dword [edi], dword [edi+4], eax, dword [esi+8] ebx, dword [esi+12] dword [edi+8], dword [edi+12], esi, edi, copyloop Chapter Cache Memory Optimizations AthlonProcessor Code Optimization Guide 22007K February 2002 Memory Copy: Step next opimization uses MMXextensions, available modern processors. registers permit bytes sequential reading, followed bytes sequential writing. optimization loop counter, which starts negative counts zero. This allows counter serve double duty pointer, eliminates need instruction. Example Code: Grouping Using Registers (bandwidth: ~800 Mbytes/sec esi, [src] edi, [dst] ecx, [len] improvement: source array destination array number QWORDS bytes) source destination negative offset esi, [esi+ecx*8] edi, [edi+ecx*8] emms copyloop: movq movq movq movq movq movq movq movq movq movq movq movq movq movq movq movq emms mm0, mm1, mm2, mm3, mm4, mm5, mm6, mm7, qword qword qword qword qword qword qword qword qword qword qword qword qword qword qword qword [esi+ecx*8] [esi+ecx*8+8] [esi+ecx*8+16] [esi+ecx*8+24] [esi+ecx*8+32] [esi+ecx*8+40] [esi+ecx*8+48] [esi+ecx*8+56] [edi+ecx*8], [edi+ecx*8+8], [edi+ecx*8+16], [edi+ecx*8+24], [edi+ecx*8+32], [edi+ecx*8+40], [edi+ecx*8+48], [edi+ecx*8+56], ecx, copyloop Cache Memory Optimizations Chapter 22007K February 2002 AthlonProcessor Code Optimization Guide Memory Copy: Step MOVNTQ used that registers being used. This streaming store instruction, writing data memory. This instruction bypasses on-chip cache, goes directly into write combining buffer, effectively increasing total write bandwidth. MOVNTQ instruction executes much faster than ordinary memory. SFENCE required flush write buffer. Example Code: MOVNTQ SFENCE Instructions (bandwidth: ~1120 Mbytes/sec esi, [src] edi, [dst] ecx, [len] improvement: 32%) source array destination array number QWORDS bytes) esi, [esi+ecx*8] edi, [edi+ecx*8] emms copyloop: movq movq movq movq movq movq movq movq movntq movntq movntq movntq movntq movntq movntq movntq mm0, mm1, mm2, mm3, mm4, mm5, mm6, mm7, qword qword qword qword qword qword qword qword [esi+ecx*8] [esi+ecx*8+8] [esi+ecx*8+16] [esi+ecx*8+24] [esi+ecx*8+32] [esi+ecx*8+40] [esi+ecx*8+48] [esi+ecx*8+56] qword qword qword qword qword qword qword qword [edi+ecx*8], [edi+ecx*8+8], [edi+ecx*8+16], [edi+ecx*8+24], [edi+ecx*8+32], [edi+ecx*8+40], [edi+ecx*8+48], [edi+ecx*8+56], ecx, copyloop sfence emms Chapter Cache Memory Optimizations AthlonProcessor Code Optimization Guide 22007K February 2002 Memory Copy: Step MOVNTQ instruction previous example improves speed writing data. Prefetch Instruction example uses prefetch instruction improve performance reading data. Prefetching cannot increase total read bandwidth, processor started loading data cache before data needed. Example Code: Prefetch Instruction (prefetchnta) (bandwidth: ~1250 Mbytes/sec esi, [src] edi, [dst] ecx, [len] improvement: 12%) source array destination array number QWORDS bytes) esi, [esi+ecx*8] edi, [edi+ecx*8] emms copyloop: prefetchnta movq movq movq movq movq movq movq movq movntq movntq movntq movntq movntq movntq movntq movntq mm0, mm1, mm2, mm3, mm4, mm5, mm6, mm7, [esi+ecx*8 512] qword qword qword qword qword qword qword qword [esi+ecx*8] [esi+ecx*8+8] [esi+ecx*8+16] [esi+ecx*8+24] [esi+ecx*8+32] [esi+ecx*8+40] [esi+ecx*8+48] [esi+ecx*8+56] qword qword qword qword qword qword qword qword [edi+ecx*8], [edi+ecx*8+8], [edi+ecx*8+16], [edi+ecx*8+24], [edi+ecx*8+32], [edi+ecx*8+40], [edi+ecx*8+48], [edi+ecx*8+56], ecx, copyloop sfence emms Cache Memory Optimizations Chapter 22007K February 2002 AthlonProcessor Code Optimization Guide Memory Copy: Step (final) final optimization memory copy code, technique called block prefetch applied. Much read grouping gave boost performance, block prefetch extreme extension this idea. strategy read large stream sequential data from main memory into cache, without interruptions. block prefetch, instruction used, rather than software prefetch instruction. Unlike prefetch instruction, instruction cannot ignored CPU. result that series MOVs will force memory system read sequential, back-to-back address blocks, which maximizes memory bandwidth. because processor always loads entire cache line (e.g. bytes) whenever accesses main memory, block prefetch instructions only need read address cache line. Reading just address cache line subtle performance. Example: Block Prefetching (bandwidth: ~1630 Mbytes/sec improvement: 30%) #define CACHEBLOCK 400h esi, [src] edi, [dst] ecx, [len] QWORDs block bytes) source array destination array total number QWORDS bytes) (assumes CACHEBLOCK integer) esi, [esi+ecx*8] edi, [edi+ecx*8] emms mainloop: eax, CACHEBLOCK note: prefetch loop unrolled Chapter Cache Memory Optimizations AthlonProcessor Code Optimization Guide 22007K February 2002 prefetchloop: ebx, [esi+ecx*8] ebx, [esi+ecx*8+64]// ecx, prefetchloop ecx, CACHEBLOCK eax, CACHEBLOCK Read address line, address next. QWORDS, 64-byte cache lines writeloop: movq mm0, qword [esi+ecx*8] movq mm7, qword [esi+ecx*8+56] movntq qword [edi+ecx*8], movntq qword [edi+ecx*8+56], ecx, writeloop ecx, mainloop sfence emms Array Addition following Array Addition example applies block prefetch technique other concepts from memory copy optimization example, optimizes memory-intensive loop that processes large arrays. Baseline Code This loop adds arrays floating-point numbers together, using FPU, writes results third array. This example also shows handle issue combining code (required using MOVNTQ instruction) with code (needed adding numbers). Array Baseline example slightly optimized, first pass, baseline version code. Cache Memory Optimizations Chapter 22007K February 2002 AthlonProcessor Code Optimization Guide Example: Array Baseline (bandwidth: ~840 MB/sec baseline performance) addloop: fadd fadd fadd fadd fadd fadd fadd fadd fstp fstp fstp fstp fstp fstp fstp fstp qword qword qword qword qword qword qword qword qword qword qword qword qword qword qword qword [esi+ecx*8+56] [ebx+ecx*8+56] [esi+ecx*8+48] [ebx+ecx*8+48] [esi+ecx*8+40] [ebx+ecx*8+40] [esi+ecx*8+32] [ebx+ecx*8+32] [esi+ecx*8+24] [ebx+ecx*8+24] [esi+ecx*8+16] [ebx+ecx*8+16] [esi+ecx*8+8] [ebx+ecx*8+8] [esi+ecx*8+0] [ebx+ecx*8+0] [edi+ecx*8+0] [edi+ecx*8+8] [edi+ecx*8+16] [edi+ecx*8+24] [edi+ecx*8+32] [edi+ecx*8+40] [edi+ecx*8+48] [edi+ecx*8+56] esi, [src1] ebx, [src2] edi, [dst] ecx, [len] source array source array destination array number Floats bytes) (assumes integer) esi, [esi+ecx*8] ebx, [ebx+ecx*8] edi, [edi+ecx*8] qword qword qword qword qword qword qword qword ecx, addloop Chapter Cache Memory Optimizations AthlonProcessor Code Optimization Guide 22007K February 2002 Optimized Code After relevant optimization techniques have been applied, code appears Array with Optimizations Example. data still processed blocks, memory copy example. this case, code must process data using FPU, simply copy Because this need mode operation, processing divided into three distinct phases: block prefetch, processing, memory write. block prefetch phase reads input data into cache maximum bandwidth. processing phase operates in-cache input data writes results in-cache temporary buffer. memory write phase uses MOVNTQ quickly transfer temporary buffer destination array main memory. These three phases components three phase processing. This general technique provides significant performance boost, seen this optimized code. Example: Array with Optimizations (bandwidth: ~1370 MB/sec improvement: #define CACHEBLOCK 400h QWORDs block, bytes) int* storedest char buffer[CACHEBLOCK in-cache temporary storage esi, [src1] ebx, [src2] edi, [dst] ecx, [len] source array source array destination array number Floats bytes) (assumes /CACHEBLOCK integer) esi, [esi+ecx*8] ebx, [ebx+ecx*8] edi, [edi+ecx*8] [storedest], save real dest later edi, [buffer] temporary in-cache buffer. edi, [edi+ecx*8] stays cache from heavy mainloop: eax, CACHEBLOCK prefetchloop1: edx, [esi+ecx*8] edx, [esi+ecx*8+64] ecx, block prefetch array (this loop unrolled Cache Memory Optimizations Chapter 22007K February 2002 AthlonProcessor Code Optimization Guide prefetchloop1 ecx, CACHEBLOCK eax, CACHEBLOCK block prefetch array (this loop unrolled prefetchloop2: edx, [ebx+ecx*8] edx, [ebx+ecx*8+64] ecx, prefetchloop2 ecx, CACHEBLOCK eax, CACHEBLOCK processloop: qword fadd qword qword fadd qword this loop read/writes cache! [esi+ecx*8+56] [ebx+ecx*8+56] [esi+ecx*8+0] [ebx+ecx*8+0] fstp qword [edi+ecx*8+0] fstp qword [edi+ecx*8+56] emms ecx, CACHEBLOCK edx, [storedest] eax, CACHEBLOCK writeloop: write buffer main movq mm0, qword [edi+ecx*8] movq mm7, qword [edi+ecx*8+56] movntq movntq qword [edx+ecx*8], qword [edx+ecx*8+56], ecx, processloop ecx, writeloop ecx, exit edi, CACHEBLOCK reset back start buffer Chapter Cache Memory Optimizations AthlonProcessor Code Optimization Guide 22007K February 2002 sfence emms exit: mainloop Summary Block prefetch three phase processing general techniques improving performance memory-intensive applications. points are: maximum memory read bandwidth, read data into cache large blocks, using block prefetch. block prefetch loop should: unrolled least instruction (not Prefetch instruction) Read only address cache line Read data into scratch register, e.g. Make sure data aligned Read only address stream loop separate loops prefetch several streams maximum memory write bandwidth, write data from cache main memory large blocks, using streaming store instructions. write loop should: registers pass data Read from cache MOVNTQ (streaming store) writing memory Make sure data aligned Write every address, ascending order, without gaps with SFENCE flush write buffer Whenever possible, code that actually "does real work" should read data from cache, write output incache buffer. enable this cache access, follow first guidelines above. Align branch targets 16-byte boundaries, these critical sections code. This optimization described "Align Branch Targets Program Spots" page Cache Memory Optimizations Chapter 22007K February 2002 AthlonProcessor Code Optimization Guide PREFETCH 3DNow!Instruction Prefetching versus Preloading code that take advantage prefetching, situations where small data sizes other constraints limit applicability block prefetch optimizations, 3DNow! PREFETCH PREFETCHW instructions increase effective bandwidth Athlon processor. advantage high bandwidth Athlon processor hide long latencies when fetching data from system memory. prefetch instructions essentially integer instructions used anywhere, type code (integer, x87, 3DNow!, MMX, etc.). code that uses block prefetch technique described "Optimizing Main Memory Performance Large Arrays" page standard load instruction best prefetch data. other situations, load instructions able mimic functionality prefetch instructions, they offer same performance advantage.Prefetch instructions only update cache line L1/L2 cache update architectural register. This uses less register compared load instruction. Prefetch instructions also cause normal instruction retirement stall. Another benefit prefetching versus preloading that prefetching instructions retire even load data arrived yet. regular load used preloading will stall machine gets bottom fixed-issue reorder buffer (part Instruction Control Unit) load data arrived yet. load "blocking" whereas prefetch "non-blocking." Unit-Stride Access Large data sets typically require unit-stride access ensure that data pulled PREFETCH PREFETCHW actually used. necessary, reorganize algorithms data structures allow unit-stride access. "Definitions" page definition unit-stride access. Chapter Cache Memory Optimizations AthlonProcessor Code Optimization Guide 22007K February 2002 Hardware Prefetch Some Athlon processors implement hardware prefetch mechanism. This feature implemented beginning with Athlon processor Model data loaded into cache. hardware prefetcher works most efficiently when data accessed cache-line-by-cache-line basis (that without skipping cache lines). Cache lines current Athlon processors bytes, cache line implementation dependent. some cases, using PREFETCH PREFETCHW instruction processors with hardware prefetch incur reduction performance. these cases, PREFETCH instruction need removed. engineer needs weigh measured gains obtained non-hardware prefetch enabled processors using PREFETCH instruction, versus loss performance processors with hardware prefetcher. PREFETCH/W versus PREFETCHNTA/T0/T1 PREFETCHNTA/T0/T1/T2 instructions extensions processor implementation dependent. developer needs maintain compatibility with million AMD-K6®-2 AMD-K6-III processors already sold, 3DNow! PREFETCH/W instructions instead various prefetch instructions that extensions. Code that intends modify cache line brought through prefetching should PREFETCHW instruction. While PREFETCHW works same PREFETCH AMD-K6-2 AMD-K6-III processors, PREFETCHW gives hint Athlon processor intent modify cache line. Athlon processor marks cache line PREFETCHW Other recent searchesSi7386ADP - Si7386ADP Si7386ADP Datasheet PC104 - PC104 PC104 Datasheet LG-170-8UG-CT - LG-170-8UG-CT LG-170-8UG-CT Datasheet J123F - J123F J123F Datasheet AN80LXXRMS - AN80LXXRMS AN80LXXRMS Datasheet
Privacy Policy | Disclaimer |