The Datasheet Archive - 100 Million Datasheets from 7500 Manufacturers.    


Datasheet Search Engine   
 
Part # or Description: • 5V RS232 Driver • 2SC5066* • "Real Time Clock" • "USB connector" • "blue led" 5mm • 10 watt zener diode • 2N3055* motorola
 
Search Tip: Try entering the part number only. Include a wildcard (eg. lm317* or 1n4148*)

 

 

Code Optimization Guide 2000 Advanced Micro Devices, Inc. rights


Datasheet Thumbnail

  

Download PDF



Top Searches for this datasheet



Athlon Processor
Code Optimization Guide
2000 Advanced Micro Devices, Inc. rights reserved. contents this document provided connection with Advanced Micro Devices, Inc. ("AMD") products. makes representations warranties with respect accuracy completeness contents this publication reserves right make changes specifications product descriptions time without notice. license, whether express, implied, arising estoppel otherwise, intellectual property rights granted this publication. Except forth AMD's Standard Terms Conditions Sale, assumes liability whatsoever, disclaims express implied warranty, relating products including, limited implied warranty merchantability, fitness particular purpose, infringement intellectual property right. AMD's products designed, intended, authorized warranted components systems intended surgical implant into body, other applications intended support sustain life, other application which failure AMD's product could create situation where personal injury, death, severe property environmental damage occur. reserves right discontinue make changes products time without notice.
Trademarks AMD, logo, Athlon, 3DNow!, combinations thereof, AMD-751, K86, Super7 trademarks, AMD-K6 registered trademark Advanced Micro Devices, Inc. Microsoft, Windows, Windows registered trademarks Microsoft Corporation. trademark Pentium registered trademark Intel Corporation. Other product names used this publication identification purposes only trademarks their respective companies.
22007I-0-September 2000
AthlonProcessor Code Optimization
Contents
Revision History xvii
Introduction
About this Document AthlonProcessor Family. Athlon Processor Microarchitecture Summary
Optimizations
Optimization Star Group Optimizations Essential Optimizations Memory Size Alignment Issues 3DNow!PREFETCH PREFETCHW Instructions Select DirectPath Over VectorPath Instructions Group Optimizations-Secondary Optimizations Load-Execute Instruction Usage. Take Advantage Write Combining. 3DNow! Instructions Avoid Branches Dependent Random Data Avoid Placing Code Data Same 64-Byte Cache Line
Source Level Optimizations
Ensure Floating-Point Variables Expressions Type Float 32-Bit Data Types Integer Code Consider Sign Integer Operands Array Style Instead Pointer Style Code Completely Unroll Small Loops. Avoid Unnecessary Store-to-Load Dependencies Always Match Size Stores Loads Contents
AthlonProcessor Code Optimization
22007I-0-September 2000
Consider Expression Order Compound Branch Conditions Switch Statement Usage. Optimize Switch Statements Prototypes Functions Const Type Qualifier Generic Loop Hoisting Generalization Multiple Constant Control Code. Declare Local Functions Static Dynamic Memory Allocation Consideration Introduce Explicit Parallelism into Code Explicitly Extract Common Subexpressions Language Structure Component Considerations Sort Local Variables According Base Type Size Accelerating Floating-Point Divides Square Roots Fast Floating-Point-to-Integer Conversion Speeding Branches Based Comparisons Between Floats. Avoid Unnecessary Integer Division. Copy Frequently De-Referenced Pointer Arguments Local Variables
Instruction Decoding Optimizations
Overview Select DirectPath Over VectorPath Instructions. Load-Execute Instruction Usage Load-Execute Integer Instructions Load-Execute Floating-Point Instructions with Floating-Point Operands Avoid Load-Execute Floating-Point Instructions with Integer Operands Read-Modify-Write Instructions Where Appropriate Align Branch Targets Program Spots Contents
22007I-0-September 2000
AthlonProcessor Code Optimization
32-Bit Rather than 16-Bit Instruction. Short Instruction Encodings Avoid Partial Register Reads Writes. LEAVE Instruction Function Epilogue Code Replace Certain SHLD Instructions with Alternative Code. 8-Bit Sign-Extended Immediates 8-Bit Sign-Extended Displacements. Code Padding Using Neutral Code Fillers Recommendations AMD-K6® Family Athlon Processor Blended Code
Cache Memory Optimizations
Memory Size Alignment Issues Avoid Memory Size Mismatches Align Data Where Possible 3DNow! PREFETCH PREFETCHW Instructions. Determining Prefetch Distance Take Advantage Write Combining Avoid Placing Code Data Same 64-Byte Cache Line. Store-to-Load Forwarding Restrictions. Store-to-Load Forwarding Pitfalls-True Dependencies. Summary Store-to-Load Forwarding Pitfalls Avoid Stack Alignment Considerations Align TBYTE Variables Quadword Aligned Addresses. Language Structure Component Considerations Sort Variables According Base Type Size
Contents
AthlonProcessor Code Optimization
22007I-0-September 2000
Branch Optimizations
Avoid Branches Dependent Random Data Athlon Processor Specific Code Blended AMD-K6 Athlon Processor Code Always Pair CALL RETURN Replace Branches with Computation 3DNow! Code Muxing Constructs Sample Code Translated into 3DNow! Code Avoid Loop Instruction Avoid Control Transfer Instructions Avoid Recursive Functions
Scheduling Optimizations
Schedule Instructions According their Latency Unrolling Loops. Complete Loop Unrolling Partial Loop Unrolling. Function Inlining Overview Always Inline Functions Called from Site Always Inline Functions with Fewer than Machine Instructions Avoid Address Generation Interlocks. MOVZX MOVSX Minimize Pointer Arithmetic Loops Push Memory Data Carefully.
Contents
22007I-0-September 2000
AthlonProcessor Code Optimization
Integer Optimizations
Replace Divides with Multiplies Multiplication Reciprocal (Division) Utility Unsigned Division Multiplication Constant. Signed Division Multiplication Constant Consider Alternative Code When Multiplying Constant MMXInstructions Integer-Only Work Repeated String Instruction Usage Latency Repeated String Instructions Guidelines Repeated String Instructions Instruction Clear Integer Registers Efficient 64-Bit Integer Arithmetic Efficient Implementation Population Count Function Efficient Binary-to-ASCII Decimal Conversion Derivation Multiplier Used Integer Division Constants Derivation Algorithm, Multiplier, Shift Factor Unsigned Integer Division. Derivation Algorithm, Multiplier, Shift Factor Signed Integer Division
Floating-Point Optimizations
Ensure Data Aligned Multiplies Rather than Divides FFREEP Macro Register from Stack Floating-Point Compare Instructions FXCH Instruction Rather than FST/FLD Pairs Avoid Using Extended-Precision Data Minimize Floating-Point-to-Integer Conversions
Contents
AthlonProcessor Code Optimization
22007I-0-September 2000
Check Argument Range Trigonometric Instructions Efficiently Take Advantage FSINCOS Instruction
3DNow!and MMXOptimizations
3DNow! Instructions FEMMS Instruction 3DNow! Instructions Fast Division Optimized 14-Bit Precision Divide Optimized Full 24-Bit Precision Divide Pipelined Pair 24-Bit Precision Divides. Newton-Raphson Reciprocal
3DNow! Instructions Fast Square Root Reciprocal Square Root Optimized 15-Bit Precision Square Root Optimized 24-Bit Precision Square Root Newton-Raphson Reciprocal Square Root. PMADDWD Instruction Perform 32-Bit Multiplies Parallel PMULHUW Compute Upper Half Unsigned Products. 3DNow! Intra-Operand Swapping Fast Conversion Signed Words Floating-Point Width Memory Access Differs Between PUNPCKL* PUNPCKH* PXOR Negate 3DNow! Data PCMP Instead 3DNow! PFCMP. Instructions Block Copies Block Fills Efficient 64-Bit Population Count Using Instructions PXOR Clear Bits Register PCMPEQD Bits Register
viii
Contents
22007I-0-September 2000
AthlonProcessor Code Optimization
PAND Find Floating-Point Absolute Value 3DNow! Code Integer Absolute Value Computation Using Instructions Optimized Matrix Multiplication. Efficient 3D-Clipping Code Computation Using 3DNow! Instructions Efficiently Determining Similarity Between RGBA Pixels 3DNow! PAVGUSB MPEG-2 Motion Compensation Efficient Implementation floor() Using 3DNow! Instructions Stream Packed Unsigned Bytes Complex Number Arithmetic.
General Optimization Guidelines
Short Forms Dependencies Register Operands Stack Allocation
Appendix
AthlonProcessor Microarchitecture
Introduction Athlon Processor Microarchitecture Superscalar Processor Instruction Cache Predecode Branch Prediction. Early Decoding Instruction Control Unit Data Cache
Contents
AthlonProcessor Code Optimization
22007I-0-September 2000
Integer Scheduler. Integer Execution Unit Floating-Point Scheduler. Floating-Point Execution Unit Load-Store Unit (LSU). Cache Controller Write Combining Athlon System
Appendix
Pipeline Execution Unit Resources Overview
Fetch Decode Pipeline Stages Integer Pipeline Stages Floating-Point Pipeline Stages Execution Unit Resources Terminology Integer Pipeline Operations Floating-Point Pipeline Operations Load/Store Pipeline Operations Code Sample Analysis
Appendix
Implementation Write Combining
Introduction Write-Combining Definitions Abbreviations What Write Combining? Programming Details Write-Combining Operations Sending Write-Buffer Data System
Contents
22007I-0-September 2000
AthlonProcessor Code Optimization
Appendix
Performance-Monitoring Counters
Overview Performance Counter Usage PerfEvtSel[3:0] MSRs (MSR Addresses C001_0000h-C001_0003h) PerfCtr[3:0] MSRs (MSR Addresses C001_0004h-C001_0007h) Starting Stopping Performance-Monitoring Counters Event Time-Stamp Monitoring Software. Monitoring Counter Overflow
Appendix
Programming MTRR
Introduction Memory Type Range Register (MTRR) Mechanism Page Attribute Table (PAT).
Appendix Appendix
Instruction Dispatch Execution Resources/Timing DirectPath versus VectorPath Instructions
Select DirectPath Over VectorPath Instructions. DirectPath Instructions VectorPath Instructions
Index
Contents
AthlonProcessor Code Optimization
22007I-0-September 2000
Contents
22007I-0-September 2000
AthlonProcessor Code Optimization
List Figures
Figure AthlonProcessor Block Diagram Figure Integer Execution Pipeline Figure Floating-Point Unit Block Diagram Figure Load/Store Unit Figure Fetch/Scan/Align/Decode Pipeline Hardware Figure Fetch/Scan/Align/Decode Pipeline Stages Figure Integer Execution Pipeline Figure Integer Pipeline Stages Figure Floating-Point Unit Block Diagram Figure Floating-Point Pipeline Stages Figure PerfEvtSel[3:0] Registers Figure MTRR Mapping Physical Memory Figure MTRR Capability Register Format Figure MTRR Default Type Register Format Figure Page Attribute Table (MSR 277h) Figure MTRRphysBasen Register Format Figure MTRRphysMaskn Register Format
List Figures
xiii
AthlonProcessor Code Optimization
22007I-0-September 2000
List Figures
22007I-0-September 2000
AthlonProcessor Code Optimization
List Tables
Table Table Table Table Table Table Table Table Table Table Table Table Table Table Table Table Table Table Table Table Table Table Table Table Table Table Table Table List Tables Latency Repeated String Instructions. Integer Pipeline Operation Types Integer Decode Types Floating-Point Pipeline Operation Types Floating-Point Decode Types Load/Store Unit Stages Sample Integer Register Operations Sample Integer Register Memory Load Operations. Write Combining Completion Events AthlonSystem Commands Generation Rules Performance-Monitoring Counters. Memory Type Encodings Standard MTRR Types Properties PATi 3-Bit Encodings Effective Memory Type Based MTRRs. Final Output Memory Types MTRR Fixed Range Register Format MTRR-Related Model-Specific Register (MSR) Integer Instructions MMXInstructions Extensions Floating-Point Instructions 3DNow!Instructions 3DNow! Extensions DirectPath Integer Instructions DirectPath Instructions. DirectPath Extensions. DirectPath Floating-Point Instructions
AthlonProcessor Code Optimization
22007I-0-September 2000
Table Table Table Table
VectorPath Integer Instructions VectorPath Instructions VectorPath Extensions VectorPath Floating-Point Instructions
List Tables
22007I-0-September 2000
AthlonProcessor Code Optimization
Revision History
Date Added "About this Document" page Further clarification "Consider Sign Integer Operands" page Added optimization, "Use Array Style Instead Pointer Style Code" page Added optimization, "Accelerating Floating-Point Divides Square Roots" page Clarified examples "Copy Frequently De-Referenced Pointer Arguments Local Variables" page Further clarification "Select DirectPath Over VectorPath Instructions" page Further clarification "Align Branch Targets Program Spots" page Further clarification instruction filler "Code Padding Using Neutral Code Fillers" page Further clarification "Use 3DNow!PREFETCH PREFETCHW Instructions" page Modified Examples "Unsigned Division Multiplication Constant" page 110. Added optimization, "Efficient Implementation Population Count Function" page 128. Nov. 1999 Further clarification "Use FFREEP Macro Register from Stack" page 146. Further clarification "Minimize Floating-Point-to-Integer Conversions" page 148. Added optimization, "Check Argument Range Trigonometric Instructions Efficiently" page 152. Added optimization, "Take Advantage FSINCOS Instruction" page 154. Further clarification "Use 3DNow!Instructions Fast Division" page 160. Further clarification "Use FEMMS Instruction" page 160. Further clarification "Use 3DNow!Instructions Fast Square Root Reciprocal Square Root" page 163. Clarified "3DNow!and MMXIntra-Operand Swapping" page 167. Corrected PCMPGT information "Use MMXPCMP Instead 3DNow!PFCMP" page 171. Added optimization, "Use MMXInstructions Block Copies Block Fills" page 172. Modified rule "Use MMXPXOR Clear Bits Register" page 180. Description
Revision History
xvii
AthlonProcessor Code Optimization
22007I-0-September 2000
Date
Description Modified rule "Use MMXPCMPEQD Bits Register" page 181. Added optimization, "Optimized Matrix Multiplication" page 182. Added optimization, "Efficient 3D-Clipping Code Computation Using 3DNow!Instructions" page 185. Added optimization, "Complex Number Arithmetic" page 194. Added Appendix "Programming MTRR PAT". Rearranged appendices. Added Index
xviii
Revision History
22007I-0-September 2000
AthlonProcessor Code Optimization
Date
Description Added more details optimizations chapter, "Top Optimizations" page Further clarification "Use Array Style Instead Pointer Style Code" page Added optimization, "Always Match Size Stores Loads" page Added optimization, "Fast Floating-Point-to-Integer Conversion" page Added optimization, "Speeding Branches Based Comparisons Between Floats" page Added optimization, "Use Read-Modify-Write Instructions Where Appropriate" page Further clarification "Align Branch Targets Program Spots" page Added optimization, "Use 32-Bit Rather than 16-Bit Instruction" page Added optimization, "Use LEAVE Instruction Function Epilogue Code" page Added more examples "Memory Size Alignment Issues" page Further clarification "Use 3DNow!PREFETCH PREFETCHW Instructions" page Further clarification "Store-to-Load Forwarding Restrictions" page
April 2000
Changed epilogue code Example "Stack Alignment Considerations" page Added Example "Avoid Branches Dependent Random Data" page Fixed comments Example "Unsigned Division Multiplication Constant" page 110. Revised code "Algorithm: Divisors <231, page "Algorithm: Divisors <231" page 111. Added more examples "Efficient 64-Bit Integer Arithmetic" page 118. Fixed typo Integer example added version "Efficient Implementation Population Count Function" page 128. Added optimization, "Efficient Binary-to-ASCII Decimal Conversion" page 132. Updated codes "Derivation Multiplier Used Integer Division Constants" page Software Development (SDK). Further clarification "Use FFREEP Macro Register from Stack" page 146. Corrected Example "Minimize Floating-Point-to-Integer Conversions" page 148. Added optimization, "Use PMULHUW Compute Upper Half Unsigned Products" page 165.
Revision History
AthlonProcessor Code Optimization
22007I-0-September 2000
Date
Description Added information that "Width Memory Access Differs Between PUNPCKL* PUNPCKH*" page 169Rewrote section "Use MMXInstructions Block Copies Block Fills" page 172. Added optimization, "Integer Absolute Value Computation Using MMXInstructions" page 182. Added optimization, "Efficient 64-Bit Population Count Using MMXInstructions" page 179.
April 2000
Added optimization, "Efficiently Determining Similarity Between RGBA Pixels" page 187. cont. Added optimization, "Efficient Implementation floor() Using 3DNow!Instructions" page 192. Corrected instruction mnemonics AAM, AAD, BOUND, FDIVP, FMULP, FDUBP, DIV, IDIV, IMUL, MUL, TEST "Instruction Dispatch Execution Resources/Timing" page "DirectPath versus VectorPath Instructions" page 303.
June 2000 Sept. 2000
Added Appendix "Performance-Monitoring Counters" page 233. Corrected Example under "Muxing Constructs" Chapter "Branch Optimizations."
Revision History
22007I-0-September 2000
AthlonProcessor Code Optimization
Introduction
Athlonprocessor newest microprocessor K86family microprocessors. advances Athlon processor take superscalar operation out-of-order execution level. Athlon processor been designed efficiently execute code written previous-generation processors. However, enable fastest code execution with Athlon processor, programmers should write software that includes specific code optimization techniques.
About this Document
This document contains information assist programmers creating optimized code Athlon processor. addition compiler assembler designers, this document been targeted assembly language programmers writing execution-sensitive code sequences. This document assumes that reader possesses in-depth knowledge instruction set, architecture (registers, programming modes, etc.), PC-AT platform. This guide been written specifically Athlon Chapter Introduction
AthlonProcessor Code Optimization
22007I-0-September 2000
previous-generation processors describes those optimizations applicable Athlon processor. This guide contains following chapters: Chapter Introduction. Outlines material covered this document. Summarizes Athlon microarchitecture. Chapter Optimizations. Provides convenient descriptions most important optimizations programmer should take into consideration. Chapter Source Level Optimizations. Describes optimizations that C/C++ programmers implement. Chapter Instruction Decoding Optimizations. Describes methods that will make most efficient three sophisticated instruction decoders Athlon processor. Chapter Cache Memory Optimizations. Describes optimizations that make efficient large caches highbandwidth buses Athlon processor. Chapter Branch Optimizations. improve branch prediction minimizes branch penalties. Chapter Scheduling Optimizations. Describes optimizations that improve code scheduling efficient execution resource utilization. Chapter Integer Optimizations. improve integer arithmetic makes efficient integer execution units Athlon processor. Chapter Floating-Point Optimizations. Describes optimizations that make maximum superscalar pipelined floatingpoint unit (FPU) Athlon processor. Chapter 3DNow!and MMXOptimizations. code optimization guidelines 3DNow!, MMX, Enhanced 3DNow!/MMX. Chapter General Optimizations Guidelines. generic optimization techniques applicable processors. Appendix Athlon Processor Microarchitecture. detail microarchitecture Athlon processor. Introduction
Chapter
22007I-0-September 2000
AthlonProcessor Code Optimization
Appendix Pipeline Execution Unit Resources Overview. Describes detail execution units relation instruction pipeline. Appendix Implementation Write Combining. algorithm used Athlon processor write combine. Appendix Performance-Monitoring Counters. Describes usage performance counters available Athlon processor. Appendix Programming MTRR PAT. needed program Memory Type Range Registers Page Attribute Table. Appendix Instruction Dispatch Execution Resources/Timing. instruction's execution resource usage latency. Appendix DirectPath versus VectorPath Instructions. instructions that DirectPath VectorPath instructions.
AthlonProcessor Family
Athlon processor family uses state-of-the-art decoupled decode/execution design techniques deliver next-generation performance with binary software compatibility. This next-generation processor family advances code execution using flexible instruction predecoding, wide balanced decoders, aggressive out-of-order execution, parallel integer execution pipelines, parallel floating-point execution pipelines, deep pipelined execution higher delivered operating frequency, dedicated backside cache memory, high-performance double-rate 64-bit local bus. binary-compatible processor, Athlon processor implements industry-standard instruction decoding executing instructions using proprietary microarchitecture. This microarchitecture allows delivery maximum performance when running x86-based software.
Chapter
Introduction
AthlonProcessor Code Optimization
22007I-0-September 2000
AthlonProcessor Microarchitecture Summary
Athlon processor brings superscalar performance ning industry-standard software. brief summary Athlon processor follows:
High-speed double-rate local interface Large, split 128-Kbyte level-one (L1) cache Dedicated backside level-two (L2) cache Instruction predecode branch detection during cache line fills Decoupled decode/execution core Three-way instruction decoding Dynamic scheduling speculative execution Three-way integer execution Three-way address generation Three-way floating-point execution 3DNow!technology MMXsingle-instruction multiple-data (SIMD) instruction extensions Super data forwarding Deep out-of-order integer floating-point execution Register renaming Dynamic branch prediction
next-generation high-speed local that beyond current Socket Super7bus standard. local transfer data twice rate operating frequency using information). reduce on-chip cache miss penalties avoid subsequent data load instruction fetch stalls, Athlon processor dedicated high-speed backside cache. large 128-Kbyte on-chip cache backside cache allow
Introduction
Chapter
22007I-0-September 2000
AthlonProcessor Code Optimization
Athlon execution core achieve sustain maximum performance. decoupled decode/execution processor, Athlon processor makes proprietary microarchitecture, which defines heart Athlon processor. With inclusion these features, Athlon processor capable decoding, issuing, executing, retiring multiple instructions cycle, resulting superior scaleable performance. Athlon processor includes both industry-standard SIMD integer instructions 3DNow! SIMD floating-point instructions that were first introduced AMD-K6®-2 processor. design 3DNow! technology based suggestions from leading graphics independent software vendors (ISVs). Using SIMD format, Athlon processor generate four 32-bit, single-precision floating-point results clock cycle. 3DNow! execution units allow high-performance floating-point vector operations, which replace instructions enhance performance graphics other floating-point-intensive applications. Because 3DNow! architecture uses same registers instructions, switching between 3DNow! penalty. Athlon processor designers took another innovative step carefully integrating traditional floating-point, MMX, 3DNow! execution units into operational engine. With introduction Athlon processor, technology virtually eliminated. Athlon processor combined with 3DNow! technology brings better multimedia experience mainstream users while maintaining backwards compatibility with existing software. Although Athlon processor extract code parallelism on-the-fly from off-the-shelf, commercially available software, specific code optimization Athlon processor result even higher delivered performance. This document describes proprietary microarchitecture Athlon processor makes recommendations optimizing execution software processor. Chapter Introduction
AthlonProcessor Code Optimization
22007I-0-September 2000
coding techniques achieving peak performance Athlon processor include, limited those AMD-K6, AMD-K6-2, Pentium®, Pentium Pro, Pentium processors. However, many these optimizations necessary Athlon processor achieve maximum performance. more flexible pipeline control aggressive out-of-order execution, Athlon processor sensitive instruction selection code scheduling. This flexibility distinct advantages Athlon processor. Athlon processor uses latest processor microarchitecture design techniques provide highest performance today's short, Athlon processor offers true next-generation performance with binary software compatibility.
Introduction
Chapter
22007I-0-September 2000
AthlonProcessor Code Optimization
Optimizations
This chapter contains descriptions best optimizations improving performance Athlonprocessor. Subsequent chapters contain more detailed descriptions these other optimizations. optimizations this chapter divided into groups listed order importance. Group Essential Optimizations Group contains essential optimizations. Users should follow these critical guidelines closely. optimizations Group follows:
Memory Size Alignment Issues-Avoid memory size mismatches-Align data where possible 3DNow!PREFETCH PREFETCHW Instructions Select DirectPath Over VectorPath Instructions
Group Secondary Optimizations
significantly improve performance Athlon processor. optimizations Group follows:
Load-Execute Instruction Usage-Use Load-Execute instructions-Avoid load-execute floating-point instructions with integer operands Take Advantage Write Combining 3DNow! Instructions Avoid Branches Dependent Random Data Avoid Placing Code Data Same 64-Byte Cache Line Optimizations
Chapter
AthlonProcessor Code Optimization
22007I-0-September 2000
Optimization Star
optimizations described this chapter flagged with star. addition, star appears beside more detailed descriptions found subsequent chapters.
Group Optimizations Essential Optimizations
Memory Size Alignment Issues
Avoid Memory Size Mismatches
Avoid memory size mismatches when different instructions operate same data. When instruction stores another instruction reloads same data, keep their operands aligned keep loads/stores each operand same store-to-load-forwarding (STLF) stall: Example (Avoid):
DWORD [FOO], DWORD [FOO+4], QWORD [FOO]
Avoid large-to-small mismatches, shown following code: Example (Avoid):
QWORD [FOO] EAX, DWORD [FOO] EDX, DWORD [FOO+4]
Optimizations
Chapter
22007I-0-September 2000
AthlonProcessor Code Optimization
Align Data Where Possible
Avoid misaligned data references. data whose size power considered aligned naturally aligned. example:
WORD accesses aligned they access address divisible DWORD accesses aligned they access address divisible QWORD accesses aligned they access address divisible TBYTE accesses aligned they access address divisible
misaligned store load operation suffers minimum one-cycle penalty Athlon processor load/store pipeline. addition, using misaligned loads stores increases likelihood encountering store-to-load forwarding pitfall. more detailed discussion store-toload forwarding issues, "Store-to-Load Forwarding Restrictions" page
3DNow!PREFETCH PREFETCHW Instructions
code that take advantage prefetching, 3DNow! PREFETCH PREFETCHW instructions increase effective bandwidth Athlon processor, which ifica roves instructions essentially integer instructions used anywhere, type code (integer, x87, 3DNow!, MMX, etc.). following formula determine prefetch distance: Prefetch Length (DS/C)
Round nearest cache line. data stride loop iteration. number cycles loop iteration when hitting cache.
"Use 3DNow!PREFETCH PREFETCHW Instructions" page more details.
Chapter
Optimizations
AthlonProcessor Code Optimization
22007I-0-September 2000
Select DirectPath Over VectorPath Instructions
instructions. DirectPath instructions optimized decode execute efficiently minimizing number operations instruction, which includes `register register memory' well `register register register' forms instructions. three DirectPath instructions decoded cycle. VectorPath instructions block decoding DirectPath instructions. Athlon processor implements majority instructions used compiler DirectPath instructions. consideration usage DirectPath versus VectorPath instructions. Appendix "Instruction Dispatch Execution Resources/Timing" page Appendix "DirectPath versus VectorPath Instructions" page tables DirectPath VectorPath instructions.
Group Optimizations-Secondary Optimizations
Load-Execute Instruction Usage
Load-Execute Instructions
Most load-execute integer instructions DirectPath decodable decoded rate three cycle. Splitting load-execute integer instruction into separate instructions-a load instruction "reg, reg" instruction- reduces decoding bandwidth increases register pressure, which results lower performance. split-instruction form avoid scheduler stalls longer executing instructions explicitly schedule load execute operations.
Optimizations
Chapter
22007I-0-September 2000
AthlonProcessor Code Optimization
Load-Execute Floating-Point Instructions with Floating-Point Operands
When operating single-precision double-precision floating-point data, wherever possible floating-point load-execute instructions increase code density. Note: This optimization applies only floating-point instructions with floating-point operands with integer operands, described immediately following section. This coding style helps ways. First, denser code allows more work held instruction cache. Second, denser code generates fewer internal and, therefore, scheduler holds more work, which increases chances extracting parallelism from code. Example (Avoid):
FMUL QWORD [TEST1] QWORD [TEST2] ST(1)
Example (Preferred):
FMUL QWORD [TEST1] QWORD [TEST2]
Avoid Load-Execute Floating-Point Instructions with Integer Operands
load-execute floating-point instructions with integer operands: FIADD, FISUB, FISUBR, FIMUL, FIDIV, FIDIVR, tructions have rands while instruction cannot have floating-point operands. separate FILD arithmetic instructions floatingpoint computations involving integer-memory operands. This optimization potential increase decode bandwidth density scheduler. floating-point loadexecute instructions with integer operands VectorPath generate cycle, while discrete equivalent enables third DirectPath instruction decoded same cycle. some situations this optimizations also reduce execution time FILD scheduled several instructions ahead arithmetic instruction order cover FILD latency.
Chapter
Optimizations
AthlonProcessor Code Optimization
22007I-0-September 2000
Example (Avoid):
FIMUL FIADD QWORD [foo] DWORD [bar] DWORD [baz]
Example (Preferred):
FILD FILD FMULP FADDP DWORD [bar] DWORD [baz] QWORD [foo] ST(2), ST(1),ST
Take Advantage Write Combining
This guideline applies only operating system, device driver, rove performance, Athlon processor aggressively combines multiple memory-write cycles data size that address locations within 64-byte cache line aligned write buffer. Appendix "Implementation Write Combining" page more details.
3DNow!Instructions
When single precision required, perform floating-point computations using 3DNow! instructions instead instructions. SIMD nature 3DNow! instructions achieves twice number FLOPs that achieved through instructions. 3DNow! instructions also provide flat register file instead stack-based approach instructions. Table page list 3DNow! instructions. information about instruction usage, 3DNow!Technology Manual, order# 21928.
Optimizations
Chapter
22007I-0-September 2000
AthlonProcessor Code Optimization
Avoid Branches Dependent Random Data
Avoid conditional branches depending random data, these difficult predict. example, piece code receives random stream characters through branches character before collating sequence. Data-dependent branches acting upon basically random data causes branch prediction logic mispredict branch about time. possible, design branch-free alternative code sequences, which results shorter average execution time. This technique especially important branch body small. "Avoid Branches Dependent Random Data" page more details.
Avoid Placing Code Data Same 64-Byte Cache Line
Sharing code data same 64-byte cache line cause caches thrash (unnecessary castout code/data) order maintain coherency between separate instruction data caches. Athlon processor cache-line size bytes, which twice size previous processors. Avoid placing code data together within this larger cache line, especially data becomes modified. example, consider that memory indirect instruction have data jump table residing same 64-byte cache line instruction. This mixing code data same cache line would result lower performance. Although rare, place critical code border between 32-byte aligned code segments data segments. Code start data segment should seldom executed possible simply padded with garbage. general, avoid following:
self-modifying code storing data code segments
Chapter
Optimizations
AthlonProcessor Code Optimization
22007I-0-September 2000
Optimizations
Chapter
22007I-0-September 2000
AthlonProcessor Code Optimization
Source Level Optimizations
This chapter details programming practices optimizing code Athlonprocessor. Guidelines listed order importance.
Ensure Floating-Point Variables Expressions Type Float
compilers that generate 3DNow!instructions, make sure that floating-point variables expressions type float. special attention floating-point constants. These require suffix (for example: 3.14f) type float, otherwise they default type double. avoid automatic promotion float arguments double, always function prototypes functions that accept float arguments.
32-Bit Data Types Integer Code
implementations vary, typically following data types included int, signed, signed int, unsigned, unsigned int, long, signed long, long int, signed long int, unsigned long, unsigned long int. Chapter Source Level Optimizations
AthlonProcessor Code Optimization
22007I-0-September 2000
Consider Sign Integer Operands
many cases, data stored integer variables determines whether signed unsigned integer type appropriate. example, record weight person pounds, negative numbers required unsigned type appropriate. However, recording temperatures degrees Celsius require both positive negative numbers signed type needed. Where there choice using either signed unsigned type, take into consideration that certain operations faster with unsigned types while others faster signed types. Integer-to-floating-point conversion using integers larger than faster with signed types, architecture provides instructions converting signed integers floatingpoint, instructions converting unsigned integers. typical case, 32-bit integer converted compiler assembly follows: Example (Avoid):
double unsigned ====> FILD FSTP [temp+4], EAX, [temp], QWORD [temp] QWORD
above code slow only because number instructions, also because size mismatch prevents store-toload forwarding FILD instruction. Instead, following code: Example (Preferred):
double ====> FILD DWORD FSTP QWORD
Source Level Optimizations
Chapter
22007I-0-September 2000
AthlonProcessor Code Optimization
Computing quotients remainders integer division constants faster when performed unsigned types. following typical case compiler output 32-bit integer divided four: Example (Avoid):
====> EAX, EDX, EAX, EAX,
Example (Preferred):
unsigned ====>
summary: unsigned types for:
Division remainders Loop counters Array indexing
signed types for:
Integer-to-float conversion
Chapter
Source Level Optimizations
AthlonProcessor Code Optimization
22007I-0-September 2000
Array Style Instead Pointer Style Code
pointers makes work difficult optimizers compilers. Without detailed aggressive pointer analysis, compiler assume that writes through pointer write place memory. This includes storage allocated other variables, creating issue aliasing, i.e., same block memory accessible more than way. help compiler optimizer analysis, avoid pointers where possible. example where this trivially possible access data organized arrays. allows either array operator pointers access array. Using array-style code makes task optimizer easier reducing possible aliasing. example, x[0] x[2] possibly refer same recommended array style, significant performance advantages achieved with most compilers. Example (Avoid):
typedef struct float x,y,z,w; VERTEX; typedef struct float m[4][4]; MATRIX; void XForm (float *res, const float const float numverts) float const VERTEX* (VERTEX *)v; numverts; i++) vv->x *m++; vv->y *m++; vv->z *m++; vv->w *m++; write transformed *m++; *m++; *m++; *m++;
*res++ vv->x vv->y vv->z vv->w
Source Level Optimizations
Chapter
22007I-0-September 2000
AthlonProcessor Code Optimization
*res++ vv->x vv->y vv->z vv->w *res++ vv->x vv->y vv->z vv->w *res++ ++vv;
write transformed *m++; *m++; *m++; *m++; write transformed *m++; *m++; *m++; *m++; write transformed
next input vertex reset start transform matrix
Example (Preferred):
typedef struct float x,y,z,w; VERTEX; typedef struct float m[4][4]; MATRIX; void XForm (float *res, const float const float numverts) const VERTEX* (VERTEX *)v; const MATRIX* (MATRIX *)m; VERTEX* (VERTEX *)res; numverts; i++) rr->x vv->x*mm->m[0][0] vv->z*mm->m[0][2] rr->y vv->x*mm->m[1][0] vv->z*mm->m[1][2] rr->z vv->x*mm->m[2][0] vv->z*mm->m[2][2] rr->w vv->x*mm->m[3][0] vv->z*mm->m[3][2] vv->y*mm->m[0][1] vv->w*mm->m[0][3]; vv->y*mm->m[1][1] vv->w*mm->m[1][3]; vv->y*mm->m[2][1] vv->w*mm->m[2][3]; vv->y*mm->m[3][1] vv->w*mm->m[3][3];
Chapter
Source Level Optimizations
AthlonProcessor Code Optimization
22007I-0-September 2000
Reality Check
Note that source code transformations interact with compiler's code generator that difficult control generated machine code from source level. even possible that source code transformations improving performance compiler optimizations "fight" each other. Depending compiler specific source code, therefore possible that pointer style code will compiled into machine code that faster than that generated from equivalent array style code. advisable check performance after source code transformation whether performance really improved.
Completely Unroll Small Loops
Take advantage Athlon processor's large, 64-Kbyte instruction cache completely unroll small loops. Unrolling loops beneficial performance, especially loop body small which makes loop overhead significant. Many compilers aggressive unrolling loops. loops that have small fixed loop count small loop body, completely unroll loops source level. Example (Avoid):
3D-transform: multiply vector transform matrix (i=0; i<4; i++) r[i] (j=0; j<4; j++) r[i] M[j][i]*V[j];
Example (Preferred):
3D-transform: multiply vector r[0] M[0][0]*V[0] M[1][0]*V[1] M[3][0]*V[3]; r[1] M[0][1]*V[0] M[1][1]*V[1] M[3][1]*V[3]; r[2] M[0][2]*V[0] M[1][2]*V[1] M[3][2]*V[3]; r[3] M[0][3]*V[0] M[1][3]*V[1] M[3][3]*v[3]; transform matrix M[2][0]*V[2] M[2][1]*V[2] M[2][2]*V[2] M[2][3]*V[2]
Source Level Optimizations
Chapter
22007I-0-September 2000
AthlonProcessor Code Optimization
Avoid Unnecessary Store-to-Load Dependencies
store-to-load dependency exists when data stored "Store-to-Load Forwarding Restrictions" page more details. Athlon processor contains hardware accelerate such store-to-load dependencies, allowing load obtain store data before been written memory. However, still faster avoid such dependencies altogether keep data internal register. Avoiding store-to-load dependencies especially important they part long dependency chains, occur recurrence computation. dependency occurs while operating arrays, many compilers unable optimize code that avoids store-to-load dependency. some instances language definition prohibit compiler from using code transformations that would remove storeto-load dependency. therefore recommended that programmer remove dependency manually, e.g., introducing temporary variable that kept register. This result significant performance increase. following example this. Example (Avoid):
double x[VECLEN], y[VECLEN], z[VECLEN]; unsigned VECLEN; k++) x[k] x[k-1] y[k]; VECLEN; k++) x[k] z[k] (y[k] x[k-1]);
Example (Preferred):
double x[VECLEN], y[VECLEN], z[VECLEN]; unsigned double x[0]; VECLEN; k++) y[k]; x[k]
Chapter
Source Level Optimizations
AthlonProcessor Code Optimization
22007I-0-September 2000
x[0]; VECLEN; k++) z[k] (y[k] x[k]
Always Match Size Stores Loads
Athlon processor contains load/store buffer (LS) speed forwarding store data dependent loads. However, this store-to-load forwarding (STLF) inside occurs general only when addresses sizes store dependent load match, when both memory accesses aligned (see section "Store-to-Load Forwarding Restrictions" page details). impossible control load store activity source level avoid cases that violate restrictions placed store-to-load-forwarding. some instances possible spot such cases source code. Size mismatches easily occur when different sized data items joined union. Address mismatches could result pointer manipulation. following examples show situation involving union differently sized data items. examples show user defined unsigned 16.16 fixed point type, operations defined this type. Function fixed_add() adds fixed point numbers, function fixed_int() extracts integer portion fixed point number. Example (Avoid) shows inappropriate implementation fixed_int(), which when used result fixed_add() causes misalignment, address mismatch, size mismatch between memory operands, such that STLF takes place. Example (Preferred) shows properly implement fixed_int() order allow store-to-load-forwarding Example (Avoid):
typedef union unsigned whole; struct unsigned short frac; lower bits fraction unsigned short intg; upper bits integer parts; FIXED_U_16_16;
Source Level Optimizations
Chapter
22007I-0-September 2000
AthlonProcessor Code Optimization
_inline FIXED_U_16_16 fixed_add (FIXED_U_16_16 FIXED_U_16_16 FIXED_U_16_16 z.whole x.whole y.whole; return (z); _inline unsigned fixed_int (FIXED_U_16_16 return ((unsigned int)(x.parts.intg)); FIXED_U_16_16 unsigned label1: fixed_add fixed_int (y); label2:
object code generated source code between $label1 $label2 typically follows these following variants:
;variant EDX, DWORD EAX, DWORD EAX, DWORD [y], EAX, DWORD
[y+2] misaligned/address mismatch, forwarding
EAX, 0FFFFh DWORD [q],
;variant EDX, DWORD EAX, DWORD EAX, DWORD [y], MOVZX EAX, WORD [y+2] DWORD [q],
size address mismatch, forwarding
Chapter
Source Level Optimizations
AthlonProcessor Code Optimization
22007I-0-September 2000
Example (Preferred):
typedef union unsigned whole; struct unsigned short frac; lower bits fraction unsigned short intg; upper bits integer parts; FIXED_U_16_16; _inline FIXED_U_16_16 fixed_add (FIXED_U_16_16 FIXED_U_16_16 FIXED_U_16_16 z.whole x.whole y.whole; return (z); _inline unsigned fixed_int (FIXED_U_16_16 return (x.whole 16); FIXED_U_16_16 unsigned label1: fixed_add fixed_int (y); label2:
object code generated source code between $label1 $label2 typically looks follows:
EDX, DWORD EAX, DWORD EAX, DWORD [y], EAX, DWORD aligned, size/address match, forwarding
EAX, DWORD [q],
Source Level Optimizations
Chapter
22007I-0-September 2000
AthlonProcessor Code Optimization
Consider Expression Order Compound Branch Conditions
Branch conditions prog often compound conditions consisting multiple boolean expressions joined boolean operators guarantees short-circuit evaluation these operators. This means that case first operand evaluate TRUE term inates evaluation, i.e., following operands evaluated all. Similarly first operand evaluate FALSE terminates evaluation. Because this short-circuit evaluation, always possible swap operands This especially case when evaluation operands causes side effect. However, most cases exchange operands possible. When used control conditional branches, expressions involving translated into series conditional branches. ordering conditional branches function ordering expressions compound condition, have significant impact performance. unfortunately possible give easy, closed-form formula order conditions. Overall performance function variety following factors:
probability branch mispredict each branches generated additional latency incurred branch mispredict cost evaluating conditions controlling each branches generated amount parallelism that extracted evaluating branch conditions data stream consumed application (mostly dependence mispredict probabilities nature incoming data data dependent branches)
therefore recommended experiment with ordering expressions compound branch conditions most active areas program called spots) where most execution time spent. Such spots found through profiling. Feed "typical" data stream program while doing experiments. Chapter Source Level Optimizations
AthlonProcessor Code Optimization
22007I-0-September 2000
Switch Statement Usage
Optimize Switch Statements
Switch statements translated using variety algorithms. most common these jump tables comparison chains/trees. recommended sort cases switch statement according probability occurrences, with most probable first. This improves performance when switch translated comparison chain. further recommended make case labels small, contiguous integer values, this allows switch translated jump table. Most compilers allow switch statement translated jump table case labels small contiguous integer values. Example (Avoid):
days_in_month, short_months, normal_months, long_months; switch (days_in_month) case case short_months++; break; case normal_months++; break; case long_months++; break; default: printf ("month fewer than more than days\n");
Example (Preferred):
days_in_month, short_months, normal_months, long_months; switch (days_in_month) case long_months++; break; case normal_months++; break; case case short_months++; break; default: printf ("month fewer than more than days\n");
Source Level Optimizations
Chapter
22007I-0-September 2000
AthlonProcessor Code Optimization
Prototypes Functions
general, prototypes functions. Prototypes convey additional information compiler that might enable more aggressive optimizations.
Const Type Qualifier
"const" type qualifier much possible. This optimization makes code more robust enable higher performance code generated additional information available compiler. example, standard allows compilers allocate storage objects that declared "const" their address never taken.
Chapter
Source Level Optimizations
AthlonProcessor Code Optimization
22007I-0-September 2000
Generic Loop Hoisting
improve performance inner loops, beneficial reduce redundant constant calculations (i.e., loop invariant calculations). However, this idea extended invariant control structures. first case that constant if() statement for() loop. Example
for( CONSTANT0 DoWork0( else DoWork1(
does affect CONSTANT0 does affect CONSTANT0
Transform above loop into:
CONSTANT0 for( DoWork0( else for( DoWork1(
This makes inner loops tighter avoiding repetitious evaluation known if() control structure. Although branch would easily predicted, extra instructions decode limitations imposed branching saved, which usually well worth
Generalization Multiple Constant Control Code
generalize this further multiple constant control code, some more work have done create proper outer loop. Enumeration constant cases will reduce this simple switch statement.
Source Level Optimizations
Chapter
22007I-0-September 2000
AthlonProcessor Code Optimization
Example
for(i CONSTANT0 DoWork0( else DoWork1( CONSTANT1 DoWork2( else DoWork3(
//does affect CONSTANT0 CONSTANT1 //does affect CONSTANT0 CONSTANT1
//does affect CONSTANT0 CONSTANT1 //does affect CONSTANT0 CONSTANT1
Transform above loop using switch statement into:
#define combine( (((c1) (c2)) switch( combine( CONSTANT0!=0, CONSTANT1!=0 case combine( for( DoWork0( DoWork2( break; case combine( for( DoWork1( DoWork2( break; case combine( for( DoWork0( DoWork3( break; case combine( for( DoWork1( DoWork3( break; default: break;
Chapter
Source Level Optimizations
AthlonProcessor Code Optimization
22007I-0-September 2000
trick here that there some up-front work involved generating combinations switch constant total amount code doubled. However, also clear that inner loops "if()-free". ideal cases where "DoWork*()" functions inlined, successive functions will have greater overlap leading greater parallelism than possibl presence intervening statements. same idea applied constant switch() statements, combinations switch() statements if() statements inside for() loops. method combining input constants gets more complicated will worth performance benefit. However, number inner loops also substantially increase. number inner loops prohibitively high, then only most common cases need dealt with directly, remaining cases fall back code "default:" clause switch() statement. This typically comes when programmer considering runtime generated code. While runtime generated code lead similar levels performance improvement, much harder maintain, developer must their optimizations their code generation without help available compiler.
Declare Local Functions Static
Functions that used outside file which they defined should always declared static, which forces internal linkage. Otherwise, such functions default external linkage, compilers-for example, aggressive inlining.
Source Level Optimizations
Chapter
22007I-0-September 2000
AthlonProcessor Code Optimization
Dynamic Memory Allocation Consideration
Dynamic memory allocation (`malloc' language) should always return pointer that suitably aligned largest base type (quadword alignment). Where this aligned pointer cannot guaranteed, technique shown following code make pointer quadword aligned, needed. This code assumes pointer cast long. Example
double* double* (double (double *)((((long)(p))+7L) (-8L));
Then `np' instead access data. still needed order deallocate storage.
Introduce Explicit Parallelism into Code
Where possible, break long dependency chains into several independent dependency chains which then executed parallel exploiting pipeline execution units. This especially important floating-point code, whether mapped 3DNow! instructions because longer latency floating-point operations. Since most languages, including ANSI guarantee that floating-point expressions re-ordered, compilers usually perform such optimizations unless they offer switch allow ANSI noncompliant reordering floating-point expressions according algebraic rules. Note that re-ordered code that algebraically identical computational results lack associativity floating considerations applying these optimizations (consult book numerical analysis). some cases, these optimizations lead unexpected results. Fortunately, vast majority cases, final result will differ only least significant bits. Chapter Source Level Optimizations
AthlonProcessor Code Optimization
22007I-0-September 2000
Example (Avoid):
double a[100],sum; 0.0f; (i=0; i<100; i++) a[i];
Example (Preferred):
double a[100],sum1,sum2,sum3,sum4,sum; sum1 0.0; sum2 0.0; sum3 0.0; sum4 0.0; (i=0; i<100; i+4) sum1 a[i]; sum2 a[i+1]; sum3 a[i+2]; sum4 a[i+3]; (sum4+sum3)+(sum1+sum2);
Notice that 4-way unrolling chosen exploit 4-stage fully pipelined floating-point adder. Each stage floatingpoint adder occupied every clock cycle, ensuring maximal sustained utilization.
Source Level Optimizations
Chapter
22007I-0-September 2000
AthlonProcessor Code Optimization
Explicitly Extract Common Subexpressions
certain situations, compilers unable extract common subexpressions from floating-point expressions guarantee against reordering such expressions ANSI standard. Specifically, compiler re-arrange computation according algebraic equivalencies before extracting common subexpressions. such cases, subexpression. Note that re-arranging expression result associativity floating-point operations, results usually differ only least significant bits. Example (Avoid):
double a,b,c,d,e,f;
b*c/d; b/d*a;
Example (Preferred):
double b/d; c*t; a*t; a,b,c,d,e,f,t;
Example (Avoid):
double a,b,c,e,f; a/c; b/c;
Example (Preferred):
double a,b,c,e,f,t; 1/c; b*t;
Chapter
Source Level Optimizations
AthlonProcessor Code Optimization
22007I-0-September 2000
Language Structure Component Considerations
Many compilers have options that allow padding structures make their multiples words, doublewords, quadwords, order achieve better alignment structures. addition, improve alignment structure members, some compilers might allocate structure elements order that differs from order which they declared. However, some compilers might offer these features, their implementation might work properly situations. Therefore, achieve best alignment structures structure members while minimizing amount padding regardless compiler optimizations, following methods suggested. Sort Base Type Size Multiple Largest Base Type Size Sort structure members according their base type size, declaring members with larger base type size ahead members with smaller base type size. structure multiple largest base type size member. this fashion, first member structure naturally aligned, other members naturally aligned well. padding structure multiple largest based type size allows, example, arrays structures perfectly aligned. following example demonstrates reordering structure member declarations: Example Original ordering (Avoid):
struct char long double baz; a[5];
Example ordering with padding (Preferred):
struct double long char char baz; a[5]; pad[7];
Language Structure Component Considerations" page different perspective. Source Level Optimizations Chapter
22007I-0-September 2000
AthlonProcessor Code Optimization
Sort Local Variables According Base Type Size
When compiler allocates local variables same order which they declared source code, helpful declare local variables such manner that variables with larger base type size declared ahead variables with smaller base type size. Then, first variable allocated that naturally aligned, other variables allocated contiguously order they declared, naturally aligned without padding. Some compilers allocate variables order they declared. these cases, compiler should automatically allocate variables such manner make them naturally aligned with minimum amount padding. addition, some compilers guarantee that stack aligned suitably largest base type (that they guarantee quadword alignment), that quadword operands might misaligned, even this technique used compiler does allocate variables order they declared. following example demonstrates reordering local variable declarations: Example Original ordering (Avoid):
short long double char float foo, bar; z[3]; baz;
Example Improved ordering (Preferred):
double double long float short z[3]; foo, bar; baz;
"Sort Variables According Base Type Size" page more information from different perspective.
Chapter
Source Level Optimizations
AthlonProcessor Code Optimization
22007I-0-September 2000
Accelerating Floating-Point Divides Square Roots
Divides square roots have much longer latency than other floating-point operations, even though Athlon processor provides significant acceleration these operations. some codes, these operations occur often recommended port code 3DNow! inline assembly compiler that generate 3DNow! code. code spots that single-precision arithmetic only (i.e., computation involves data type float) some reason cannot ported 3DNow!, following technique used improve performance. precision-control field part control word. precision-control setting determines what precision results rounded affects basic arithmetic operations, including divides square roots. Athlon AMD-K6® family processors implement divide square root such fashion only compute number bits necessary currently selected precision. This means that setting precision control single precision (versus Win32 default double precision) lowers latency those operations. Microsoft Visual environment provides functions manipulate control word thus precision control. Note that these functions very fast, insert changes precision control where creates little overhead, such outside computation-intensive loop. Otherwise overhead created function calls outweighs benefit from reducing latencies divide square root operations. following example shows precision control single precision later restore original settings Microsoft Visual environment.
Source Level Optimizations
Chapter
22007I-0-September 2000
AthlonProcessor Code Optimization
Example
prototype _controlfp() function #include <float.h> unsigned orig_cw; current control word save orig_cw _controlfp (0,0); precision control control word single precision. This reduces latency divide square root operations. _controlfp (_PC_24, MCW_PC); restore original control word _controlfp (orig_cw, 0xfffff);
Chapter
Source Level Optimizations
AthlonProcessor Code Optimization
22007I-0-September 2000
Fast Floating-Point-to-Integer Conversion
Floating-point-to-integer conversion programs typically very slow operation. semantics demand that conversion truncation. floating-point operand type float, compiler supports 3DNow! code generation, 3DNow! PF2ID instruction, which performs truncating conversion, utilized compiler accomplish rapid floating-point integer conversion. double-precision operands, usual accomplish truncating conversion involves following algorithm: Save current rounding mode (this usually round nearest even). rounding mode truncation. Load floating-point source operand store integer result. Restore original rounding mode. This algorithm typically implemented through runtime library function called ftol(). While Athlon processor special hardware optimizations speed changing rounding modes therefore ftol(), calls ftol() still tend slow. situations where very fast floating-point-to-integer conversion required, conversion code "Fast" example below helpful. Note that this code uses current rounding mode instead truncation when performing conversion. Therefore result differ from ftol() result. replacement code adds "magic number" 252+251 source operand, then stores double precision result memory retrieves lower DWORD stored result. Adding magic number shifts original argument right inside double precision mantissa, placing binary point immediately right least significant mantissa bit. Extracting lower DWORD then delivers integral portion original argument. Note: This conversion code causes 64-bit store feed into 32-bit load. load from lower bits 64-bit store, case size mismatch between store Source Level Optimizations Chapter
22007I-0-September 2000
AthlonProcessor Code Optimization
depending load specifically supported store-to-loadforwarding hardware Athlon processor. Example (Slow):
double
Example (Fast):
#define DOUBLE2INT(i,d) {double ((d)+6755399441055744.0); i=*((int *)(&t));} double DOUBLE2INT(i,x);
Chapter
Source Level Optimizations
AthlonProcessor Code Optimization
22007I-0-September 2000
Speeding Branches Based Comparisons Between Floats
Branches based floating-point comparisons often slow. Athlon processor supports FCOMI, FUCOMI, implementation fast branches based comparisons between operands type double type float. However, many compilers support generating these instructions. Likewise, floating-point comparisons between operands type float accomplished quickly using 3DNow! PFCMP instruction compiler supports 3DNow! code generation. With many compilers, only they implement branches based floating-point comparisons FCOM FCOMP instructions compare floating-point operands, followed "FSTSW order transfer condition code flags into EAX. This allows branch based contents that register. Although Athlon processor acceleration hardware speed FSTSW instruction, this process still fairly slow. Branches Dependent Integer Comparisions Fast alternative branches based comparisons between operands type float store operand(s) into memory location then perform integer comparison with that memory location. Branches dependent integer comparisons very fast. should noted that replacement code uses load dependent immediately prior store. store DWORD aligned, store-to-load-forwarding takes place branch still slow. Also, there activity load-store queue forwarding store data somewhat delayed, thus negating some advantages using replacement code. recommended experiment with replacement code test whether actually provides performance increase code hand. replacement code works well comparisons against zero, including correct behavior when encountering negative zero allowed IEEE-754. also works well comparing positive constants. that case user must first determine integer representation that floating-point constant. This accomplished with following code snippet: Source Level Optimizations Chapter
22007I-0-September 2000
AthlonProcessor Code Optimization
float scanf ("%g", &x); printf ("%08X\n", (*((int *)(&x))));
replacement code IEEE-754 compliant classes floating-point operands except NaNs. However, NaNs occur properly working software. Examples:
#define FLOAT2INTCAST(f) (*((int *)(&f))) #define FLOAT2UINTCAST(f) (*((unsigned *)(&f))) comparisons 0.0f) 0.0f) 0.0f) 0.0f) comparisons 3.0f) 3.0f) 3.0f) 3.0f) against against zero (FLOAT2UINTCAST(f) 0x80000000U) (FLOAT2INCAST(f) (FLOAT2INTCAST(f) (FLOAT2UINTCAST(f) 0x80000000U) positive constant (FLOAT2INTCAST(f) (FLOAT2INTCAST(f) (FLOAT2INTCAST(f) (FLOAT2INTCAST(f)
0x40400000) 0x40400000) 0x40400000) 0x40400000)
comparisons among floats float (FLOAT2UINTCAST(t) 0x80000000U) float (FLOAT2INTCAST(t) float (FLOAT2INTCAST(t) float (FLOAT2UINTCAST(f) 0x80000000U)
Chapter
Source Level Optimizations
AthlonProcessor Code Optimization
22007I-0-September 2000
Avoid Unnecessary Integer Division
Integer division slowest integer arithmetic operations should avoided wherever possible. possibility reducing number integer divisions multiple divisions, which division replaced with multiplication shown following examples. This replacement possible only overflow occurs during computation product. This determined considering possible ranges divisors. Example (Avoid):
i,j,k,m;
Example (Preferred):
i,j,k,l;
Copy Frequently De-Referenced Pointer Arguments Local Variables
Avoid frequently de-referencing pointer arguments inside function. Since compiler knowledge whether aliasing exists between pointers, such de-referencing optimized away compiler. This prevents data from being kept registers significantly increases memory traffic. Note that many compilers have "assume aliasing" optimization switch. This allows compiler assume that different pointers always have disjoint contents does require copying pointer arguments local variables. Otherwise, copy data pointed pointer arguments local variables start function necessary copy them back function.
Source Level Optimizations
Chapter
22007I-0-September 2000
AthlonProcessor Code Optimization
Example (Avoid):
//assumes pointers different q!=r void isqrt unsigned long unsigned long unsigned long while *q))
Example (Preferred):
//assumes pointers different q!=r void isqrt unsigned long unsigned long unsigned long unsigned long while qq))
Chapter
Source Level Optimizations
AthlonProcessor Code Optimization
22007I-0-September 2000
Source Level Optimizations
Chapter
22007I-0-September 2000
AthlonProcessor Code Optimization
Instruction Decoding Optimizations
This chapter describes ways maximize number instructions decoded instruction decoders Athlonprocessor. Guidelines listed order importance.
Overview
Athlon processor instruction fetcher reads 16-byte aligned code windows from instruction cache. instruction bytes then merged into 24-byte instruction queue. each cycle, in-order front-end engine selects decode three instructions from instruction-byte queue. instructions (x86, x87, 3DNow!TM, MMXTM) ssified types decodes rect VectorPath (see "DirectPath Decoder" "VectorPath Decoder" page more information). DirectPath instructions common instructions that decoded directly hardware. VectorPath instructions more complex instructions that require sequence multiple operations issued from on-chip ROM. three DirectPath instructions selected decode cycle. Only VectorPath instruction selected decode cycle. DirectPath instructions VectorPath instructions cannot simultaneously decoded. Chapter Instruction Decoding Optimizations
AthlonProcessor Code Optimization
22007I-0-September 2000
Select DirectPath Over VectorPath Instructions
instructions. DirectPath instructions optimized decode execute efficiently minimizing number operations instruction, which includes `register register memory' well `register register register' forms instructions. three DirectPath instructions decoded cycle. VectorPath instructions block decoding DirectPath instructions. Athlon processor implements majority instructions used compiler DirectPath instructions. However, assembly writers must still take into consideration usage DirectPath versus VectorPath instructions. Appendix "Instruction Dispatch Execution Resources/Timing" page Appendix "DirectPath versus VectorPath Instructions" page tables DirectPath VectorPath instructions.
Load-Execute Instruction Usage
Load-Execute Integer Instructions
Most load-execute integer instructions DirectPath decodable decoded rate three cycle. Splitting load-execute integer instruction into separate instructions-a load instruction "reg, reg" instruction- reduces decoding bandwidth increases register pressure, which results lower performance. split-instruction form avoid scheduler stalls longer executing instructions explicitly schedule load execute operations.
Instruction Decoding Optimizations
Chapter
22007I-0-September 2000
AthlonProcessor Code Optimization
Load-Execute Floating-Point Instructions with Floating-Point Operands
When operating single-precision double-precision floating-point data, floating-point load-execute instructions wherever possible increase code density. Note: This optimization applies only floating-point instructions with floating-point operands with integer operands, described immediately following section. This coding style helps ways. First, denser code allows more work held instruction cache. Second, denser code generates fewer internal and, therefore, scheduler holds more work increasing chances extracting parallelism from code. Example (Avoid):
FMUL QWORD [TEST1] QWORD [TEST2] ST(1)
Example (Preferred):
FMUL QWORD [TEST1] QWORD [TEST2]
Avoid Load-Execute Floating-Point Instructions with Integer Operands
Chapter
load-execute floating-point instructions with integer operands: FIADD, FISUB, FISUBR, FIMUL, FIDIV, FIDIVR, tructions have rands while instructions cannot have floating-point operands. Floating-point computations involving integer-memory operands should separate FILD arithmetic instructions. This optimization potential increase decode bandwidth density scheduler. floatingpoint load-execute instructions with integer operands VectorPath generate cycle, while discrete equivalent enables third DirectPath instruction decoded same cycle. some situations this optimizations also reduce execution time FILD scheduled several instructions ahead arithmetic instruction order cover FILD latency. Instruction Decoding Optimizations
AthlonProcessor Code Optimization
22007I-0-September 2000
Example (Avoid):
FIMUL FIADD QWORD [foo] DWORD [bar] DWORD [baz]
Example (Preferred):
FILD FILD FMULP FADDP DWORD [bar] DWORD [baz] QWORD [foo] ST(2), ST(1),ST
Read-Modify-Write Instructions Where Appropriate
Athlon processor handles read-modify-write (RMW) instructions such "ADD [mem], reg32" very efficiently. vast majority instructions DirectPath instructions. instructions provide performance benefit over equivalent combination load, load-execute store instructions. comparison load/loadexecute/store combination, equivalent instruction promotes code density (better I-cache utilization), preserves decode bandwidth, saves execution resources occupies only reservation station requires only address comput ation. also reduce register pressure, demonstrated Example instructions indicated operation performed data that memory, result that operation reused soon. limited number integer registers processor, often case that data needs kept memory instead registers. Additionally, case that data, once operated upon, reused soon. example would accumulator inside loop unknown trip count, where accumulator result reused inside loop. Note that loops with known trip count, accumulator manipulation frequently hoisted loop.
Instruction Decoding Optimizations
Chapter
22007I-0-September 2000
AthlonProcessor Code Optimization
Example code):
code accu, increment; while (condition) accu read increment written here accu increment;
Example (Avoid):
EAX, [increment] EAX, [accu] [accu],
Example (Preferred):
EAX, [increment] [accu],
Example code):
code iterationcount; iteration_count while (condition) iteration count read here iteration_count++;
Example (Avoid):
EAX, [iteration_count] [iteration_count],
Example (Preferred):
[iteration_count]
Chapter
Instruction Decoding Optimizations
AthlonProcessor Code Optimization
22007I-0-September 2000
Align Branch Targets Program Spots
program spots determined either profiling loop nesting analysis), place branch targets near beginning 16-byte aligned code windows. This guideline improves performance inside hotspots maximizing number instructions fills into instruction-byte queue preserves I-cache space branch intensive code outside such hotspots.
32-Bit Rather than 16-Bit Instruction
32-bit Load Effective Address (LEA) instruction implemented DirectPath operation with execute latency only cycles. 16-bit instruction, however, VectorPath instruction, which lowers decode bandwidth longer execution latency.
Short Instruction Encodings
Assemblers compilers should generate shortest instruction encodings possible optimize I-cache increase average decode rate. Wherever possible, instructions with shorter lengths. Using shorter instructions increases number instructions that into instruction-byte queue. example, 8-bit displacements oppo -bit displaceme ddition, single-byte format simple integer instructions whenever possible, opposed 2-byte opcode ModR/M format. Example (Avoid):
EAX, 12345678h ;uses 2-byte opcode form (with ModR/M) EBX, ;uses 32-bit immediate $label1 ;uses 2-byte opcode, 32-bit immediate
Instruction Decoding Optimizations
Chapter
22007I-0-September 2000
AthlonProcessor Code Optimization
Example (Preferred):
EAX, 12345678h EBX, $label1 ;uses single byte opcode form ;uses 8-bit sign extended immediate ;uses 1-byte opcode, 8-bit immediate
Avoid Partial Register Reads Writes
order handle partial register writes, Athlon processor execution core implements data-merging scheme. execution unit, instruction writing partial register merges modified portion with current state remainder register. Therefore, dependency hardware potentially force false dependency most recent instruction that writes part register. Example (Avoid):
;inst ;inst false dependency inst ;inst merges with current register value forwarded inst
addition, instruction that read dependency part given architectural register read dependency most recent instruction that modifies part same architectural register. Example (Avoid):
;inst ;inst false dependency completion inst ;inst false dependency completion inst ;inst depends completion inst
Chapter
Instruction Decoding Optimizations
AthlonProcessor Code Optimization
22007I-0-September 2000
LEAVE Instruction Function Epilogue Code
classical approach referencing function arguments local variables inside function called frame pointer. code, register customarily used frame pointer. function prologue code, frame pointer follows:
PUSH EBP, ESP, nnnnnnnn ;save frame pointer ;new frame pointer ;allocate local variables
Function arguments stack accessed positive offsets relative EBP, local variables accessible negative offsets relative EBP. function epilogue code, following work performed:
ESP, ;deallocate local variables ;restore frame pointer
functionality these instructions identical that LEAVE instruction. LEAVE instruction single-byte instruction thus saves bytes code space over MOV/POP epilogue sequence. Replacing MOV/POP sequence with LEAVE also preserves decode bandwidth. Therefore, LEAVE instruction function epilogue code both specific Athlon optimized blended code (code that performs well both AMD-K6 Athlon processors). Note that functions that allocate local variables, prologue epilogue code simplified following:
PUSH ;restore frame pointer EBP, ;save frame pointer ;new frame pointer
This optimal cases where frame pointer desired. highest performance code, frame pointer all. Function arguments local variables should accessed directly ESP, thus freeing general purpose register reducing register pressure.
Instruction Decoding Optimizations
Chapter
22007I-0-September 2000
AthlonProcessor Code Optimization
Replace Certain SHLD Instructions with Alternative Code
Certain instances SHLD instruction replaced alternative code sequences using LEA. alternative code lower latency requires less execution resources. ADD, ADC, (32-bit version) DirectPath instructions, while SHLD VectorPath instruction. replacement code optimizes decode bandwidth potentially enables decoding third DirectPath instruction. replacement code increase register pressure since destroys contents REG2, whereas REG2 preserved SHLD. situations where register pressure high, replacement sequences therefore indicated. Example (Avoid):
SHLD REG1, REG2,
Example (Preferred):
REG2, REG2 REG1, REG1
Example (Avoid):
SHLD REG1, REG2,
Example (Preferred):
REG2, REG1, [REG1*4 REG2]
Example (Avoid):
SHLD REG1, REG2,
Example (Preferred):
REG2, REG1, [REG1*8 REG2]
8-Bit Sign-Extended Immediates
Using 8-bit sign-extended immediates improves code density with negative effects Athlon processor. example, encode FB". Chapter Instruction Decoding Optimizations
AthlonProcessor Code Optimization
22007I-0-September 2000
8-Bit Sign-Extended Displacements
8-bit sign-extended displacements conditional branches. Using short, 8-bit sign-extended displacements conditional branches improves code density with negative effects Athlon processor.
Code Padding Using Neutral Code Fillers
Occasionally need arises insert neutral code fillers into code stream, e.g., code alignment purposes space branches. Since this filler code executed, should take execution resources possible, diminish decode density, modify processor state other than advancing EIP. byte padding easily achieved using instructions (XCHG EAX, EAX; opcode 0x90). archit there seve multi- instructions available that change processor state other than EIP:
REG, XCHG REG, CMOVcc REG, REG, REG, REG, SHRD REG, REG, SHLD REG, REG, REG, [REG] REG, [REG+00] REG, [REG*1+00] REG, [REG+00000000] REG, [REG*1+00000000]
these instructions equally suitable purposes code padding. example, SHLD/SHRD microcoded which reduces decode bandwidth takes execution resources.
Instruction Decoding Optimizations
Chapter
22007I-0-September 2000
AthlonProcessor Code Optimization
Recommendations AMD-K6® Family AthlonProcessor Blended Code
instructions instructions sequences presented below recommended code padding both AMD-K6 family processors Athlon processor. Note that each instructions instruction sequences degradation, register used padding should selected lengthen existing dependency chains, i.e., select register that used instructions vicinity neutral code filler. Note that certain instructions registers implicitly. example, PUSH, POP, CALL, make implicit register. 5-byte filler sequence below consists instructions. flag changes across code padding acceptable, following instructions used single instruction, 5-byte code fillers:
TEST EAX, 0FFFF0000h EAX, 0FFFF0000h
recommended neutral code fillers code optimized Athlon processor that also well other processors. Note some padding lengths, versions using missing lack fully generalized addressing modes.
NOP2_EAX NOP2_EBX NOP2_ECX NOP2_EDX NOP2_ESI NOP2_EDI NOP2_ESP NOP2_EBP NOP3_EAX NOP3_EBX NOP3_ECX NOP3_EDX NOP3_ESI NOP3_EDI NOP3_ESP NOP3_EBP TEXTEQU TEXTEQU TEXTEQU TEXTEQU TEXTEQU TEXTEQU TEXTEQU TEXTEQU TEXTEQU TEXTEQU TEXTEQU TEXTEQU TEXTEQU TEXTEQU TEXTEQU TEXTEQU 08Bh,0C0h> 08Bh,0DBh> 08Bh,0C9h> 08Bh,0D2h> 08Bh,0F6h> 08Bh,0FFh> 08Bh,0E4h> 08Bh,0EDh> ;MOV ;MOV ;MOV ;MOV ;MOV ;MOV ;MOV ;MOV EAX, EBX, ECX, EDX, ESI, EDI, ESP, EBP, ;LEA ;LEA ;LEA ;LEA ;LEA ;LEA ;LEA ;LEA EAX, EBX, ECX, EDX, ESI, EDI, ESP, EBP, [EAX] [EBX] [ECX] [EDX] [ESI] [EDI] [ESP] [EBP]
08Dh,004h,020h> 08Dh,01Ch,023h> 08Dh,00Ch,021h> 08Dh,014h,022h> 08Dh,024h,024h> 08Dh,034h,026h> 08Dh,03Ch,027h> 08Dh,06Dh,000h>
Chapter
Instruction Decoding Optimizations
AthlonProcessor Code Optimization
22007I-0-September 2000
NOP4_EAX NOP4_EBX NOP4_ECX NOP4_EDX NOP4_ESI NOP4_EDI NOP4_ESP
TEXTEQU TEXTEQU TEXTEQU TEXTEQU TEXTEQU TEXTEQU TEXTEQU
08Dh,044h,020h,000h> 08Dh,05Ch,023h,000h> 08Dh,04Ch,021h,000h> 08Dh,054h,022h,000h> 08Dh,064h,024h,000h> 08Dh,074h,026h,000h> 08Dh,07Ch,027h,000h>
;LEA ;LEA ;LEA ;LEA ;LEA ;LEA ;LEA
EAX, EBX, ECX, EDX, ESI, EDI, ESP,
[EAX+00] [EBX+00] [ECX+00] [EDX+00] [ESI+00] [EDI+00] [ESP+00]
;LEA EAX, [EAX+00];NOP NOP5_EAX TEXTEQU 08Dh,044h,020h,000h,090h> ;LEA EBX, [EBX+00];NOP NOP5_EBX TEXTEQU 08Dh,05Ch,023h,000h,090h> ;LEA ECX, [ECX+00];NOP NOP5_ECX TEXTEQU 08Dh,04Ch,021h,000h,090h> ;LEA EDX, [EDX+00];NOP NOP5_EDX TEXTEQU 08Dh,054h,022h,000h,090h> ;LEA ESI, [ESI+00];NOP NOP5_ESI TEXTEQU 08Dh,064h,024h,000h,090h> ;LEA EDI, [EDI+00];NOP NOP5_EDI TEXTEQU 08Dh,074h,026h,000h,090h> ;LEA ESP, [ESP+00];NOP NOP5_ESP TEXTEQU 08Dh,07Ch,027h,000h,090h> ;LEA EAX, [EAX+00000000] NOP6_EAX TEXTEQU 08Dh,080h,0,0,0,0> ;LEA EBX, [EBX+00000000] NOP6_EBX TEXTEQU 08Dh,09Bh,0,0,0,0> ;LEA ECX, [ECX+00000000] NOP6_ECX TEXTEQU 08Dh,089h,0,0,0,0> ;LEA EDX, [EDX+00000000] NOP6_EDX TEXTEQU 08Dh,092h,0,0,0,0> ;LEA ESI, [ESI+00000000] NOP6_ESI TEXTEQU 08Dh,0B6h,0,0,0,0> ;LEA EDI, [EDI+00000000] NOP6_EDI TEXTEQU 08Dh,0BFh,0,0,0,0> ;LEA EBP, [EBP+00000000] NOP6_EBP TEXTEQU 08Dh,0ADh,0,0,0,0>
Instruction Decoding Optimizations
Chapter
22007I-0-September 2000
AthlonProcessor Code Optimization
;LEA EAX, [EAX*1+00000000] NOP7_EAX TEXTEQU 08Dh,004h,005h,0,0,0,0> ;LEA EBX, [EBX*1+00000000] NOP7_EBX TEXTEQU 08Dh,01Ch,01Dh,0,0,0,0> ;LEA ECX, [ECX*1+00000000] NOP7_ECX TEXTEQU 08Dh,00Ch,00Dh,0,0,0,0> ;LEA EDX, [EDX*1+00000000] NOP7_EDX TEXTEQU 08Dh,014h,015h,0,0,0,0> ;LEA ESI, [ESI*1+00000000] NOP7_ESI TEXTEQU 08Dh,034h,035h,0,0,0,0> ;LEA EDI, [EDI*1+00000000] NOP7_EDI TEXTEQU 08Dh,03Ch,03Dh,0,0,0,0> ;LEA EBP, [EBP*1+00000000] NOP7_EBP TEXTEQU 08Dh,02Ch,02Dh,0,0,0,0> ;LEA EAX, [EAX*1+00000000] ;NOP NOP8_EAX TEXTEQU 08Dh,004h,005h,0,0,0,0,90h> ;LEA EBX, [EBX*1+00000000] ;NOP NOP8_EBX TEXTEQU 08Dh,01Ch,01Dh,0,0,0,0,90h> ;LEA ECX, [ECX*1+00000000] ;NOP NOP8_ECX TEXTEQU 08Dh,00Ch,00Dh,0,0,0,0,90h> ;LEA EDX, [EDX*1+00000000] ;NOP NOP8_EDX TEXTEQU 08Dh,014h,015h,0,0,0,0,90h> ;LEA ESI, [ESI*1+00000000] ;NOP NOP8_ESI TEXTEQU 08Dh,034h,035h,0,0,0,0,90h> ;LEA EDI, [EDI*1+00000000] ;NOP NOP8_EDI TEXTEQU 08Dh,03Ch,03Dh,0,0,0,0,90h> ;LEA EBP, [EBP*1+00000000] ;NOP NOP8_EBP TEXTEQU 08Dh,02Ch,02Dh,0,0,0,0,90h> ;JMP NOP9 TEXTEQU
Chapter
Instruction Decoding Optimizations
AthlonProcessor Code Optimization
22007I-0-September 2000
Instruction Decoding Optimizations
Chapter
22007I-0-September 2000
AthlonProcessor Code Optimization
Cache Memory Optimizations
This chapter describes code optimization techniques that take advantage large caches high-bandwidth buses Athlonprocessor. Guidelines listed order importance.
Memory Size Alignment Issues
Avoid Memory Size Mismatches
Chapter
Avoid memory size mismatches when different instructions operate same data. When instruction stores another instruction reloads same data, keep their operands aligned keep loads/stores each operand same store-to-load-forwarding (STLF) stall: Example (Avoid):
DWORD [FOO], DWORD [FOO+4], QWORD [FOO]
Example (Avoid):
MOVQ [FOO], [FOO+4], MM0, [FOO]
Cache Memory Optimizations
AthlonProcessor Code Optimization
22007I-0-September 2000
Example (Preferred):
MOVD PUNPCKLDQ [FOO], [FOO+4], MM0, [FOO] MM0, [FOO+4]
Example (Preferred stores close load):
MOVD MM0, [FOO+4], PUNPCKLDQ MM0, [FOO+4]
Avoid large-to-small mismatches, shown following code: Example (Avoid):
QWORD [FOO] EAX, DWORD [FOO] EDX, DWORD [FOO+4]
Example (Avoid):
MOVQ [foo], EAX, [foo] EDX, [foo+4]
Example (Preferred):
MOVD PSWAPD MOVD PSWAPD [foo], MM0, [foo+4], MM0, EAX, [foo] EDX, [foo+4]
Example (Preferred contents longer needed):
MOVD PUNPCKHDQ MOVD [foo], MM0, [foo+4], EAX, [foo] EDX, [foo+4]
Example (Preferred stores loads close together, Option
MOVD PSWAPD MOVD PSWAPD EAX, MM0, EDX, MM0,
Cache Memory Optimizations
Chapter
22007I-0-September 2000
AthlonProcessor Code Optimization
Example (Preferred stores loads close together, Option
MOVD EAX, PUNPCKHDQ MM0, MOVD EDX,
Align Data Where Possible
general, avoid misaligned data references. data whose size power considered aligned naturally aligned. example:
WORD accesses aligned divisible DWORD accesses aligned divisible QWORD accesses aligned divisible TBYTE accesses aligned divisible
they access address they access address they access address they access address
misaligned store load operation suffers minimum one-cycle penalty Athlon processor load/store pipeline. addition, using misaligned loads stores increases likelihood encountering store-to-load forwarding pitfall. more detailed discussion store-toload forwarding issues, "Store-to-Load Forwarding Restrictions" page
3DNow!PREFETCH PREFETCHW Instructions
Chapter
code that take advantage prefetching, 3DNow! PREFETCH PREFETCHW instructions increase effective bandwidth Athlon processor. advantage Athlon processor's high bandwidth hide long latencies when fetching data from system memory. prefetch instructions essentially integer instructions used anywhere, type code (integer, x87, 3DNow!, MMX, etc.).
Cache Memory Optimizations
AthlonProcessor Code Optimization
22007I-0-September 2000
Prefetching versus Preloading
functionality prefetch instructions, they offer same performance advantage. Prefetch instructions only updates cache line L1/L2 cache update architectural register. This saves register compared load instruction. Prefetch instructions also cause normal instruction retirement stall. Another benefit prefetching versus preloading that prefetching instructions retire even load data arrived yet. regular load used preloading will stall machine gets bottom fixed-issue reorder buffer (part Instruction Control Unit) load data arrived yet. load "blocking" whereas prefetch "non-blocking".
Unit-Stride Access
Large data sets typically require unit-stride access ensure that data pulled PREFETCH PREFETCHW actually used. necessary, reorganize algorithms data structures allow unit-stride access. page definition unit-stride access. PREFETCHNTA/T0/T1/T2 instructions extensions processor implementation dependent. developer needs maintain compatibility with million AMD-K6®-2 AMD-K6-III processors already sold, 3DNow! PREFETCH/W instructions instead various prefetch instructions that extensions. Code that intends modify cache line brought through prefetching should PREFETCHW instruction. While PREFETCHW works same PREFETCH AMD-K6-2 AMD-K6-III processors, PREFETCHW gives hint Athlon processor intent modify cache line. Athlon processor marks cache line PREFETCH PREFETCHW save additional 15-25 cycles compared PREFETCH subsequent cache state change caused write prefetched cache line. Only PREFETCHW there will write same cache line soon afterwards.
PREFETCH/W versus PREFETCHNTA/T0/T1
PREFETCHW Usage
Cache Memory Optimizations
Chapter
22007I-0-September 2000
AthlonProcessor Code Optimization
Multiple Prefetches
Programmers initiate multiple outstanding prefetches AMD-K6-III processors have only outstanding prefetch, Athlon processor have outstanding prefetches. When buffers filled various memory read requests, processor will simply ignore prefetch requests until buffer frees Multiple prefetch requests essentially handled in-order. Prefetch data order that needed. following example shows initiate multiple prefetches when traversing more than array. Example Multiple Prefetches Code:
.CODE .K3D .686 original code #define LARGE_NUM 65536 #define ARR_SIZE (LARGE_NUM*8) double array_a[LARGE_NUM]; double array b[LARGE_NUM]; double array c[LARGE_NUM]; LARGE_NUM; i++) a[i] b[i] c[i] ECX, EAX, EDX, ECX, (-LARGE_NUM) OFFSET array_a OFFSET array_b OFFSET array_c ;used biased ;get address ;get address ;get address index array_a array_b array_c
$loop:
PREFETCHW PREFETCH PREFETCH QWORD FMUL QWORD FSTP QWORD QWORD FMUL QWORD FSTP QWORD
[EAX+128] ;two cachelines ahead [EDX+128] ;two cachelines ahead [ECX+128] ;two cachelines ahead [EDX+ECX*8+ARR_SIZE] ;b[i] [ECX+ECX*8+ARR_SIZE] ;b[i]*c[i] [EAX+ECX*8+ARR_SIZE] ;a[i] b[i]*c[i] [EDX+ECX*8+ARR_SIZE+8] ;b[i+1] [ECX+ECX*8+ARR_SIZE+8] ;b[i+1]*c[i+1] [EAX+ECX*8+ARR_SIZE+8] ;a[i+1] b[i+1]*c[i+1]
Chapter
Cache Memory Optimizations
AthlonProcessor Code Optimization
22007I-0-September 2000
FMUL FSTP FMUL FSTP FMUL FSTP FMUL FSTP FMUL FSTP FMUL FSTP
QWORD [EDX+ECX*8+ARR_SIZE+16] QWORD [ECX+ECX*8+ARR_SIZE+16] QWORD [EAX+ECX*8+ARR_SIZE+16] QWORD [EDX+ECX*8+ARR_SIZE+24] QWORD [ECX+ECX*8+ARR_SIZE+24] QWORD [EAX+ECX*8+ARR_SIZE+24] QWORD [EDX+ECX*8+ARR_SIZE+32] QWORD [ECX+ECX*8+ARR_SIZE+32] QWORD [EAX+ECX*8+ARR_SIZE+32] QWORD [EDX+ECX*8+ARR_SIZE+40] QWORD [ECX+ECX*8+ARR_SIZE+40] QWORD [EAX+ECX*8+ARR_SIZE+40] QWORD [EDX+ECX*8+ARR_SIZE+48] QWORD [ECX+ECX*8+ARR_SIZE+48] QWORD [EAX+ECX*8+ARR_SIZE+48] QWORD [EDX+ECX*8+ARR_SIZE+56] QWORD [ECX+ECX*8+ARR_SIZE+56] QWORD [EAX+ECX*8+ARR_SIZE+56] ECX, $loop
;b[i+2] ;b[i+2]*c[i+2] ;a[i+2] [i+2]*c[i+2] ;b[i+3] ;b[i+3]*c[i+3] ;a[i+3] b[i+3]*c[i+3] ;b[i+4] ;b[i+4]*c[i+4] ;a[i+4] b[i+4]*c[i+4] ;b[i+5] ;b[i+5]*c[i+5] ;a[i+5] b[i+5]*c[i+5] ;b[i+6] ;b[i+6]*c[i+6] ;a[i+6] b[i+6]*c[i+6] ;b[i+7] ;b[i+7]*c[i+7] ;a[i+7] b[i+7]*c[i+7] ;next products ;until none left
following optimization rules were applied this example:
Partially unroll loops ensure that data stride loop iteration equal length cache line. This avoids overlapping PREFETCH instructions thus makes optimal available number outstanding PREFETCHes. Since array "array_a" written rather than read, PREFETCHW instead PREFETCH avoid overhead switching cache lines correct MESI state. PREFETCH lookahead optimized such that each loop iteration working three cache lines while active PREFETCHes bring next cache lines. Reduce index arithmetic minimum complex addressing modes biasing array base addresses order down loop overhead.
Cache Memory Optimizations
Chapter
22007I-0-September 2000
AthlonProcessor Code Optimization
Determining Prefetch Distance
When determining ahead prefetch, basic guideline initiate prefetch early enough that data cache time needed, under constraint that there can't more than PREFETCHes flight given time. processors achieve speeds faster, second constraint starts limit ahead programmer PREFETCH. Formula Given latency typical Athlon processor system expected processor speeds, following formula determine prefetch distance bytes single array: Prefetch Distance (DS/C) bytes
Round nearest 64-byte cache line. number constant based upon expected Athlon processor clock frequencies typical system memory latencies. data stride bytes loop iteration. number cycles loop execute entirely from cache.
Programmers should isolate loop have loop work data that fits determine loop time. L1_loop_time execution time cycles loop iterations Where multiple arrays being prefetched, prefetch distance usually needs increased over what above formula suggests, prefetches array delayed prefetches different array. Definitions Unit-stride access refers memory access pattern where consecutive memory accesses consecutive array elements, ascending descending order. arrays made elemental types, then implies adjacent memory locations well. example:
char k[MAX]; (i=0; i<MAX; i++) k[i];
Chapter
Cache Memory Optimizations
AthlonProcessor Code Optimization
22007I-0-September 2000
double y[MAX]; (i=0; i<MAX; i++) y[i];
Exception Unit Stride
unit-stride concept works well when stepping through arrays elementary data types. some instances, unit stride PREFETCH properly. example, assume vertex structure bytes code steps through vertices unit stride, using only components, each being type float (e.g., first bytes each vertex). this case, prefetch distance obviously should some function data size structure (for properly chosen "n"):
PREFETCH [EAX+n*STRUCTURE_SIZE] EAX, STRUCTURE_SIZE
Programmers need experiment find optimal prefetch distance; there formula that works situations. Data Stride Loop Iteration Assuming unit-stride access single array, data stride loop refers number bytes accessed array loop iteration. example:
FLDZ $add_loop: FADD QWORD [EBX*8+base_address] $add_loop
data stride above loop bytes. general, optimal prefetch, data stride iteration length cache line bytes Athlon processor). "loop stride" smaller, unroll loop. Note that this unfeasible original loop stride very small, e.g., bytes. Prefetch Least Bytes Away from Surrounding Stores PREFETCH PREFETCHW instructions affected false dependencies stores. there store address that matches request, that request (the PREFETCH PREFETCHW instruction) blocked until store written cache. Therefore, code should prefetch data that located least bytes away from surrounding store's data address. Cache Memory Optimizations Chapter
22007I-0-September 2000
AthlonProcessor Code Optimization
Take Advantage Write Combining
Operating system device driver programmers should take Athlon processor. Athlon processor very rove performance significantly. Appendix "Implementation Write Combining" page more details.
Avoid Placing Code Data Same 64-Byte Cache Line
Sharing code data same 64-byte cache line cause caches thrash (unnecessary castout code/data) order maintain coherency between separate instruction data caches. Athlon processor cache-line size 64-bytes, which twice size previous processors. Programmers must aware that code data should shared within this larger cache line, especially data becomes modified. example, programmers should consider that memory indirect instruction have data jump table residing same 64-byte cache line instruction, which would result lower performance. Although unlikely, place critical code border between 32-byte aligned code segments data segments. code start your data segment should executed infrequently possible simply padded with garbage. general, avoid following:
self-modifying code storing data code segments
Chapter
Cache Memory Optimizations
AthlonProcessor Code Optimization
22007I-0-September 2000
Store-to-Load Forwarding Restrictions
Store-to-load forwarding refers process load reading (forwarding) data from store buffer (LS2). There instances Athlon processor load/store architecture when either load operation allowed read needed data from store store buffer, load detects false data dependency store store buffer either case, load cannot complete (load needed data into register) until store retired store buffer written data cache. store-buffer entry cannot retire write data cache until every instruction before store completed retired from reorder buffer. implication this restriction that instructions reorder buffer, including store, must complete retire reorder buffer before load complete. Effectively, load false dependency every instruction store. significant depth Athlon processor's buffer, load dependent store that bypass data through experience significant delays tens clock cycles, where exact delay function pipeline conditions. following sections describe store-to-load forwarding examples that acceptable those avoid.
Store-to-Load Forwarding Pitfalls-True Dependencies
load allowed read data from store-buffer entry only following conditions satisfied:
start address load matches start address store. load operand size equal smaller than store operand size. Neither load store misaligned. store data from high-byte register (AH, DH). Cache Memory Optimizations Chapter
22007I-0-September 2000
AthlonProcessor Code Optimization
following sections describe common-case scenarios avoid whereby load true dependency LS2-buffered store, cannot read (forward) data from store-buffer entry. Narrow-to-Wide Store-Buffer Data Forwarding Restriction narrow-to-wide store-buffer data forwarding restriction:
operand size store data smaller than operand size load data. range addresses spanned store data covers some sub-region range addresses spanned load data.
Avoid type code shown following examples. Example (Avoid):
EAX, WORD [EAX], ECX, DWORD [EAX] ;word store ;doubleword load ;cannot forward upper byte from store buffer
Example (Avoid):
EAX, BYTE [EAX ;byte store ECX, DWORD [EAX] ;doubleword load ;cannot forward upper byte from store buffer
Wide-to-Narrow Store-Buffer Data Forwarding Restriction
wide-to-narrow store-buffer data forwarding restriction:
operand size store data greater than operand size load data. start address store data does match start address load.
Example (Avoid):
EAX, DWORD [EAX], ;doubleword store WORD [EAX ;word load-cannot forward high word from store buffer
Chapter
Cache Memory Optimizations
AthlonProcessor Code Optimization
22007I-0-September 2000
Example (Avoid):
MOVQ [foo], EAX, [foo] EDX, [foo+4] ;store upper lower half ;fine ;not good!
Example (Preferred):
MOVD PUNPCKHDQ MOVD [foo], MM1, [foo+4], EAX, [foo] EDX, [foo+4] ;store lower half ;get upper half into lower half ;store lower half ;fine ;fine
Misaligned Store-Buffer Data Forwarding Restriction
following condition present, there misaligned store-buffer data forwarding restriction:
store load address misaligned. example, quadword store aligned quadword boundary, doubleword store aligned doubleword boundary, etc.
common case misaligned store-data forwarding involves passing misaligned quadword floating-point data doubleword-aligned integer stack. Avoid type code shown following example. Example (Avoid):
FSTP ESP, QWORD [ESP] ;esp=24 ;store occurs quadword misaligned address QWORD PTR[ESP] ;quadword load cannot forward from quadword misaligned `fstp[esp]' store
Cache Memory Optimizations
Chapter
22007I-0-September 2000
AthlonProcessor Code Optimization
High-Byte Store-Buffer Data Forwarding Restriction
following condition present, there high-byte store-data buffer forwarding restriction:
store data from high-byte register (AH, DH).
Avoid type code shown following example. Example (Avoid):
EAX, [EAX], [EAX] ;high-byte store ;load cannot forward from high-byte store
Supported Storeto-Load Forwarding Case
There case mismatched store-to-load forwarding that supported Athlon processor. lower bits from aligned QWORD write feeding into DWORD read allowed. Example (Allowed):
MOVQ [AlignedQword], EAX, [AlignedQword]
Summary Store-to-Load Forwarding Pitfalls Avoid
avoid store-to-load forwarding pitfalls, conform code following guidelines:
Maintain consistent operand size across loads stores. Preferably, doubleword quadword operand sizes. Avoid misaligned data references. Avoid narrow-to-wide wide-to-narrow forwarding cases. When using word byte stores, avoid loading data from anywhere same doubleword memory other than identical start addresses stores.
Chapter
Cache Memory Optimizations
AthlonProcessor Code Optimization
22007I-0-September 2000
Stack Alignment Considerations
Make sure stack suitably aligned local variable with largest base type. Then, using technique described Language Structure Component Considerations" page variables properly aligned with padding. Extend Bits Before Pushing onto Stack Function arguments smaller than bits should extended bits before being pushed onto stack, which ensures that stack always doubleword aligned entry function. function local variables with base type larger than doubleword, further work necessary. function does have variables whos type rger than doubleword, insert additional code ensure proper alignment stack. example, following code achieves quadword alignment: Example (Preferred):
Prologue: PUSH EBP, ESP, SIZE_OF_LOCALS ;size local variables ESP, ;push registers that need preserved Epilogue: LEAVE ;pop register that needed preserved
With this technique, function arguments accessed EBP, local variables accessed ESP. order free general use, needs saved restored between prologue epilogue.
Align TBYTE Variables Quadword Aligned Addresses
Align variables type TBYTE quadword aligned addresses. order make array TBYTE variables that aligned, array elements 16-bytes apart. general, TBYTE variables should avoided. double-precision variables instead.
Cache Memory Optimizations
Chapter
22007I-0-September 2000
AthlonProcessor Code Optimization
Language Structure Component Considerations
Structures (`struct' language) should made size multiple largest base type their components. meet this requirement, padding where necessary. This ensures that elements array structures properly aligned provided array itself properly aligned. minimize padding, sort allocate structure components (language definitions permitting) such that components with larger base type allocated ahead those with smaller base type. example, consider following code: Example
struct char a[5]; long double baz;
Allocate structure components (lowest highest address) follows:
a[4], a[3], a[2], a[1], a[0], padbyte6, padbyte0
Language Structure Component Considerations" page more information from source code perspective.
Chapter
Cache Memory Optimizations
AthlonProcessor Code Optimization
22007I-0-September 2000
Sort Variables According Base Type Size
Sort local variables according their base type size allocate variables with larger base type size ahead those with smaller base type size. Assuming first variable allocated naturally aligned, other variables naturally aligned without padding. following example declaration local variables function: Example
short long double char float foo, bar; z[3]; baz;
Allocate variables following order from left right (from higher lower addresses):
z[2], z[1], z[0], foo, bar, baz,
"Sort Local Variables According Base Type Size" page more information from source code perspective.
Cache Memory Optimizations
Chapter
22007I-0-September 2000
AthlonProcessor Code Optimization
Branch Optimizations
sophisticated branch unit, certain optimizations increase effectiveness branch prediction unit. This chapter discusses rules that improve branch prediction minimize branch penalties. Guidelines listed order importance.
Avoid Branches Dependent Random Data
Chapter
Avoid conditional branches depending random data, these difficult predict. example, piece code receives random stream characters through branches character before collating sequence. Data-dependent branches acting upon basically random data causes branch prediction logic mispredict branch about time. possible, design branch-free alternative code sequences, which results shorter average execution time. This technique especially important branch body small. Examples illustrate this concept using CMOV instruction. Note that AMD-K6 processor does support CMOV instruction. Therefore, blended AMD-K6 Athlon processor code should Examples
Branch Optimizations
AthlonProcessor Code Optimization
22007I-0-September 2000
AthlonProcessor Specific Code
Example Signed integer function labs(X)):
CMOVS ECX, EBX, ECX, [X], ;load value ;save value ;-value -value negative, select value ;save labs result
Example Unsigned integer function
CMOVNC EAX, EBX, EAX, EAX, [Z], ;load value ;load value ;EBX<=EAX CF=0 CF=1 ;EAX=(EBX<=EAX) EBX:EAX ;save (X,Y)
Blended AMD-K6® AthlonProcessor Code
Example Signed integer function labs(X)):
ECX, EBX, ECX, EBX, EBX, [X], ;load value ;save value 0xffffffff (~x)+1
Example Unsigned integer function
EAX, EBX, EAX, ECX, ECX, ECX, [z], ;load ;load
0xffffffff
Example Hexadecimal ASCII conversion (y=x 0x30: 0x41):
[Y], ;load value less than carry flag ;0.9 96h, Ah.Fh A1h.A6h ;0.9: subtract 66h, Ah.Fh: Sub. ;save conversion
Branch Optimizations
Chapter
22007I-0-September 2000
AthlonProcessor Code Optimization
Example Increment Ring Buffer Offset:
Code char buf[BUFSIZE]; (BUFSIZE-1)) a++; else ;-;Assembly Code EAX, EAX, (BUFSIZE-1) EDX, EAX, [a],
offset (BUFSIZE-1) (BUFSIZE-1) 0xffffffff (BUFSIZE-1) store offset
Example Integer Signum Function:
Code (!a) else else
;-;Assembly Code EAX, EDX, EDX, [s],
;load 0xffffffff ;signum(x)
Chapter
Branch Optimizations
AthlonProcessor Code Optimization
22007I-0-September 2000
Example Conditional Write:
Code dummy, c[BUFSIZE]; c[i++] Assembly code CMOVGE CMOVL EDI, [c+ECX*4] EDX, [ECX+1] EAX, EDI, ECX, [EDI], ;&c[i] ;i++ ;ptr &dummy &c[i] ;*ptr ESI, [dummy] ECX, ;&dummy
Always Pair CALL RETURN
synchronization, latency returns increase. returnaddress stack becomes sync when:
calls returns match depth return-address stack exceeded because many levels nested functions calls
Branch Optimizations
Chapter
22007I-0-September 2000
AthlonProcessor Code Optimization
Replace Branches with Computation 3DNow!Code
Branches negatively impact performance 3DNow! code. Branches operate only data item time, i.e., they inherently scalar inhibit SIMD processing that makes 3DNow! code superior. Also, branches based 3DNow! comparisons require data passed integer units, which requires either transport through memory, "MOVD reg, MMreg" instructions. body branch small, achieve higher performance replacing branch with putation. putation simulat predicated execution conditional moves. principal tools this following instructions: PCMPGT, PFCMPGT, PFCMPGE, PFMIN, PFMAX, PAND, PANDN, POR, PXOR.
Muxing Constructs
most important construct avoiding branches 3DNow! MMXcode 2-way muxing construct that equivalent ternary operator "?:" C++. implemented using PCMP/PFCMP, PAND, PANDN, instructions. maximize performance, important apply PAND PANDN instructions proper order. Example (Avoid):
out: PCMPGTD MOVQ PANDN PAND MM3, MM4, MM3, MM0, MM0, 0xffffffff duplicate mask
Because PANDN destroys mask created PCMP, mask needs saved, which requires additional register. This adds instruction, lengthens dependency chain, increases register pressure. Therefore, write 2-way muxing constructs follows.
Chapter
Branch Optimizations
AthlonProcessor Code Optimization
22007I-0-September 2000
Example (Preferred):
out: PCMPGTD PAND PANDN MM3, MM0, MM3, MM0, 0xffffffff
Sample Code Translated into 3DNow!Code
following examples scalar code translated into 3DNow! code. Note that recommended 3DNow! SIMD instructions scalar code, because advantage 3DNow! instructions lies their "SIMDness". These examples meant demonstrate general techniques translating source code with branches into branchless 3DNow! code. Scalar source code chosen keep examples simple. These techniques work identical fashion vector code. Each example shows code resulting 3DNow! code. Example code:
float x,y,z; 1.0; else 1.0;
3DNow! code:
;in: ;out: MOVQ MM3, MOVQ MM4, PFCMPGE MM0, PSLLD MM0, PXOR MM0, PFADD MM0,
;save ;1.0
0xffffffff 0x80000000 -1.0 z+1.0 z-1.0
Branch Optimizations
Chapter
22007I-0-September 2000
AthlonProcessor Code Optimization
Example
code:
float x,z; abs(x); 1/z;
3DNow! code:
;in: ;out: MOVQ MM5, PAND MM0, PFRCP MM2, MOVQ MM1, PFRCPIT1 MM0, PFRCPIT2 MM0, PFMIN MM0,
mabs
;0x7fffffff ;z=abs(x) ;1/z approx ;save ;1/z step ;1/z final
Example
code:
float x,z,r,res; fabs(x) 0.575) else PI/2 2*r;
3DNow! code:
;in: ;out: MOVQ MM7, mabs PAND MM0, MOVQ MM2, PCMPGTD MM2, MOVQ MM3, pio2 MOVQ MM0, PFADD MM1, PFSUBR MM1, PAND MM0, PANDN MM2, MM0,
;mask absolute value abs(x) ;0.575 0.575 0xffffffff ;pi/2 ;save ;2*r ;pi/2 0.575 0.575 pi/2 0.575 pi/2
Chapter
Branch Optimizations
AthlonProcessor Code Optimization
22007I-0-September 2000
Example
code:
#define 3.14159265358979323 float x,z,r,res; PI/4 abs(x) else PI/2-r;
3DNow! code:
;in: ;out: MOVQ MM5, mabs MOVQ MM6, PAND MM0, PCMPGTD MM6, MOVQ MM4, pio2 PFSUB MM4, PANDN MM6, PFMAX MM1,
mask clear sign z=abs(x) 0xffffffff pi/2 pi/2-r pi/2-r pi/2-r
Branch Optimizations
Chapter
22007I-0-September 2000
AthlonProcessor Code Optimization
Example
code:
#define 3.14159265358979323 float x,y,xa,ya,r,res; xs,df; fabs(x); fabs(y); ya); PI/2 else (xs) else (df) PI/2 else
3DNow! code:
;in: ;out: MOVQ MM7, MOVQ MM6, MOVQ MM5, mabs PAND MM7, PAND MM1, PAND MM2, MOVQ MM6, PCMPGTD MM6, PSLLD MM6, MOVQ MM5, PXOR MM7, MOVQ MM3, npio2 PXOR MM5, PSRAD MM6, PANDN MM6, PFSUB MM6, PFADD MM0, MM0,
;mask extract sign ;mask extract sign ;mask clear sign sign(x) abs(y) abs(x) 0xffffffff bit<31> ;xs^df 0x80000000 ;-pi/2 pi/2 -pi/2 0xffffffff pi/2) -pi/2) pi/2 pi/2) -pi/2)) xs^df ;res
Chapter
Branch Optimizations
AthlonProcessor Code Optimization
22007I-0-September 2000
Avoid Loop Instruction
LOOP instruction Athlon processor requires eight cycles execute. preferred code shown below: Example (Avoid):
LOOP LABEL
Example (Preferred):
LABEL
Avoid Control Transfer Instructions
Avoid using control transfer instructions. control transfer branches predicted branch target buffer.
Branch Optimizations
Chapter
22007I-0-September 2000
AthlonProcessor Code Optimization
Avoid Recursive Functions
Avoid recursive functions danger overflowing return address stack. Convert end-recursive functions iterative code. recursive function called end-recursive when function call itself code. Example (Avoid):
long fac(long (a==0) return (1); else return (a*fac(a-1)); return (t);
Example (Preferred):
long fac(long long t=1; while retu

Other recent searches


U6209B - U6209B   U6209B Datasheet
OM1577 - OM1577   OM1577 Datasheet
NJG1662MD7 - NJG1662MD7   NJG1662MD7 Datasheet
NJG1642HE3 - NJG1642HE3   NJG1642HE3 Datasheet
EPG4012S - EPG4012S   EPG4012S Datasheet
DS4820 - DS4820   DS4820 Datasheet
CT2A01 - CT2A01   CT2A01 Datasheet
CT2A07 - CT2A07   CT2A07 Datasheet
CGS74B304 - CGS74B304   CGS74B304 Datasheet

 

Privacy Policy | Disclaimer
© 2012 Datasheet Archive