| The Datasheet Archive - 100 Million Datasheets from 7500 Manufacturers. |
Publication 22621 Rev: Issue Date: August 1999 1999 Advanced Micr
Top Searches for this datasheet3DNow!Instruction Porting Guide Publication 22621 Rev: Issue Date: August 1999 1999 Advanced Micro Devices, Inc. rights reserved. contents this document provided connection with Advanced Micro Devices, Inc. ("AMD") products. makes representations warranties with respect accuracy completeness contents this publication reserves right make changes specifications product descriptions time without notice. license, whether express, implied, arising estoppel otherwise, intellectual property rights granted this publication. Except forth AMD's Standard Terms Conditions Sale, assumes liability whatsoever, disclaims express implied warranty, relating products including, limited implied warranty merchantability, fitness particular purpose, infringement intellectual property right. AMD's products designed, intended, authorized warranted components systems intended surgical implant into body, other applications intended support sustain life, other application which failure AMD's product could create situation where personal injury, death, severe property environmental damage occur. reserves right discontinue make changes products time without notice. Trademarks AMD, logo, Athlon, 3DNow!, combinations thereof, trademarks, AMD-K6 registered trademarks Advanced Micro Devices, Inc. Microsoft registered trademark Microsoft Corporation. MetroWerks CodeWarrior trademarks Metrowerks, Inc. trademark Pentium registered trademark Intel Corporation. Other product names used this publication identification purposes only trademarks their respective companies. 22621B/0-August 1999 3DNow!Instruction Porting Contents Revision History 3DNow!Instruction Porting Guide Introduction Detecting 3DNow!Technology Support Related Documents. 3DNow!Instruction Porting Code Support Considerations Separate Executables Separate Different Optimized Versions Conditional Code Paths 3DNow! Porting Preparations Perform High-Level Optimizations. Profile Existing Code Port Major Hotspots. Compiler Optimizations MASM Code Critical Code Port Code Blocks. 3DNow! Code versus Code Optimize Register Allocation. Schedule Instructions 3DNow! Code Debugging Decode Degradation Checking [ESI] Inhibits Short Decode Instructions Longer Than Seven Bytes Crossing Cache Line Boundary Instruction Length Determination Align Loops 32-Byte Boundary. Contents 3DNow!Instruction Porting 22621B/0-August 1999 Blended Code Guidelines Introduction Data Alignment Alignment Structures Alignment Structure Components Alignment Dynamically Allocated Memory Alignment Stack Data Maximize SIMD Processing PREFETCH PREFETCHW Instructions Take Advantage Write Combining FEMMS Instruction Load-Execute Instruction Usage Scheduling Instructions Instruction Addressing Mode Selection General Porting Guidelines Minimize AMD-K6®-2 Processor Switching Overhead Using PREFETCH. PREFETCH AMD-K6 Processor PREFETCH AthlonProcessor PREFETCHW Usage Multiple Prefetches Determining Prefetch Distance Prefetch Least Bytes Away from Surrounding Stores PFSUBR Instruction When Needed. Using PAND PXOR Swapping MMXRegisters Halves PUNPCKL* PUNPCKH* Instructions Storing Upper Bits Register PFMIN PFMAX Contents 22621B/0-August 1999 3DNow!Instruction Porting Precision Considerations Moving Data Between Integer Registers Store-to-Load Forwarding Block Copies Instruction Cache Branch Prediction Effects Linker Code Alignment. Software Write Combining. Addressing Modes AMD-K6-2 AMD-K6-III Processors Contents 3DNow!Instruction Porting 22621B/0-August 1999 Contents 22621B/0-August 1999 3DNow!Instruction Porting Revision History Date August 1999 Initial public release. Description Revision History 3DNow!Instruction Porting 22621B/0-August 1999 viii Revision History 22621B/0-August 1999 3DNow!Instruction Porting Application Note 3DNow!Instruction Porting Guide Introduction This document contains information assist programmers creating optimized code processors with 3DNow!technology. Compiler assembler designers assembly language programmers writing execution-sensitive code sequences well high-level programmers will also find guidelines useful. This document assumes that reader possesses in-depth knowledge instruction set, architecture (registers, programming modes, etc.), PC-AT platform. This document three sections guidelines 3DNow! porting: 3DNow!Instruction Porting Blended Code Guidelines General Porting Guidelines 3DNow! Instruction Porting section describes actual process converting existing code 3DNow! code. Blended Code Guidelines section deals specifically with creation blended code-3DNow! code that provides high performance AMD-K6 processors well Athlonprocessor. applications should blended code ensure optimal performance current future Introduction 3DNow!Instruction Porting 22621B/0-August 1999 platforms. General Porting Guidelines section describes number important issues 3DNow! code optimization mainly family AMD-K6 processors, also addressing Athlon processor. Detecting 3DNow!Technology Support 3DNow! technology open standard that been adopted multiple processor vendors. Therefore, checking 3DNow! technology capability should limited processors. 3DNow! technology licensees have agreed indicate 3DNow! technology capability through extended feature flags. Checks 3DNow! technology support made without first checking processor vendor. This allows current detection code also detect future 3DNow! technology licensees. basic steps 3DNow! technology capability detection follows: Test that processor CPUID instruction. Check that CPUID instruction also supports extended function 8000_0001h. Execute CPUID extended function 8000_0001h retrieve register. register set, processor supports 3DNow! instruction set. following assembly language code shows this implemented: check whether CPUID supported (bit Eflags toggled) pushfd ;save Eflags ;transfer Eflags into edx, ;save original Eflags eax, 00200000h ;toggle push ;put value stack popfd ;transfer value Eflags pushfd ;save updated Eflags ;transfer Eflags eax, ;updated Eflags original differ? NO_CPUID diff, can't toggled Detecting 3DNow!Technology Support 22621B/0-August 1999 3DNow!Instruction Porting ;;test whether extended function 80000001h supported eax, 80000000h ;call extended function 80000000h cpuid ;reports back highest supported ext. function eax, 80000000h ;supports functions 80000000h? NO_EXTENDED 3DNow! support, either ;;test function 80000001h indicates 3DNow! support eax, 80000001h ;call extended function 80000001h cpuid ;reports back extended feature flags test edx, 80000000h ;bit extended features YES_3DNow! set, 3DNow! supported Related Documents Related documents downloaded following URL: Including: AMD-K6® Processor Code Optimization Application Note, order# 21924 3DNow!Technology Manual, order# 21928 AMD-K6® Processor Multimedia Technology, order# 20726 Implementation Write Allocate Application Note, order# 21326 AthlonProcessor Code Optimization Guide, order# 22007 Extensions 3DNow!and MMXInstruction Sets Manual, order# 22466 Processor Recognition Application Note, order# 20734 Related Documents 3DNow!Instruction Porting 22621B/0-August 1999 Related Documents 22621B/0-August 1999 3DNow!Instruction Porting 3DNow!Instruction Porting Code Support Considerations Consider your software support several paths through different code optimized various processors. Choices include following methods: Separate Executables Build separate executables optimized each platform. This probably highest performance opti impractical code distribution issues other problems. Separate Place performance-sensitive code into separate DLL, providing several DLLs optimized each target platform supported. This high-performance solution overhead typically more than selecting loading version most appropriate platform detected time. problem with this approach that performance-sensitive code come from different unrelated parts source tree, becomes grouped together single DLL. Code Support Considerations 3DNow!Instruction Porting 22621B/0-August 1999 Different Optimized Versions Provide optimized versions each performance-critical function each target platform, call functions through pointers that initialized time based system processor software running This negative performance impact AMD-K6® processors because function calls through pointers slower than regular function calls. Conditional Code Paths Inside performance-critical parts code, conditionally select code paths based capability flags. AMD-K6 processors, this faster than approach using function pointers, because branches will well predicted since capabilities change during time. other hand, this approach make code less clear more difficult maintain. 3DNow!Porting Preparations Perform High-Level Optimizations Before starting 3DNow! porting effort, perform high-level optimizations that done source-code level. This primarily affects loops, which transformed variety ways better performance-loop unrolling, loop splitting, loop merging, loop inversion, loop switching, hoisting loop invariant expressions conditionals. Function calls also optimized inlining. much more difficult perform high-level transformations once code been ported assembly-language level. Profile Existing Code Before starting actual porting process, profile existing code target platform identify hotspots that merit manual porting work. 3DNow!Porting Preparations 22621B/0-August 1999 3DNow!Instruction Porting Profilers come various types. Some require source code, some instrument binaries, others sampling approach. profilers should work processors. VTUNE works too, doesn't have event-based profiling disassemble 3DNow! code capabilities. Some profilers, like MetrowerksCATS, have built-in support 3DNow! instructions easier when reprofiling code during porting process. Port Major Hotspots Candidates 3DNow! porting hotspots that frequently floating-point unit (FPU) instructions. AMD-K6 processors incur penalty called switching overhead whenever instruction flow changes between instructions MMXTM/3DNow! instructions vice versa). full 3DNow! optimization, port code down hotspots that take only small percentage (approximately total execution time. switching overhead, porting small functions 3DNow! often detrimental overall performance. goal keep processor operating 3DNow!/MMX code long periods time, with only occasional code. Some manual porting work saved compiling code which contains fewer hotspots with compiler that CodeWarriorProfessional Release later. this time, performance. Compiler Optimizations achieve best performance from hotspots that floating-point intensive lend themselves 3DNow! porting, experiment with compiler flags find which flag settings provide best code processors. Most compilers allow processor-specific optimizations based capabilities Intel processors. Since processors different from Intel processors, available processor-specific settings fully optimal processors. microarchitecture processors most closely resembles Compiler Optimizations 3DNow!Instruction Porting 22621B/0-August 1999 Pentium® Pentium microarchitecture, most cases selecting P6/PII/PIII-specific optimization results highest performance processors (for example, Microsoft Visual C/C++). Metrowerks CodeWarrior piler specific optimizatio setting processors. MASM Code Critical Code standalone MASM code performance-critical parts code that ported 3DNow!. This gives best control over code (for example, code alignment). assemble 3DNow! code, MASM 6.13 MASM 6.14. Upgrade from existing installation MASM 6.11 6.13 downloading ML613.EXE from following site: Apply this patch. enable instructions, .MMX directive. enable 3DNow! instructions, .K3D directive after using .MMX directive. order dependent. MASM 6.14 supports most 3DNow! extensions introduced Athlon processor. .XMM directive enable these extensions. Note that instructions, PFNACC PFPNACC accessible MASM 6.14. Also, order PSWAPD instruction, users need define text macro follows: pswapd TEXTEQU <pswapw> some functions where only small part code replaced, inline assembly. Since Microsoft Visual does have native inline assembly support 3DNow! instruction set, download instruction macros from site. macros amd3d.h file 3DNow! SDK, which downloaded from following URL: started assembly language code, have compiler generate assembly language listing that initial assembly language version. Make sure compile with maximum optimizations have compiler perform MASM Code Critical Code 22621B/0-August 1999 3DNow!Instruction Porting high-level optimizations front. compiler will convert symbolic constants source code "magic numbers"; however, programmer need mechanism extract symbolic constants from code import them into assembly code maintain assembly code well code. Port Code Blocks Most functions contain several more-or-less self-contained blocks. Port blocks one-by-one 3DNow! surround 3DNow! code with FEMMS. This block-by-block approach minimizes debug time. After each block ported, code verify that still working. code working, it's usually easy locate errors because they isolated block. With this approach debugger only rarely necessary 3DNow! porting work. commenting conventions 3DNow! code show most significant half operand left hand side, least significant half operand right hand side, with halves separated vertical bar. 3DNow!Code versus Code When porting, most programmers find that 3DNow! code much easier write than code because register file flat because with 3DNow! single instruction multiple data (SIMD) capability twice many operands manipulated. often possible remove local temporary variables. Maximize SIMD-always useful work both parts operands. advantageous overhead pack unpack operands order SIMD arithmetic. Consider modifying existing data structures data layout more conducive SIMD processing, thereby eliminating need additional pack unpack instructions. Replace integer code with code. Unroll small loops completely. This free integer registers, branches Port Code Blocks 3DNow!Instruction Porting 22621B/0-August 1999 that exist cannot mispredicted. large number global history bits, AMD-K6 processor does predict well many short loops. possible, computations replace branches caused "if.then.else" constructs acting 3DNow! data. Branching 3DNow! data slower since 3DNow! instructions don't affect integer flags. Also, branching disruptive SIMD code inherently scalar operation which diminishes advantages SIMD processing. Avoid moving data between integer registers Athlonprocessors. move data between integer registers, MOVD instruction. Write 3DNow! code load/store construction-but load execute instructions such PFADD MM0, [FOO]. Using load/store construct enables aggressive scheduling which essential good performance. (See Schedule Instructions page 11.) Maximize instructions that guarantee high decode bandwidth. These called short-decode instructions AMD-K6 family processors DirectPath Athlon family processors. optimization guides both processors list short-decode DirectPath instructions. Maintaining high decode bandwidth essential high performance code. Using short-decoded instructions, AMD-K6 family processors decode instructions cycle. Using DirectPath instructions, Athlon family processors decode three instructions cycle. AMD-K6 family processors, only 3DNow!/MMX instructions that short-decoded EMMS, FEMMS, PREFETCH. Avoid indirect calls jumps, AMD-K6 processors apply branch predict control-t ransfe instructions. source code level, this affects functions called through function pointer (such entry points into DLLs). latency DWORD eight cycles, latency CALL DWORD seven cycles. Note that AMD-K6 processors return stack indirect calls, return from indirectly called routine still accelerated. Athlon processor applies branch prediction indirect calls jumps. 3DNow!Code versus Code 22621B/0-August 1999 3DNow!Instruction Porting Optimize Register Allocation After porting complete function, optimize register allocation across function. Keep much data possible registers reduce overall memory traffic. Make sure data aligned natural boundaries-QWORDs QWORD boundaries, DWORDs DWORD boundaries. Note that data that accessed 3DNow! code QWORDs necessarily declared QWORDs program, therefore properly aligned even compiler switches used force data alignment natural boundaries. Ensuring alignment require slight changes padding data structures outside ported code, require manual QWORD alignment pointers returned dynamic memory allocation routines such malloc(), calloc(), etc. /zp8 switch Microsoft Visual align structs QWORD boundaries. Note however that /zp8 doesn't always perfect job, small amount manual padding still needed. Schedule Instructions Schedule code according instruction latencies. Scheduling important AMD-K6-2 AMD-K6-III processors because their scheduler deep four wide, holds OPs. pushed into scheduler four op-quad) time. come top, previous lines shift down. When line reaches bottom scheduler haven't completed yet, scheduler stalls-no pushed top. have completed line bottom scheduler, results committed architectural state (retired) op-quad discarded from scheduler, allowing following lines shift down. best possible case, decoders push op-quad every cycle. must complete after cycles else processor loses performance. equivalent short-decoded instructions. out-of-order window very big, instruction that doesn't source operands right away bottom scheduler without having completed, this prevents scheduler from shifting. Optimize Register Allocation 3DNow!Instruction Porting 22621B/0-August 1999 There basic scheduling rules. 3DNow! instructions AMD-K6-2 AMD-K6-III processors have two-cycle latency. instructions have one-cycle latency, except multiplies which cycles. Loads have two-cycle latency. guarantee smooth flow code through machine, group instructions into pairs that decode together, issue together, retire together. achieve this, observe following rules: dependencies between instructions decode pair resource conflicts between instructions decode pair cycle, AMD-K6-2 AMD-K6-III processors perform following: load store integer operations integer shift operations shift 3DNow! pipe 3DNow! pipe branch counts store PUNPCK* instructions instructions. scheduling method first group code following above rules, marking empty slots with then move instructions fill slots. example: movd punpcklbw movq punpcklwd punpckhwd pi2fd pi2fd mm1, [foo_var] v[3],v[2],v[1],v[0] mm1, 0,v[3],0,v[2] 0,v[1],0,v[0] mm2, mm1, mm2, mm1, mm2, 0,v[3],0,v[2] 0,v[1],0,v[0] 0,0,0,v[1] 0,0,0,v[0] 0,0,0,v[3] 0,0,0,v[2] float(v[1]) float(v[0]) float(v[3]) float(v[2]) Schedule Instructions 22621B/0-August 1999 3DNow!Instruction Porting 3DNow!Code Debugging debug 3DNow! code, best have debugger that supports both disassembly 3DNow! instructions, allows registers viewed pairs singleprecision floating-point values. NuMega SoftICE version 3.24 later both these capabilities. Microsoft Visual C/C++ also disassemble 3DNow! instructions; however, does provide convenient viewing registers pairs floating-point numbers. Decode Degradation Checking After code been scheduled thoroughly tested, last degradation. AMD-K6 processors technique called predecode speed decoding. certain instances, predecode information degraded, resulting decode only instruction cycle (long decode) even instruction cycles (vector decode), even though instruction itself listed short decoded. following guidelines AMD-K6 family processors: [ESI] Inhibits Short Decode [ESI] addressing mode inhibits short decode. Note that [ESI+disp], [ESI+reg] etc. acceptable. Also, note that specifying [ESI+0] optimized most assemblers [ESI]. Instructions Longer Than Seven Bytes length instruction exceeds seven bytes, short decode inhibited, instruction never short decoded. 3DNow!Code Debugging 3DNow!Instruction Porting 22621B/0-August 1999 Crossing Cache Line Boundary instruction crosses cache line boundary opcode byte modR/M byte same cache line, short decode inhibited. Instruction cache lines 32-bytes long AMD-K6 family processor. code segment only paragraph (16-byte) aligned, check 16-byte boundaries occurrence this case. cases remedied follows: Swap instructions decode pair. Choose alternative instructions move code. (For example, EAX, instead TEST EAX, EAX) Insert filler instructions like NOPs. Since instruction degraded vector decode takes cycles, it's better additional instruction have both short decoded. Hand code instruction zero displacement make displacement bits instead bits. Instruction Length Determination Short-decode inhibited more than three instruction bytes required determine length instruction. This happens certain addressing modes where decoder needs look byte determine instruction length, 0Fh, opcode, modR/M already make maximum three bytes. Avoid these addressing modes. more information, AMD-K6 Processor Code Optimization Application Note, order# 21924. AMD-K6-2 processors with core (CPUIDs 588h 58Fh) AMD-K6-III processors eliminate this particular form degraded predecode. Align Loops 32-Byte Boundary Align important loops 32-byte cache line boundary. minimum, make sure that after start loop there least instructions before next 32-byte boundary. Decode Degradation Checking 22621B/0-August 1999 3DNow!Instruction Porting Blended Code Guidelines Introduction Blended code 3DNow!optimized code that runs well both AMD-K6® Athlonprocessor platforms. basic approach blended code optimization address AMD-K6 processor requirements first, then look specific Athlon processor improvements issues which adversely affect AMD-K6 processor performance. With much larger buffers much larger out-of-order instruction window than other processors, Athlon processor good automatically extracting performance existing executables, even they specifically optimized different processor. course, best Athlon performance achieved optimizing code exploit specific strengths Athlon processor. learn more about Athlon code optimization, refer AthlonProcessor Code Optimization Guide, order# 22007. Introduction 3DNow!Instruction Porting 22621B/0-August 1999 Data Alignment Data alignment very important both AMD-K6 Athlon processor performance. Standard processor designs will work their full potential data aligned. Alignment specially important data that written instruction subsequently read another instruction. Three typical areas watch data alignment are: Alignment structures structure components Alignment dynamically allocated memory Alignment stack data Alignment Structures With regard alignment structures, many compilers offer switches automatically align structures. These switches always work perfectly. best check alignment manually necessary. Arranging structure components order decreasing size help. example, declare components with larger base type (e.g., DWORD) ahead components with smaller base types (e.g., BYTE). With regard alignment dynamically allocated memory, your programming environment does guarantee pointers returned dynamic memory allocators, such malloc(), suitably aligned, allocate slightly larger chunk memory align pointer manually. example, QWORD alignment should p=(QWORD np=(QWORD *)((((long)(p))+7L) (-8L)); Alignment Structure Components Alignment Dynamically Allocated Memory Alignment Stack Data Alignment stack data hard control unless complete functions written assembly language. this case, code like following example keep local 3DNow! data QWORD aligned. Prolog: PUSH EBP, ESP, ESP, size_of_local_variable Data Alignment 22621B/0-August 1999 3DNow!Instruction Porting Note: access arguments, access local variables. Epilog: ESP, Maximize SIMD Processing Maximize amount SIMD processing your code. instructions aggressively, using 3DNow! instructions code provide significant performance benefits compared code. Using PUNPCK instructions combine scalar data SIMD processing create significant overhead should avoided where possible. best rearrange computations data structures source such that amount SIMD computation maximized. Example (Avoid): float Xscale, Xoffset, Yscale, Yoffset; xnew x*Xscale+Xoffset; ynew y*Yscale+Yoffset; Example (Better): float Xscale, Yscale, Xoffset, Yoffset; xnew x*Xscale+Xoffset; ynew y*Yscale+Yoffset; second example efficiently implemented using 3DNow! instructions: MOVQ MOVQ MOVQ PFMUL PFADD MOVQ mm0, mm1, Xscale mm2, Xoffset mm0, mm0, xnew, ;Yscale Xscale ;Yoffset Xoffset ;y*Yscale x*Xscale ;y*Yscale+Yoffset x*Xscale+Xoffset ;store ynew xnew rough goal, strive more available computational slots provided SIMD instructions. Maximize SIMD Processing 3DNow!Instruction Porting 22621B/0-August 1999 PREFETCH PREFETCHW Instructions PREFETCH PREFETCHW aggressively possible. AMD-K6-2 processor, PREFETCH results only small performance improvements, because prefetches share frontside (FSB) bandwidth. However, high utilization high core-clock multipliers, prefetches often bumped because they priority memory access. This situation improves with AMD-K6-III processor, where traffic redirected separate backside bus, which frees bandwidth. large ount bandwidth available, application-level improvements have been observed using PREFETCH(W) aggressively. Examine code carefully find opportunities using PREFETCH(W). Good PREFETCH requires that essentially prefetched data actually used, therefore works best data accessed with unit stride ascending order. Sometimes algorithms rewritten create such data access pattern. AMD-K6 processor, PREFETCH creates small overhead, since vector decode instruction. Athlon processor, PREFETCH DirectPath. PREFETCH aggressively possible without decreasing AMD-K6 processor performance overhead PREFETCH instruction. This possible almost cases. PREFETCH Athlon processor brings bytes PREFETCH cache line length having doubled over AMD-K6 processor bytes versus bytes), acceptable have overlapping Athlon processor) prefetches account shorter 32-byte cache lines AMD-K6 processor. Make sure prefetch addresses least bytes apart from target address stores vicinity PREFETCH(W) instruction. Also, best Athlon performance, prefetch about three cache lines (192 bytes) ahead current loads. more detailed formula, PREFETCH usage guideline AthlonProcessor Code Optimization Guide, order# 22007. PREFETCH PREFETCHW Instructions 22621B/0-August 1999 3DNow!Instruction Porting Take Advantage Write Combining mechanisms provided hardware. AMD-K6 processor, best performance achieved using software write combining. (See "Software Write Combining" page 34.) Also enable write-combining features provided hardware AMD-K6-2 processor with core AMD-K6-III processor. Aggressive software write combining often better than AMD-K6 processor's hardware write-combining mechanism, enabling hardware write-combining mechanism provides additional benefit shorter latency writes non-cacheable memory areas. Athlon processor very powerful writecombining mechanism that achieves even better acceleration writes non-cacheable space than possible with write combining AMD-K6 processor. Specifically, Athlon write-combining buffer bytes combine writes size. programming writecombining hardware through model-specific registers (MSRs), which have been implemented compatibly with Intel Pentium processor. addition accelerating writes write-combining (WC) regions, Athlon write combining also accelerate writes write-through (WT) memory areas they occur strictly ascending order. (Writes areas combined regardless order writes.) Write Combining chapter AthlonProcessor Code Optimization Guide, order# 22007 more details. FEMMS Instruction Athlon processor does have switching overhead when switching between 3DNow!/MMX instructions instructions. Also, FEMMS EMMS instructions essentially free because they execute with apparent zerocycle latency. However, blended code important avoid frequent switching between 3DNow!/MMX code blocks FEMMS before entering after leaving block Take Advantage Write Combining 3DNow!Instruction Porting 22621B/0-August 1999 performance suffer significantly. Load-Execute Instruction Usage Athlon processor performs well when load-execute instructions (i.e., instructions that have register memory source, where result goes register) used. fact, load-execute instructions recommended Athlon processor because they improve code density. However, blended code, load-execute instructions 3DNow!/MMX code enable proper scheduling loads avoid potential problems with load-execute instructions (degradation vector decode instruction length) AMD-K6 family processors. Athlon processor built-in mechanism that enables sequence load dependent 3DNow!/MMX instruction execute just quickly Avoi instructions does cause performance degradation Athlon processor help AMD-K6 processor. Scheduling Instructions Schedule instructions AMD-K6 processor. (See "Schedule Instructions" page 11.) relatively small inst ruction order buff AMD-K6 processor, performance AMD-K6 processors. However, Athlon processor very aggressive out-of-order machine with huge instruction re-order buffer. Therefore, instruction scheduling Athlon processor minor importance, because extract available parallelism automatically. Scheduling code AMD-K6 processor adverse side effects Athlon performance. Load-Execute Instruction Usage 22621B/0-August 1999 3DNow!Instruction Porting Instruction Addressing Mode Selection instruction selection concerned, only issues require attention. Athlon processor, transferring data between integer registers somewhat slower than AMD-K6 processor. Therefore, such transfers should minimized. Usually, this difficult Among integer instructions, avoid LOOP instruction. While very fast AMD-K6 processor, somewhat slower Athlon processor. should replaced with sequence ECX;JNZ. This will, most cases, reduce AMD-K6 performance, only very limited amount. Athlon processor uses different instruction predecode scheme than AMD-K6 processor. therefore sub-optimal addressing modes. However, since this real performance issue AMD-K6 processor, addressing modes considered sub-optimal AMD-K6 processor should avoided blended code. Sub-optimal addressing modes described "Addressing Modes AMD-K6 AMD-K6®-III Processors" page Instruction Addressing Mode Selection 3DNow!Instruction Porting 22621B/0-August 1999 Instruction Addressing Mode Selection 22621B/0-August 1999 3DNow!Instruction Porting General Porting Guidelines Minimize AMD-K6®-2 Processor Switching Overhead Minimize 3DNow!and MMXswitching overhead porting hotspots containing code 3DNow! code. Even FEMMS used, switching incurs about cycles each direction-50 cycles round-trip. Always FEMMS, EMMS, switching overhead with EMMS about cycles round-trip. Always bracket 3DNow! code with FEMMS ensure proper operation minimize switching overhead. there function calls functions that contain code, bracket function call with FEMMS. also beneficial simply minimize number FEMMS. technique there multiple calls (where functions _stdcall), perform following order: Push arguments first Execute FEMMS Call functions (which unload stack) Execute another FEMMS Since FEMMS three-cycle vector path instruction, functions should made very small avoid adding significant Minimize AMD-K6®-2 Processor Switching Overhead 3DNow!Instruction Porting 22621B/0-August 1999 overhead (functions have been observed OpenGL that consist just five instructions). Note: Athlon processor, important that CALLs spaced closely together. more than CALLs every bytes code recommended. switching overhead occurs first floating-point unit instruction after piece 3DNow!/MMX code, occurs first 3DNow! instruction after piece code. FEMMS EMMS 3DNow!/MMX instructions. Thus, looking following sample code: code <FPU instructions> FEMMS <MMX/3DNow! instructions> FEMMS <1st instruction> cycles switching overhead switching overhead Note that PREFETCH(W), although introduced part 3DNow! instruction extension, treated like ordinary integer instruction therefore never incurs switching overhead. PREFETCH(W) used accelerate integer, x87, MMX, 3DNow! code. Using PREFETCH PREFETCH judiciously. PREFETCH AMD-K6® AMD-K6 -III processors microcoded, adds some overhead. Also AMD-K6-2 processor, cache memory accesses have flow through same frontside bus. waste bandwidth frontside executing useless prefetching. Opportunities using PREFETCH typically inside loops that process large amounts data. loop goes through less than cache line data iteration, partially unroll loop. Make sure that close 100% prefetched data actually being used. This usually requires unit stride access-all accesses contiguous memory locations. Using PREFETCH 22621B/0-August 1999 3DNow!Instruction Porting PREFETCH AMD-K6® Processor usefulness PREFETCH AMD-K6-III processors limited hardware constraints, most important that AMD-K6-III processor allows only load miss outstanding time. cases where PREFETCH most likely provide benefits characterized follows: bandwidth requirements code moderate- there relatively large amount computation relatively memory accesses. example moderate bandwidth requirements would code that consumes about Mbytes second worth data when running cache 400-MHz processor. Stores code that access cacheable memory write small area memory only-the working sets stores small empty. write-allocate feature AMD-K6-2 AMD-K6-III processors, stores bring lines into cache which subsequently dirtied must written back from cache when cache line replaced with data brought PREFETCH. Cache writebacks bandwidth front-side bus. PREFETCHes overlap-no PREFETCH instructions bring same data. number distinct memory regions being prefetched small, preferably only region there multiple memory regions being prefetched (like multiple source arrays), density loads must compared amount computation, such that computation overlapped with each PREFETCH. PREFETCH instructions should scheduled separately such cases allow each overlap with computation, avoid first PREFETCH blocking subsequent PREFETCHes limit load miss machine time. PREFETCH AthlonProcessor PREFETCH Athlonprocessor very powerful tool both because much larger available bandwidth that exploit because ability have multiple outstanding load misses. Using PREFETCH 3DNow!Instruction Porting 22621B/0-August 1999 PREFETCHW Usage Code that intends modify cache line brought through prefetching should PREFETCHW instruction. While PREFETCHW works same PREFETCH AMD-K6-2 AMD-K6-III processors, PREFETCHW gives hint Athlon processor intent modify cache line. Athlon processor will mark cache line being brought PREFET modif ied. Using PREFETCHW save additional 15-25 cycles compared PREFETCH subsequent cache state change caused write prefetched cache line. Programmers initiate multiple outstanding prefetches AMD-K6-III processors have only outstanding prefetch, Athlon processor have outstanding prefetches. example, when traversing more than array, programmer should initiate multiple prefetches. Example (Multiple Prefetches): double a[A_REALLY_LARGE_NUMBER]; double b[A_REALLY_LARGE_NUMBER]; double c[A_REALLY_LARGE_NUMBER]; (i=0; i<A_REALLY_LARGE_NUMBER/4; i++) prefetchw (a[i*4+64]); will modifying prefetch (b[i*4+64]); prefetch (c[i*4+64]); a[i*4] b[i*4] c[i*4]; a[i*4+1] b[i*4+1] c[i*4+1]; a[i*4+2] b[i*4+2] c[i*4+2]; a[i*4+3] b[i*4+3] c[i*4+3]; Multiple Prefetches Determining Prefetch Distance make sure code with PREFETCH works well Athlon processor, prefetch several cache lines ahead Athlon cache lines bytes each), bytes ahead current loads. That code currently operating data address prefetch X+192. Given latency typical Athlon processor system expected processor speeds, following formula should used determine prefetch distance bytes: Prefetch Distance (DS/C) bytes Round nearest 64-byte cache line. Using PREFETCH 22621B/0-August 1999 3DNow!Instruction Porting number constant that based upon expected Athlon processor clock frequencies typical system memory latencies. data stride bytes loop iteration. number cycles loop execute entirely from cache. Prefetch Least Bytes Away from Surrounding Stores PREFETCH PREFETCHW instructions suffer from false dependencies stores. there store address that matches request bits 14-6, that request (the PREFETCH PREFETCHW instruction) blocked until store written cache. Therefore, code should prefetch data that located least bytes away from surrounding store's data address. PREFETCH helps piece code, doesn't affect AMD-K6-III processors, keep PREFETCH code anyway. There good chance that will help Athlon implementation PREFETCH very aggressive. Athlon processor available, check that benefits from PREFETCH, then make sure that PREFETCH doesn't hurt AMD-K6-III processor. PFSUBR Instruction When Needed Note that there PFSUBR instruction, subtraction programmer choose which operand destroy. Using PAND PXOR PAND PXOR perform FABS FCHS work 3DNow! operands. example: mabs movq movq pxor pand 07fffffff7fffffffh 08000000080000000h mm0, [mabs] mm1, [sgn] mm2, ;change sign mm2, ;absolute value PFSUBR Instruction When Needed 3DNow!Instruction Porting 22621B/0-August 1999 PXOR MMreg, MMreg instruction clear bits register. PCMPEQD MMreg, MMreg instruction bits register. Swapping MMXRegisters Halves swap register halves register (which should avoided) following: ;mm1 swapd (mm0), destroyed movq mm1, punpckldq mm0, punpckhdq mm1, ;mm1 swapd (mm0), preserved movq mm1, punpckhdq mm1, punpckldq mm1, code being used only Athlon family processors, PSWAPD instructions. Extensions 3DNow!and Instruction Sets Manual, order# 22466 instruction usage. PUNPCKL* PUNPCKH* Instructions PUNPCKL* PUNPCKH* essential facilities MOVQ/MOVD, these most frequently used instructions 3DNow! code. example, converting stream unsigned bytes into 3DNow! floating-point operands: outside loop: pxor mm0, ;inside loop: movd mm1, punpcklbw mm1, movq mm2, punpcklwd mm1, punpckhwd mm2, pi2fd mm1, pi2fd mm2, [foo_var] v[3],v[2],v[1],v[0] ;0,v[3],0,v[2] 0,v[1],0,v[0] ;0,v[3],0,v[2] 0,v[1],0,v[0] ;0,0,0,v[1] 0,0,0,v[0] ;0,0,0,v[3] 0,0,0,v[2] ;float(v[1]) float(v[0]) float(v[3]) float(v[2]) Swapping MMXRegisters Halves 22621B/0-August 1999 3DNow!Instruction Porting Storing Upper Bits MMXRegister store upper bits register using MOVD, either PSRLQ PUNPCKHDQ instruction move high-order bits register low-order bits register. this situat ion, optimal PUNPCKHDQ instruction. AMD-K6-III processor only shifter (which execute PSRLQ), ALUs (which execute PUNPCKHDQ). Using PUNPCHDQ therefore maximizes likelihood execution unit being available. PFMIN PFMAX PFMIN PFMAX where possible. They much faster than equivalent code using 3DNow! instructions. PFMIN PFMAX used clamping. They also used SIMD code that avoids branching replacing with computation. example: float x,z; abs(x); 1/z; coded using branchless SIMD code follows: ;;in: ;;out: movq mm5, movq mm6, pand mm0, pcmpgtd mm6, pfrcp mm2, movq mm1, pfrcpit1 mm0, pfrcpit2 mm0, pfmin mm0, mabs ;0x7fffffff ;1.0 ;z=abs(x) 0xffffffff ;1/z approx ;save ;1/z step ;1/z final Storing Upper Bits MMXRegister 3DNow!Instruction Porting 22621B/0-August 1999 Another example. following code: #define 3.14159265358979323f float x,z,r,res; abs(x) else PI/2-r; This code branchless SIMD code follows: ;;in: ;;out: movq mm5, mabs movq mm6, pand mm0, pcmpgtd mm6, movq mm4, pio2 pfsub mm4, pandn mm6, pfmax mm1, ;0x7fffffff ;1.0 ;z=abs(x) 0xffffffff ;pi/2 ;pi/2-r pi/2-r ;res pi/2-r Precision Considerations Carefully consider whether reciprocals, divides, square roots, reciprocal square roots full precision. full precision required, accelerate code using just approximations returned PFRCP bits accuracy), PFRSQRT bits accuracy) instead coding reciprocal reciprocal square root sequence with Newton-Raphson step instructions. lighting computations, accuracy approximation instructions often suffices, geometry transforms typically require full precision. Moving Data Between MMXand Integer Registers Athlon processor, avoid moving data between integer registers vice versa. this cannot avoided, MOVD instruction accomplish transfer, pass data manually through memory (except Precision Considerations 22621B/0-August 1999 3DNow!Instruction Porting where store scheduled least instructions ahead load). Store-to-Load Forwarding Avoid store-to-load forwarding (store feeding into load) that does have address size matches. only exception wide store feeding into small load where addresses match: movq [foo], eax, [foo] Here some cases avoid: movq movq movq movq movq [foo], [foo+4], mm0, [foo] [foo], eax, [foo+4] [foo], [foo+8], mm2, [foo+4] Block Copies memory block copies AMD-K6-III processor, most code will have very similar performance large blocks, because limited interface. AMD-K6-2 processor, this verified creating multiple block copy performance differences. This also true block copies inside (for off-chip L2). However, L1-to-L1 block copies there difference. AMD-K6-2/300 Epox motherboard with MVP3 chipset PC100 DRAM. Data blocks QWORD aligned. L1-to-L1 L2-to-L2 mem-to-mem memcpy() MB/s MB/s MB/s aggressive MOVQ loop 1718 MB/s MB/s MB/s Store-to-Load Forwarding 3DNow!Instruction Porting 22621B/0-August 1999 L2-to-L2 mem-to-mem throughput increases with Athlon processor. aggressive MOVQ loop performs minimum well memcpy(), does much better L1-to-L1 transfers. also preferable copies non-cacheable areas AMD-K6-III processor doubled chunk size over MOVSD inside memcpy() function. this reason, consider using block copies. code follows: _asm eax, [src] edx, [dst] ecx, (SIZE xfer: movq mm0, [eax] edx, movq movq movq movq movq movq movq movq movq movq movq movq movq movq movq mm1, [eax+8] eax, mm2, [eax-48] [edx-64], mm3, [eax-40] [edx-56], mm4, [eax-32] [edx-48], mm5, [eax-24] [edx-40], mm6, [eax-16] [edx-32], mm7, [eax-8] [edx-24], [edx-16], [edx-8], xfer Care should taken make label xfer: 32-byte-aligned maximum performance. side note, Microsoft Visual without Service Pack appears ignores align directives Block Copies 22621B/0-August 1999 3DNow!Instruction Porting inline assembly. This problem occur after applying Service Pack Microsoft Visual 5.0. Instruction Cache Branch Prediction Effects performance, there sometimes interesting differences several (frames econd- rformance graphics applicatio based that. Instruction cache thrashing suspect this. other branch prediction which global history component where branches influence prediction other branches. Most time this helps. (Two branches might closely correlated-if taken other always taken.) also hurt, like heuristic algorithms. order reduce potential instruction cache thrashing, group program's hotspots close together. example extract performance-critical functions into single file. Linker There another affect function ordering that more desirable. linker allows programmer specify exact order every function DLL/executable follows: source code must compiled with switch. This creates packaged functions-a COMDAT record emitted into object file each function. link time, /ORDER:@filename switch order functions DLL/executable. term filename, refers file that lists function names order emitted, function name line. code it's simply function name appears source pre-pended underscore, suffix Pascal calling convention). This does work object files produced MASM. MASM doesn't have switch create packaged functions, does allow user create COMDAT entry manually putting COMDAT func into your source. reduce potential problems branch prediction, eliminate many branches possible. AMD-K6-III Instruction Cache Branch Prediction Effects 3DNow!Instruction Porting 22621B/0-August 1999 Athlon processors have large instruction caches, aggressive loop unrolling (which increases code size) helps. also worthwhile eliminate branches which have small line computation. Code Alignment 32-byte alignment MASM 6.13, forgo convenience new-style segment declarations, something like following: _TEXT SEGMENT PAGE PUBLIC USE32 'CODE' ASSUME CS:FLAT, DS:FLAT, SS:FLAT, ES:FLAT ALIGN _TEXT ENDS MASM allow ALIGN more restrictive than SEGMENT alignment. .CODE used, result PARA aligned segment-a 16-byte aligned segment. inline assembly Microsoft® Visual best alignment 16-byte alignment using align inline assembly code. Microsoft Visual without ignores this directive, check whether alignment actually there. Microsoft Visual seems work this regard. present, correct operation align under Microsoft Visual with been verified. inline assembly MetrowerksCodeWarrior align accepted works. specific vendor more information. Software Write Combining writes-to-non-cacheable space important issue lowlevel drivers. Processors communicate with graphics chips through command buffer graphics card which mapped non-cacheable AGP) space. Pentium® this made high-performance setting that space Code Alignment 22621B/0-August 1999 3DNow!Instruction Porting UCWC (non-cacheable write-combining), which case Pentium does write-combining even bursting that space. AMD-K6-2 processors that predate core (CPUID less than 588h) support UCWC memory type, they neither perform write combining, they burst memory. AMD-K6-2 processors with core (CPUID 588h 58Fh) AMD-K6-III processors support write combining non-cacheable space, they able burst transfers when writing non-cacheable memory areas. Also, AMD-K6-2 processors predating core pipeline writes non-cacheable space well. This create bottleneck when data needs transferred graphic card, which raphics drivers happens predominantly texture download triangle download code. (These cover about writes.) Therefore, good performance with millions existing AMD-K6-2 processors even AMD-K6-2 processors with core AMD-K6-III processors, software needs organize ully achieve around performance gain process. This technique called software write combining. basic technique collect writes non-cacheable space into aligned QWORDs much possible. This accomplished using register write buffer collecting DWORD writes using PUNPCK. Then store data using aligned MOVQ stores. following basic approaches align QWORD writes: there command consisting single DWORD, which takes processing time graphics chip, issue command buffer pointer QWORD aligned, then continue writing QWORDs. This works DWORDs command buffer least DWORD aligned. drawback wasting some bandwidth commands. split code into code streams. buffer pointer QWORD aligned, take path write first chunk DWORD, then continue writing QWORD. buffer pointer aligned, take path start writing QWORDs immediately. Software Write Combining 3DNow!Instruction Porting 22621B/0-August 1999 both cases there case where need flush write buffer (MMX register) write loop. Option recommended highest possible performance, option often easier implement often provides similar performance. AMD-K6-III processors, both software write combining enable hardware write-combining features these processors. Addressing Modes AMD-K6®-2 AMD-K6®-III Processors addressing modes listed below sub-optimal instructions. They degrade short-decoded instructions vector decode (degrade long-decode case 3DNow! instructions). This lack on-the-fly corrections instruction length that computed during predecode. 16-bit addressing: [SI], [SI+disp8], [SI+disp16], [DI] 32-bit addressing: [ESI] following addressing modes sub-optimal instructions with prefix (including MMX/3DNow! instructions). Again, degrades short-decoded instructions vector (long decode case 3DNow! instruction set). This inability determine instruction length from first three bytes (0F-prefix, opcode, ModR/M). Note: This category been eliminated AMD-K6-2 processors with core AMD-K6-III processor. However millions existing AMD-K6-2 processors affected this issue, highly recommended avoid these addressing modes. ModR/M 00_xxx_100b only ModR/M encoding that requires value determine instruction length. this ModR/M, processor doesn't know whether there disp32 until looks (which predecode cannot case MMX/3DNow!). ModR/M 01_xxx_100b there always disp8, ModR/M Addressing Modes AMD-K6®-2 AMD-K6®-III Processors 22621B/0-August 1999 3DNow!Instruction Porting 10_xxx_100b there always disp32, length determined from looking ModR/M without looking byte. This ModR/M encoding encountered with following source-level addressing modes: [base+index] following example demonstrates ModR/M byte byte resulting from several addressing modes; note that instruction affected issue described here. disp 00000000 eax, eax, eax, eax, [edx+8*esi] [4*esi+ebx] [8*edx] [edx+ebx] Note that third mode actually identical second actual encoding concerned (basically it's encoded Also, there length restriction. instruction longer than seven bytes cannot short decoded. instructions, avoid addressing modes with 32-bit displacement. 3DNow! instructions, avoid addressing modes with 32-bit displacement. Addressing Modes AMD-K6®-2 AMD-K6®-III Processors Other recent searchesTA2160FN - TA2160FN TA2160FN Datasheet PDTB113Z - PDTB113Z PDTB113Z Datasheet IR2520D - IR2520D IR2520D Datasheet HS-460 - HS-460 HS-460 Datasheet BAS385 - BAS385 BAS385 Datasheet
Privacy Policy | Disclaimer |