| The Datasheet Archive - 100 Million Datasheets from 7500 Manufacturers. |
Processor Code Optimization Application Note Publicatio
Top Searches for this datasheetAMD-K6 Processor Code Optimization Application Note Publication 21924 Rev: Issue Date: January 2000 Amendment/0 2000 Advanced Micro Devices, Inc. rights reserved. contents this document provided connection with Advanced Micro Devices, Inc. ("AMD") products. makes representations warranties with respect accuracy completeness contents this publication reserves right make changes specifications product descriptions time without notice. license, whether express, implied, arising estoppel otherwise, intellectual property rights granted this publication. Except forth AMD's Standard Terms Conditions Sale, assumes liability whatsoever, disclaims express implied warranty, relating products including, limited implied warranty merchantability, fitness particular purpose, infringement intellectual property right. AMD's products designed, intended, authorized warranted components systems intended surgical implant into body, other applications intended support sustain life, other application which failure AMD's product could create situation where personal injury, death, severe property environmental damage occur. reserves right discontinue make changes products time without notice. Trademarks AMD, logo, 3DNow!, combinations thereof, K86, Super7, AMD-K5 trademarks, RISC86 AMD-K6 registered trademarks Advanced Micro Devices, Inc. trademark Pentium registered trademark Intel Corporation. Other product names used this publication identification purposes only trademarks their respective companies. 21924D/0-January 2000 AMD-K6® Processor Code Optimization Contents Revision History Introduction Purpose. AMD-K6® Family Processors AMD-K6-2 AMD-K6-III Processors RISC86 Microarchitecture Overview Enhanced RISC86® Microarchitecture AMD-K6® AMD-K6-III Processors Execution Units Dependency Latencies Execution Unit Terminology Six-Stage Pipeline Register Execution Units Load Unit Store Unit. Branch Condition Unit Floating-Point Unit Latencies Throughput Resource Constraints Code Sample Analysis Instruction Dispatch Optimization Coding Guidelines General Optimization Techniques General AMD-K6 Family Coding Optimizations AMD-K6 Family Integer Coding Optimizations Contents AMD-K6® Processor Code Optimization 21924D/0-January 2000 AMD-K6-2 AMD-K6-III Processors Multimedia Coding Optimizations. AMD-K6-2 AMD-K6-III Processors Floating Point Coding Optimizations Considerations Other Processors Contents 21924D/0-January 2000 AMD-K6® Processor Code Optimization List Figures Figure AMD-K6®-III Processor Block Diagram Figure Processor Pipeline Figure Register Functional Units Figure Register Execution Stages Figure Microarchitecture Execution Resources Figure Load Execution Unit Figure Store Unit Execution Pipeline List Figures AMD-K6® Processor Code Optimization 21924D/0-January 2000 List Figures 21924D/0-January 2000 AMD-K6® Processor Code Optimization List Tables Table Table Table Table Table Table Table Table Table Table Table Table Table Table Table RISC86® Execution Latencies Throughput Sample Integer Register Operations Sample Integer Register Memory Load Operations Sample Integer Register Memory Load/Store Operations Sample Integer, MMXTM, Memory Load/Store Operations Integer Instructions. Instructions Floating-Point Instructions 3DNow!Instructions Decode Accumulation Serialization Specific Optimizations Guidelines AMD-K6® AMD-K5Processors AMD-K6 Processor Versus Pentium® Processor-Specific Optimizations Guidelines AMD-K6 Processor Pentium Processor with Optimizations Instructions AMD-K6 Processor Pentium Pro/Pentium Specific Optimizations93 AMD-K6 Processor Pentium with Optimizations Instructions List Tables AMD-K6® Processor Code Optimization 21924D/0-January 2000 viii List Tables 21924D/0-January 2000 AMD-K6® Processor Code Optimization Revision History Date 1998 1998 1998 August 1999 August 1999 August 1999 2000 Initial Release Added instructions Table "Integer Instructions," page Clarified address modes page page Changed title Introduction reflect that information this document applies AMD-K6® family processors mainly AMD-K6-2 AMD-K6-III processors. Revised address mode information page Revised examples "Division Square Root" page Changed mem64 mem32 PUNPCKLBW, PUNPCKLWD, PUNPCKLDQ Table page Description Revision History AMD-K6® Processor Code Optimization 21924D/0-January 2000 Revision History 21924D/0-January 2000 AMD-K6® Processor Code Optimization Introduction Purpose K86family processors efficiently execute code written previous-generation processors. However, highest performance from unique microarchitecture AMD-K6® family processors, certain code optimization techniques should applied. This document contains information assist programmers creating optimized code AMD-K6 family. This document targeted compiler/assembler designers assembly language programmers writing high-performance code sequences. assumed that reader possesses in-depth knowledge architecture. information this application note pertains AMD-K6 family processors information specific AMD-K6-2 processor Model AMD-K6-III processor Model noted. information about recognition processor model numbers, Processor Recognition Application Note, order# 20734. Chapter Introduction AMD-K6® Processor Code Optimization 21924D/0-January 2000 AMD-K6® Family Processors Processors AMD-K6 family decoupled instruction decode superscalar execution microarchitecture, including sixth-generation performance with binary software compatibility. binary-compatible processor implements industry-standard instruction decoding executing instruction native mode operation. Only this native mode permits delivery maximum performance when running software. AMD-K6®-2 AMD-K6®-III Processors AMD-K6-2 AMD-K6-III processors (hereafther both performance desktop systems running industry-standard software. processor impleme dvance sign techniques such Instruction pre-decoding Multiple opcode decoding Single-cycle internal RISC operations Multiple parallel execution units Out-of-order execution Data-forwarding Register renaming Dynamic branch prediction processor capable issuing, executing, retiring multiple instructions cycle, resulting superior scaleable performance. ract parallelism off-the-shelf, commercially available software, specific code optimizations processor result significantly higher delivered performance. This document describes RISC86 microarchitecture processor makes recommendations optimizing Introduction Chapter 21924D/0-January 2000 AMD-K6® Processor Code Optimization execution software processor. coding techniques achieving peak performance processor include, limited those recommended Pentium®, Pentium Pentium processors. However, many these optimizations necessary processor achieve maximum performance. example, more flexible pipeline control AMD-K6 microarchitecture, processor less sensitive instruction selection scheduling code. This flexibility distinct advantages AMD-K6 processor microarchitecture. addition ability execute MMXinstructions, processor includes implementation 3DNow!instruction set. 3DNow! technology created based suggestions from leading graphics software vendors. Utilizing data format single instruction multiple data (SIMD) operations based instruction model, processor produce four, 32-bit, single-precision floating-point results clock cycle. 3DNow! technology also lude multi ctio instruction allow prefetching data under software control, faster enter/exit multimedia-state instruction. 3DNow! units provide support high-performance, floating-point vector operations, which replace instructions enhance performance graphics other floating-point-intensive applications. complete multimedia processing unit processor combines existing instructions with 3DNow! instructions. 3DNow! instructions share registers with multimedia unit. mixing 3DNow! instructions with instructions, becomes possible write programs containing both integer floating-point instructions without performance penalty that would have been incurred floating-point instructions were intermixed. these improvements have been carefully designed bring better multimedia experience mainstream users while maintaining backwards compatibility with existing software. Chapter Introduction AMD-K6® Processor Code Optimization 21924D/0-January 2000 Introduction Chapter 21924D/0-January 2000 AMD-K6® Processor Code Optimization RISC86 Microarchitecture Overview When discussing processor design, important understand implementation. term architecture refers instruction features processor that visible software rchi termines software processor run. architecture AMD-K6 processor industry-standard instruction set. term microarchitecture refers design techniques used processor reach target cost, performance, functionality goals. AMD-K6 processor based sophisticated RISC core known Enhanced RISC86 microarchitecture. Enhanced RISC86 microarchitecture advanced decoupled decode/execution design approach that enables industry-leading performance x86-based software. term design implementation refers actual logic circuit designs from which processor created according microarchitecture specifications. Chapter RISC86 Microarchitecture AMD-K6® Processor Code Optimization 21924D/0-January 2000 Enhanced RISC86® Microarchitecture anced RISC86 croarchi tecture characteristics AMD-K6 family processors. innovative RISC86 microarchitecture approach implements instruction internally translating instructions into RISC86 operations. These RISC86 operations were specially designed include direct support instruction while observing RISC performance principles fixed-length encoding, regularized instruction fields, large register set. Enhanced RISC86 microarchitecture used AMD-K6 processor enables straightforward extensions future designs. Instead directly executing complex instructions, which have lengths bytes, AMD-K6 processor executes simpler fixed-length RISC86 operations, while maintaining instruction coding efficiencies found programs. AMD-K6 processor includes parallel instruction decoders, centralized RISC86 operation scheduler, several execution resources that support superscalar execution- multiple decode, execution, retirement-of instructions. These elements packed into aggressive highly efficient six-stage processing pipeline. Decoding instructions into RISC86 operations begins when on-chip level-one instruction cache filled. Predecode logic determines length instruction byte-by-byte basis. This predecode information stored along with instructions dedicated, level-one predecode cache used later decoders. predecode data essential ability short decoders operate. AMD-K6 processor categorizes instructions into three types decodes-short, long, vector. decoders process either short, long, vector decode time. three types decodes have following characteristics: Short decodes-common instructions less than equal bytes length that produce RISC86 operations. short decoders work parallel, resulting maximum four RISC86 operations clock with additional latency. RISC86 Microarchitecture Chapter 21924D/0-January 2000 AMD-K6® Processor Code Optimization Long decodes-more complex somewhat common instructions less than equal bytes length that produce four RISC86 operations. Vector decodes-complex instructions requiring long sequences RISC86 operations. Short long decodes processed completely within decoders. Vector decodes started vector decoder with generation initial four RISC86 operations, then completed fetching sequence additional operations from on-chip rate four operations clock). RISC86 operations, whether produced decoders fetched from ROM, then loaded into buffer line centralized scheduler dispatch execution units. AMD-K6®-2 AMD-K6®-III Processor-Specific Microarchitecture internal RISC86 instruction consists following seven categories types operations (the execution unit that handles each type operation displayed parenthesis): Memory load operations (load) Load immediate (instruction control unit) Memory store operations (store) Integer register operations (alu/alux) MMX/3DNow! register operations (multimedia execution unit (meu)) floating-point register operations (float) Branch condition evaluations (branch) following example shows series instructions corresponding decoded RISC86 operations. Instructions [SP+4] AX,BX CX,[AX] RISC86 Operations Load (Add) Load (Sub) Branch instruction converts RISC86 load operation that requires indirect data loaded from memory. instruction converts register operation that sent either integer units. instruction converts into RISC86 operations. first RISC86 load Chapter RISC86 Microarchitecture AMD-K6® Processor Code Optimization 21924D/0-January 2000 operation requires indirect data loaded from memory. That value then compared (alu function) with Once RISC86 operations placed centralized scheduler buffer, they immediately issued appropriate execution pipeline. processor contains execution pipelines store, load, integer ALU, integer ALU, (X), (Y), MMX/3DNow! multiplier, 3DNow! ALU, Floating-Point, Branch. Figure shows block diagram these units within processor. Register Functional Units contain several execution resources, which described Chapter page KByte Level-One Instruction Cache Predecode Logic Entry ITLB KByte Predecode Cache Byte Fetch Level-One Cache Controller Super7Bus Interface Branch Logic Dual Instruction Decoders RISC86 (8192-Entry BHT) (16-Entry BTC) (16-Entry RAS) Out-of-Order Execution Engine RISC86® Operation Issue Four RISC86 Decode Scheduler Buffer RISC86) Instruction Control Unit Branch Resolution Unit Level-Two Cache (256 KByte) Load Unit Store Unit Register Unit (Integer/ Multimedia/3DNow!TM) Register Unit (Integer/ Multimedia/3DNow!) Floating- Point Unit Store Queue Level-One Dual-Port Data Cache KByte) Entry DTLB Figure AMD-K6®-III Processor Block Diagram centralized scheduler buffer, conjunction with instruction control unit (ICU), buffers manages RISC86 operations time (which equals instructions). This buffer size matched processor's six-stage RISC86 pipeline decode rate four RISC86 operations clock. RISC86 Microarchitecture Chapter 21924D/0-January 2000 AMD-K6® Processor Code Optimization every clock, centralized scheduler buffer accept four RISC86 operations from decoders, issue RISC86 operations corresponding execution unit pipelines, retire four RISC86 operations. register execution units shared between execution pipelines. maximum these register operations issued time. When managing RISC86 operations, uses microarchitecture. Forty-eight physical registers located general register file grouped committed architectural registers plus rename registers. architectural registers consist scratch registers registers that correspond general-purpose registers EAX, EBX, ECX, EDX, EBP, ESP, ESI, EDI. There analogous registers specifically 3DNow! operations. There MMX/3DNow! committed architectural registers plus MMX/3DNow! rename registers. architectural registers consist scratch register registers that correspond registers (mm0-mm7). processor offers sophisticated dynamic branch logic that includes following elements: Branch history/prediction table Branch target cache Return address stack These components serve minimize eliminate delays branch instructions (jumps, calls, returns) common software. processor implements two-level branch prediction scheme based 8192-entry branch history table. branch history table stores prediction information that used predicting direction conditional branches. target addresses conditional unconditional branches predicted, instead calculated on-the-fly during instruction decode special branch target address ALUs. branch target cache augments performance taken branches avoiding one-cycle cache-fetch penalty. This specialized target cache does this supplying first bytes target instructions decoders when branch taken. Chapter RISC86 Microarchitecture AMD-K6® Processor Code Optimization 21924D/0-January 2000 return address stack serves optimize CALL RETURN instruction pairs remembering return address each CALL within nested series subroutines corresponding RETURN instruction. shown Figure page high-performance, out-of-order execution engine mated split 64-Kbyte writeback level-one cache (Harvard architecture) with Kbytes instruction cache Kbytes data cache. level-one instruction cache feeds decoders and, turn, decoders feed scheduler. controls issue retirement RISC86 operations contained centralized scheduler buffer. level-one data cache satisfies most memory reads writes load store execution units. store queue temporarily buffers memory writes from store unit until they safely committed into cache (that when preceding operations have been found free faults branch mispredictions). system interface industry-standard Super7 Socket interface. RISC86 Microarchitecture Chapter 21924D/0-January 2000 AMD-K6® Processor Code Optimization AMD-K6 AMD-K6-III Processors Execution Units Dependency Latencies AMD-K6-2 AMD-K6-III processors contain several specialized execution pipelines store, load, register register floating-point, branch condition. Each pipeline operates independently handles specific subset RISC86 instruction set. register register pipelines each contain integer, multimedia, 3DNow! technology execution resources, some which shared between two. This chapter describes operation these units, their execution latencies, these latencies affect concurrent dependency chains. Note: meu-Multimedia execution units execute 3DNow! instructions. dependency occurs when data needed execution unit/resource being processed another unit/resource different stage same unit/resource). Additional latencies occur because dependent execution unit must wait data from supplying unit. Table page provides summary execution units, operations performed within these units, operation latency, operation throughput. Chapter Execution Units Dependency Latencies AMD-K6® Processor Code Optimization 21924D/0-January 2000 Execution Unit Terminology Introduction execution units operate with different types register values-operands results. these there three types operands types results. three types operands follows: Operands Address register operands-used address calculations load store operations Data register operands-used register operations Store data register operands-used memory stores Results types results follows: Data register results-produced load register operations Address register results-produced Push operations following examples illustrate operand result definitions: operation data register operands data register result (AX). Load Load operation address register operands base index registers, respectively) data register result (BX). Store Store operation store data register operand (AX) address register operands base index registers, respectively). operation type store operation) address register operands base index registers, respectively), address register result (SI). Execution Units Dependency Latencies Chapter 21924D/0-January 2000 AMD-K6® Processor Code Optimization Six-Stage Pipeline help visualize operations within processor, Figure illustrates effective pipeline stages. This simplified illustration that processor contains multiple parallel pipelines (starting after common instruction fetch decode pipe stages), these pipelines often execute operations out-of-order with respect each other. This view processor execution pipeline illustrates effect execution latencies various types operations. many instructions, effective pipeline seven stages. register operations that require execution stage effective pipeline stages. Instruction Fetch Note: x86->RISC86® Decode RISC86 Issue Operand Fetch Execution Stage Execution Stage Commit Execution Stage optional Figure Processor Pipeline Register Execution Units register execution resources attached register unit execution pipeline register unit execution pipeline. Each register execution pipeline dedicated resources that consist integer execution unit multimedia/ALU execution unit. addition, both pipelines shared execution units 3DNow! operations shift multiply operations. Figure page shows details register execution pipelines. Chapter Execution Units Dependency Latencies AMD-K6® Processor Code Optimization 21924D/0-January 2000 Scheduler Buffer RISC86® Operations) Issue Register Execution Pipeline Issue Register Execution Pipeline Integer MMXALU MMX/ 3DNow!Multiplier Shifter 3DNow! Integer Figure Register Functional Units register integer execution resource execute operations including ALU, multiply, divide (signed unsigned), shift, rotate. Data register results available after minimum clock execution latency. dedicated integer execution unit contained within register execution pipeline execute basic word doubleword operations (ADD, AND, CMP, SUB, XOR), zero-extend, sign-extend operations. Data register results available after clock. register execution pipelines each contain dedicated multimedia execution unit that handles add/subtract, logical, pack/unpack instructions. simultaneously. This means that processor execute operations each clock cycle. number execution resources available both register execution pipelines. These shared resources Execution Units Dependency Latencies Chapter 21924D/0-January 2000 AMD-K6® Processor Code Optimization include shifter, 3DNow! ALU, combined MMX/3DNow! multiplier. Figure page shows which instruction types associated with various execution pipelines. combination operations that utilize same shared execution resource issued executed simultaneously. example, following pairs register operations execute together: logical 3DNow! add, 3DNow! 3DNow! multiply, multiply 3DNow! add, etc. issued simultaneously, following examples result resource contentions stall RISC86 operation: multiply 3DNow! multiply, multiplies, 3DNow! multiplies, 3DNow! adds, etc. Figure shows data flow architecture single-stage double-stage integer execution unit pipeline. There operations (such integer multiply) that require second execution stage. operation issue operand fetch stages (execution stage that precede execution stage part execution pipeline. data register result produced near execution pipe stage. Data Register Operands (Base Index) Execution Stage Integer, Integer, etc.) Execution Stage necessary) Data Register Result Figure Register Execution Stages Chapter Execution Units Dependency Latencies AMD-K6® Processor Code Optimization 21924D/0-January 2000 Register Execution Pipeline Register Execution Pipeline Integer Integer Shift Integer Multiply Divide Integer Byte Operations Integer Special Registers Integer Segment Register Loads Add/Subtract, Compare Logical, Pack, Unpack 3DNow!Add/Subtract, Compare, Integer Conversion, Reciprocal Reciprocal Square Root Table Lookup MMXand 3DNow! Multiply, Reciprocal Reciprocal Square Root Iteration Integer Add/Subtract, Compare Logical, Pack, Unpack Shifter Dedicated Register Resources Shared Register Resources Dedicated Register Resources Figure Microarchitecture Execution Resources Execution Units Dependency Latencies Chapter 21924D/0-January 2000 AMD-K6® Processor Code Optimization Load Unit load unit two-stage pipelined design that performs data memory reads. two-clock latency from time receives address register operands until produces data register result Dcache hit. cache miss produces longer latencies. load unit Dcache support hit-under-miss operations where load operation bypasses previous load operation that stalled waiting cache line refill. This unit uses address register operands memory data value inputs, produces data register result. Memory read data come from either data cache from store queue entry (for recent store). data forwarded from store queue, there zero additional execution latency, which means that dependent load operation complete execution clock after store operation completes execution. Figure shows architecture two-stage load execution pipeline. address register operands received operand fetch pipe stage, data register result produced near second execution pipe stage. operation issue fetch stages that precede these execution stages shown. Address Register Operands (Base Index) Execution Stage Address Calculation Stage Memory data from Data Cache Store Queue Execution Stage Data Cache/ Store Queue Lookup Data Register Result Figure Load Execution Unit Chapter Execution Units Dependency Latencies AMD-K6® Processor Code Optimization 21924D/0-January 2000 Store Unit store execution unit two-stage pipelined design that performs data memory writes and, some cases, produces address register result. inputs, store unit uses address register operands and, during actual memory writes, store data register operand. This unit also produces address register result some store unit operations. most store operations, example those that write data memory, store unit produces physical memory address associated data bytes written. After execution completes, these results entered store queue entry. store queue hold seven data results, each which bits. store unit one-clock execution latency from time receives address register operands until time produces address register result. most common examples Load Effective Address (Lea) Store Update (Push) RISC86 operations, which produced from PUSH instructions, respectively. Most store operations produce address register result only perform memory write. Push operation unique because produces address register result performs memory write. store unit one-clock execution latency from time receives store data operand until enters store memory address data pair into store queue. store unit have three-clock latency from time receives address register operands store data register operand until enters store memory address data pair into store queue. Note: Address register operands required start execution, register store data required until execution. Figure page shows architecture two-stage store execution pipeline. operation issue fetch stages that precede this execution stage part execution pipeline. address register operands received operand fetch pipe stage, store queue entry created upon completion second execution pipe stage. Execution Units Dependency Latencies Chapter 21924D/0-January 2000 AMD-K6® Processor Code Optimization Address Register Operands (Base Index) Execution Stage Address Calculation Stage Address Register Result Execution Stage Store Data Register Operand Address Data Store Queue Entry Figure Store Unit Execution Pipeline Branch Condition Unit branch condition unit separate from branch prediction logic, which utilized instruction decode time. This unit resolves conditional branches, such LOOP instructions, rate clock cycle. This unit dedicated RISC86 issue from scheduler. Floating-Point Unit floating-point unit (FPU) handles register operations instructions. execution unit single-stage design that takes data register operands inputs produces data register result output. most common floating-point instructions have clock execution latency from time receives data register operands until produces data register result. RISC86 issue from scheduler. Chapter Execution Units Dependency Latencies AMD-K6® Processor Code Optimization 21924D/0-January 2000 Latencies Throughput Table summarizes static latencies throughput each execution unit. Table RISC86® Execution Latencies Throughput Operations Integer Register Integer Unit Integer Multiply Integer Shift Register Multimedia Unit MMXAdd/Subtract Logical, Pack, Unpack Add/Subtract Logical, Pack, Unpack MMX/3DNow! Multiply, Reciprocal and, Reciprocal Square Root Iteration 3DNow! Add, Compare, Integer Conversion, Reciprocal, Reciprocal Square Root Table Lookup From Address Register Operands Data Register Result Memory Read Data from Data Cache/Store Queue Data Register Result From Address Register Operands Address Register Result Store Branch Note: Execution Unit Latency Throughput Register Integer Unit Integer (16- operands) Register Multimedia Unit Multimedia/3DNow!MMX Shifter Shared Execution Units Load From Store Data Register Operand Store Queue Entry From Address Register Operands Store Queue Entry Resolves Branch Conditions FADD, FSUB FMUL additional latency exists between execution dependent operations. Bypassing register results directly from producing execution units operand inputs dependent units fully supported. Similarly, forwarding memory store values from store queue dependent load operations supported. Execution Units Dependency Latencies Chapter 21924D/0-January 2000 AMD-K6® Processor Code Optimization Resource Constraints optimize code effectively, execution resource constraints must considered. fixed number execution units, even with RISC86 operations cycle, optimal execution parallelism should carefully scheduled. example, IMUL decoded issued pipeline, next three cycles integer, MMX, 3DNow! technology RISC86 operations only issued pipeline. Another example instructions that require load unit. Only load occur each cycle, therefore, instruction would stall cycle. Contention execution resources cause delays issuing execution instructions. addition, stalls resource constraints increase dependency latencies cause exacerbate stalls dependencies. general, constraints that delay non-critical instructions impact performance because such stalls typically overlap with execution critical operations. Chapter Execution Units Dependency Latencies AMD-K6® Processor Code Optimization 21924D/0-January 2000 Code Sample Analysis samples this section show execution behavior several ries instr uctions functio constraints, dependencies, execution resource constraints. sample tables show instructions, RISC86 operation equivalents description events occurring within processor. following nomenclature used describe current location RISC86 operation: Decode stage Issue stage register unit Operand fetch stage register unit Execution stage register unit Execution stage register unit Issue stage register unit Operand fetch stage register unit Execution stage register unit Execution stage register unit Issue stage load unit Operand fetch stage load unit Execution stage load unit Execution stage load unit Issue stage store unit Operand fetch stage store unit Execution stage store unit Execution stage store unit Note: Instructions execute more efficiently (that without delays) when scheduled apart suitable distances based dependencies. general, samples this section show poorly scheduled code order illustrate resultant effects. Execution Units Dependency Latencies Chapter 21924D/0-January 2000 AMD-K6® Processor Code Optimization Table Sample Integer Register Operations RISC86® Operation alux alux alux Clocks Instruction Number Instruction IMUL EAX, EAX, EAX, 0x0F ESI, EDI, alux EDI, 0x07F4 limm Comments Each Instruction Number takes decode cycles because IMUL vector decoded. IMUL instruction executable only integer unit. non-pipelined cycle latency register operation that equivalent three serially-dependent register operations (the result second third operations EDX, respectively). This simple operation ends pipe. load immediate (limm) RISC86 operation does require execution. result value immediately available dependent operations. Shift instructions only executable integer unit. Issue delayed preceding IMUL operations resource constraint integer unit. register operation bumped integer unit clock because must wait more than cycle dependencies resolve. reissued next cycle integer unit (just time availability operand). This falls through integer unit right behind first issuance instruction without delay result instruction being bumped way). issuance subtract register operation delayed clock resource constraints integer unit. Chapter Execution Units Dependency Latencies AMD-K6® Processor Code Optimization 21924D/0-January 2000 Table Sample Integer Register Memory Load Operations RISC86® Operation load load Clocks Instruction Number Instruction EDI, [ECX] EAX, [EDX+20] EAX, ECX, [EDI+4] alux load EBX, 0x1F ESI, [0x0F100] load ECX, [ESI+EAX*4+8] load Comments Each Instruction Number This simple operation ends pipe. This operation occupies load execution unit. register operand load operation bypassed, without delay, from result instruction #1's register operand. clock register operation bumped integer unit while waiting previous load operation result complete. reissued just time receive bypassed result load. Shift instructions only executable integer unit. register operation bumped clock while waiting result preceding instruction register operand load operation bypassed, without delay, from result instruction #2's register operand. This most surrounding load operations generated instruction decoders, issued smoothly executed load unit rate clock cycle. clock register operation bumped integer unit while waiting previous load operation result complete. register operation falls through into integer unit right behind instruction #5's register operation. This operation falls into load unit behind load instruction operand fetch load operation delayed because needs result immediately preceding load operation well results from earlier instruction Execution Units Dependency Latencies Chapter 21924D/0-January 2000 AMD-K6® Processor Code Optimization Table Sample Integer Register Memory Load/Store Operations RISC86® Operation load load store Clocks Instruction Number Instruction EDX, [0xA0008F00] [EDX+16], EAX, [EDX+16] load PUSH EBX, [ECX+EAX*4+3] EDI, store store Comments Each Instruction Number This operation occupies load unit. This long-decoded instruction takes single clock decode. operand fetch load operation delayed waiting result previous load operation from instruction store operation completes concurrent with register operation. result register operation bypassed directly into store queue entry created store operation. issue load operation delayed because operand fetch preceding load operation from instruction delayed. completion load operation held memory dependency preceding store operation instruction load operation completes immediately after store operation, with store data being forwarded from store queue entry. Completion store operation held data dependency preceding instruction store data bypassed directly into store queue entry from result instruction #3's register operation. RISC86 operation executed store unit. operand fetch delayed waiting result instruction register result value produced first execution stage store unit. This simple operation stalled dependency result instruction Chapter Execution Units Dependency Latencies AMD-K6® Processor Code Optimization Table Inst. Num. 21924D/0-January 2000 Sample Integer, MMXTM, Memory Load/Store Operations RISC86® Operation mload mload Clocks Instruction PADDSWMM0, PADDSWMM1, PSRAW MM0, MOVQ MM2, [EAX+EBX] PAND MM0, PMULLWMM2, [EDI+8] MOVQ [ESP+4], EBX, mstore PMULLWMM6, PMADDWDMM2, Comments Each Instruction Number Instructions decoded, issued, executed simultaneously parallel decode restrictions, dependency delays, execution resource constraints. This instruction decoded, issued, executed without delay, cycle behind preceding one-cycle execution latency instruction which dependent. This multimedia operation occupies load unit. This instruction decoded, issued, executed without delay, right behind preceding operations which dependent. This preceding instruction decoded issued together without delay. operand fetch register operation delayed because dependency associated load. result, register operation bumped register unit clock reissued next cycle register unit happens), just time availability operands. Completion this store operation held data dependency preceding multiply register operation (which two-cycle execution latency). store data bypassed directly into store queue entry from result register operation. This operation issued register unit executes without delay out-of-order with respect preceding register operation from instruction (which bumped while waiting operands). This multiply register operation issues starts execution register unit parallel with multiply register operation from instruction which simultaneously issues starts execution register unit execution resource constraint, this operation delayed cycle first execution pipe stage then executes completes normally, cycle behind other contending register operation. (This takes advantage pipelined nature multiply execution logic.) issue this operation delayed clock cycle earlier register operations being selected issue. then delayed further during operand fetch while waiting preceding two-cycle latency multiply register operations complete execution. Execution Units Dependency Latencies Chapter 21924D/0-January 2000 AMD-K6® Processor Code Optimization Instruction Dispatch This chapter describes RISC86 operations executed each instruction. Tables through starting page define integer, MMX, floating-point, 3DNow! instructions. Only AMD-K6-2 AMD-K6-III processors support instructions Table "3DNow!Instructions," page first column these tables indicates instruction mnemonic operand types with following notations: reg8-byte integer register defined instruction byte(s) bits modR/M byte mreg8-byte integer register byte integer value memory defined modR/M byte reg16/32-word doubleword integer register defined instruction byte(s) bits modR/M byte mreg16/32-word doubleword integer register, word doubleword integer value memory defined modR/M byte mem8-byte integer value memory mem16/32-word doubleword integer value memory mem32/48-doubleword 48-bit integer value memory mem48-48-bit integer value memory mem64-64-bit value memory imm8-8-bit immediate value imm16/32-16-bit 32-bit immediate value Instruction Dispatch Chapter AMD-K6® Processor Code Optimization 21924D/0-January 2000 disp8-8-bit displacement value disp16/32-16-bit 32-bit displacement value disp32/48-doubleword 48-bit displacement value eXX-register width depending operand size mem32real-32-bit floating-point value memory mem64real-64-bit floating-point value memory mem80real-80-bit floating-point value memory mmreg-MMX/3DNow! register mmreg1-MMX/3DNow! register defined bits modR/M byte mmreg2-MMX/3DNow! register defined bits modR/M byte second third columns list applicable encoding opcode bytes. fourth column lists modR/M byte when used instruction. modR/M byte defines instruction register memory form. bits documented (memory form), only 10b, 01b, 00b. fifth column lists type instruction decode short, long, vector. processor decode logic process short, long, vector decode clock pair short decodable instructions, integer, floating-point, MMX, 3DNow!, decoded simultaneously. 3DNow! instructions short decodable except EMMS, FEMMS, PREFETCH instructions. sixth column lists type RISC86 operation(s) required instruction. operation types corresponding execution units follows: load, fload, mload-load unit store, fstore, mstore-store unit alu-either integer register execution units alux-integer register execution unit only branch-branch condition unit float-floating-point execution unit meu-Multimedia execution units 3DNow! instructions limm-load immediate, instruction control unit only Instruction Dispatch Chapter 21924D/0-January 2000 AMD-K6® Processor Code Optimization operation(s) most instructions form single dependency chain. instructions whose operations form parallel dependency chain shown separate row. Table Integer Instructions Instruction Mnemonic mreg8, reg8 mem8, reg8 mreg16/32, reg16/32 mem16/32, reg16/32 reg8, mreg8 reg8, mem8 reg16/32, mreg16/32 reg16/32, mem16/32 imm8 EAX, imm16/32 mreg8, imm8 mem8, imm8 mreg16/32, imm16/32 mem16/32, imm16/32 mreg16/32, imm8 (signed ext.) mem16/32, imm8 (signed ext.) mreg8, reg8 mem8, reg8 mreg16/32, reg16/32 mem16/32, reg16/32 reg8, mreg8 reg8, mem8 reg16/32, mreg16/32 reg16/32, mem16/32 imm8 First Byte 11-010-xxx mm-010-xxx 11-010-xxx mm-010-xxx 11-010-xxx mm-010-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx Second Byte ModR/M Byte Decode Type vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector short long short long short short short short short alux load, alux, store load, alu, store alux load, alux load, alux RISC86® Operations Chapter Instruction Dispatch AMD-K6® Processor Code Optimization 21924D/0-January 2000 Table Integer Instructions (continued) Instruction Mnemonic First Byte 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-100-xxx mm-100-xxx 11-100-xxx mm-100-xxx 11-100-xxx mm-100-xxx 11-xxx-xxx mm-xxx-xxx 11-000-xxx mm-000-xxx 11-000-xxx mm-000-xxx 11-000-xxx mm-000-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx Second Byte ModR/M Byte Decode Type short short long short long short long short long short long short short short short short short short long short long short long vector vector vector vector vector vector vector long long long alux load, alux, store load, alu, store alux load, alux, store alux load, alux, store load, alu, store alux load, alux load, alux alux load, alux, store load, alu, store alux load, alux, store RISC86® Operations EAX, imm16/32 mreg8, imm8 mem8, imm8 mreg16/32, imm16/32 mem16/32, imm16/32 mreg16/32, imm8 (signed ext.) mem16/32, imm8 (signed ext.) mreg8, reg8 mem8, reg8 mreg16/32, reg16/32 mem16/32, reg16/32 reg8, mreg8 reg8, mem8 reg16/32, mreg16/32 reg16/32, mem16/32 imm8 EAX, imm16/32 mreg8, imm8 mem8, imm8 mreg16/32, imm16/32 mem16/32, imm16/32 mreg16/32, imm8 (signed ext.) mem16/32, imm8 (signed ext.) ARPL mreg16, reg16 ARPL mem16, reg16 BOUND reg16/32, mreg16/32 reg16/32, mem16/32 reg16/32, mreg16/32 reg16/32, mem16/32 BSWAP BSWAP BSWAP Instruction Dispatch Chapter 21924D/0-January 2000 AMD-K6® Processor Code Optimization Table Integer Instructions (continued) Instruction Mnemonic First Byte 11-xxx-xxx 11-011-xxx 11-010-xxx mm-010-xxx Second Byte 11-xxx-xxx mm-xxx-xxx 11-100-xxx mm-100-xxx 11-xxx-xxx mm-xxx-xxx 11-111-xxx mm-111-xxx 11-xxx-xxx mm-xxx-xxx 11-110-xxx mm-110-xxx 11-xxx-xxx mm-xxx-xxx 11-101-xxx mm-101-xxx ModR/M Byte Decode Type long long long long long vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector short vector vector vector vector vector vector vector vector vector short alux store RISC86® Operations BSWAP BSWAP BSWAP BSWAP BSWAP mreg16/32, reg16/32 mem16/32, reg16/32 mreg16/32, imm8 mem16/32, imm8 mreg16/32, reg16/32 mem16/32, reg16/32 mreg16/32, imm8 mem16/32, imm8 mreg16/32, reg16/32 mem16/32, reg16/32 mreg16/32, imm8 mem16/32, imm8 mreg16/32, reg16/32 mem16/32, reg16/32 mreg16/32, imm8 mem16/32, imm8 CALL full pointer CALL near imm16/32 CALL mem16:16/32 CALL near mreg32 (indirect) CALL near mem32 (indirect) CBW/CWDE CLTS mreg8, reg8 Chapter Instruction Dispatch AMD-K6® Processor Code Optimization 21924D/0-January 2000 Table Integer Instructions (continued) Instruction Mnemonic First Byte 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-111-xxx mm-111-xxx 11-111-xxx mm-111-xxx 11-111-xxx mm-111-xxx Second Byte ModR/M Byte mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx Decode Type short short short short short short short short short short short short short long long vector vector vector vector vector vector vector vector vector vector vector vector vector short short short short short load, alux load, alux load, alux alux load, alux load, load, load, RISC86® Operations load, alux mem8, reg8 mreg16/32, reg16/32 mem16/32, reg16/32 reg8, mreg8 reg8, mem8 reg16/32, mreg16/32 reg16/32, mem16/32 imm8 EAX, imm16/32 mreg8, imm8 mem8, imm8 mreg16/32, imm16/32 mem16/32, imm16/32 mreg16/32, imm8 (signed ext.) mem16/32, imm8 (signed ext.) CMPSB mem8,mem8 CMPSW mem16, mem32 CMPSD mem32, mem32 CMPXCHG mreg8, reg8 CMPXCHG mem8, reg8 CMPXCHG mreg16/32, reg16/32 CMPXCHG mem16/32, reg16/32 CMPXCHG8B EDX:EAX CMPXCHG8B mem64 CPUID CWD/CDQ EDX, Instruction Dispatch Chapter 21924D/0-January 2000 AMD-K6® Processor Code Optimization Table Integer Instructions (continued) Instruction Mnemonic First Byte 11-001-xxx mm-001-xxx 11-001-xxx mm-001-xxx 11-110-xxx mm-110-xxx 11-110-xxx mm-110-xxx 11-111-xxx mm-111-xxx 11-111-xxx mm-111-xxx 11-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-101-xxx mm-101-xxx 11-101-xxx mm-101-xxx 11-xxx-xxx mm-xxx-xxx Second Byte ModR/M Byte Decode Type short short short vector long vector long vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector load, alu, store load, alux, store RISC86® Operations mreg8 mem8 mreg16/32 mem16/32 mreg8 mem8 EAX, mreg16/32 EAX, mem16/32 IDIV mreg8 IDIV mem8 IDIV EAX, mreg16/32 IDIV EAX, mem16/32 IMUL reg16/32, imm16/32 IMUL reg16/32, mreg16/32, imm16/32 IMUL reg16/32, mem16/32, imm16/32 IMUL reg16/32, imm8 (sign extended) IMUL reg16/32, mreg16/32, imm8 (signed) IMUL reg16/32, mem16/32, imm8 (signed) IMUL mreg8 IMUL mem8 IMUL EDX:EAX, EAX, mreg16/32 IMUL EDX:EAX, EAX, mem16/32 IMUL reg16/32, mreg16/32 IMUL reg16/32, mem16/32 imm8 imm8 EAX, imm8 EAX, Chapter Instruction Dispatch AMD-K6® Processor Code Optimization 21924D/0-January 2000 Table Integer Instructions (continued) Instruction Mnemonic First Byte mm-111-xxx 11-000-xxx mm-000-xxx 11-000-xxx mm-000-xxx Second Byte ModR/M Byte Decode Type short short short short short short short short vector long vector long vector vector short short short short short short short short short short short short short short short short vector short short branch branch branch branch branch branch branch branch branch branch branch branch branch branch branch branch branch branch load, alu, store load, alux, store RISC86® Operations mreg8 mem8 mreg16/32 mem16/32 INVD INVLPG short disp8 JB/JNAE short disp8 short disp8 JNB/JAE short disp8 JZ/JE short disp8 JNZ/JNE short disp8 JBE/JNA short disp8 JNBE/JA short disp8 short disp8 short disp8 JP/JPE short disp8 JNP/JPO short disp8 JL/JNGE short disp8 JNL/JGE short disp8 JLE/JNG short disp8 JNLE/JG short disp8 JCXZ/JEC short disp8 near disp16/32 near disp16/32 Instruction Dispatch Chapter 21924D/0-January 2000 AMD-K6® Processor Code Optimization Table Integer Instructions (continued) Instruction Mnemonic First Byte mm-011-xxx 11-010-xxx mm-010-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx mm-xxx-xxx mm-xxx-xxx 11-101-xxx mm-101-xxx 11-100-xxx mm-100-xxx Second Byte ModR/M Byte Decode Type short short short short short short short short short short short short short short short vector short vector vector vector vector vector vector vector vector short long vector vector vector vector vector vector load, load, alu, branch RISC86® Operations branch branch branch branch branch branch branch branch branch branch branch branch branch branch branch JB/JNAE near disp16/32 JNB/JAE near disp16/32 JZ/JE near disp16/32 JNZ/JNE near disp16/32 JBE/JNA near disp16/32 JNBE/JA near disp16/32 near disp16/32 near disp16/32 JP/JPE near disp16/32 JNP/JPO near disp16/32 JL/JNGE near disp16/32 JNL/JGE near disp16/32 JLE/JNG near disp16/32 JNLE/JG near disp16/32 near disp16/32 (direct) disp32/48 (direct) disp8 (short) mreg32 (indirect) mem32 (indirect) near mreg16/32 (indirect) near mem16/32 (indirect) LAHF reg16/32, mreg16/32 reg16/32, mem16/32 reg16/32, mem32/48 reg16/32, mem16/32 LEAVE reg16/32, mem32/48 reg16/32, mem32/48 LGDT mem48 reg16/32, mem32/48 LIDT mem48 LLDT mreg16 Chapter Instruction Dispatch AMD-K6® Processor Code Optimization 21924D/0-January 2000 Table Integer Instructions (continued) Instruction Mnemonic First Byte 11-xxx-xxx mm-xxx-xxx mm-xxx-xxx 11-011-xxx mm-011-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx Second Byte ModR/M Byte mm-010-xxx 11-100-xxx mm-100-xxx Decode Type vector vector vector long long long short vector vector vector vector vector vector vector short short short short short short short short long vector vector vector short short short short short short short load load store store limm limm limm alux store store alux load load load load, alux load, load, alu, branch RISC86® Operations LLDT mem16 LMSW mreg16 LMSW mem16 LODSB mem8 LODSW mem16 LODSD EAX, mem32 LOOP disp8 LOOPE/LOOPZ disp8 LOOPNE/LOOPNZ disp8 reg16/32, mreg16/32 reg16/32, mem16/32 reg16/32, mem32/48 mreg16 mem16 mreg8, reg8 mem8, reg8 mreg16/32, reg16/32 mem16/32, reg16/32 reg8, mreg8 reg8, mem8 reg16/32, mreg16/32 reg16/32, mem16/32 mreg16, segment mem16, segment segment reg, mreg16 segment reg, mem16 mem8 EAX, mem16/32 mem8, mem16/32, imm8 imm8 imm8 Instruction Dispatch Chapter 21924D/0-January 2000 AMD-K6® Processor Code Optimization Table Integer Instructions (continued) Instruction Mnemonic First Byte 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-100-xxx mm-100-xxx 11-100-xxx mm-100-xxx 11-011-xxx 11-000-xxx mm-000-xxx 11-000-xxx mm-000-xxx Second Byte ModR/M Byte Decode Type short short short short short short short short short short short short short short long short long long long long short short short short short short short short vector vector vector vector short alux RISC86® Operations limm limm limm limm limm limm limm limm limm limm limm limm limm limm store limm store load, store, alux, alux load, store, alu, load, store, alu, load, load, load, load, imm8 imm8 imm8 imm8 imm8 EAX, imm16/32 ECX, imm16/32 EDX, imm16/32 EBX, imm16/32 ESP, imm16/32 EBP, imm16/32 ESI, imm16/32 EDI, imm16/32 mreg8, imm8 mem8, imm8 reg16/32, imm16/32 mem16/32, imm16/32 MOVSB mem8,mem8 MOVSD mem16, mem16 MOVSW mem32, mem32 MOVSX reg16/32, mreg8 MOVSX reg16/32, mem8 MOVSX reg32, mreg16 MOVSX reg32, mem16 MOVZX reg16/32, mreg8 MOVZX reg16/32, mem8 MOVZX reg32, mreg16 MOVZX reg32, mem16 mreg8 mem8 EAX, mreg16/32 EAX, mem16/32 mreg8 Chapter Instruction Dispatch AMD-K6® Processor Code Optimization 21924D/0-January 2000 Table Integer Instructions (continued) Instruction Mnemonic First Byte 11-001-xxx mm-001-xxx 11-001-xxx mm-001-xxx 11-001-xxx mm-001-xxx 11-010-xxx mm-010-xxx 11-010-xxx mm-010-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx Second Byte ModR/M Byte mm-011-xxx 11-011-xxx mm-011-xxx Decode Type vector short vector short short vector short vector short long short long short short short short short short short long short long short long vector vector vector vector vector vector vector vector vector alux load, alux, store load, alu, store alux load, alux load, alux alux load, alux, store load, alu, store alux load, alux, store limm alux RISC86® Operations mem8 mreg16/32 mem16/32 (XCHG mreg8 mem8 mreg16/32 mem16/32 mreg8, reg8 mem8, reg8 mreg16/32, reg16/32 mem16/32, reg16/32 reg8, mreg8 reg8, mem8 reg16/32, mreg16/32 reg16/32, mem16/32 imm8 EAX, imm16/32 mreg8, imm8 mem8, imm8 mreg16/32, imm16/32 mem16/32, imm16/32 mreg16/32, imm8 (signed ext.) mem16/32, imm8 (signed ext.) imm8, imm8, imm8, Instruction Dispatch Chapter 21924D/0-January 2000 AMD-K6® Processor Code Optimization Table Integer Instructions (continued) Instruction Mnemonic First Byte 11-110-xxx mm-110-xxx 11-000-xxx mm-000-xxx Second Byte ModR/M Byte Decode Type vector vector short short short short short short short short short long vector vector long vector vector vector vector long short short short short short short short short long long vector long vector load, store load, store store store store store store store store store store store load, store load, load, load, load, load, load, load, load, load, load, store, RISC86® Operations mreg 16/32 16/32 POPA/POPAD POPF/POPFD PUSH PUSH PUSH PUSH PUSH PUSH PUSH PUSH PUSH PUSH PUSH PUSH PUSH PUSH PUSH imm8 PUSH imm16/32 PUSH mreg16/32 PUSH mem16/32 PUSHA/PUSHAD Chapter Instruction Dispatch AMD-K6® Processor Code Optimization 21924D/0-January 2000 Table Integer Instructions (continued) Instruction Mnemonic First Byte 11-000-xxx mm-000-xxx 11-000-xxx mm-000-xxx 11-010-xxx mm-010-xxx 11-010-xxx mm-010-xxx 11-010-xxx mm-010-xxx 11-010-xxx mm-010-xxx 11-010-xxx mm-010-xxx 11-010-xxx mm-010-xxx 11-011-xxx mm-011-xxx 11-011-xxx mm-011-xxx 11-011-xxx mm-011-xxx 11-011-xxx mm-011-xxx 11-011-xxx mm-011-xxx 11-011-xxx mm-011-xxx Second Byte ModR/M Byte Decode Type vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector RISC86® Operations PUSHF/PUSHFD mreg8, imm8 mem8, imm8 mreg16/32, imm8 mem16/32, imm8 mreg8, mem8, mreg16/32, mem16/32, mreg8, mem8, mreg16/32, mem16/32, mreg8, imm8 mem8, imm8 mreg16/32, imm8 mem16/32, imm8 mreg8, mem8, mreg16/32, mem16/32, mreg8, mem8, mreg16/32, mem16/32, near imm16 near imm16 mreg8, imm8 mem8, imm8 mreg16/32, imm8 mem16/32, imm8 Instruction Dispatch Chapter 21924D/0-January 2000 AMD-K6® Processor Code Optimization Table Integer Instructions (continued) Instruction Mnemonic First Byte 11-111-xxx mm-111-xxx 11-111-xxx mm-111-xxx 11-111-xxx mm-111-xxx 11-111-xxx mm-111-xxx 11-111-xxx mm-111-xxx 11-111-xxx mm-111-xxx Second Byte ModR/M Byte 11-000-xxx mm-000-xxx 11-000-xxx mm-000-xxx 11-000-xxx mm-000-xxx 11-000-xxx mm-000-xxx 11-001-xxx mm-001-xxx 11-001-xxx mm-001-xxx 11-001-xxx mm-001-xxx 11-001-xxx mm-001-xxx 11-001-xxx mm-001-xxx 11-001-xxx mm-001-xxx Decode Type vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector short vector short vector short vector short vector short vector short vector alux alux alux RISC86® Operations mreg8, mem8, mreg16/32, mem16/32, mreg8, mem8, mreg16/32, mem16/32, mreg8, imm8 mem8, imm8 mreg16/32, imm8 mem16/32, imm8 mreg8, mem8, mreg16/32, mem16/32, mreg8, mem8, mreg16/32, mem16/32, SAHF mreg8, imm8 mem8, imm8 mreg16/32, imm8 mem16/32, imm8 mreg8, mem8, mreg16/32, mem16/32, mreg8, mem8, mreg16/32, mem16/32, Chapter Instruction Dispatch AMD-K6® Processor Code Optimization 21924D/0-January 2000 Table Integer Instructions (continued) Instruction Mnemonic First Byte 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-011-xxx mm-011-xxx 11-011-xxx mm-011-xxx 11-011-xxx mm-011-xxx Second Byte ModR/M Byte 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx Decode Type vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector RISC86® Operations mreg8, reg8 mem8, reg8 mreg16/32, reg16/32 mem16/32, reg16/32 reg8, mreg8 reg8, mem8 reg16/32, mreg16/32 reg16/32, mem16/32 imm8 EAX, imm16/32 mreg8, imm8 mem8, imm8 mreg16/32, imm16/32 mem16/32, imm16/32 mreg8, imm8 (signed ext.) mem8, imm8 (signed ext.) SCASB mem8 SCASW mem16 SCASD EAX, mem32 SETO mreg8 SETO mem8 SETNO mreg8 SETNO mem8 SETB/SETNAE mreg8 SETB/SETNAE mem8 SETNB/SETAE mreg8 SETNB/SETAE mem8 SETZ/SETE mreg8 SETZ/SETE mem8 SETNZ/SETNE mreg8 SETNZ/SETNE mem8 SETBE/SETNA mreg8 SETBE/SETNA mem8 Instruction Dispatch Chapter 21924D/0-January 2000 AMD-K6® Processor Code Optimization Table Integer Instructions (continued) Instruction Mnemonic First Byte Second Byte ModR/M Byte 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx mm-000-xxx mm-001-xxx 11-100-xxx mm-100-xxx 11-100-xxx mm-100-xxx 11-100-xxx mm-100-xxx 11-100-xxx mm-100-xxx 11-100-xxx mm-100-xxx 11-100-xxx mm-100-xxx 11-101-xxx Decode Type vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector short vector short vector short vector short vector short vector short vector short alux alux alux alux RISC86® Operations SETNBE/SETA mreg8 SETNBE/SETA mem8 SETS mreg8 SETS mem8 SETNS mreg8 SETNS mem8 SETP/SETPE mreg8 SETP/SETPE mem8 SETNP/SETPO mreg8 SETNP/SETPO mem8 SETL/SETNGE mreg8 SETL/SETNGE mem8 SETNL/SETGE mreg8 SETNL/SETGE mem8 SETLE/SETNG mreg8 SETLE/SETNG mem8 SETNLE/SETG mreg8 SETNLE/SETG mem8 SGDT mem48 SIDT mem48 SHL/SAL mreg8, imm8 SHL/SAL mem8, imm8 SHL/SAL mreg16/32, imm8 SHL/SAL mem16/32, imm8 SHL/SAL mreg8, SHL/SAL mem8, SHL/SAL mreg16/32, SHL/SAL mem16/32, SHL/SAL mreg8, SHL/SAL mem8, SHL/SAL mreg16/32, SHL/SAL mem16/32, mreg8, imm8 Chapter Instruction Dispatch AMD-K6® Processor Code Optimization 21924D/0-January 2000 Table Integer Instructions (continued) Instruction Mnemonic First Byte 11-001-xxx mm-001-xxx 11-xxx-xxx mm-xxx-xxx Second Byte ModR/M Byte mm-101-xxx 11-101-xxx mm-101-xxx 11-101-xxx mm-101-xxx 11-101-xxx mm-101-xxx 11-101-xxx mm-101-xxx 11-101-xxx mm-101-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-000-xxx mm-000-xxx 11-100-xxx mm-100-xxx Decode Type vector short vector short vector short vector short vector short vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector vector long long long vector vector short long alux load, alux, store store, alux store, store, alux alux RISC86® Operations mem8, imm8 mreg16/32, imm8 mem16/32, imm8 mreg8, mem8, mreg16/32, mem16/32, mreg8, mem8, mreg16/32, mem16/32, SHLD mreg16/32, reg16/32, imm8 SHLD mem16/32, reg16/32, imm8 SHLD mreg16/32, reg16/32, SHLD mem16/32, reg16/32, SHRD mreg16/32, reg16/32, imm8 SHRD mem16/32, reg16/32, imm8 SHRD mreg16/32, reg16/32, SHRD mem16/32, reg16/32, SLDT mreg16 SLDT mem16 SMSW mreg16 SMSW mem16 STOSB mem8, STOSW mem16, STOSD mem32, mreg16 mem16 mreg8, reg8 mem8, reg8 Instruction Dispatch Chapter 21924D/0-January 2000 AMD-K6® Processor Code Optimization Table Integer Instructions (continued) Instruction Mnemonic First Byte 11-000-xxx mm-000-xxx 11-000-xxx mm-000-xxx 11-100-xxx mm-100-xxx 11-101-xxx mm-101-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-101-xxx mm-101-xxx 11-101-xxx mm-101-xxx 11-101-xxx mm-101-xxx Second Byte ModR/M Byte 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx Decode Type short long short short short short short short short long short long short long vector vector short vector short vector long long long long long long vector vector vector vector vector alux alux load, alux load, alux load, alu, store alux load, alux load, alux alux load, alux, store load, alu, store alux load, alux, store RISC86® Operations mreg16/32, reg16/32 mem16/32, reg16/32 reg8, mreg8 reg8, mem8 reg16/32, mreg16/32 reg16/32, mem16/32 imm8 EAX, imm16/32 mreg8, imm8 mem8, imm8 mreg16/32, imm16/32 mem16/32, imm16/32 mreg16/32, imm8 (signed ext.) mem16/32, imm8 (signed ext.) SYSCALL (only supported AMD-K6-2 AMD-K6-III processors) SYSRET (only supported AMD-K6-2 AMD-K6-III processors) TEST mreg8, reg8 TEST mem8, reg8 TEST mreg16/32, reg16/32 TEST mem16/32, reg16/32 TEST imm8 TEST EAX, imm16/32 TEST mreg8, imm8 TEST mem8, imm8 TEST mreg16/32, imm16/32 TEST mem16/32, imm16/32 VERR mreg16 VERR mem16 VERW mreg16 VERW mem16 WAIT Chapter Instruction Dispatch AMD-K6® Processor Code Optimization 21924D/0-January 2000 Table Integer Instructions (continued) Instruction Mnemonic First Byte 11-110-xxx mm-110-xxx 11-110-xxx mm-110-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx Second Byte 11-100-xxx mm-100-xxx 11-101-xxx mm-101-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx ModR/M Byte Decode Type vector vector vector vector vector vector vector vector vector short long long long long long long long vector short long short long short short short short short short short long short long alux load, alux, store load, alu, store alux load, alux load, alux alux load, alux, store load, alu, store limm alu, alu, alu, alu, alu, alu, alu, alu, alu, alu, alu, alu, alu, alu, RISC86® Operations WBINVD XADD mreg8, reg8 XADD mem8, reg8 XADD mreg16/32, reg16/32 XADD mem16/32, reg16/32 XCHG reg8, mreg8 XCHG reg8, mem8 XCHG reg16/32, mreg16/32 XCHG reg16/32, mem16/32 XCHG EAX, XCHG EAX, XCHG EAX, XCHG EAX, XCHG EAX, XCHG EAX, XCHG EAX, XCHG EAX, XLAT mreg8, reg8 mem8, reg8 mreg16/32, reg16/32 mem16/32, reg16/32 reg8, mreg8 reg8, mem8 reg16/32, mreg16/32 reg16/32, mem16/32 imm8 EAX, imm16/32 mreg8, imm8 mem8, imm8 mreg16/32, imm16/32 mem16/32, imm16/32 Instruction Dispatch Chapter 21924D/0-January 2000 AMD-K6® Processor Code Optimization Table Integer Instructions (continued) Instruction Mnemonic First Byte Second Byte ModR/M Byte 11-110-xxx mm-110-xxx Decode Type short long alux load, alux, store RISC86® Operations mreg16/32, imm8 (signed ext.) mem16/32, imm8 (signed ext.) Table MMXInstructions Prefix First Byte(s) Byte 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx ModR/M Byte Decode Type vector short short short short short short short short short short short short short short short short short short short short short short short short mload mstore, load mstore mload mstore mload, mload, mload, mload, mload, mload, mload, mload, RISC86® Operations Note Instruction Mnemonic EMMS MOVD mmreg, mreg32 MOVD mmreg, mem32 MOVD mreg32, mmreg MOVD mem32, mmreg MOVQ mmreg1, mmreg2 MOVQ mmreg, mem64 MOVQ mmreg2, mmreg1 MOVQ mem64, mmreg PACKSSDW mmreg1, mmreg2 PACKSSDW mmreg, mem64 PACKSSWB mmreg1, mmreg2 PACKSSWB mmreg, mem64 PACKUSWB mmreg1, mmreg2 PACKUSWB mmreg, mem64 PADDB mmreg1, mmreg2 PADDB mmreg, mem64 PADDD mmreg1, mmreg2 PADDD mmreg, mem64 PADDSB mmreg1, mmreg2 PADDSB mmreg, mem64 PADDSW mmreg1, mmreg2 PADDSW mmreg, mem64 PADDUSB mmreg1, mmreg2 PADDUSB mmreg, mem64 Notes: Bits modR/M byte select integer register. Chapter Instruction Dispatch AMD-K6® Processor Code Optimization 21924D/0-January 2000 Table MMXInstructions (continued) Prefix First Byte(s) Byte ModR/M Byte 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-110-xxx Decode Type short short short short short short short short short short short short short short short short short short short short short short short short short short short short short short short RISC86® Operations mload, mload, mload, mload, mload, mload, mload, mload, mload, mload, mload, mload, mload, mload, mload, Note Instruction Mnemonic PADDUSW mmreg1, mmreg2 PADDUSW mmreg, mem64 PADDW mmreg1, mmreg2 PADDW mmreg, mem64 PAND mmreg1, mmreg2 PAND mmreg, mem64 PANDN mmreg1, mmreg2 PANDN mmreg, mem64 PCMPEQB mmreg1, mmreg2 PCMPEQB mmreg, mem64 PCMPEQD mmreg1, mmreg2 PCMPEQD mmreg, mem64 PCMPEQW mmreg1, mmreg2 PCMPEQW mmreg, mem64 PCMPGTB mmreg1, mmreg2 PCMPGTB mmreg, mem64 PCMPGTD mmreg1, mmreg2 PCMPGTD mmreg, mem64 PCMPGTW mmreg1, mmreg2 PCMPGTW mmreg, mem64 PMADDWD mmreg1, mmreg2 PMADDWD mmreg, mem64 PMULHW mmreg1, mmreg2 PMULHW mmreg, mem64 PMULLW mmreg1, mmreg2 PMULLW mmreg, mem64 mmreg1, mmreg2 mmreg, mem64 PSLLW mmreg1, mmreg2 PSLLW mmreg, mem64 PSLLW mmreg, imm8 Notes: Bits modR/M byte select integer register. Instruction Dispatch Chapter 21924D/0-January 2000 AMD-K6® Processor Code Optimization Table MMXInstructions (continued) Prefix First Byte(s) Byte ModR/M Byte 11-xxx-xxx mm-xxx-xxx 11-110-xxx 11-xxx-xxx mm-xxx-xxx 11-110-xxx 11-xxx-xxx mm-xxx-xxx 11-100-xxx 11-xxx-xxx mm-xxx-xxx 11-100-xxx 11-xxx-xxx mm-xxx-xxx 11-010-xxx 11-xxx-xxx mm-xxx-xxx 11-010-xxx 11-xxx-xxx mm-xxx-xxx 11-010-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx Decode Type short short short short short short short short short short short short short short short short short short short short short short short short short short short short short short short RISC86® Operations mload, mload, mload, mload, mload, mload, mload, mload, mload, mload, mload, mload, Note Instruction Mnemonic PSLLD mmreg1, mmreg2 PSLLD mmreg, mem64 PSLLD mmreg, imm8 PSLLQ mmreg1, mmreg2 PSLLQ mmreg, mem64 PSLLQ mmreg, imm8 PSRAW mmreg1, mmreg2 PSRAW mmreg, mem64 PSRAW mmreg, imm8 PSRAD mmreg1, mmreg2 PSRAD mmreg, mem64 PSRAD mmreg, imm8 PSRLW mmreg1, mmreg2 PSRLW mmreg, mem64 PSRLW mmreg, imm8 PSRLD mmreg1, mmreg2 PSRLD mmreg, mem64 PSRLD mmreg, imm8 PSRLQ mmreg1, mmreg2 PSRLQ mmreg, mem64 PSRLQ mmreg, imm8 PSUBB mmreg1, mmreg2 PSUBB mmreg, mem64 PSUBD mmreg1, mmreg2 PSUBD mmreg, mem64 PSUBSB mmreg1, mmreg2 PSUBSB mmreg, mem64 PSUBSW mmreg1, mmreg2 PSUBSW mmreg, mem64 PSUBUSB mmreg1, mmreg2 PSUBUSB mmreg, mem64 Notes: Bits modR/M byte select integer register. Chapter Instruction Dispatch AMD-K6® Processor Code Optimization 21924D/0-January 2000 Table MMXInstructions (continued) Prefix First Byte(s) Byte ModR/M Byte 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx Decode Type short short short short short short short short short short short short short short short short short short RISC86® Operations mload, mload, mload, mload, mload, mload, mload, mload, mload, Note Instruction Mnemonic PSUBUSW mmreg1, mmreg2 PSUBUSW mmreg, mem64 PSUBW mmreg1, mmreg2 PSUBW mmreg, mem64 PUNPCKHBW mmreg1, mmreg2 PUNPCKHBW mmreg, mem64 PUNPCKHWD mmreg1, mmreg2 PUNPCKHWD mmreg, mem64 PUNPCKHDQ mmreg1, mmreg2 PUNPCKHDQ mmreg, mem64 PUNPCKLBW mmreg1, mmreg2 PUNPCKLBW mmreg, mem32 PUNPCKLWD mmreg1, mmreg2 PUNPCKLWD mmreg, mem32 PUNPCKLDQ mmreg1, mmreg2 PUNPCKLDQ mmreg, mem32 PXOR mmreg1, mmreg2 PXOR mmreg, mem64 Notes: Bits modR/M byte select integer register. Instruction Dispatch Chapter 21924D/0-January 2000 AMD-K6® Processor Code Optimization Table Floating-Point Instructions First Byte 11-110-xxx 11-110-xxx 11-110-xxx 11-111-xxx 11-111-xxx 11-111-xxx mm-110-xxx mm-110-xxx 11-111-xxx 11-110-xxx 11-111-xxx 11-010-xxx mm-010-xxx mm-010-xxx 11-011-xxx mm-011-xxx mm-011-xxx 11-011-001 Second Byte 11-000-xxx mm-000-xxx 11-000-xxx mm-000-xxx 11-000-xxx mm-100-xxx mm-110-xxx ModR/M Byte Decode Type short short short short short short short vector vector short vector short short short short short short short short short short short short short short short short short short short short float fload, float fload, float float fload, float fload, float float float float float float float float float float fload, float fload, float float float float float RISC86® Operations float float float fload, float float fload, float float Note Instruction Mnemonic F2XM1 FABS FADD ST(0), ST(i) FADD ST(0), mem32real FADD ST(i), ST(0) FADD ST(0), mem64real FADDP ST(i), ST(0) FBLD FBSTP FCHS FCLEX FCOM ST(0), ST(i) FCOM ST(0), mem32real FCOM ST(0), mem64real FCOMP ST(0), ST(i) FCOMP ST(0), mem32real FCOMP ST(0), mem64real FCOMPP FCOS FDECSTP FDIV ST(0), ST(i) (single precision) FDIV ST(0), ST(i) (double precision) FDIV ST(0), ST(i) (extended precision) FDIV ST(i), ST(0) (single precision) FDIV ST(i), ST(0) (double precision) FDIV ST(i), ST(0) (extended precision) FDIV ST(0), mem32real FDIV ST(0), mem64real FDIVP ST(0), ST(i) FDIVR ST(0), ST(i) FDIVR ST(i), ST(0) Notes: last three bits modR/M byte select stack entry ST(i). Chapter Instruction Dispatch AMD-K6® Processor Code Optimization 21924D/0-January 2000 Table Floating-Point Instructions (continued) First Byte mm-010-xxx mm-010-xxx mm-011-xxx mm-011-xxx mm-111-xxx mm-100-xxx mm-100-xxx mm-101-xxx mm-101-xxx 11-000-xxx Second Byte ModR/M Byte mm-111-xxx mm-111-xxx 11-110-xxx 11-000-xxx mm-000-xxx mm-000-xxx mm-010-xxx mm-010-xxx mm-011-xxx mm-011-xxx mm-110-xxx mm-110-xxx mm-111-xxx mm-111-xxx mm-000-xxx mm-000-xxx mm-101-xxx mm-001-xxx mm-001-xxx Decode Type short short short short short short short short short short short short short short short short short short short short vector short short short short short short short short short short fload, float fload, float fload, float fload, float fload, float fload, float fload, float fload, float fload, float fload, float RISC86® Operations fload, float fload, float float float fload, float fload, float fload, float fload, float fload, float fload, float fload, float fload, float fload, float fload, float fload, float fload, float fload, float fload, float fload, float float Note Instruction Mnemonic FDIVR ST(0), mem32real FDIVR ST(0), mem64real FDIVRP ST(i), ST(0) FFREE ST(i) FIADD ST(0), mem32int FIADD ST(0), mem16int FICOM ST(0), mem32int FICOM ST(0), mem16int FICOMP ST(0), mem32int FICOMP ST(0), mem16int FIDIV ST(0), mem32int FIDIV ST(0), mem16int FIDIVR ST(0), mem32int FIDIVR ST(0), mem16int FILD mem16int FILD mem32int FILD mem64int FIMUL ST(0), mem32int FIMUL ST(0), mem16int FINCSTP FINIT FIST mem16int FIST mem32int FISTP mem16int FISTP mem32int FISTP mem64int FISUB ST(0), mem32int FISUB ST(0), mem16int FISUBR ST(0), mem32int FISUBR ST(0), mem16int ST(i) Notes: last three bits modR/M byte select stack entry ST(i). Instruction Dispatch Chapter 21924D/0-January 2000 AMD-K6® Processor Code Optimization Table Floating-Point Instructions (continued) First Byte mm-100-xxx mm-110-xxx 11-001-xxx 11-001-xxx mm-001-xxx mm-001-xxx 11-001-xxx mm-101-xxx mm-100-xxx Second Byte ModR/M Byte mm-000-xxx mm-000-xxx mm-101-xxx Decode Type short short vector short vector short short short short short short short short short short short short short short short short vector short vector vector short short vector short short short float float float float float float fload, float float float float float float float float float fload, float fload, float float float float float float fload, float RISC86® Operations fload, float fload, float Note Instruction Mnemonic mem32real mem64real mem80real FLD1 FLDCW FLDENV FLDL2E FLDL2T FLDLG2 FLDLN2 FLDPI FLDZ FMUL ST(0), ST(i) FMUL ST(i), ST(0) FMUL ST(0), mem32real FMUL ST(0), mem64real FMULP ST(0), ST(i) FNOP FPATAN FPREM FPREM1 FPTAN FRNDINT FRSTOR FSAVE FSCALE FSIN FSINCOS FSQRT (single precision) FSQRT (double precision) FSQRT (extended precision) Notes: last three bits modR/M byte select stack entry ST(i). Chapter Instruction Dispatch AMD-K6® Processor Code Optimization 21924D/0-January 2000 Table Floating-Point Instructions (continued) First Byte 11-001-xxx 11-100-xxx 11-101-xxx mm-111-xxx mm-100-xxx mm-100-xxx 11-100-xxx 11-101-xxx 11-101-xxx mm-101-xxx mm-101-xxx 11-100-xxx 11-101-xxx 11-100-xxx Second Byte ModR/M Byte mm-010-xxx mm-010-xxx 11-010-xxx mm-111-xxx mm-110-xxx mm-011-xxx mm-011-xxx mm-111-xxx 11-011-xxx Decode Type short short short vector vector short short vector short vector vector short short short short short short short short short short short short short short short short vector short short vector float float fload, float fload, float float float float fload, float fload, float float float float float float float float float float float fstore fstore RISC86® Operations fstore fstore fstore Note Instruction Mnemonic mem32real mem64real ST(i) FSTCW FSTENV FSTP mem32real FSTP mem64real FSTP mem80real FSTP ST(i) FSTSW FSTSW mem16 FSUB ST(0), mem32real FSUB ST(0), mem64real FSUB ST(0), ST(i) FSUB ST(i), ST(0) FSUBP ST(0), ST(i) FSUBR ST(0), mem32real FSUBR ST(0), mem64real FSUBR ST(0), ST(i) FSUBR ST(i), ST(0) FSUBRP ST(i), ST(0) FTST FUCOM FUCOMP FUCOMPP FXAM FXCH FXTRACT FYL2X FYL2XP1 FWAIT Notes: last three bits modR/M byte select stack entry ST(i). Instruction Dispatch Chapter 21924D/0-January 2000 AMD-K6® Processor Code Optimization Table 3DNow!Instructions Prefix Opcode Byte(s) Byte 0Fh, 0Fh, 0Fh, 0Fh, 0Fh, 0Fh, 0Fh, 0Fh, 0Fh, 0Fh, 0Fh, 0Fh, 0Fh, 0Fh, 0Fh, 0Fh, 0Fh, 0Fh, 0Fh, 0Fh, 0Fh, 0Fh, 0Fh, 0Fh, 0Fh, 0Fh, 0Fh, 0Fh, 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx ModR/M Byte Decode Type vector short short short short short short short short short short short short short short short short short short short short short short short short short short short short mload, mload, mload, mload, mload, mload, mload, mload, mload, mload, mload, mload, mload, mload, RISC86® Operations Note Instruction Mnemonic FEMMS PAVGUSB mmreg1, mmreg2 PAVGUSB mmreg, mem64 PFADD mmreg1, mmreg2 PFADD mmreg, mem64 PFSUB mmreg1, mmreg2 PFSUB mmreg, mem64 PFSUBR mmreg1, mmreg2 PFSUBR mmreg, mem64 PFACC mmreg1, mmreg2 PFACC mmreg, mem64 PFMUL mmreg1, mmreg2 PFMUL mmreg, mem64 PFCMPGE mmreg1, mmreg2 PFCMPGE mmreg, mem64 PFCMPGT mmreg1, mmreg2 PFCMPGT mmreg, mem64 PFCMPEQ mmreg1, mmreg2 PFCMPEQ mmreg, mem64 PFMIN mmreg1, mmreg2 PFMIN mmreg, mem64 PFMAX mmreg1, mmreg2 PFMAX mmreg, mem64 PI2FD mmreg1, mmreg2 PI2FD mmreg, mem64 PF2ID mmreg1, mmreg2 PF2ID mmreg, mem64 PFRCP mmreg1, mmreg2 PFRCP mmreg, mem64 Notes: more information about FEMMS instruction, "AMD-K6®-2 AMD-K6®-III Processors Multimedia Coding Optimizations" page PREFETCH PREFETCHW, mem8 value refers address 32-byte line that will prefetched. PREFETCHW will implemented future K86processor. AMD-K6-2 processor, this instruction performs same manner PREFETCH instruction. Chapter Instruction Dispatch AMD-K6® Processor Code Optimization 21924D/0-January 2000 Table 3DNow!Instructions (continued) Prefix Opcode Byte(s) Byte 0Fh, 0Fh, 0Fh, 0Fh, 0Fh, 0Fh, 0Fh, 0Fh, 0Fh, 0Fh, ModR/M Byte 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx 11-xxx-xxx mm-xxx-xxx mm-000-xxx mm-001-xxx Decode Type short short short short short short short short short short vector vector RISC86® Operations mload, mload, mload, mload, mload, load load Note Instruction Mnemonic PFRSQRT mmreg1, mmreg2 PFRSQRT mmreg, mem64 PFRCPIT1 mmreg1, mmreg2 PFRCPIT1 mmreg, mem64 PFRSQIT1 mmreg1, mmreg2 PFRSQIT1 mmreg, mem64 PFRCPIT2 mmreg1, mmreg2 PFRCPIT2 mmreg, mem64 PMULHRW mmreg1, mmreg2 PMULHRW mmreg1, mem64 PREFETCH mem8 PREFETCHW mem8 Notes: more information about FEMMS instruction, "AMD-K6®-2 AMD-K6®-III Processors Multimedia Coding Optimizations" page PREFETCH PREFETCHW, mem8 value refers address 32-byte line that will prefetched. PREFETCHW will implemented future K86processor. AMD-K6-2 processor, this instruction performs same manner PREFETCH instruction. Instruction Dispatch Chapter 21924D/0-January 2000 AMD-K6® Processor Code Optimization Optimization Coding Guidelines General Optimization Techniques This section describes general code optimization techniques specific superscalar processors (that techniques common AMD-K6 family processors, AMD-K5processor, Pentium family processors). general, optimization techniques used AMD-K5 processor, Pentium, Pentium processors either improve performance AMD-K6 family required have neutral effect (usually fewer coding restrictions with AMD-K6 family). Short Forms-Use shorter forms instructions increase effective number instructions that examined decoding time. 8-bit displacements jump offsets where possible. Simple Instructions-Use simple instructions with hardwired decode (pairable, short, fast) because they perform more efficiently. This includes "register register memory" well "registerregister register" forms instructions. Dependencies-Spread true dependencies increase opportunities parallel execution. Anti-dependencies output dependencies impact performance. Chapter Optimization Coding Guidelines AMD-K6® Processor Code Optimization 21924D/0-January 2000 Memory Operands Instructions that operate data memory (load/operation/store) inhibit parallelism. separate move instructions allows better code scheduling independent operations. However, there load/operation/store forms reduce number register spills (storing values memory free registers other uses). Register Operands Maintain frequently used values registers rather than memory. Stack References-Use stack references that remains available. Stack Allocation -When allocating space local variables and/or outgoing parameters within procedure, adjust stack pointer moves rather than pushes. This method allocation allows random access outgoing parameters that they when they calculated instead being held somewhere else until procedure call. This method also reduces dependencies uses fewer execution resources. Data Embedding When data embedded code segment, align separate cache blocks from nearby code. This technique avoids some overhead when maintaining coherency between instruction data caches. Loops Unroll loops more parallelism reduce loop overhead, even with branch prediction. Inline small routines avoid procedure-call overhead. both techniques, however, consider cost possible increased register usage, which might load/store instructions register spilling. Unrolling large code loops result inefficient instruction caches. Code Alignment-Aligning subroutines 0-mod-16 ideally, 0-mod-32) address boundaries optimiz instruction cache-fill efficiency. Keeping starting point loops least instructions away from 32-byte cache lines optim branch-t arget instruct etch deco efficiency. Optimization Coding Guidelines Chapter 21924D/0-January 2000 AMD-K6® Processor Code Optimization General AMD-K6® Family Coding Optimizations This section describes general code optimization techniques specific AMD-K6 family processors. short-decodable instructions increase decode bandwidth minimize number RISC86 operations instruction, short-decodable instructions. short-decodable instructions. Pair short-decodable instructions short-decodable instructions decoded clock, using full decode bandwidth AMD-K6 family. Note: AMD-K6-2 AMD-K6-III processors, 3DNow! instructions short-decodable except EMMS, FEMMS, PREFETCH instructions. Avoid using complex instructions more complex uncommon instructions vector decoded generate larger ratio RISC86 operations instruction compared with short-decodable long-decodable instructions. Avoid multiple accumulated prefixes order accomplish instruction decode, decoders require sufficient predecode information. When instruction multiple prefixes this cannot deduced decoders (due lack data instruction decode buffer), first decoder retires accumulates prefix cycle until instruction completely decoded. Table shows when prefixes accumulated decoding serialized. Table Decode Accumulation Serialization Decode Instruction Instruction Instruction Prefix Instruction Prefix Decoder Results Single instruction decoded. Dual instruction decode. Single instruction decode prefix accumulated. Instruction prefix accumulation single instruction (modified Prefix) decoded. Chapter Optimization Coding Guidelines AMD-K6® Processor Code Optimization 21924D/0-January 2000 Table Decode Accumulation Serialization (continued) Decode PrefixA Decoder PrefixB Results Accumulate PrefixA cancel decode second prefix. prefix already been accumulated previous decode cycle, accumulate PrefixB cancel instruction decode, wait next decode cycle decode instruction. PrefixB Instruction prefix usage does count prefix decoder accumulation rules (that does cause accumulation). Avoid long instruction length instructions that less than eight bytes length. instruction that longer than seven bytes cannot short-decoded. read-modify-write instructions over discrete equivalent- advantage gained splitting read-modify-write instructions into load-execute-store instruction group. Both read-modif rite instructions load-execut e-st instruction groups decode execute cycle read-modify-write instructions promote better code density. Move rarely used code data separate pages- Placing code, such exception handlers, separate pages data, such error text messages, separate pages maximizes TLBs prevents table pollution with rarely used items. Avoid mixing code size types Size prefixes that affect length instruction sometimes inhibit dual decoding. Always pair CALL RETURN-If CALLs RETs paired, return address stack gets synchronization, increasing latency returns decreasing performance. Exploit parallel execution integer floating-point multiplies AMD-K6 processor allows simultaneous g-po mult low-latency multipliers. Optimization Coding Guidelines Chapter 21924D/0-January 2000 AMD-K6® Processor Code Optimization Avoid more than levels nesting subroutines-More than levels nested subroutine calls overflow return address stack, leading lower performance. While this problem most code, recursive subroutines might easily exceed levels subroutine calls. recursive subroutine tail recursive, usually mechanically transformed into iterative version, which leads increased performance. Place frequently used stack data within bytes EBP- statically most-referenced data items function's stack frame should located from -128 +127 bytes from EBP. This technique improves code density enabling 8-bit sign-extended displacement instead 32-bit displacement. Avoid superset dependencies Using larger form register immediate after instruction uses smaller form creates superset dependency prevents parallel execution. example, avoid following type code: AH,07h EAX,1555555h method avoiding superset dependencies schedule instruction with superset dependency (for example, instruction) instructions later than would normally preferable. Another method, useful some cases, MOVZX instruction efficiently convert byte-size value doubleword-size value, which then combined with other values 32-bit operations. Avoid excessive loop unrolling code inlining-Excessive loop unrolling code inlining increases code size reduces locality, which leads lower cache rates reduced performance. Avoid splitting 16-bit memory access 32-bit code advantage gained splitting 16-bit memory access 32-bit code into byte-sized accesses. This technique avoids operand size override. Avoid data dependent branches around single instruction Data dependent branches acting upon basically random data cause branch prediction logic mispredict branch about time. Design branch-free alternative code Chapter Optimization Coding Guidelines AMD-K6® Processor Code Optimization 21924D/0-January 2000 sequences. effect shorter average execution time. following examples illustrate this concept: Signed integer function labs(x)) Static Latency: cycles ECX, EBX, ECX, EBX, EBX, [x], ;load value ;1's complement x<0, else don't modify ;2's complement x<0, else don't modify ;save labs result Unsigned integer function Static Latency: cycles EAX, EBX, EAX, ECX, ECX, ECX, [z], ;load value ;load value ;set carry flag greater than ;get borrow from previous x-y, else return x-y+y else ;save (x,y) Hexadecimal ASCII conversion (y=x 0x30: 0x41) Static Latency: cycles [y],AL ;load value less than carry flag ;0.9 96h, Ah.Fh A1h.A6h ;0.9: subtract 66h, Ah.Fh: Subtract ;save conversion Avoid using [ESI] addressing mode- This addressing mode forces instructions using become vector decoded. There ways avoid this problem. first another register. second alter addressing mode explicitly coding [ESI+0]. Assemblers optimize this [ESI] removing Optimization Coding Guidelines Chapter 21924D/0-January 2000 AMD-K6® Processor Code Optimization AMD-K6® Family Integer Coding Optimizations This section describes integer code optimization techniques specific AMD-K6 family processors. Neutral code filler XCHG EAX, instruction when aligning instructions. XCHG EAX, consumes decode slot requires execution resources. Essentially, scheduler absorbs equivalent RISC86 operation without requiring execution units. Inline String with counts Expand String instructions into equiva lent sequences simple instructions. This technique eliminates setup overhead these instructions increases instruction throughput. reg, instead reg, This optimization technique allows scheduler either integer adders rather than single shifter effectively increases overall throughput. only difference between these instructions setting flag. MOVZX MOVSX zero-extend sign-extend byte-size word-size operands doubleword length -For example, typical code zero extension creates superset dependency when zero-extended value used, following code: EAX,EAX [mem] Instead, following code: MOVZX EAX,BYTE [mem] load-execute integer instructions Most load-execute integer instructions short-decodable decoded rate cycle. Splitting load-execute instruction into separate instructions-a load instruction reg, instruction reduces decoding bandwidth increases register pressure. split-instruction form used avoid scheduler stalls longer executing instructions explicitly schedule load execute operations. Chapter Optimization Coding Guidelines AMD-K6® Processor Code Optimization 21924D/0-January 2000 improve code density-In many cases, instructions using encoded less byte than using general-purpose register. example, 0x5555 should encoded Clear registers using reg, instead reg, reg- Executing reg, requires additional overhead register dependency checking flag generation. Using reg, produces limm (load immediate) RISC86 operation that completed when placed scheduler does consume execution resources. sign-extended immediates improves code density with negative effects AMD-K6 processor. example, should encoded 8-bit sign-extended displacements conditional branches-Using short, 8-bit sign-extended displacements conditional branches improves code density with negative effects AMD-K6 processor. integer multiply over shift-add sequences when advantageous-The AMD-K6 processor features low-latency integer multiplier. Therefore, almost shift-add sequences have higher latency than IMUL instructions. exception trivial case involving multiplication powers means left shifts. general, replacements should made shift-add sequences have latency greater than equal clocks. Carefully choose best method pushing memory data- reduce register pressure code dependency, PUSH [mem] rather than EAX, [mem] followed PUSH EAX. Balance CWD, CBW, CDQ, CWDE These instructions require special attention avoid either decreased decode execution bandwidth. following code illustrates possible trade-offs: following code replacement trades decode bandwidth (CWD vector decoded, with only RISC86 operation) with execution bandwidth (SAR requires RISC86 operations, including shift): Replace:CWD With: DX,AX DX,15 Optimization Coding Guidelines Chapter 21924D/0-January 2000 AMD-K6® Processor Code Optimization following code replacement improves decode bandwidth (CBW vector decoded while MOVSX short decoded): Replace:CBW With: MOVSX AX,AL following code replacement trades decode bandwidth vecto decode with only SC86 operations) with execution bandwidth (SAR requires RISC86 operations, including shifter): Replace:CDQ With: EDX,EAX EDX,31 following code replacement improves decode bandwidth (CWDE vector decoded while MOVSX short decoded): Replace:CWDE With: MOVSX EAX, Replace integer division constants with multiplication reciprocal -This optimization commonly used RISC processors. Because AMD-K6 processor extremely fast integer multiply (two cycles) integer division delivers only bits quotient cycle (approximately cycles 32-bit divides), equivalent code much faster. following examples illustrate integer division constants: Unsigned division using multiplication reciprocal Static Latency: cycles OUT:EDX EDX, EDX, dividend quotient 0CCCCCCCDh ;0.1 2^32 rounded ;divide 2^32 Unsigned division using multiplication reciprocal Static Latency: cycles OUT:EDX EDX, EDX, dividend quotient 0AAAAAAABh ;1/3 2^32 rounded ;divide 2^32 Signed division Static Latency: cycles OUT:EAX EAX, EAX, EAX, dividend quotient 800000000h dividend ;increment dividend ;perform right shift Chapter Optimization Coding Guidelines AMD-K6® Processor Code Optimization 21924D/0-January 2000 Signed division Static Latency: cycles dividend OUT:EAX quotient EDX, EDX, EDX, (2^n-1) EAX, EAX, ;sign extend into ;EDX 0xFFFFFFFF dividend ;mask correction (use divisor ;apply correction necessary ;perform right shift log2 (divisor) Signed division Static Latency: cycles OUT:EAX EAX, EAX, EAX, dividend quotient 800000000h dividend ;increment dividend ;perform right shift ;use (x/-2) (x/2) Signed division -(2^n) Static Latency: cycles OUT:EAX EDX, EDX, EDX, EAX, EAX, dividend quotient (2^n-1) ;sign extend into ;EDX 0xFFFFFFFF dividend ;mask correction (-divisor ;apply correction necessary ;right shift log2(-divisor) ;use (x/-(2^n)) (x/2^n)) Remainder signed integer (-2) Static Latency: cycles dividend OUT:EDX quotient EDX, EDX, EDX, EAX, EAX, [quotient], ;sign extend into ;EDX 0xFFFFFFFF dividend ;compute remainder ;negate remainder ;dividend Remainder signed integer (2^n) (-(2^n))) Static Latency: cycles dividend OUT:EDX quotient EDX, EDX, EDX, (2^n-1) EAX, EAX, (2^n-1) EAX, [quotient], ;sign extend into ;EDX 0xFFFFFFFF dividend ;mask correction (abs(divison)-1) ;apply pre-correction ;mask remainder (abs(divison)-1) ;apply pre-correction necessary Optimization Coding Guidelines Chapter 21924D/0-January 2000 AMD-K6® Processor Code Optimization AMD-K6®-2 AMD-K6®-III Processors Multimedia Coding Optimizations This ction multime ptimiz atio techniques AMD-K6-2 AMD-K6-III processors. optimal floating-point performance-Wherever possible, packed single-precision, floating-point capability double-precision, extended-precision floating-point capabilities floating-point unit. 3DNow! units fully pipelined, allow vectorized optimizations, stack based, provide faster inverse, square root, inverse square root calculations. Issues ensure optimal predecode MMXand 3DNow!instructions-Attention must paid coding issues that instructions. Instructions predecoded during instruction cache line fills. predecode information that produced then stored predecode cache later used instruction decoders quickly find consecutive instructions and, therefore, enable dual-instruction decode. (The predecode information, particular, reflects length instructions.) processor predecode scheme based number assumptions constraints that have been mentioned previously, which repeated here convenience: Only subset instructions short decodable require predecode information. These include 3DNow! instructions except EMMS, FEMMS, PREFETCH instructions. Predecodable instructions seven bytes length. processor predecoders only examine first three bytes instruction determine length instruction generate predecode information. determine instruction length, non-modR/M instructions require examination opcode byte, modR/M instructions require examination opcode byte plus modR/M byte. Instructions with prefix Optimization Coding Guidelines Chapter AMD-K6® Processor Code Optimization 21924D/0-January 2000 require examination byte addition opcode byte modR/M byte. Finally, modR/M address modes with byte displacement (modR/M 00_xxx_100b) require examination additional byte. Instructions this last category that also require prefix violate three-byte predecode constraint and, therefore, cannot predecoded-these instructions [disp32 index], [disp32 scale index], [base index], [base scale index] address modes and, therefore, require examination four bytes determine instruction length. Note that [base], [disp32], [base disp32] address modes affected this. 32-bit modR/M address mode [ESI] cannot predecoded. instructions starting within last bytes cache line, predecode logic able scan past cache line when needs examine more bytes determine length instruction. This constraint limits type instructions that predecoded cache line. example, modR/M instruction that starts last byte 32-byte cache line, 0Fh-prefix plus modR/M instruction that starts within last bytes cache line, cannot predecoded. 3DNow! instructions have 0Fh-prefix byte, opcode byte, modR/M byte, which must examined predecode logic. These constraints result following recommendations successful predecode multimedia instructions: With 3DNow! instructions, address modes with large (32-bit) displacements. Large displacements result total instruction length eight bytes (including additional suffix byte used instruction sub-opcode byte). With 3DNow! instructions, [disp32 index], [disp32 scale index], [base index], [base scale index], [ESI] address modes. Instead [base], [base disp32], [disp32], [ESI+0] (with byte offset) address modes. Avoid placing start 3DNow! instructions last bytes cache line. successfully predecoded, instructions default vector decodes 3DNow! instructions default long decodes. Optimization Coding Guidelines Chapter 21924D/0-January 2000 AMD-K6® Processor Code Optimization comparison instruction decode clock-cycle count optimized code follows: cycle short decode part dual decode. cycle single long decode. cycles single vector decode (for simple instructions such 3DNow! instructions). double-precision floating-point data Although using MMX/3DNow! register move floating-point data appears fast, using these registers requires EMMS FEMMS instruction when switching from 3DNow! instructions instructions. FEMMS instruction instead EMMS instruction- processor implements improved version EMMS instruction, called FEMMS. Because MMX/3DNow! registers mapped onto stack, EMMS FEMMS instruction must executed when switching from 3DNow! code code. Execution EMMS FEMMS instruction marks floating-point word empty (all 1s), which guarantees correct results ensures that exceptions occur subsequent code stack overflow. Each time processor encounters switch between 3DNow! code code, either direction, significant clock-cycle count penalty occurs. FEMMS instruction created reduce this penalty. FEMMS instruction sets floating-point word empty (like EMMS), also sets register values undefined. switch required following FEMMS instruction, executes less than half cycles required after EMMS instruction. switch overhead occurs when instruction encountered, during execution EMMS FEMMS instructions. addition, FEMMS instruction executes clock cycles, cycles less than EMMS instruction. more information operation advantages FEMMS instruction, 3DNow!Technology Manual, order# 21928. Chapter Optimization Coding Guidelines AMD-K6® Processor Code Optimization 21924D/0-January 2000 FEMMS instruction beginning MMXor 3DNow!routine While FEMMS instruction necessary correct program functionality beginning 3DNow! routines, usage reduces clock-cycle count penalty when entering such routines from preceding code. switch occurs, FEMMS takes clock cycles execute. switch necessary, FEMMS reduces clock cycles required over half. Practice following general rules when using MMXor 3DNow!code mixed with code: Always FEMMS instruction (instead EMMS) 3DNow! instruction routine when instructions unknown code follows. FEMMS instruction beginning 3DNow! instruction routine that preceded instructions unknown code. FEMMS serves reduce switch penalty. Group partition 3DNow! code separate from code minimize frequency switching between 3DNow! operations operations. 3DNow!instruction PAVGUSB instruction MPEG-2 motion compensation decoding, motion compensation performs byte averaging between within macroblocks. PAVGUSB instruction helps speed these operations. addition, PAVGUSB free some registers make unrolling averaging loops possible. Optimization Coding Guidelines Chapter 21924D/0-January 2000 AMD-K6® Processor Code Optimization following code fragment uses original code perform averaging between source macroblock destination macroblock: movq movq movq movq movq movq pand pand pand pand psrlq psrlq pand paddb paddb movq movq movq movq movq pand pand pand pand psrlq psrlq pand paddb paddb movq loop esi, edi, edx, ebx, mm7, mm6, ecx, DWORD DWORD DWORD DWORD QWORD QWORD Src_MB Dst_MB SrcStride DstStride [ConstFEFE] [Const0101] mm0, [esi] mm1, [edi] mm2, mm3, mm2, mm3, mm0, mm1, mm2, mm0, mm1, mm2, mm0, mm0, [edi], mm4, [esi+8] mm5, [edi+8] mm2, mm3, mm2, mm3, mm4, mm5, mm2, mm4, mm5, mm2, mm4, mm4, [edi+8], esi, edi, ;mm0=qword1 ;mm1=qword3 ;mm0 qword1 0xfefefefe ;mm1 qword3 0xfefefefe ;calculate adjustment ;mm0 (qword1 0xfefefefe)/2 ;mm1 (qword3 0xfefefefe)/2 ;mm0 qw1/2 qw3/2 adjustment ;add adjustment ;mm4=qword2 ;mm5=qword4 ;mm0 qword2 0xfefefefe ;mm1 qword4 0xfefefefe ;calculate adjustment ;mm0 (qword2 0xfefefefe)/2 ;mm1 (qword4 0xfefefefe)/2 ;mm0 qw2/2 qw4/2 adjustment ;add adjustment Chapter Optimization Coding Guidelines AMD-K6® Processor Code Optimization 21924D/0-January 2000 following code fragment uses 3DNow! PAVGUSB uction averagi between macroblock destination macroblock: movq movq pavgusb pavgusb movq movq loop eax, edi, edx, ebx, ecx, DWORD DWORD DWORD DWORD Src_MB Dst_MB SrcStride DstStride mm0, [eax] mm1, [eax+8] mm0, [edi] mm1, [edi+8] eax, [edi], [edi+8], edi, ;mm0=qword1 ;mm1=qword2 ;(qw1+qw3)/2 with adjustment ;(qw2+qw4)/2 with adjustment 3DNow!Matrix Multiplication Optimization Example ampl page contai non-optimized optimized sample matrix multiplied vector. This type code often used graphics geometry transformation. This routine serves translate, scale, rotate, apply perspective coordinates represented homogeneous coordinates. code samples contain many addition multiplication instructions that implemented three ways. high-end, graphic programs, instructions supply only moderate performance, superscalar, cannot efficiently intermixed with 3DNow! instructions. Integer instructions instructions, while fast superscalar, have accuracy dynamic range that required these programs. Therefore, 3DNow! instructions, providing benefit packed, floating-point data precision parallel execution, used order write software that outperforms standard floating-point code switching overhead when intermixed with code. following code samples illustrate non-optimized optimized code. description steps programmer should take when optimizing code AMD-K6-2 processor starts page Optimization Coding Guidelines Chapter 21924D/0-January 2000 AMD-K6® Processor Code Optimization Non-Optimized Code Sample: void Transform4x4(Vertex *firstv, cnt, const Matrix NON-OPTIMIZED VERSION Full matrix transform array vertices starting from vertex pointed firstv, using transform matrix pointed Each vertex data structure assumed occupy bytes, bytes which contains vertex coordinates transformed. new_x x*m[0][0] y*m[0][1] z*m[0][2] w*m[0][3]; new_y x*m[1][0] y*m[1][1] z*m[1][2] w*m[1][3]; new_z x*m[2][0] y*m[2][1] z*m[2][2] w*m[2][3]; new_w x*m[3][0] y*m[3][1] z*m[3][2] w*m[3][3]; ;-Vrtx_X Vrtx_Y Vrtx_Z Vrtx_W Mat_00 Mat_01 Mat_02 Mat_03 Mat_10 Mat_11 Mat_12 Mat_13 Mat_20 Mat_21 Mat_22 Mat_23 Mat_30 Mat_31 Mat_32 Mat_33 ;EAX transform matrix ;EBX firstv first vertex transformed ;EDX lastv last vertex transformed Comments appear after code lines. TransformLoop: movq movq mm0, QWORD [ebx Vrtx_X] mm2, ;All multiplies XResult: ;mm0 ;copy vector Right beginning there dependency mm0, which stalls second movq clock cycles, even though both instructions short-decodable decode together instruction pair. Chapter Optimization Coding Guidelines AMD-K6® Processor Code Optimization 21924D/0-January 2000 pfmul mm0, QWORD [eax Mat_00] ;mm0 y*a21 x*a11 PFMUL instruction leads another dependency, because previous stall, PFMUL instruction executes when Mat_00 loads from memory. PFMUL instruction translates 3DNow! Load unit operation. movq mm1, QWORD [ebx Vrtx_Z] ;mm1 MOVQ instruction decodes with previous PFMUL there resource constraint, with both instructions trying Load unit. This contention causes instructions stall extra cycle. movq mm3, ;copy vector ;mm1 w*a41 z*a31 Another stall while waiting mm1. pfmul mm1, QWORD [eax Mat_20] Same previous PFMUL instruction. Note that tasks this code line serialized, with opportunity overlap execution resources. Even instructions short decode pairs, other constraints causing stalls. addition, scheduler stall occurs when instruction cannot retire bottom scheduler because dependency resource stalls have delayed instruction many cycles. movq mm4, pfmul mm2, QWORD [eax Mat_01] ;All multiplies YResult: ;copy vector ;mm2 y*a22 x*a12 These instructions paired. PFMUL instructions decode Load unit operation followed 3DNow! Multiply unit operation. movq mm5, pfmul mm3, QWORD [eax Mat_21] ;copy vector ;mm3 w*a42 z*a32 ;All multiplies ZResult: ;copy vector ;mm4 y*a23 x*a13 ;copy vector ;mm5 w*a43 z*a33 ;All multiplies WResult: ;mm6 y*a24 x*a14 ;mm7 w*a44 z*a34 These instructions paired. Same comments before. movq mm6, pfmul mm4, QWORD [eax Mat_02] movq mm7, pfmul mm5, QWORD [eax Mat_22] These instructions paired. Same comments before. These instructions paired. Same comments before. pfmul mm6, QWORD [eax Mat_03] pfmul mm7, QWORD [eax Mat_23] These instructions paired. However, this pair causes conflict both Load unit 3DNow! Multiplier resources, which stalls instruction scheduler clock cycle. instructions execute staggered fashion. goal short-decodeable pairs simultaneous execution. ;All ;mm0 ;mm2 first sums: XResult w*a41 y*a21 z*a31 x*a11 YResult w*a42 y*a22 z*a32 x*a12 pfadd mm0, pfadd mm2, These instructions paired. However, this pair causes conflict 3DNow! ALU, which delay instruction. pfadd mm4, ZResult ;mm4 w*a43 y*a23 z*a33 x*a13 Optimization Coding Guidelines Chapter 21924D/0-January 2000 AMD-K6® Processor Code Optimization pfadd mm6, WResult ;mm6 w*a44 y*a24 z*a34 x*a14 These instructions paired, there conflict 3DNow! with PFADD instructions from previous pair that delayed cycle. These dual-decodeable operations serialize execution, eventually stalling scheduler because RISC86 instructions longer retire. pfacc mm0, pfacc mm2, pfacc mm4, pfacc mm6, ;All final sums: XResult YResult ZResult WResult ;All result stores: XResult YResult ZResult WResult These instructions paired, there conflict 3DNow! ALU. comments above. These instructions paired, there conflict 3DNow! ALU. comments above. movd movd movd movd DWORD [ebx Vrtx_X], DWORD [ebx Vrtx_Y], DWORD [ebx Vrtx_Z], DWORD [ebx Vrtx_W], These instructions paired, there conflict Store unit. These instructions paired, there conflict Store Unit well delayed store operation from previous instruction pair. ebx, Vertex_Stride ebx, TransformLoop ;Advance next vertex ;Compare with last vertex done These instructions paired, dependency value delays second instruction cycle. Optimized Code Sample: void Transform4x4(Vertex *firstv, cnt, const Matrix OPTIMIZED VERSION Full matrix transform array vertices starting from vertex pointed firstv, using transform matrix pointed Each vertex data structure assumed occupy bytes, bytes which contains vertex coordinates transformed. new_x x*m[0][0] y*m[0][1] z*m[0][2] w*m[0][3]; new_y x*m[1][0] y*m[1][1] z*m[1][2] w*m[1][3]; new_z x*m[2][0] y*m[2][1] z*m[2][2] w*m[2][3]; new_w x*m[3][0] y*m[3][1] z*m[3][2] w*m[3][3]; ;-Vrtx_X Vrtx_Y Vrtx_Z Chapter Optimization Coding Guidelines AMD-K6® Processor Code Optimization 21924D/0-January 2000 Vrtx_W Mat_00 Mat_01 Mat_02 Mat_03 Mat_10 Mat_11 Mat_12 Mat_13 Mat_20 Mat_21 Mat_22 Mat_23 Mat_30 Mat_31 Mat_32 Mat_33 ;EAX ;EBX firstv ;ECX transform matrix first vertex transformed count vertices transformed code begins here, this section loop. initial Loads conflict stall waiting load first vertex values first four values from matrix. However, once loop begins, this code runs efficiently. Note that most these instructions four bytes long, which helps make them short decodable. movq movq mm6, DWORD [ebx] mm7, DWORD [ebx Vrtx_Z] ;Load first vertex: ;mm6 ;mm7 ;Start load matrix: ;mm0 ;mm1 These instructions decode together, cause conflict Load unit. movq movq mm0, DWORD [eax Mat_00] mm1, DWORD [eax Mat_20] Decode together, conflict Load Unit. TransformLoop: prefetchw [ebx 128] ;Prefetch next vertex PREFETCHW instruction vector decode takes cycles. However, this instruction increases efficiency because begins preload data cache with next vertex. vertex four dwords half cache line. However, `stride' distance from vertex data structure next within vertex array, this example, bytes, which means that each vertex separate cache line. assumed that vertex data starts cache line boundaries. From this point forward, instructions form instruction pairs that both decode into Opquad. Opquad line instruction scheduler that composed four RISC86 operations. movq mm2, DWORD [eax Mat_01] ;mm2 This MOVQ instruction continues fill matrix. Separating matrix load from multiply instruction avoids serializing load multiply, which lead stall scheduler. load takes cycles execute multiply takes cycles execute. Including operand fetch stage almost fills six-stage length scheduler. pfmul mm0, ;mm0 y*m01 x*m00 Optimization Coding Guidelines Chapter 21924D/0-January 2000 AMD-K6® Processor Code Optimization This PFMUL instruction paired with MOVQ instruction. These instructions different resources (Load Unit 3DNow! ALU, respectively). There resource conflicts, dependencies (mm0 should loaded from three cycles earlier), instructions execute together. movq mm3, DWORD [eax Mat_21] pfmul mm1, ;mm3 ;mm1 w*m03 z*m02 ;mm4 ;mm2 y*m11 x*m10 Same comments previous instruction pair. movq mm4, DWORD [eax Mat_02] pfmul mm2, Same comments previous instruction pair, except load started cycles earlier should forwarded from Load unit 3DNow! just-in-time. movq mm5, DWORD [eax Mat_22] pfmul mm3, ;mm5 ;mm3 w*m13 z*m12 this pair instructions, last free register loaded (mm5). Because there only eight registers, registers must reused then reloaded with matrix values next vertex calculation. pfadd mm0, pfmul mm4, ;First XResult: ;mm Other recent searchesVDS1793 - VDS1793 VDS1793 Datasheet UPG2110TB - UPG2110TB UPG2110TB Datasheet TMS320C6205 - TMS320C6205 TMS320C6205 Datasheet SCBS737A - SCBS737A SCBS737A Datasheet LT1431 - LT1431 LT1431 Datasheet LRMS-5HJ - LRMS-5HJ LRMS-5HJ Datasheet JLC1563 - JLC1563 JLC1563 Datasheet E159135 - E159135 E159135 Datasheet 1722690000 - 1722690000 1722690000 Datasheet
Privacy Policy | Disclaimer |