John Fremlin's blog: Operands to NOP on AMD64

Posted 2010-02-22 23:00:00 GMT

I was looking at a disassembly. It contained this

nopl   0x0(%rax)

What is the point of passing an operand to NOP? NOP is the instruction that does nothing. Yet not quite true: Intel's US Patent 5,701,442 Method of modifying an instruction set architecture of a computer processor to maintain backward compatibility suggests that they could opt to use more complex NOP instructions to provide hints like memory prefetch requests. On processors without the prefetch logic the operations would do nothing, but processors with prefetch would initiate memory requests to bring the data into cache. The extended NOPs taking operands that lie in the amd64 and x86 instruction sets are called hinting nops for this reason, but as far as I know they don't yet hint anything (see the PREFETCH instructions near the same opcode code points).

It turns out that these interesting nopls, nopw, etc. are generated by GAS (that is, when using GCC). As the instruction set is encoded in quite a uniform way to simplify the decode stage, there are many instructions that achieve nothing: for example, xchg %ax,%ax (the standard two-byte nop) or leal 0(%esi),%esi, a three-byte nop. Dedicating opcodes for longer NOPs is a sensible way to simplify the CPU's own optimizations.

Aligning jump targets in code to a 16-byte boundary to make sure that the target can be fetched in a single cacheline request is important. However the padding used to flow through to this aligned boundary should be as efficiently encoded as possible, and using only a single instruction to take up eight bytes — nopw 0L(%[re]ax,%[re]ax,1), is more efficient than repeatedly exercising the decode logic on eight one-byte nops. The code to do this is in binutils/gas/config/tc-i386.c:i386_align_code.

Goes to show how mature the AMD64 instruction set has become — and how far from the days of the Binary Coded Decimal nonsense.

Hi John,

I search on NOP and AMD64 and found your post.

Two important points: looking at some code output I had that includes many (hundreds) conditional jumps, I saw a nopl every 16 bytes.

Looking closer, knowing of the target optimization (that you mentioned) it could NOT be the reason. Therefore, I search some more to really understand what they're trying to achieve with all those nops. Obviously an optimization, but which one?!

Then I found this document that explains to the letter what they're trying to do:

Paragraph 7.1 says what is happening:

"As described in Chapter 2, “Microarchitecture of AMD Family 15h Processors” on page 29, each pair of integer execution units and one floating-point unit shares one instruction fetch, branch predictor, decode and dispatch unit, also known as the shared frontend. Code layout and alignment optimizations relative to these shared resources can improve performance. The function of this shared front end and related code optimizations are discussed below."

The processor has the capability to memorize ONE jump instruction for every 16 bytes (32 bytes in newer processors). This is the reason for all the additional nopw/nopl instructions. They are trying to limit the number of jumps in each block. Not just align the targets. (as far as I recall, target alignment existed since early Pentium processors.)

Thank you for the post though. Forced me to look a little further in the matter.

Alexis Wilke

Posted 2011-07-05 00:57:29 GMT by Anonymous from

Post a comment