Monday 10 March 2008

What the heck does 'rep ret' mean?

I was looking at what the .NET x64 JIT compiler generates for some code, and saw something very odd at the end of the routine: the last instruction of the function was rep ret. Looking a bit further, this is the same at the end of every JIT-compiled routine.

The thing is, the rep prefix to an instruction is supposed to tell it to be repeated. Repeat the return? How do I do that?

The Intel architecture software developer’s manual set says it’s only defined for the ‘string’ instructions like movs (which moves a byte/word/dword from the address pointed to by ESI to the address pointed to by EDI). The rep prefix repeats the string instruction ECX times. Yes, this means that you can implement memcpy in a single instruction. (You can do memset with a single rep stos instruction, once AL is loaded with the value to be stored.) It’s explicitly undefined for anything else.

So where the heck has this illegal usage come from? I followed a couple of clues and found this patch notification for glibc on x64. And indeed the current version of AMD’s optimization guide [PDF] for Athlon 64 processors says that you should do this. The reason? The branch predictor gets it wrong if the ret instruction is jumped to directly by a branch instruction, or if the ret directly follows a branch instruction.

I’m not sure doing it throughout, even when you’ve got epilog code in there which prevents the bug, is necessary though.

AMD have now published a new optimization guide for their Family 10h processors [PDF] and guess what, the advice has changed. Instead of using a two-byte illegal instruction, they now recommend the three-byte instruction ret 0. The difference between a plain ret and a ret imm16 (where imm16 is an immediate 16–bit value) is that ret imm16 pops the return address, then specified number of bytes from the stack before jumping to the return address. It’s common to see this in 32–bit Windows WINAPI (__stdcall) code as this calling convention requires the called function to clean up the parameters from the stack. (64–bit Windows has only one calling convention and it mainly passes parameters in registers, so stack cleanup is not required.

Still, it’s a shame to see the JIT generating this on my Core 2 Duo laptop, which doesn’t have the bug (as far as I know, but there’s no mention in Intel’s optimization guide). And it’s an even bigger shame on AMD that they a) didn’t fix the damn bug in the processor and b) recommended an illegal instruction as a way round it.

No comments: