Optimising integer division
by Richard Russell, July 2010
The integer division instructions div (unsigned) and idiv (signed) are the slowest of all the integer instructions, sometimes taking tens of clock cycles. When the divisor is a constant (that is, it is known when the code is assembled) a significantly faster execution can be achieved by using a multiplication (effectively by the reciprocal of the divisor) followed by a shift.
A difficulty with this approach is that it is non-trivial to devise an instruction sequence which is guaranteed to produce the identical result to the division instruction it replaces. Listed at the end of this article are two assembler macro functions FN_udiv and FN_sdiv which automatically generate the optimum code sequence for any divisor, giving identical results to the div and idiv instructions respectively.
It is probably easiest to illustrate the use of these macros by example. Suppose you want to divide the unsigned value in the ebx register by 123; the conventional code would be something like this:
00B4174C 8B C3 mov eax,ebx 00B4174E 33 D2 xor edx,edx 00B41750 B9 7B 00 00 00 mov ecx,123 00B41755 F7 F1 div ecx
This will return the unsigned quotient in the eax register.
To speed up the operation use the FN_udiv macro as follows:
00B423E2 B8 53 08 34 85 F7 E3 05 53 08 34 85 83 D2 00 C1 EA 06 OPT FN_udiv("ebx", 123)
Although the code is longer (18 bytes compared with 11) the execution will be significantly faster. Note that the quotient is returned in edx not in eax.
The actual code generated by the macro is as follows:
00B423E2 B8 53 08 34 85 mov eax,&85340853 00B423E7 F7 E3 mul ebx 00B423E9 05 53 08 34 85 add eax,&85340853 00B423EE 83 D2 00 adc edx,0 00B423F1 C1 EA 06 shr edx,6
Suppose instead that one wishes to divide the signed value stored in the memory location [edi] by 123. The conventional code would be something like this:
00B42424 8B 07 mov eax,[edi] 00B42426 99 cdq 00B42427 B9 7B 00 00 00 mov ecx,123 00B4242C F7 F9 idiv ecx
This will return the signed quotient in the eax register.
To speed up the operation use the FN_sdiv macro as follows:
00B423EE B8 15 02 4D 21 F7 2F 8B 07 C1 FA 04 C1 E8 1F 03 D0 OPT FN_sdiv("dword [edi]", 123)
Again the the quotient is returned in edx not in eax.
The actual code generated by the macro in this case is as follows:
00B423EE B8 15 02 4D 21 mov eax,&214D0215 00B423F3 F7 2F imul dword [edi] 00B423F5 8B 07 mov eax,dword [edi] 00B423F7 C1 FA 04 sar edx,4 00B423FA C1 E8 1F shr eax,31 00B423FD 03 D0 add edx,eax
In the case of FN_udiv the dividend can be any register (except eax) or a memory location. For FN_sdiv the dividend can be any register (except eax and edx) or a memory location. The divisor can be any 32-bit value except zero.
Here is the code of the macros:
REM Generate assembler code for fast unsigned division by a constant REM Dividend may be a memory location or a register other than eax DEF FN_udiv(dividend$, d%) LOCAL opt%, i%, l%, f#, i# opt% = ?419 >> 4 l% = FNlog2(d%) f# = 2.0#^(l%+32) / FNuint(d%) i# = INT(f#) IF f# = i# THEN PROCasm(opt%, "mov edx,"+dividend$) ELSEIF (f# - i#) < 0.5# THEN; i% = FNintu(i#) WHILE (i% AND 1)=0 AND l%>0 i% = i% >>> 1 l% -= 1 ENDWHILE PROCasm(opt%, "mov eax,&"+STR$~i%) PROCasm(opt%, "mul "+dividend$) PROCasm(opt%, "add eax,&"+STR$~i%) PROCasm(opt%, "adc edx,0") ELSE i% = FNintu(i#+1) WHILE (i% AND 1)=0 AND l%>0 i% = i% >>> 1 l% -= 1 ENDWHILE PROCasm(opt%, "mov eax,&"+STR$~i%) PROCasm(opt%, "mul "+dividend$) ENDIF IF l% PROCasm(opt%, "shr edx,"+STR$l%) = opt% REM Generate assembler code for fast *signed* division by a constant REM Dividend may be a memory location or a register other than eax or edx DEF FN_sdiv(dividend$, d%) LOCAL opt%, e%, i%, l%, j#, k#, l#, m# opt% = ?419 >> 4 e% = d% d% = ABS(d%) l% = FNlog2(d%) IF (d% AND (d%-1)) = 0 THEN PROCasm(opt%, "mov eax,"+dividend$) PROCasm(opt%, "cdq") PROCasm(opt%, "and edx,&"+STR$~(d%-1)) PROCasm(opt%, "add edx,eax") IF l% PROCasm(opt%, "sar edx,"+STR$l%) IF e%<0 PROCasm(opt%, "neg edx") = opt% ENDIF j# = 2.0#^31 / d% j# = INT(d% * (j# - INT(j#))) k# = INT(2.0#^(l%+32) / (2.0#^31 - j#)) l# = INT(2.0#^(l%+32) / d%) m# = INT((2.0#^(l%+32) + k#) / d%) WHILE INT(l#/2)<INT(m#/2) AND l%>0 l# = INT(l#/2) m# = INT(m#/2) l% -= 1 ENDWHILE i% = FNintu(m#) PROCasm(opt%, "mov eax,&"+STR$~i%) PROCasm(opt%, "imul "+dividend$) PROCasm(opt%, "mov eax,"+dividend$) IF (i% AND &80000000) PROCasm(opt%, "add edx,eax") IF l% PROCasm(opt%, "sar edx,"+STR$l%) PROCasm(opt%, "shr eax,31") PROCasm(opt%, "add edx,eax") IF e%<0 PROCasm(opt%, "neg edx") = opt% DEF FNlog2(i%) LOCAL t% REPEAT i% = i% >>> 1 t% += 1 UNTIL i% = 0 = t% - 1 DEF PROCasm(opt%, code$) LOCAL chan% code$ = "[OPT " + STR$(3) + " : " + code$ + " : ]" chan% = OPENOUT(@tmp$+"ASMDIV") PRINT #chan%, CHR$(LEN(code$)+4) + CHR$0 + CHR$0 + code$ : BPUT #chan%,0 CLOSE #chan% CALL @tmp$+"ASMDIV" ENDPROC DEF FNintu(N) = ((N / 2) << 1) - (INT(N / 2) <> N / 2) DEF FNuint(N%) = (N% >>> 1) * 2 + (N% AND 1)