Using SSE2 instructions
by Richard Russell, updated May 2015
SSE2 instructions are supported by the ASMLIB2 library, and that will generally be the most appropriate way to incorporate them in a program. However using the library has one significant disadvantage: the resulting program cannot be straightforwardly compiled, because the SSE2 instructions will not be accepted by the cruncher. To workaround this issue the assembler code must be placed in a separate file (with an extension other than .BBC) which is executed at run time, for example:
CALL "mysse2code.bba"
(the file should have a RETURN as the last statement).
Whilst this solution is relatively straightforward it is arguably inconvenient, especially if the amount of assembler code is small. There is an alternative way of assembling many of the SSE2 instructions which does not require the use of a library and which allows the program to be compiled conventionally; that is to add a word qualifier to the equivalent MMX instruction. So for example the instruction:
punpcklbw xmm0,xmm1
can be assembled as follows:
punpcklbw word mm0,mm1 ; punpcklbw xmm0,xmm1
The full set of SSE2 instructions which can be assembled in this way is as follows:
punpcklbw word mm0,mm1 ; punpcklbw xmm0,xmm1
punpcklwd word mm0,mm1 ; punpcklwd xmm0,xmm1 punpckldq word mm0,mm1 ; punpckldq xmm0,xmm1 punpckhbw word mm0,mm1 ; punpckhbw xmm0,xmm1 punpckhwd word mm0,mm1 ; punpckhwd xmm0,xmm1 punpckhdq word mm0,mm1 ; punpckhdq xmm0,xmm1 packsswb word mm0,mm1 ; packsswb xmm0,xmm1 packssdw word mm0,mm1 ; packssdw xmm0,xmm1 packuswb word mm0,mm1 ; packuswb xmm0,xmm1 pcmpgtb word mm0,mm1 ; pcmpgtb xmm0,xmm1 pcmpgtw word mm0,mm1 ; pcmpgtw xmm0,xmm1 pcmpgtd word mm0,mm1 ; pcmpgtd xmm0,xmm1 pcmpeqb word mm0,mm1 ; pcmpeqb xmm0,xmm1 pcmpeqw word mm0,mm1 ; pcmpeqw xmm0,xmm1 pcmpeqd word mm0,mm1 ; pcmpeqd xmm0,xmm1 pshufw word mm0,mm1,5 ; pshufd xmm0,xmm1,5 psrlw word mm0,5 ; psrlw xmm0,5 psrld word mm0,5 ; psrld xmm0,5 psrlq word mm0,5 ; psrlq xmm0,5 psrlw word mm0,mm1 ; psrlw xmm0,xmm1 psrld word mm0,mm1 ; psrld xmm0,xmm1 psrlq word mm0,mm1 ; psrlq xmm0,xmm1 psraw word mm0,5 ; psraw xmm0,5 psrad word mm0,5 ; psrad xmm0,5 psraw word mm0,mm1 ; psraw xmm0,xmm1 psrad word mm0,mm1 ; psrad xmm0,xmm1 psllw word mm0,5 ; psllw xmm0,5 pslld word mm0,5 ; pslld xmm0,5 psllq word mm0,5 ; psllq xmm0,5 psllw word mm0,mm1 ; psllw xmm0,xmm1 pslld word mm0,mm1 ; pslld xmm0,xmm1 psllq word mm0,mm1 ; psllq xmm0,xmm1 pinsrw word mm0,[esi],5; pinsrw xmm0,[esi],5 pextrw word [esi],mm0,5; pextrw [esi],xmm0,5 pavgb word mm0,mm1 ; pavgb xmm0,xmm1 pavgw word mm0,mm1 ; pavgw xmm0,xmm1 pmullw word mm0,mm1 ; pmullw xmm0,xmm1 pmulhuw word mm0,mm1 ; pmulhuw xmm0,xmm1 pmulhw word mm0,mm1 ; pmulhw xmm0,xmm1 movntq word [edi],mm1 ; movntq [edi],xmm1 pmaddwd word mm0,mm1 ; pmaddwd xmm0,xmm1 psadbw word mm0,mm1 ; psadbw xmm0,xmm1 maskmovq word mm0,mm1 ; maskmovq xmm0,xmm1 movd word mm0,[esi] ; movd xmm0,[esi] movd word [edi],mm0 ; movd [edi],xmm0 movq word mm0,[esi] ; movdqa xmm0,[esi] movq word [edi],mm0 ; movdqa [edi],xmm0 psubusb word mm0,mm1 ; psubusb xmm0,xmm1 psubusw word mm0,mm1 ; psubusw xmm0,xmm1 psubsb word mm0,mm1 ; psubsb xmm0,xmm1 psubsw word mm0,mm1 ; psubsw xmm0,xmm1 psubb word mm0,mm1 ; psubb xmm0,xmm1 psubw word mm0,mm1 ; psubw xmm0,xmm1 psubd word mm0,mm1 ; psubd xmm0,xmm1 paddusb word mm0,mm1 ; paddusb xmm0,xmm1 paddusw word mm0,mm1 ; paddusw xmm0,xmm1 paddsb word mm0,mm1 ; paddsb xmm0,xmm1 paddsw word mm0,mm1 ; paddsw xmm0,xmm1 paddb word mm0,mm1 ; paddb xmm0,xmm1 paddw word mm0,mm1 ; paddw xmm0,xmm1 paddd word mm0,mm1 ; paddd xmm0,xmm1 pminub word mm0,mm1 ; pminub xmm0,xmm1 pminsw word mm0,mm1 ; pminsw xmm0,xmm1 pmaxub word mm0,mm1 ; pmaxub xmm0,xmm1 pmaxsw word mm0,mm1 ; pmaxsw xmm0,xmm1 pand word mm0,mm1 ; pand xmm0,xmm1 pandn word mm0,mm1 ; pandn xmm0,xmm1 por word mm0,mm1 ; por xmm0,xmm1 pxor word mm0,mm1 ; pxor xmm0,xmm1
In addition the MOVDQU instruction (unaligned move) may be assembled as follows:
repe movq mm0,[esi] ; movdqu xmm0,[esi] repe movq [edi],mm0 ; movdqu [edi],xmm0
In all cases, where mm0 or mm1 (xmm0 or xmm1) is shown, any of the eight registers may be used instead. In many cases a memory reference can be used instead of the mm1 register in the 'source' field.