PROWAREtech
Intel IA-32 Assembly Tutorial - A Guide to the Basics of x86 Assembly - Page 14
The assembly code on this page was compiled and tested with Visual Studio 2022. Disabling option Image has Safe Exception Handlers in Project Properties → Configuration Properties → Linker → All Options may be required to compile most of this code.
Be familiar with SSE/SSE2 before reading this part of the tutorial.
AVX/AVX2: Advanced Vector Extensions
AVX is the latest generation of Single instruction, multiple data (SIMD) added to the x86 CPU, and it is available on newer CPU's only, such as Intel's Sandy Bridge and later. This is a large extension to the x86 CPU and the biggest since SSE. AVX2 is also known as "Haswell New Instructions" from Intel's Haswell processors.
AVX requires operating system support. Windows 7 requires Service Pack 1. Windows 8.1, 10 and 11 support AVX. Linux kernel version 2.6.30 supports AVX. When the operating system does context switching, the register file needs to be larger. This is why AVX needs OS support.
Check for the CPUID instruction:
TITLE 'extern "C" int __cdecl isCPUID();'
.686P
.model FLAT
PUBLIC _isCPUID
_TEXT SEGMENT
_isCPUID PROC NEAR
push ebx ; save ebx for the caller
pushfd ; push eflags on the stack
pop eax ; pop them into eax
mov ebx, eax ; save to ebx for restoring afterwards
xor eax, 200000h ; toggle bit 21
push eax ; push the toggled eflags
popfd ; pop them back into eflags
pushfd ; push eflags
pop eax ; pop them back into eax
cmp eax, ebx ; see if bit 21 was reset
jz not_supported
mov eax, 1
jmp exit
not_supported:
xor eax, eax;
exit:
pop ebx
ret 0
_isCPUID ENDP
_TEXT ENDS
END
Detecting AVX
TITLE 'extern "C" int __cdecl isAVX();'
.686P
.model FLAT
PUBLIC _isAVX
_TEXT SEGMENT
_isAVX PROC NEAR
push ebx
mov eax, 1
cpuid
shr ecx, 28 ; bit 28 of the ecx register
and ecx, 1
mov eax, ecx
pop ebx
exit:
ret 0
_isAVX ENDP
_TEXT ENDS
END
Detecting AVX2
TITLE 'extern "C" int __cdecl isAVX2();'
.686P
.model FLAT
PUBLIC _isAVX2
_TEXT SEGMENT
_isAVX2 PROC NEAR
push ebx
mov eax, 7
xor ecx, ecx
cpuid
shr ebx, 5 ; bit 5 of the ebx register
and ebx, 1
mov eax, ebx
pop ebx
exit:
ret 0
_isAVX2 ENDP
_TEXT ENDS
END
AVX Compared to SSE
TITLE 'extern "C" int __cdecl isAVX();'
.686P
.model FLAT
PUBLIC _isAVX
_TEXT SEGMENT
_isAVX PROC NEAR
push ebx
mov eax, 1
cpuid
shr ecx, 28 ; bit 28 of the ecx register
and ecx, 1
mov eax, ecx
pop ebx
exit:
ret 0
_isAVX ENDP
_TEXT ENDS
END
TITLE 'extern "C" int __cdecl isAVX2();'
.686P
.model FLAT
PUBLIC _isAVX2
_TEXT SEGMENT
_isAVX2 PROC NEAR
push ebx
mov eax, 7
xor ecx, ecx
cpuid
shr ebx, 5 ; bit 5 of the ebx register
and ebx, 1
mov eax, ebx
pop ebx
exit:
ret 0
_isAVX2 ENDP
_TEXT ENDS
END
AVX Compared to SSE
Now, the exciting changes in AVX as compared to SSE. Registers are 256 bits (32-bytes) so they are twice as wide as SSE. The AVX instuctions work on eight packed singles (C/C++ float data-type) or 4 packed doubles (C/C++ double data-type). Many AVX instructions take three operands and are non-destructive so the source operands are not necessarily overwritten as they are in SSE! This is a big departure from how most of the x86 architecture works. AVX introduced the YMMWORD which is a 256-bit value. Just like SSE, there are 16 registers. Each AVX register (YMM0 to YMM15) is aliased to the 16 SSE registers.
1 ymmword | ||||||||
2 xmmwords | ||||||||
4 doubles | ||||||||
8 singles |
Some SSE2 instructions deal with integers. The AVX instructions deal only with singles and doubles, but the AVX registers can of course be used to store many different data-types. AVX2 expands most SSE2 integer commands to 256 bits (32 bytes) and introduces new instructions.
Most SSE instructions are available. The mnemonics of AVX instructions have a V prefix, so SSE instruction SUBPS
becomes VSUBPS
, for example. SSE MMX and scalar instructions are unavailable. Most SSE2 integer instructions are available with AVX2.
There are new broadcasting (fill AVX register with a value), extraction/insertion, masked moves (imitates SIMD branching), shuffling and zeroing instructions.
Consider the following code:
vsubps ymm0, ymm1, ymm2
This operation subtracts eight packed singles. It will subtract register ymm2 from register ymm1 and store the eight results in register ymm0.
When copying to and from memory, AVX is bottlenecked by the speed of the memory and as a result is only about as fast a SSE. The difference comes in calculations, so if an application primarily moves a lot of data in and out of memory then SSE may be a better choice because of better CPU and OS support.
In this complete example, 32 bytes are copied from one location in memory to another:
TITLE 'extern "C" int __cdecl avx_copy_example(char dest[], char src[]);'
.686P
.xmm
.model FLAT
PUBLIC _avx_copy_example
_TEXT SEGMENT
_avx_copy_example PROC NEAR
mov eax, [esp+(2+0)*4] ; src - should be at least 32 bytes long
vmovupd ymm0, YMMWORD PTR [eax] ; copy src to ymm0
mov eax, [esp+(1+0)*4] ; dest - should be at least 32 bytes long
vmovupd YMMWORD PTR [eax], ymm0 ; copy ymm0 to dest
mov eax, 1
ret 0
_avx_copy_example ENDP
_TEXT ENDS
END
Here is the C/C++ driver code:
extern "C" int __cdecl avx_copy_example(char dest[], char src[]);
int main()
{
char src[32], dest[32];
for (int i = 0; i < 32; i++)
src[i] = 'a' + i;
avx_copy_example(dest, src);
return 0;
}
Data Alignment
Just like aligned SSE instructions must have data aligned to 16-bytes, data must be 32-byte aligned to use AVX instructions like VMOVAPD
and VMOVAPS
. If data are not aligned on 32-bytes then these instructions will crash the program.
New Instructions
AVX
As has been already stated, most of the SSE instructions are available with a V added to the beginning of the instruction. Here are the all new AVX instructions
These AVX instructions are completely new:
Instruction | Description |
---|---|
VBROADCASTSS VBROADCASTSD VBROADCASTF128 | Copy a 32-bit, 64-bit or 128-bit memory operand to all elements of a XMM or YMM vector register. |
VINSERTF128 | Replaces either the lower half or the upper half of a 256-bit YMM register with the value of a 128-bit source operand. The other half of the destination is unchanged. |
VEXTRACTF128 | Extracts either the lower half or the upper half of a 256-bit YMM register and copies the value to a 128-bit destination operand. |
VMASKMOVPS VMASKMOVPD | Conditionally reads any number of elements from a SIMD vector memory operand into a destination register, leaving the remaining vector elements unread and setting the corresponding elements in the destination register to zero, or, conditionally writes any number of elements from a SIMD vector register operand to a vector memory operand, leaving the remaining elements of the memory operand unchanged. |
VPERMILPS VPERMILPD | Permute In-Lane. Shuffle the 32-bit or 64-bit vector elements of one input operand. These are in-lane 256-bit instructions, meaning that they operate on all 256 bits with two separate 128-bit shuffles, so they can not shuffle across the 128-bit lanes. |
VPERM2F128 | Shuffle the four 128-bit vector elements of two 256-bit source operands into a 256-bit destination operand, with an immediate constant as selector. See example below. |
VTESTPS VTESTPD | Packed bit test of the packed single-precision or double-precision floating-point sign bits, setting or clearing the ZF flag based on AND and CF flag based on ANDN. |
VZEROALL | Set all YMM registers to zero and tag them as unused. Used when switching between 128-bit use and 256-bit use. |
VZEROUPPER | Set the upper half of all YMM registers to zero. Used when switching between 128-bit use and 256-bit use. |
VPERM2F128 Instruction Example
Like the SSE shuffle instructions, this is a somewhat complex one, so the example is rather long to showcase some of what this instruction can do.
VPERM2F128
— parameters: [ymm], [ymm], [ymm/256-bits-memory], [imm8] — first two must be YMM registers
TITLE 'extern "C" int __cdecl avx_vperm2f128_example(double src1[], double src2[], double dest[], void (*print)(double d[], unsigned char imm8));'
.686P
.xmm
.model FLAT
PUBLIC _avx_vperm2f128_example
_TEXT SEGMENT
_avx_vperm2f128_example PROC NEAR
push esi
mov esi, [esp+(1+1)*4] ; parameter: src1
vmovupd ymm1, YMMWORD PTR [esi]
mov esi, [esp+(2+1)*4] ; parameter: src2
vperm2f128 ymm0, ymm1, YMMWORD PTR [esi], 00b
mov esi, [esp+(3+1)*4] ; parameter: dest
vmovupd YMMWORD PTR [esi], ymm0
push 00b
push esi ; parameter: dest
mov eax,[esp+(4+3)*4] ; parameter: print function
call eax ; call print function
add esp, 8
mov esi, [esp+(1+1)*4] ; parameter: src1
vmovupd ymm1, YMMWORD PTR [esi]
mov esi, [esp+(2+1)*4] ; parameter: src2
vperm2f128 ymm0, ymm1, YMMWORD PTR [esi], 01b
mov esi, [esp+(3+1)*4] ; parameter: dest
vmovupd YMMWORD PTR [esi], ymm0
push 01b
push esi ; parameter: dest
mov eax,[esp+(4+3)*4] ; parameter: print function
call eax ; call print function
add esp, 8
mov esi, [esp+(1+1)*4] ; parameter: src1
vmovupd ymm1, YMMWORD PTR [esi]
mov esi, [esp+(2+1)*4] ; parameter: src2
vperm2f128 ymm0, ymm1, YMMWORD PTR [esi], 10b
mov esi, [esp+(3+1)*4] ; parameter: dest
vmovupd YMMWORD PTR [esi], ymm0
push 10b
push esi ; parameter: dest
mov eax,[esp+(4+3)*4] ; parameter: print function
call eax ; call print function
add esp, 8
mov esi, [esp+(1+1)*4] ; parameter: src1
vmovupd ymm1, YMMWORD PTR [esi]
mov esi, [esp+(2+1)*4] ; parameter: src2
vperm2f128 ymm0, ymm1, YMMWORD PTR [esi], 11b
mov esi, [esp+(3+1)*4] ; parameter: dest
vmovupd YMMWORD PTR [esi], ymm0
push 11b
push esi ; parameter: dest
mov eax,[esp+(4+3)*4] ; parameter: print function
call eax ; call print function
add esp, 8
mov esi, [esp+(1+1)*4] ; parameter: src1
vmovupd ymm1, YMMWORD PTR [esi]
mov esi, [esp+(2+1)*4] ; parameter: src2
vperm2f128 ymm0, ymm1, YMMWORD PTR [esi], 010000b
mov esi, [esp+(3+1)*4] ; parameter: dest
vmovupd YMMWORD PTR [esi], ymm0
push 010000b
push esi ; parameter: dest
mov eax,[esp+(4+3)*4] ; parameter: print function
call eax ; call print function
add esp, 8
mov esi, [esp+(1+1)*4] ; parameter: src1
vmovupd ymm1, YMMWORD PTR [esi]
mov esi, [esp+(2+1)*4] ; parameter: src2
vperm2f128 ymm0, ymm1, YMMWORD PTR [esi], 100000b
mov esi, [esp+(3+1)*4] ; parameter: dest
vmovupd YMMWORD PTR [esi], ymm0
push 100000b
push esi ; parameter: dest
mov eax,[esp+(4+3)*4] ; parameter: print function
call eax ; call print function
add esp, 8
mov esi, [esp+(1+1)*4] ; parameter: src1
vmovupd ymm1, YMMWORD PTR [esi]
mov esi, [esp+(2+1)*4] ; parameter: src2
vperm2f128 ymm0, ymm1, YMMWORD PTR [esi], 110000b
mov esi, [esp+(3+1)*4] ; parameter: dest
vmovupd YMMWORD PTR [esi], ymm0
push 110000b
push esi ; parameter: dest
mov eax,[esp+(4+3)*4] ; parameter: print function
call eax ; call print function
add esp, 8
mov esi, [esp+(1+1)*4] ; parameter: src1
vmovupd ymm1, YMMWORD PTR [esi]
mov esi, [esp+(2+1)*4] ; parameter: src2
vperm2f128 ymm0, ymm1, YMMWORD PTR [esi], 010001b
mov esi, [esp+(3+1)*4] ; parameter: dest
vmovupd YMMWORD PTR [esi], ymm0
push 010001b
push esi ; parameter: dest
mov eax,[esp+(4+3)*4] ; parameter: print function
call eax ; call print function
add esp, 8
mov esi, [esp+(1+1)*4] ; parameter: src1
vmovupd ymm1, YMMWORD PTR [esi]
mov esi, [esp+(2+1)*4] ; parameter: src2
vperm2f128 ymm0, ymm1, YMMWORD PTR [esi], 100010b
mov esi, [esp+(3+1)*4] ; parameter: dest
vmovupd YMMWORD PTR [esi], ymm0
push 100010b
push esi ; parameter: dest
mov eax,[esp+(4+3)*4] ; parameter: print function
call eax ; call print function
add esp, 8
mov esi, [esp+(1+1)*4] ; parameter: src1
vmovupd ymm1, YMMWORD PTR [esi]
mov esi, [esp+(2+1)*4] ; parameter: src2
vperm2f128 ymm0, ymm1, YMMWORD PTR [esi], 110011b
mov esi, [esp+(3+1)*4] ; parameter: dest
vmovupd YMMWORD PTR [esi], ymm0
push 110011b
push esi ; parameter: dest
mov eax,[esp+(4+3)*4] ; parameter: print function
call eax ; call print function
add esp, 8
mov esi, [esp+(1+1)*4] ; parameter: src1
vmovupd ymm1, YMMWORD PTR [esi]
mov esi, [esp+(2+1)*4] ; parameter: src2
vperm2f128 ymm0, ymm1, YMMWORD PTR [esi], 010010b
mov esi, [esp+(3+1)*4] ; parameter: dest
vmovupd YMMWORD PTR [esi], ymm0
push 010010b
push esi ; parameter: dest
mov eax,[esp+(4+3)*4] ; parameter: print function
call eax ; call print function
add esp, 8
mov esi, [esp+(1+1)*4] ; parameter: src1
vmovupd ymm1, YMMWORD PTR [esi]
mov esi, [esp+(2+1)*4] ; parameter: src2
vperm2f128 ymm0, ymm1, YMMWORD PTR [esi], 110010b
mov esi, [esp+(3+1)*4] ; parameter: dest
vmovupd YMMWORD PTR [esi], ymm0
push 110010b
push esi ; parameter: dest
mov eax,[esp+(4+3)*4] ; parameter: print function
call eax ; call print function
add esp, 8
mov esi, [esp+(1+1)*4] ; parameter: src1
vmovupd ymm1, YMMWORD PTR [esi]
mov esi, [esp+(2+1)*4] ; parameter: src2
vperm2f128 ymm0, ymm1, YMMWORD PTR [esi], 010011b
mov esi, [esp+(3+1)*4] ; parameter: dest
vmovupd YMMWORD PTR [esi], ymm0
push 010011b
push esi ; parameter: dest
mov eax,[esp+(4+3)*4] ; parameter: print function
call eax ; call print function
add esp, 8
mov esi, [esp+(1+1)*4] ; parameter: src1
vmovupd ymm1, YMMWORD PTR [esi]
mov esi, [esp+(2+1)*4] ; parameter: src2
vperm2f128 ymm0, ymm1, YMMWORD PTR [esi], 00011000b
mov esi, [esp+(3+1)*4] ; parameter: dest
vmovupd YMMWORD PTR [esi], ymm0
push 00011000b
push esi ; parameter: dest
mov eax,[esp+(4+3)*4] ; parameter: print function
call eax ; call print function
add esp, 8
mov esi, [esp+(1+1)*4] ; parameter: src1
vmovupd ymm1, YMMWORD PTR [esi]
mov esi, [esp+(2+1)*4] ; parameter: src2
vperm2f128 ymm0, ymm1, YMMWORD PTR [esi], 10000010b
mov esi, [esp+(3+1)*4] ; parameter: dest
vmovupd YMMWORD PTR [esi], ymm0
push 10000010b
push esi ; parameter: dest
mov eax,[esp+(4+3)*4] ; parameter: print function
call eax ; call print function
add esp, 8
mov esi, [esp+(1+1)*4] ; parameter: src1
vmovupd ymm1, YMMWORD PTR [esi]
mov esi, [esp+(2+1)*4] ; parameter: src2
vperm2f128 ymm0, ymm1, YMMWORD PTR [esi], 10001000b
mov esi, [esp+(3+1)*4] ; parameter: dest
vmovupd YMMWORD PTR [esi], ymm0
push 10001000b
push esi ; parameter: dest
mov eax,[esp+(4+3)*4] ; parameter: print function
call eax ; call print function
add esp, 8
mov eax, 1
pop esi
ret 0
_avx_vperm2f128_example ENDP
_TEXT ENDS
END
The driver code:
#include <iostream>
using namespace std;
extern "C" int __cdecl avx_vperm2f128_example(double src1[], double src2[], double dest[], void (*print)(double d[], unsigned char imm8));
void print(double d[], unsigned char imm8)
{
unsigned char printbits = 0;
for (int i = 0; i < 8; i++)
{
printbits <<= 1;
printbits += (imm8 >> i & 1);
}
cout << endl << "imm8 = ";
for (int i = 0; i < 8; i++)
{
cout << (printbits >> i & 1);
}
cout << "b" << " = " << (int)imm8 << endl << "dest = [";
for (int i = 0; i < 4; i++)
{
if (i > 0)
cout << ", ";
cout << d[i];
}
cout << "]" << endl;
}
int main()
{
double src1[4], src2[4], dest[4];
for (int i = 0; i < 4; i++)
src1[i] = (i + 1) * 1.1;
for (int i = 0; i < 4; i++)
src2[i] = (i + 1) * 10.1;
cout << "src1 = [";
for (int i = 0; i < 4; i++)
{
if (i > 0)
cout << ", ";
cout << src1[i];
}
cout << "]" << endl << "src2 = [";
for (int i = 0; i < 4; i++)
{
if (i > 0)
cout << ", ";
cout << src2[i];
}
cout << "]" << endl;
avx_vperm2f128_example(src1, src2, dest, print);
cout << endl;
cin.get();
return 0;
}
The program output:
src1 = [1.1, 2.2, 3.3, 4.4] src2 = [10.1, 20.2, 30.3, 40.4] imm8 = 00000000b = 0 dest = [1.1, 2.2, 1.1, 2.2] imm8 = 00000001b = 1 dest = [3.3, 4.4, 1.1, 2.2] imm8 = 00000010b = 2 dest = [10.1, 20.2, 1.1, 2.2] imm8 = 00000011b = 3 dest = [30.3, 40.4, 1.1, 2.2] imm8 = 00010000b = 16 dest = [1.1, 2.2, 3.3, 4.4] imm8 = 00100000b = 32 dest = [1.1, 2.2, 10.1, 20.2] imm8 = 00110000b = 48 dest = [1.1, 2.2, 30.3, 40.4] imm8 = 00010001b = 17 dest = [3.3, 4.4, 3.3, 4.4] imm8 = 00100010b = 34 dest = [10.1, 20.2, 10.1, 20.2] imm8 = 00110011b = 51 dest = [30.3, 40.4, 30.3, 40.4] imm8 = 00010010b = 18 dest = [10.1, 20.2, 3.3, 4.4] imm8 = 00110010b = 50 dest = [10.1, 20.2, 30.3, 40.4] imm8 = 00010011b = 19 dest = [30.3, 40.4, 3.3, 4.4] imm8 = 00011000b = 24 dest = [0, 0, 3.3, 4.4] imm8 = 10000010b = 130 dest = [10.1, 20.2, 0, 0] imm8 = 10001000b = 136 dest = [0, 0, 0, 0]
This figure shows how the destination is affected by the bitmask in [imm8]:
IMM8 BITS 0 and 1: 00000000: SRC1[127-0] → DEST[127-0] 00000001: SRC1[255-128] → DEST[127-0] 00000010: SRC2[127-0] → DEST[127-0] 00000011: SRC2[255-128] → DEST[127-0] IMM8 BIT 2 00000100: reserved IMM8 BIT 3 00001000: 0 → DEST[127-0] IMM8 BITS 4 and 5: 00000000: SRC1[127-0] → DEST[255-128] 00001000: SRC1[255-128] → DEST[255-128] 00010000: SRC2[127-0] → DEST[255-128] 00011000: SRC2[255-128] → DEST[255-128] IMM8 BIT 6 01000000: reserved IMM8 BIT 7 10000000: 0 → DEST[255-128]
Copy Memory Using 256-bit AVX Registers
This code will copy memory from one location to another using 32-byte registers (it is not really faster than SSE 128-bit copy example because memory speed is a bottleneck):
TITLE 'extern "C" char * memcopy256(char *dest, const char *src, unsigned int length);'
.686P
.xmm
.model FLAT
PUBLIC _memcopy256
_TEXT SEGMENT
_memcopy256 PROC NEAR
push ebx
push esi
mov esi, DWORD PTR [esp+(3+3)*4] ; move length to esi
mov ebx, esi ; move length to ebx
shr esi, 5 ; divide by 32 - holds quotient now
shl esi, 5 ; multiply by 32
sub ebx, esi ; find the remainder and store in ebx
shr esi, 5 ; divide by 32 - holds quotient now
mov ecx, esi ; move the quotient to ecx
mov esi, DWORD PTR [esp+(2+3)*4] ; move the src pointer to esi
mov eax, DWORD PTR [esp+(1+3)*4] ; move the dest pointer to eax
cmp ecx, 0 ; make sure there are at least 32 bytes to copy
je compare_remainder
copy32bytes:
vmovdqu ymm0, YMMWORD PTR [esi] ; copy 32 bytes from src to ymm0
vmovdqu YMMWORD PTR [eax], ymm0 ; copy 32 bytes from ymm0 to dest
add eax, 32
add esi, 32
loopnz copy32bytes ; loop while ecx > 0 - this will automatically decrement ecx
compare_remainder:
cmp ebx, 0 ; if there is no remainder then finished
je exit
mov ecx, ebx ; move the remainder to ecx
shr ecx, 1 ; divide by 2
cmp ecx, 0 ; make sure there are at least 2 bytes to copy
je check_odd_byte
copy2bytes:
mov dx, WORD PTR [esi] ; copy 2 bytes from src to dx
mov WORD PTR [eax], dx ; copy 2 byte from dx to dest
add eax, 2
add esi, 2
loopnz copy2bytes ; loop while ecx > 0 - this will automatically decrement ecx
check_odd_byte:
test ebx, 1
jz exit
mov dl, BYTE PTR [esi] ; copy 1 byte from src to r9b
mov BYTE PTR [eax], dl ; copy 1 byte from r9b to dest
inc eax
exit:
pop esi
pop ebx
ret 0
_memcopy256 ENDP
_TEXT ENDS
END
The driver code:
#include <iostream>
using namespace std;
extern "C" char* memcopy256(char* dest, const char* src, unsigned int length);
int main()
{
char dest[1024];
const char* src = "abcdefghijklmnopqrstuvwxyz? ";
int len = strlen(src);
char* end;
end = memcopy256(dest, src, len);
end = memcopy256(end, dest, len * 1);
end = memcopy256(end, dest, len * 2);
end = memcopy256(end, dest, len * 4);
memcopy256(end, "", 1); // copy sentinel
return 0;
}
Multiply Large Array with 256-bit AVX Registers
Math operations on 32-byte arrays are very fast. This is useful for processing images, for example, because the bitmap data are stored in a large array.
This code will multiply a large array with one multiplier using 32-byte (256-bit) registers:
TITLE 'extern "C" void avx_multiply_array(float product[], float multiplicand[], float multiplier, int length);'
.686P
.xmm
.model FLAT
PUBLIC _avx_multiply_array
_TEXT SEGMENT
_avx_multiply_array PROC NEAR
push ebx
push edi
push esi
mov eax, DWORD PTR [esp+(4+4)*4] ; move length to eax
mov edx, eax ; move length to ebx
shr edx, 5 ; divide by 32 - holds quotient now
shl edx, 5 ; multiply by 32
mov ebx, eax ; move length to edx
sub ebx, edx ; find the remainder and store in ebx
shr edx, 5 ; divide by 32 - holds quotient now
mov edi, DWORD PTR [esp+(2+4)*4] ; move multiplicand array to edi
mov esi, DWORD PTR [esp+(1+4)*4] ; move product array (destination array) to esi
cmp edx, 0 ; make sure there are at least 32 bytes to copy
je compare_remainder
mov ecx, edx ; move the quotient to ecx
vbroadcastss ymm1, DWORD PTR [esp+(3+4)*4] ; broadcast multiplier to ymm1
multiply32bytes:
vmulps ymm0, ymm1, YMMWORD PTR [edi] ; perform multiplication on 8 dwords
vmovups YMMWORD PTR [esi], ymm0 ; move products to product array
add edi, 32
add esi, 32
loopnz multiply32bytes ; loop while ecx > 0 - this will automatically decrement ecx
compare_remainder:
cmp ebx, 0 ; if there is no remainder then finished
je exit
mov ecx, ebx ; move the remainder to ecx
multi4bytes:
movss xmm0, DWORD PTR [esp+(3+4)*4] ; move multiplier to xmm0
mulss xmm0, DWORD PTR [edi] ; multiply 4 byte multiplicand with xmm0
movss DWORD PTR [esi], xmm0 ; move the product to the product array
add edi, 4
add esi, 4
loopnz multi4bytes ; loop while ecx > 0 - this will automatically decrement ecx
exit:
pop esi
pop edi
pop ebx
ret 0
_avx_multiply_array ENDP
_TEXT ENDS
END
The driver code:
#include <random>
#include <iostream>
using namespace std;
extern "C" void avx_multiply_array(float product[], float multiplicand[], float multiplier, int length);
const int length = 10000000; // very large array
float product[length], multiplicand[length], multiplier = 1.5f;
int main()
{
srand(clock());
for (int i = 0; i < length; i++)
multiplicand[i] = rand() * (rand() % 2 ? -0.33333f : 0.33333f);
avx_multiply_array(product, multiplicand, multiplier, length);
return 0;
}
AVX2
AVX2 adds vector shifts, adds DWORD-granularity and QWORD-granularity "any to any" permutes, enables vector elements to be loaded from non-contiguous memory locations, and expands most vector integer SSE and AVX instructions to 256 bits. AVX2 came out in 2013 processors so support is somewhat limited at the time of writing this section of the tutorial.
These AVX2 instructions are completely new:
Instruction | Description |
---|---|
VBROADCASTSS VBROADCASTSD | Copy a 32-bit or 64-bit register operand to all elements of a XMM or YMM vector register. These are register versions of the same instructions in AVX. There is no 128-bit version, but the same effect can be simply achieved using VINSERTF128 . |
VPBROADCASTB VPBROADCASTW VPBROADCASTD VPBROADCASTQ | Copy an 8, 16, 32 or 64-bit integer register or memory operand to all elements of a XMM or YMM vector register. |
VBROADCASTI128 | Copy a 128-bit memory operand to all elements of a YMM vector register. |
VINSERTI128 | Replaces either the lower half or the upper half of a 256-bit YMM register with the value of a 128-bit source operand. The other half of the destination is unchanged. |
VEXTRACTI128 | Extracts either the lower half or the upper half of a 256-bit YMM register and copies the value to a 128-bit destination operand. |
VGATHERDPD VGATHERQPD VGATHERDPS VGATHERQPS | Gathers single or double precision floating point values using either 32 or 64-bit indices and scale. |
VPGATHERDD VPGATHERDQ VPGATHERQD VPGATHERQQ | Gathers 32 or 64-bit integer values using either 32 or 64-bit indices and scale. |
VPMASKMOVD VPMASKMOVQ | Conditionally reads any number of elements from a SIMD vector memory operand into a destination register, leaving the remaining vector elements unread and setting the corresponding elements in the destination register to zero. Alternatively, conditionally writes any number of elements from a SIMD vector register operand to a vector memory operand, leaving the remaining elements of the memory operand unchanged. |
VPERMPS VPERMD | Shuffle the eight 32-bit vector elements of one 256-bit source operand into a 256-bit destination operand, with a register or memory operand as selector. |
VPERMPD VPERMQ | Shuffle the four 64-bit vector elements of one 256-bit source operand into a 256-bit destination operand, with a register or memory operand as selector. |
VPERM2I128 | Shuffle (two of) the four 128-bit vector elements of two 256-bit source operands into a 256-bit destination operand, with an immediate constant as selector. |
VPBLENDD | Doubleword immediate version of the PBLEND instructions from SSE4. |
VPSLLVD VPSLLVQ | Shift left logical. Allows variable shifts where each element is shifted according to the packed input. |
VPSRLVD VPSRLVQ | Shift right logical. Allows variable shifts where each element is shifted according to the packed input. |
VPSRAVD | Shift right arithmetically. Allows variable shifts where each element is shifted according to the packed input. |