/* ******************************************************************************
* Copyright (c) 2010-2022 Google, Inc. All rights reserved.
* ******************************************************************************/
/*
* Redistribution and use in source and binary forms, with or without
* modification, are permitted provided that the following conditions are met:
*
* * Redistributions of source code must retain the above copyright notice,
* this list of conditions and the following disclaimer.
*
* * Redistributions in binary form must reproduce the above copyright notice,
* this list of conditions and the following disclaimer in the documentation
* and/or other materials provided with the distribution.
*
* * Neither the name of Google, Inc. nor the names of its contributors may be
* used to endorse or promote products derived from this software without
* specific prior written permission.
*
* THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
* AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
* IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
* ARE DISCLAIMED. IN NO EVENT SHALL VMWARE, INC. OR CONTRIBUTORS BE LIABLE
* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
* DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
* SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
* CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
* LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
* OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH
* DAMAGE.
*/
/**
****************************************************************************
\page page_scatter_gather_emulation Emulating x86 Scatter and Gather Instructions
\tableofcontents
# Background
The x86 gather and scatter instructions were introduced in the AVX2 and AVX512
instruction set extensions. They allow loading or storing a subset of elements in a
vector from/to multiple non-contiguous addresses.
AVX2 has only gather instructions, no scatter instructions, whereas AVX512 has
both. AVX2 is limited to 256-bit length vectors, whereas AVX512 has 512-bit
support. Both support masking of individual memory accesses, using either a special
mask register in AVX512 or another vector in AVX2.
Examples of these instructions are (in DR’s IR):
```
vpgatherdd %rax(,%ymm11,4)[4byte] %ymm13 -> %ymm12 %ymm13
```
Above is an AVX2 gather instruction that reads 32-bit doublewords into the 256-bit
`ymm12` vector from addresses generated by adding the base address in `rax` to the
corresponding index elements in `ymm11`, conditionally based on the masks in
`ymm13`. Elements may be gathered in any order. When an element is read, its mask
is cleared. If some load faults, all elements to its right (closer to LSB) will
be complete.
```
vpscatterdd {%k1} %xmm10 -> %rax(,%xmm11,4)[4byte] %k1
```
Above is an AVX512 scatter instruction that writes 32-bit doublewords from the 128-
bit `xmm10` vector to addresses generated by adding the base address in `rax` to
the corresponding index element in `xmm11`, conditionally based on the mask
register `k1`. Elements may be scattered in any order. When an element is stored,
its mask is cleared. If some store faults, all elements to its right (closer to
LSB) will be complete.
# Problem Statement
Scatter and gather instructions pose a challenge to DynamoRIO clients that observe
memory addresses, like address tracing tools (e.g. drcachesim, that collects
memory address and control flow traces), and taint tracking tools. They are complex
to handle because:
- A single gather or scatter instruction loads from or stores to multiple
addresses
- Accessed addresses may be non-contiguous
- Each access is conditional based on some mask
DynamoRIO clients only see the scatter or gather instruction and need to do more
work to extract all accessed addresses. This is unlike regular scalar loads or
stores, where the accessed address is readily available. The goal of this work is
to make it easier for DR clients to observe these addresses. We achieve this by
expanding the scatter and gather instructions into a functionally equivalent
sequence of scalar stores and loads. This way, DR clients will see regular store
and load instructions which they can instrument as usual. This is similar to what
DR does for repeat string operations (like `rep movs`, `repnz cmps`): convert it into a
loop so that each memory access is made by a separate dynamic instruction. This
method has worked well for such instructions that implicitly issue multiple
memory accesses.
Original issue: [DynamoRIO/dynamorio#2985](https://github.com/DynamoRIO/dynamorio/issues/2985).
# Design
This required the addition of new support in various DynamoRIO components, like
drreg, drx, drmgr and core DR. Multiple contributors worked on designing and
implementing the required changes.
Note that we expect the same approach to work for other platforms too, like for the
AArch64 SVE scatter/gather instructions.
## Scatter/gather Instruction Expansion
Owner: [Hendrik Greving](https://github.com/hgreving2304)
As described above, we can simplify work for DR clients by replacing each scatter
and gather instruction with a functionally equivalent sequence of scalar stores and
loads. The expanded sequence is the unrolled version of the following loop:
```
num_accesses = vector_size / element_size
for i = 0, 1, ..., (num_accesses-1), do
extract mask for the ith access from mask reg or mask vector
if mask is set, then
extract ith element of index vector
compute address = base + ith index element
if instr_is_gather, then
load data from address into a scalar reg
insert scalar data into destination vector
else // instr_is_scatter
extract scalar data from source vector to scalar reg
store data from scalar reg to address
done
clear ith mask in mask reg or mask vector
done
done
```
Due to the x86 ISA, the extraction/insertion of the scalar value from/to the vector
may involve multiple steps, e.g. to extract a 32-bit scalar value from a 512-bit
`zmm` reg, we first need to extract a 128-bit `xmm` from it.
drmgr in DR provides multiple phases of instrumentation. Our expansion is done in
the first phase known as app2app. As the name suggests, this phase is intended to
transform app instructions to equivalent instructions. For simplicity, we also
separate out the scatter and gather instructions from their basic block and create a
separate fragment with only the expanded sequence. The logic for expanding scatter
and gather instructions is implemented in the drx extension library as
[`drx_expand_scatter_gather`](https://github.com/DynamoRIO/dynamorio/blob/
eb5d5af8e3444912c9f3f70e5ebf7969252ee4d6/ext/drx/drx.h#L538), and can be used by
any client that needs it, including drcachesim. This support was added by
[commit](https://github.com/DynamoRIO/dynamorio/commit/4359ef134e47942004c09db04b54593579763186).
As an example, the following are the expansions of some instructions.
Expansion for
```
vpgatherdd 0x00402039(,%xmm11,4)[4byte] %xmm13 -> %xmm12 %xmm13
```
```
+0 m4 @0x00007fdb2ac5e6a0 65 48 a3 e0 00 00 00 mov %rax -> %gs:0x000000e0[8byte]
00 00 00 00
+11 m4 @0x00007fdb2ac5eac0 9f lahf -> %ah
+12 m4 @0x00007fdb2ac5ea40 0f 90 c0 seto -> %al // Spill aflags using drreg.
+15 m4 @0x00007fdb2ac5efa8 65 48 89 0c 25 e8 00 mov %rcx -> %gs:0x000000e8[8byte] // Spill the scratch GPR using drreg.
00 00
+24 m4 @0x00007fdb2ac0ec70 65 48 8b 0c 25 20 00 mov %gs:0x20[8byte] -> %rcx
00 00
+33 m4 @0x00007fdb2ac0ebf0 48 8b 89 f0 0a 00 00 mov 0x00000af0(%rcx)[8byte] -> %rcx
+40 m4 @0x00007fdb2ac5f0a8 48 8b 09 mov (%rcx)[8byte] -> %rcx
+43 m4 @0x00007fdb2ac5e7f0 48 8b 49 10 mov 0x10(%rcx)[8byte] -> %rcx
+47 m4 @0x00007fdb2ac0f488 48 8b 49 08 mov 0x08(%rcx)[8byte] -> %rcx
+51 m4 @0x00007fdb2ac0ed38 62 f1 7c 48 29 01 vmovaps {%k0} %zmm0 -> (%rcx)[64byte] // Manually spill the scratch zmm reg.
+57 m4 @0x00007fdb2ac0ee00 <label>
+57 L4 @0x00007fdb2ac0efb0 c4 63 7d 39 e8 00 vextracti128 %ymm13 $0x00 -> %xmm0 // Expansion for the first vector element starts here.
+63 L4 @0x00007fdb2ac0bee8 c4 e3 79 16 c1 00 vpextrd %xmm0 $0x00 -> %ecx // Extract mask for the first element.
+69 L4 @0x00007fdb2ac0c750 c1 e9 1f shr $0x0000001f %ecx -> %ecx
+72 L4 @0x00007fdb2ac0be68 81 e1 01 00 00 00 and $0x00000001 %ecx -> %ecx
+78 L4 @0x00007fdb2ac0bca0 0f 84 fa ff ff ff jz @0x00007fdb2ac0ee98[8byte] // Check whether to load the first element based on mask.
+84 L4 @0x00007fdb2ac0c6d0 c4 63 7d 39 d8 00 vextracti128 %ymm11 $0x00 -> %xmm0
+90 L4 @0x00007fdb2ac0be00 c4 e3 79 16 c1 00 vpextrd %xmm0 $0x00 -> %ecx // Extract index for the first load address.
+96 L4 @0x00007fdb2ac0b8b8 48 63 c9 movsxd %ecx -> %rcx
+99 L4 @0x00007fdb2ac0cc20 8b 0c 8d 39 20 40 00 mov 0x00402039(,%rcx,4)[4byte] -> %ecx // Load the first element into a scalar reg.
+106 L4 @0x00007fdb2ac5e5b8 c4 63 7d 39 e0 00 vextracti128 %ymm12 $0x00 -> %xmm0
+112 L4 @0x00007fdb2ac5e870 c4 e3 79 22 c1 00 vpinsrd %xmm0 %ecx $0x00 -> %xmm0
+118 L4 @0x00007fdb2ac5e9c0 c4 63 1d 38 e0 00 vinserti128 %ymm12 %xmm0 $0x00 -> %ymm12 // Insert the first element into the destination vector reg
+124 L4 @0x00007fdb2ac5eb40 33 c9 xor %ecx %ecx -> %ecx
+126 L4 @0x00007fdb2ac5ebc0 c4 63 7d 39 e8 00 vextracti128 %ymm13 $0x00 -> %xmm0
+132 L4 @0x00007fdb2ac5eda8 c4 e3 79 22 c1 00 vpinsrd %xmm0 %ecx $0x00 -> %xmm0
+138 L4 @0x00007fdb2ac5ec40 c4 63 15 38 e8 00 vinserti128 %ymm13 %xmm0 $0x00 -> %ymm13 // Clear the mask bit for the first element.
+144 m4 @0x00007fdb2ac0ee98 <label>
+144 L4 @0x00007fdb2ac5ed40 c4 63 7d 39 e8 00 vextracti128 %ymm13 $0x00 -> %xmm0 // Repeat for the second vector element.
+150 L4 @0x00007fdb2ac5ee28 c4 e3 79 16 c1 01 vpextrd %xmm0 $0x01 -> %ecx
+156 L4 @0x00007fdb2ac5eea8 c1 e9 1f shr $0x0000001f %ecx -> %ecx
+159 L4 @0x00007fdb2ac5e638 81 e1 01 00 00 00 and $0x00000001 %ecx -> %ecx
+165 L4 @0x00007fdb2ac5e788 0f 84 fa ff ff ff jz @0x00007fdb2ac5ecc0[8byte]
+171 L4 @0x00007fdb2ac5e720 c4 63 7d 39 d8 00 vextracti128 %ymm11 $0x00 -> %xmm0
+177 L4 @0x00007fdb2ac5e958 c4 e3 79 16 c1 01 vpextrd %xmm0 $0x01 -> %ecx
+183 L4 @0x00007fdb2ac5e8f0 48 63 c9 movsxd %ecx -> %rcx
+186 L4 @0x00007fdb2ac5f028 8b 0c 8d 39 20 40 00 mov 0x00402039(,%rcx,4)[4byte] -> %ecx
+193 L4 @0x00007fdb2ac5ef28 c4 63 7d 39 e0 00 vextracti128 %ymm12 $0x00 -> %xmm0
+199 L4 @0x00007fdb2ac0baa0 c4 e3 79 22 c1 01 vpinsrd %xmm0 %ecx $0x01 -> %xmm0
+205 L4 @0x00007fdb2ac5f128 c4 63 1d 38 e0 00 vinserti128 %ymm12 %xmm0 $0x00 -> %ymm12
+211 L4 @0x00007fdb2ac0ea60 33 c9 xor %ecx %ecx -> %ecx
+213 L4 @0x00007fdb2ac0e9c8 c4 63 7d 39 e8 00 vextracti128 %ymm13 $0x00 -> %xmm0
+219 L4 @0x00007fdb2ac0e930 c4 e3 79 22 c1 01 vpinsrd %xmm0 %ecx $0x01 -> %xmm0
+225 L4 @0x00007fdb2ac0cbb8 c4 63 15 38 e8 00 vinserti128 %ymm13 %xmm0 $0x00 -> %ymm13
+231 m4 @0x00007fdb2ac5ecc0 <label>
+231 L4 @0x00007fdb2ac5e538 c4 63 7d 39 e8 00 vextracti128 %ymm13 $0x00 -> %xmm0 // Repeat for the third vector element.
+237 L4 @0x00007fdb2ac5e4b8 c4 e3 79 16 c1 02 vpextrd %xmm0 $0x02 -> %ecx
+243 L4 @0x00007fdb2ac5df68 c1 e9 1f shr $0x0000001f %ecx -> %ecx
+246 L4 @0x00007fdb2ac5e438 81 e1 01 00 00 00 and $0x00000001 %ecx -> %ecx
+252 L4 @0x00007fdb2ac5e3b8 0f 84 fa ff ff ff jz @0x00007fdb2ac0f018[8byte]
+258 L4 @0x00007fdb2ac5e050 c4 63 7d 39 d8 00 vextracti128 %ymm11 $0x00 -> %xmm0
+264 L4 @0x00007fdb2ac5e338 c4 e3 79 16 c1 02 vpextrd %xmm0 $0x02 -> %ecx
+270 L4 @0x00007fdb2ac5e2b8 48 63 c9 movsxd %ecx -> %rcx
+273 L4 @0x00007fdb2ac5e238 8b 0c 8d 39 20 40 00 mov 0x00402039(,%rcx,4)[4byte] -> %ecx
+280 L4 @0x00007fdb2ac5e1b8 c4 63 7d 39 e0 00 vextracti128 %ymm12 $0x00 -> %xmm0
+286 L4 @0x00007fdb2ac5e138 c4 e3 79 22 c1 02 vpinsrd %xmm0 %ecx $0x02 -> %xmm0
+292 L4 @0x00007fdb2ac5e0b8 c4 63 1d 38 e0 00 vinserti128 %ymm12 %xmm0 $0x00 -> %ymm12
+298 L4 @0x00007fdb2ac5dfd0 33 c9 xor %ecx %ecx -> %ecx
+300 L4 @0x00007fdb2ac5dee8 c4 63 7d 39 e8 00 vextracti128 %ymm13 $0x00 -> %xmm0
+306 L4 @0x00007fdb2ac0f080 c4 e3 79 22 c1 02 vpinsrd %xmm0 %ecx $0x02 -> %xmm0
+312 L4 @0x00007fdb2ac0f0e8 c4 63 15 38 e8 00 vinserti128 %ymm13 %xmm0 $0x00 -> %ymm13
+318 m4 @0x00007fdb2ac0f018 <label>
+318 L4 @0x00007fdb2ac0f220 c4 63 7d 39 e8 00 vextracti128 %ymm13 $0x00 -> %xmm0 // Repeat for the fourth vector element.
+324 L4 @0x00007fdb2ac0f2a0 c4 e3 79 16 c1 03 vpextrd %xmm0 $0x03 -> %ecx
+330 L4 @0x00007fdb2ac0f150 c1 e9 1f shr $0x0000001f %ecx -> %ecx
+333 L4 @0x00007fdb2ac0f388 81 e1 01 00 00 00 and $0x00000001 %ecx -> %ecx
+339 L4 @0x00007fdb2ac0f408 0f 84 fa ff ff ff jz @0x00007fdb2ac0f1b8[8byte]
+345 L4 @0x00007fdb2ac0ba20 c4 63 7d 39 d8 00 vextracti128 %ymm11 $0x00 -> %xmm0
+351 L4 @0x00007fdb2ac0f320 c4 e3 79 16 c1 03 vpextrd %xmm0 $0x03 -> %ecx
+357 L4 @0x00007fdb2ac0f508 48 63 c9 movsxd %ecx -> %rcx
+360 L4 @0x00007fdb2ac0f588 8b 0c 8d 39 20 40 00 mov 0x00402039(,%rcx,4)[4byte] -> %ecx
+367 L4 @0x00007fdb2ac0ef30 c4 63 7d 39 e0 00 vextracti128 %ymm12 $0x00 -> %xmm0
+373 L4 @0x00007fdb2ac0c050 c4 e3 79 22 c1 03 vpinsrd %xmm0 %ecx $0x03 -> %xmm0
+379 L4 @0x00007fdb2ac0c1c8 c4 63 1d 38 e0 00 vinserti128 %ymm12 %xmm0 $0x00 -> %ymm12
+385 L4 @0x00007fdb2ac0c3c8 33 c9 xor %ecx %ecx -> %ecx
+387 L4 @0x00007fdb2ac0bfd0 c4 63 7d 39 e8 00 vextracti128 %ymm13 $0x00 -> %xmm0
+393 L4 @0x00007fdb2ac5de68 c4 e3 79 22 c1 03 vpinsrd %xmm0 %ecx $0x03 -> %xmm0
+399 L4 @0x00007fdb2ac5dde8 c4 63 15 38 e8 00 vinserti128 %ymm13 %xmm0 $0x00 -> %ymm13
+405 m4 @0x00007fdb2ac0f1b8 <label>
+405 L4 @0x00007fdb2ac5d898 c4 41 11 ef ed vpxor %xmm13 %xmm13 -> %xmm13 // Zero the mask reg.
+410 m4 @0x00007fdb2ac5dd68 65 48 8b 0c 25 20 00 mov %gs:0x20[8byte] -> %rcx
00 00
+419 m4 @0x00007fdb2ac5dce8 48 8b 89 f0 0a 00 00 mov 0x00000af0(%rcx)[8byte] -> %rcx
+426 m4 @0x00007fdb2ac5d980 48 8b 09 mov (%rcx)[8byte] -> %rcx
+429 m4 @0x00007fdb2ac5dc68 48 8b 49 10 mov 0x10(%rcx)[8byte] -> %rcx
+433 m4 @0x00007fdb2ac5dbe8 48 8b 49 08 mov 0x08(%rcx)[8byte] -> %rcx
+437 m4 @0x00007fdb2ac5db68 62 f1 7c 48 28 01 vmovaps {%k0} (%rcx)[64byte] -> %zmm0 // Manually restore the scratch zmm reg.
+443 m4 @0x00007fdb2ac5dae8 65 48 8b 0c 25 e8 00 mov %gs:0x000000e8[8byte] -> %rcx // Restore the scratch GPR using drreg.
00 00
+452 m4 @0x00007fdb2ac5da68 3c 81 cmp %al $0x81
+454 m4 @0x00007fdb2ac5d9e8 9e sahf %ah
+455 m4 @0x00007fdb2ac5d900 65 48 a1 e0 00 00 00 mov %gs:0x000000e0[8byte] -> %rax // Restore aflags using drreg.
00 00 00 00
+466 m4 @0x00007fdb2ac5d818 <label>
```
Expansion for
```
vpscatterdd {%k1} %xmm10 -> 0x00402039(,%xmm11,4)[4byte] %k1
```
```
+0 m4 @0x00007fdb2ac106e0 65 48 a3 e0 00 00 00 mov %rax -> %gs:0x000000e0[8byte]
00 00 00 00
+11 m4 @0x00007fdb2ac10760 9f lahf -> %ah
+12 m4 @0x00007fdb2ac107e0 0f 90 c0 seto -> %al // Spill aflags using drreg.
+15 m4 @0x00007fdb2ac100a8 65 48 89 0c 25 e8 00 mov %rcx -> %gs:0x000000e8[8byte] // Spill the first scratch GPR using drreg.
00 00
+24 m4 @0x00007fdb2ac10110 65 48 89 14 25 f0 00 mov %rdx -> %gs:0x000000f0[8byte] // Spill the second scratch GPR using drreg.
00 00
+33 m4 @0x00007fdb2ac0fed8 65 48 8b 0c 25 20 00 mov %gs:0x20[8byte] -> %rcx
00 00
+42 m4 @0x00007fdb2ac0ff40 48 8b 89 f0 0a 00 00 mov 0x00000af0(%rcx)[8byte] -> %rcx
+49 m4 @0x00007fdb2ac0f608 48 8b 09 mov (%rcx)[8byte] -> %rcx
+52 m4 @0x00007fdb2ac10860 48 8b 49 10 mov 0x10(%rcx)[8byte] -> %rcx
+56 m4 @0x00007fdb2ac108e0 48 8b 49 08 mov 0x08(%rcx)[8byte] -> %rcx
+60 m4 @0x00007fdb2ac0ca50 62 f1 7c 48 29 01 vmovaps {%k0} %zmm0 -> (%rcx)[64byte] // Manually spill the scratch zmm reg.
+66 m4 @0x00007fdb2ac10660 <label>
+66 L4 @0x00007fdb2ac10560 c5 f8 93 c9 kmovw %k1 -> %ecx // Expansion for the first vector element starts here.
+70 L4 @0x00007fdb2ac104f8 f7 c1 01 00 00 00 test %ecx $0x00000001
+76 L4 @0x00007fdb2ac10478 0f 84 fa ff ff ff jz @0x00007fdb2ac105e0[8byte] // Check whether to store the first element based on mask.
+82 L4 @0x00007fdb2ac103f8 62 73 7d 48 39 d8 00 vextracti32x4 {%k0} $0x00 %zmm11 -> %xmm0
+89 L4 @0x00007fdb2ac10378 c4 e3 79 16 c1 00 vpextrd %xmm0 $0x00 -> %ecx // Extract index for the first store address.
+95 L4 @0x00007fdb2ac102f8 62 73 7d 48 39 d0 00 vextracti32x4 {%k0} $0x00 %zmm10 -> %xmm0
+102 L4 @0x00007fdb2ac10278 c4 e3 79 16 c2 00 vpextrd %xmm0 $0x00 -> %edx // Extract the element for the first store.
+108 L4 @0x00007fdb2ac101f8 48 63 c9 movsxd %ecx -> %rcx
+111 L4 @0x00007fdb2ac10178 89 14 8d 39 20 40 00 mov %edx -> 0x00402039(,%rcx,4)[4byte] // Store the first element.
+118 L4 @0x00007fdb2ac10028 b9 01 00 00 00 mov $0x00000001 -> %ecx
+123 m4 @0x00007fdb2ac0ffa8 65 48 89 1c 25 f8 00 mov %rbx -> %gs:0x000000f8[8byte] // Spill the third scratch GPR using drreg.
00 00
+132 m4 @0x00007fdb2ac0fe58 c5 f8 93 d8 kmovw %k0 -> %ebx // Manually spill the scratch mask reg k0 to the scratch GPR.
+136 L4 @0x00007fdb2ac0fdd8 c5 f8 92 c1 kmovw %ecx -> %k0
+140 L4 @0x00007fdb2ac0fbf0 c5 fc 42 c9 kandnw %k0 %k1 -> %k1 // Clear bit for the first element in the mask reg.
+144 m4 @0x00007fdb2ac0fd58 c5 f8 92 c3 kmovw %ebx -> %k0 // Manually restore the scratch mask reg from the scratch GPR.
+148 m4 @0x00007fdb2ac0fcd8 65 48 8b 1c 25 f8 00 mov %gs:0x000000f8[8byte] -> %rbx // Restore the third scratch GPR using drreg.
00 00
+157 m4 @0x00007fdb2ac105e0 <label>
+157 L4 @0x00007fdb2ac0fb70 c5 f8 93 c9 kmovw %k1 -> %ecx // Repeat for the second vector element.
+161 L4 @0x00007fdb2ac0faf0 f7 c1 02 00 00 00 test %ecx $0x00000002
+167 L4 @0x00007fdb2ac0fa70 0f 84 fa ff ff ff jz @0x00007fdb2ac0fc58[8byte]
+173 L4 @0x00007fdb2ac0f9f0 62 73 7d 48 39 d8 00 vextracti32x4 {%k0} $0x00 %zmm11 -> %xmm0
+180 L4 @0x00007fdb2ac0f970 c4 e3 79 16 c1 01 vpextrd %xmm0 $0x01 -> %ecx
+186 L4 @0x00007fdb2ac0f8f0 62 73 7d 48 39 d0 00 vextracti32x4 {%k0} $0x00 %zmm10 -> %xmm0
+193 L4 @0x00007fdb2ac0f870 c4 e3 79 16 c2 01 vpextrd %xmm0 $0x01 -> %edx
+199 L4 @0x00007fdb2ac0f7f0 48 63 c9 movsxd %ecx -> %rcx
+202 L4 @0x00007fdb2ac0f770 89 14 8d 39 20 40 00 mov %edx -> 0x00402039(,%rcx,4)[4byte]
+209 L4 @0x00007fdb2ac0f6f0 b9 02 00 00 00 mov $0x00000002 -> %ecx
+214 m4 @0x00007fdb2ac0f670 65 48 89 1c 25 f8 00 mov %rbx -> %gs:0x000000f8[8byte]
00 00
+223 m4 @0x00007fdb2ac0c448 c5 f8 93 d8 kmovw %k0 -> %ebx
+227 L4 @0x00007fdb2ac0c2c8 c5 f8 92 c1 kmovw %ecx -> %k0
+231 L4 @0x00007fdb2ac0c9e8 c5 fc 42 c9 kandnw %k0 %k1 -> %k1
+235 m4 @0x00007fdb2ac0c980 c5 f8 92 c3 kmovw %ebx -> %k0
+239 m4 @0x00007fdb2ac0c818 65 48 8b 1c 25 f8 00 mov %gs:0x000000f8[8byte] -> %rbx
00 00
+248 m4 @0x00007fdb2ac0fc58 <label>
+248 L4 @0x00007fdb2ac0c518 c5 f8 93 c9 kmovw %k1 -> %ecx // Repeat for the third vector element.
+252 L4 @0x00007fdb2ac0c668 f7 c1 04 00 00 00 test %ecx $0x00000004
+258 L4 @0x00007fdb2ac0cab8 0f 84 fa ff ff ff jz @0x00007fdb2ac0ba20[8byte]
+264 L4 @0x00007fdb2ac0cb20 62 73 7d 48 39 d8 00 vextracti32x4 {%k0} $0x00 %zmm11 -> %xmm0
+271 L4 @0x00007fdb2ac0c248 c4 e3 79 16 c1 02 vpextrd %xmm0 $0x02 -> %ecx
+277 L4 @0x00007fdb2ac0bd98 62 73 7d 48 39 d0 00 vextracti32x4 {%k0} $0x00 %zmm10 -> %xmm0
+284 L4 @0x00007fdb2ac0c0d0 c4 e3 79 16 c2 02 vpextrd %xmm0 $0x02 -> %edx
+290 L4 @0x00007fdb2ac0c900 48 63 c9 movsxd %ecx -> %rcx
+293 L4 @0x00007fdb2ac0bfd0 89 14 8d 39 20 40 00 mov %edx -> 0x00402039(,%rcx,4)[4byte]
+300 L4 @0x00007fdb2ac0c3c8 b9 04 00 00 00 mov $0x00000004 -> %ecx
+305 m4 @0x00007fdb2ac0c1c8 65 48 89 1c 25 f8 00 mov %rbx -> %gs:0x000000f8[8byte]
00 00
+314 m4 @0x00007fdb2ac0c050 c5 f8 93 d8 kmovw %k0 -> %ebx
+318 L4 @0x00007fdb2ac0ef30 c5 f8 92 c1 kmovw %ecx -> %k0
+322 L4 @0x00007fdb2ac0f588 c5 fc 42 c9 kandnw %k0 %k1 -> %k1
+326 m4 @0x00007fdb2ac0f508 c5 f8 92 c3 kmovw %ebx -> %k0
+330 m4 @0x00007fdb2ac0f320 65 48 8b 1c 25 f8 00 mov %gs:0x000000f8[8byte] -> %rbx
00 00
+339 m4 @0x00007fdb2ac0ba20 <label>
+339 L4 @0x00007fdb2ac0f408 c5 f8 93 c9 kmovw %k1 -> %ecx // Repeat for the fourth vector element.
+343 L4 @0x00007fdb2ac0f388 f7 c1 08 00 00 00 test %ecx $0x00000008
+349 L4 @0x00007fdb2ac0f150 0f 84 fa ff ff ff jz @0x00007fdb2ac0f488[8byte]
+355 L4 @0x00007fdb2ac0f2a0 62 73 7d 48 39 d8 00 vextracti32x4 {%k0} $0x00 %zmm11 -> %xmm0
+362 L4 @0x00007fdb2ac0f220 c4 e3 79 16 c1 03 vpextrd %xmm0 $0x03 -> %ecx
+368 L4 @0x00007fdb2ac0f1b8 62 73 7d 48 39 d0 00 vextracti32x4 {%k0} $0x00 %zmm10 -> %xmm0
+375 L4 @0x00007fdb2ac0f0e8 c4 e3 79 16 c2 03 vpextrd %xmm0 $0x03 -> %edx
+381 L4 @0x00007fdb2ac0f080 48 63 c9 movsxd %ecx -> %rcx
+384 L4 @0x00007fdb2ac0f018 89 14 8d 39 20 40 00 mov %edx -> 0x00402039(,%rcx,4)[4byte]
+391 L4 @0x00007fdb2ac0cbb8 b9 08 00 00 00 mov $0x00000008 -> %ecx
+396 m4 @0x00007fdb2ac0e930 65 48 89 1c 25 f8 00 mov %rbx -> %gs:0x000000f8[8byte]
00 00
+405 m4 @0x00007fdb2ac0e9c8 c5 f8 93 d8 kmovw %k0 -> %ebx
+409 L4 @0x00007fdb2ac0ea60 c5 f8 92 c1 kmovw %ecx -> %k0
+413 L4 @0x00007fdb2ac0eb28 c5 fc 42 c9 kandnw %k0 %k1 -> %k1
+417 m4 @0x00007fdb2ac0ebf0 c5 f8 92 c3 kmovw %ebx -> %k0
+421 m4 @0x00007fdb2ac0ec70 65 48 8b 1c 25 f8 00 mov %gs:0x000000f8[8byte] -> %rbx
00 00
+430 m4 @0x00007fdb2ac0f488 <label>
+430 L4 @0x00007fdb2ac0ed38 c4 e1 f4 47 c9 kxorq %k1 %k1 -> %k1 // Clear the mask reg.
+435 m4 @0x00007fdb2ac0ee00 65 48 8b 0c 25 20 00 mov %gs:0x20[8byte] -> %rcx
00 00
+444 m4 @0x00007fdb2ac0ee98 48 8b 89 f0 0a 00 00 mov 0x00000af0(%rcx)[8byte] -> %rcx
+451 m4 @0x00007fdb2ac0efb0 48 8b 09 mov (%rcx)[8byte] -> %rcx
+454 m4 @0x00007fdb2ac0bee8 48 8b 49 10 mov 0x10(%rcx)[8byte] -> %rcx
+458 m4 @0x00007fdb2ac0c750 48 8b 49 08 mov 0x08(%rcx)[8byte] -> %rcx
+462 m4 @0x00007fdb2ac0be68 62 f1 7c 48 28 01 vmovaps {%k0} (%rcx)[64byte] -> %zmm0 // Manually restore the scratch zmm reg.
+468 m4 @0x00007fdb2ac0bca0 65 48 8b 0c 25 e8 00 mov %gs:0x000000e8[8byte] -> %rcx // Restore the first scratch GPR using drreg.
00 00
+477 m4 @0x00007fdb2ac0c6d0 65 48 8b 14 25 f0 00 mov %gs:0x000000f0[8byte] -> %rdx // Restore the second scratch GPR using drreg.
00 00
+486 m4 @0x00007fdb2ac0be00 3c 81 cmp %al $0x81
+488 m4 @0x00007fdb2ac0b8b8 9e sahf %ah
+489 m4 @0x00007fdb2ac0cc20 65 48 a1 e0 00 00 00 mov %gs:0x000000e0[8byte] -> %rax // Restore aflags using drreg.
00 00 00 00
+500 m4 @0x00007fdb2ac0baa0 <label>
```
As shown by the above expanded scatter and gather sequences, we require scratch
registers for the expansion. The GPR scratch registers are obtained using
drreg, whereas the scratch `zmm` register and the scratch mask register are
obtained by manually spilling them.
We need to make sure that we restore the application state correctly when a state
restoration event occurs, which can be a fault in one of the scalar loads or stores
in the expanded sequence, a fault in instrumentation added by some other DR client,
or some async event like DR detach. While the spilled registers obtained from drreg are
restored by the drreg state restoration logic, drx still needs to restore the
scratch mask register that is spilled manually to a GPR, and the scratch `zmm` register
that is spilled manually to a drx spill slot. We also need to ensure that the bit
for the previous access is cleared if the state restore event happened after the load
or store completed but before we could reflect it in the mask. When a state restore
event occurs, we walk the expanded sequence using a state machine till we reach the
faulting pc, keeping track of the state that needs to be restored
([commit](https://github.com/DynamoRIO/dynamorio/commit/de511a72197e9b04a9732f33622fe7bc7fd12623),
[commit](https://github.com/DynamoRIO/dynamorio/commit/e54913b4845794b96e8682ba6a214d54e10741a0)).
As pointed out above, this expansion is done in the app2app phase. DR clients
may use drreg to get scratch registers for their instrumentation in later phases
(like insertion or instru2instru). While drreg indeed supports some basic usage
outside of the insertion phase, it does not mitigate bad interactions by such
multi-phase use. The following section talks about the changes made in drreg to
support multi-phase use.
## Drreg Support For Multi-phase Reservations
Owner: [Abhinav Sharma](https://github.com/abhinav92003)
Upstream issue: [DynamoRIO/dynamorio#3823](https://github.com/DynamoRIO/dynamorio/issues/3823)
Drreg is DynamoRIO’s register reservation framework. It allows users to reserve a
register to use as scratch. Internally, drreg automatically performs the following
functions so that the user does not need to. Drreg
- keeps all required book-keeping like the spill slot to spilled register mapping
- restores spilled registers to their application value before they are read by an
application instruction; also, it re-spills the spilled registers if they are
written by an application instruction.
- performs application state restoration on state restore events like encountering
an application fault, and DR detach.
While expanding a scatter or gather instruction in the app2app phase, we need a
scratch register to hold the scalar values and masks. In later phases (like the
insertion or the instru2instru phase), drcachesim and other DR clients may also use
drreg to get scratch registers for their instrumentation.
Drreg initially supported only insertion phase use, with some basic support in other
phases. Importantly, it did not attempt to avoid any bad interactions between the
multiple phases. To support multi-phase use of drreg, we needed to solve the following:
- avoid spill slot conflict across multiple phases: multi-phase use can potentially
lead to spill slot conflicts if the same slot is selected in multiple phases. This
may clobber the spilled application value and cause the application to crash or
otherwise fail.
- allow aflags spill to any slot: drreg hardcoded the aflags spill slot as the
zero-th slot, to simplify some logic. To support the ability to spill aflags in
multiple phases, drreg should be able to use any spill slot for aflags.
- application state restore logic: on a state restore event, we should be able to
figure out which slot contains each spilled register's app value. This is
complicated by the fact that registers may be spilled by instrumentation added by
multiple phases, and the spill regions may overlap which causes the spilled
application value to be moved between spill slots.
We explored the following ideas to avoid spill slot conflicts in drreg:
__Disjoint slot spaces or arenas__
We can ask drreg to create slot spaces or arenas at init time, which are assigned
disjoint spill slots. When reserving a register, the user passes in a "space/arena
Id" to instruct drreg to pick free slots only from that arena. This requires
keeping some global drreg state. This also requires the user to guess the best
configuration for assigning slots to the arenas, and passing the correct arena Id
before each reservation. It may artificially make some spill slots unavailable for
use, thereby reducing efficiency.
__Assign phase Id to slots__
Instead of creating slot spaces at init time with a best-guess assignment of slots,
we can instead assign a phase Id to slots when they are requested in that phase. We
then avoid using slots that are already assigned a phase Id, when we are not in
that phase where the slot was used before. This also requires keeping some global
drreg state. This does not help in avoiding spill slot conflicts between multiple
clients in the same phase.
__Preferred: Scan fragment to determine eligible slots__
When picking a spill slot, we can determine whether using it will cause a slot
conflict by scanning for its uses in the current fragment after the current
instruction. We pick only that spill slot which does not have any later uses in the
current fragment. This does not require any init time guesses or keeping any global
drreg state. It does not impose any additional responsibilities on the users, and
it also works for multiple clients in the same phase. This was implemented to pick
spill slots for GPRs
([commit](https://github.com/DynamoRIO/dynamorio/commit/238bb25de02d4741f52ec368e2118244329d76f0))
and aflags too
([commit](https://github.com/DynamoRIO/dynamorio/commit/a3d0419e4d05113ca8c665f4ef0edb970e3bcf58)).
### State Restoration For Drreg
Owner: [Abhinav Sharma](https://github.com/abhinav92003)
Upstream issue: [DynamoRIO/dynamorio#3823](https://github.com/DynamoRIO/dynamorio/issues/3823), [DynamoRIO/dynamorio#3801](https://github.com/DynamoRIO/dynamorio/issues/3801)
On a state restore event, drreg should be able to restore all spilled registers to
their application values.
Unfortunately, when a state restore event happens, we only have the encoded
fragment, and none of the drreg state, like the register to spill slot mappings. We
need to reconstruct this state based on the faulting pc and the encoded fragment.
It is complex to determine which registers need to be restored and from which spill
slot. This is because drreg automatically adds spill and restore instructions to
handle various complex cases like automatic re-spilling of reserved registers after
their application write instruction, and automatic restore of reserved registers
before their application read instruction. Drreg also uses various optimisations
like lazy restores for application values in case the register is reserved again.
This is even more complex for aflags, for which spill and restore require atleast
two steps (spilling aflags involves reading aflags into a register using `lahf` and
then writing that register to a spill slot; restoring aflags involves reading
aflags from its spill slot to a register, and then writing aflags from that
register using `sahf`); and an additional step for reading or writing the overflow
flag if needed. In some cases, aflags are even kept in a register as an optimisation.
Additionally, in multi-phase use, a register may be spilled by multiple phases,
with a separate spill slot for each phase. The application value for the register
may reside in one or more spill slots, and may also move between spill slots based
on how the spill regions from different phases overlap. See various tricky
scenarios in
[drreg-test.c](https://github.com/DynamoRIO/dynamorio/blob/f1d496b451eaa6e9aaff7125617030164c6cfdff/suite/tests/client-interface/drreg-test.c#L637).
We explored two ways to adapt drreg’s state restoration logic to multi-phase use.
This also fixed some known existing issues with drreg:
[Dynamorio/dynamorio#4933](https://github.com/DynamoRIO/dynamorio/issues/4933),
[DynamoRIO/dynamorio#4939](https://github.com/DynamoRIO/dynamorio/issues/4939).
__Track app values as they are moved between slots and registers__
At a state restoration event, we walk the faulting fragment from beginning to the
faulting instruction, and we keep track of where the native value of each register
is present. At any point, it may be present in the register itself, a spill slot,
or both. We track `gpr_is_native` to denote whether a register contains its native
app value or not; and `spill_slot_to_reg`, to denote which register’s app value a
spill slot contains.
- When a register is written by an application instruction, we invalidate all
`spill_slot_to_reg` entries that are mapped to that register, and also set
`gpr_is_native` for that register.
- When a register is written by a non-drreg meta instruction, we clear
`gpr_is_native` for that reg.
- When a register is loaded by drreg from the slot it was spilled to, we set
`gpr_is_native`.
- When a register is spilled to some spill slot, we set `spill_slot_to_reg` for
that spill slot to that reg.
This strategy allows us to robustly keep track of the various corner cases that can
arise in drreg, like spill regions from different phases overlapping (nesting or
just overlapping), and the other known issues linked above. This was implemented by
this
[commit](https://github.com/DynamoRIO/dynamorio/commit/f62441bc2a0ced41263e2229deb3691433d6abb9).
The drawback of this approach is that it needs to be aware of other methods of
spilling and restoring registers outside drreg ([dropped PR](https://github.com/DynamoRIO/dynamorio/pull/4987)).
DynamoRIO uses various such methods internally (spilling to stack, slots not
managed by drreg), and also the client may use their own unique methods. So, some
non-drreg meta instructions may actually restore an application value to a
register, but this approach will not be able to recognize that. This may cause it
to lose track of some register’s application value. We dropped this approach on
encountering
[DynamoRIO/dynamorio#4963](https://github.com/DynamoRIO/dynamorio/issues/4963).
__Preferred: Pairing restores with spills (instead of the other way)__
The key observation behind this approach is that it is easier to find the matching
spill for a given restore, than to find the matching restore for a given spill.
This is because there may be other restores besides the final restore, e.g.
restores for app read, user prompted restores, etc. This makes it hard to find
exactly where the spill region for a register/aflags ends. Additional complexities
include the fact that aflags re-spills may not use the same slot, which makes
differentiating spills from multiple phases difficult.
Each restore must have a matching spill. Based on this observation, we scan the
faulting fragment from end to beginning, matching register restores to their
spills. When we reach the faulting instruction, any restore for which we did not
see the matching spill yet must be performed by the drreg state restoration. This
was implemented by
([commit](https://github.com/DynamoRIO/dynamorio/commit/be78c124eb787601182e3c73f4be8bb859c50ef8)).
This algorithm does not need to be aware of non-drreg methods of spilling/restoring
registers. Note that, like the general drreg operation, this method does not
restore the application value of a spilled GPR/aflags if they are dead at the
faulting instruction. However, even dead registers need to be restored when
`drreg_options_t.conservative` is set. This can be handled if there is additional
metadata available to the drreg state restore callback
([DynamoRIO/dynamorio#3801](https://github.com/DynamoRIO/dynamorio/issues/3801)).
## Simplifying Instrumentation For Emulated Instructions
Owner: [Derek Bruening](https://github.com/derekbruening)
Upstream Issue: [DynamoRIO/dynamorio#4865](https://github.com/DynamoRIO/dynamorio/issues/4865)
Emulated sequences like the expanded scatter and gather sequence described above
pose another challenge for clients that need to observe instructions and memory
references both. For observing instructions, these clients should see the original
application instruction (that is, the scatter or gather instruction), whereas for
observing memory references, they should see the emulated sequence (that is, all
the individual scalar stores or loads). DynamoRIO should absorb this complexity and
provide the required events to the client.
We implemented `drmgr_orig_app_instr_for_fetch`,
`drmgr_orig_app_instr_for_operands` and `drmgr_in_emulation_region` APIs
([commit](https://github.com/DynamoRIO/dynamorio/commit/eb5d5af8e3444912c9f3f70e5ebf7969252ee4d6),
[commit](https://github.com/DynamoRIO/dynamorio/commit/6d84fea04a036038db5a3af2e979e77d2cd356c0))
that return the appropriate instruction
to the client to be used for either instruction instrumentation or memory reference
instrumentation. These were subsequently used in drcachesim as well
([commit](https://github.com/DynamoRIO/dynamorio/commit/8b9be0fd04e40deb41d993e8d846b69160fb4f04)).
## Support For Vector Reservation
Owner: [Abhinav Sharma](https://github.com/abhinav92003)
The scatter and gather expansions requires a scratch `xmm` register, for which we
need the capability to spill and restore vector registers. Following are the design
choices:
- Extend drreg to support reservation for vector registers.
[DynamoRIO/dynamorio#3844](https://github.com/DynamoRIO/dynamorio/issues/3844)
aims to add this support.
- Use custom spill and restore logic in drx. We can do this by reserving memory in
TLS to use as a spill slot.
Some observations about this use-case for vector reservation:
- We need to spill only one vector register, so we do not need sophisticated spill
slot management logic.
- The spilled vector register will not need to be restored for app reads, or re-
spilled after app writes. Note that we will not encounter any application
instructions that use the spilled vector register, because it needs to be spilled
only for the duration of the expanded scatter or gather sequence.
Extending drreg to support vector spilling is a complex task. Given the above
observations, the current use case does not justify the effort. Therefore, we chose
to implement custom spill logic in drx
([commit](https://github.com/DynamoRIO/dynamorio/commit/88cba2817fef7a4d4ba3e8c2375784ea4165c133),
[commit](https://github.com/DynamoRIO/dynamorio/commit/84bf9288a12e8e38abab04d4d8273cc3226fa13c)).
## Using The Expansion In DR Clients
Owner: [Abhinav Sharma](https://github.com/abhinav92003)
Clients that need to observe each memory reference must use the
`drx_expand_scatter_gather` API. This was added in the app2app phase of drcachesim
and other DynamoRIO clients
([commit](https://github.com/DynamoRIO/dynamorio/commit/cf9d6a95262015581a5184e75ce599cc66ac4df4)).
This also required fixing some issues (crashes and correctness problems) that
surfaced when all pieces were integrated
([commit](https://github.com/DynamoRIO/dynamorio/commit/ced6e253b2e6bb7f1402398798d4bd8a988dacd0),
[commit](https://github.com/DynamoRIO/dynamorio/commit/e9f05212a7c1f2ff969cabdacd92914865345303)).
# Testing On Large Apps
Owner: [Abhinav Sharma](https://github.com/abhinav92003)
drcachesim was successfully used to trace an application with scatter and gather
instructions. The resulting trace was observed to have millions of such
instructions. We also verified correctness by comparing application output with and
without tracing.
***************************************************************************
*/