/* ******************************************************************************
 * Copyright (c) 2010-2022 Google, Inc.  All rights reserved.
 * ******************************************************************************/

/*
 * Redistribution and use in source and binary forms, with or without
 * modification, are permitted provided that the following conditions are met:
 *
 * * Redistributions of source code must retain the above copyright notice,
 *   this list of conditions and the following disclaimer.
 *
 * * Redistributions in binary form must reproduce the above copyright notice,
 *   this list of conditions and the following disclaimer in the documentation
 *   and/or other materials provided with the distribution.
 *
 * * Neither the name of Google, Inc. nor the names of its contributors may be
 *   used to endorse or promote products derived from this software without
 *   specific prior written permission.
 *
 * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
 * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
 * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
 * ARE DISCLAIMED. IN NO EVENT SHALL VMWARE, INC. OR CONTRIBUTORS BE LIABLE
 * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
 * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
 * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
 * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
 * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
 * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH
 * DAMAGE.
 */

/**
 ****************************************************************************
\page page_scatter_gather_emulation Emulating x86 Scatter and Gather Instructions

\tableofcontents

# Background

The x86 gather and scatter instructions were introduced in the AVX2 and AVX512
instruction set extensions. They allow loading or storing a subset of elements in a
vector from/to multiple non-contiguous addresses.

AVX2 has only gather instructions, no scatter instructions, whereas AVX512 has
both. AVX2 is limited to 256-bit length vectors, whereas AVX512 has 512-bit
support. Both support masking of individual memory accesses, using either a special
mask register in AVX512 or another vector in AVX2.

Examples of these instructions are (in DR’s IR):
```
vpgatherdd %rax(,%ymm11,4)[4byte] %ymm13 -> %ymm12 %ymm13
```

Above is an AVX2 gather instruction that reads 32-bit doublewords into the 256-bit
`ymm12` vector from addresses generated by adding the base address in `rax` to the
corresponding index elements in `ymm11`, conditionally based on the masks in
`ymm13`. Elements may be gathered in any order. When an element is read, its mask
is cleared. If some load faults, all elements to its right (closer to LSB) will
be complete.


```
vpscatterdd {%k1} %xmm10 -> %rax(,%xmm11,4)[4byte] %k1
```

Above is an AVX512 scatter instruction that writes 32-bit doublewords from the 128-
bit `xmm10` vector to addresses generated by adding the base address in `rax` to
the corresponding index element in `xmm11`, conditionally based on the mask
register `k1`. Elements may be scattered in any order. When an element is stored,
its mask is cleared. If some store faults, all elements to its right (closer to
LSB) will be complete.


# Problem Statement

Scatter and gather instructions pose a challenge to DynamoRIO clients that observe
memory addresses, like address tracing tools (e.g. drcachesim, that collects
memory address and control flow traces), and taint tracking tools. They are complex
to handle because:

-  A single gather or scatter instruction loads from or stores to multiple
addresses
-  Accessed addresses may be non-contiguous
-  Each access is conditional based on some mask

DynamoRIO clients only see the scatter or gather instruction and need to do more
work to extract all accessed addresses. This is unlike regular scalar loads or
stores, where the accessed address is readily available. The goal of this work is
to make it easier for DR clients to observe these addresses. We achieve this by
expanding the scatter and gather instructions into a functionally equivalent
sequence of scalar stores and loads. This way, DR clients will see regular store
and load instructions which they can instrument as usual. This is similar to what
DR does for repeat string operations (like `rep movs`, `repnz cmps`): convert it into a
loop so that each memory access is made by a separate dynamic instruction. This
method has worked well for such instructions that implicitly issue multiple
memory accesses.

Original issue: [DynamoRIO/dynamorio#2985](https://github.com/DynamoRIO/dynamorio/issues/2985).

# Design

This required the addition of new support in various DynamoRIO components, like
drreg, drx, drmgr and core DR. Multiple contributors worked on designing and
implementing the required changes.

Note that we expect the same approach to work for other platforms too, like for the
AArch64 SVE scatter/gather instructions.


## Scatter/gather Instruction Expansion

Owner: [Hendrik Greving](https://github.com/hgreving2304)

As described above, we can simplify work for DR clients by replacing each scatter
and gather instruction with a functionally equivalent sequence of scalar stores and
loads. The expanded sequence is the unrolled version of the following loop:
```
num_accesses = vector_size / element_size
for i = 0, 1, ..., (num_accesses-1), do
  extract mask for the ith access from mask reg or mask vector
  if mask is set, then
    extract ith element of index vector
    compute address = base + ith index element
    if instr_is_gather, then
      load data from address into a scalar reg
      insert scalar data into destination vector
    else // instr_is_scatter
      extract scalar data from source vector to scalar reg
      store data from scalar reg to address
    done
    clear ith mask in mask reg or mask vector
  done
done
```

Due to the x86 ISA, the extraction/insertion of the scalar value from/to the vector
may involve multiple steps, e.g. to extract a 32-bit scalar value from a 512-bit
`zmm` reg, we first need to extract a 128-bit `xmm` from it.

drmgr in DR provides multiple phases of instrumentation. Our expansion is done in
the first phase known as app2app. As the name suggests, this phase is intended to
transform app instructions to equivalent instructions. For simplicity, we also
separate out the scatter and gather instructions from their basic block and create a
separate fragment with only the expanded sequence. The logic for expanding scatter
and gather instructions is implemented in the drx extension library as
[`drx_expand_scatter_gather`](https://github.com/DynamoRIO/dynamorio/blob/
eb5d5af8e3444912c9f3f70e5ebf7969252ee4d6/ext/drx/drx.h#L538), and can be used by
any client that needs it, including drcachesim. This support was added by
[commit](https://github.com/DynamoRIO/dynamorio/commit/4359ef134e47942004c09db04b54593579763186).

As an example, the following are the expansions of some instructions.

Expansion for
```
vpgatherdd 0x00402039(,%xmm11,4)[4byte] %xmm13 -> %xmm12 %xmm13
```


```
 +0    m4 @0x00007fdb2ac5e6a0  65 48 a3 e0 00 00 00 mov    %rax -> %gs:0x000000e0[8byte]
                               00 00 00 00
 +11   m4 @0x00007fdb2ac5eac0  9f                   lahf    -> %ah
 +12   m4 @0x00007fdb2ac5ea40  0f 90 c0             seto    -> %al                              // Spill aflags using drreg.
 +15   m4 @0x00007fdb2ac5efa8  65 48 89 0c 25 e8 00 mov    %rcx -> %gs:0x000000e8[8byte]        // Spill the scratch GPR using drreg.
                               00 00
 +24   m4 @0x00007fdb2ac0ec70  65 48 8b 0c 25 20 00 mov    %gs:0x20[8byte] -> %rcx
                               00 00
 +33   m4 @0x00007fdb2ac0ebf0  48 8b 89 f0 0a 00 00 mov    0x00000af0(%rcx)[8byte] -> %rcx
 +40   m4 @0x00007fdb2ac5f0a8  48 8b 09             mov    (%rcx)[8byte] -> %rcx
 +43   m4 @0x00007fdb2ac5e7f0  48 8b 49 10          mov    0x10(%rcx)[8byte] -> %rcx
 +47   m4 @0x00007fdb2ac0f488  48 8b 49 08          mov    0x08(%rcx)[8byte] -> %rcx
 +51   m4 @0x00007fdb2ac0ed38  62 f1 7c 48 29 01    vmovaps {%k0} %zmm0 -> (%rcx)[64byte]       // Manually spill the scratch zmm reg.
 +57   m4 @0x00007fdb2ac0ee00                       <label>
 +57   L4 @0x00007fdb2ac0efb0  c4 63 7d 39 e8 00    vextracti128 %ymm13 $0x00 -> %xmm0          // Expansion for the first vector element starts here.
 +63   L4 @0x00007fdb2ac0bee8  c4 e3 79 16 c1 00    vpextrd %xmm0 $0x00 -> %ecx                 // Extract mask for the first element.
 +69   L4 @0x00007fdb2ac0c750  c1 e9 1f             shr    $0x0000001f %ecx -> %ecx
 +72   L4 @0x00007fdb2ac0be68  81 e1 01 00 00 00    and    $0x00000001 %ecx -> %ecx
 +78   L4 @0x00007fdb2ac0bca0  0f 84 fa ff ff ff    jz     @0x00007fdb2ac0ee98[8byte]           // Check whether to load the first element based on mask.
 +84   L4 @0x00007fdb2ac0c6d0  c4 63 7d 39 d8 00    vextracti128 %ymm11 $0x00 -> %xmm0
 +90   L4 @0x00007fdb2ac0be00  c4 e3 79 16 c1 00    vpextrd %xmm0 $0x00 -> %ecx                 // Extract index for the first load address.
 +96   L4 @0x00007fdb2ac0b8b8  48 63 c9             movsxd %ecx -> %rcx
 +99   L4 @0x00007fdb2ac0cc20  8b 0c 8d 39 20 40 00 mov    0x00402039(,%rcx,4)[4byte] -> %ecx   // Load the first element into a scalar reg.
 +106  L4 @0x00007fdb2ac5e5b8  c4 63 7d 39 e0 00    vextracti128 %ymm12 $0x00 -> %xmm0
 +112  L4 @0x00007fdb2ac5e870  c4 e3 79 22 c1 00    vpinsrd %xmm0 %ecx $0x00 -> %xmm0
 +118  L4 @0x00007fdb2ac5e9c0  c4 63 1d 38 e0 00    vinserti128 %ymm12 %xmm0 $0x00 -> %ymm12    // Insert the first element into the destination vector reg
 +124  L4 @0x00007fdb2ac5eb40  33 c9                xor    %ecx %ecx -> %ecx
 +126  L4 @0x00007fdb2ac5ebc0  c4 63 7d 39 e8 00    vextracti128 %ymm13 $0x00 -> %xmm0
 +132  L4 @0x00007fdb2ac5eda8  c4 e3 79 22 c1 00    vpinsrd %xmm0 %ecx $0x00 -> %xmm0
 +138  L4 @0x00007fdb2ac5ec40  c4 63 15 38 e8 00    vinserti128 %ymm13 %xmm0 $0x00 -> %ymm13    // Clear the mask bit for the first element.
 +144  m4 @0x00007fdb2ac0ee98                       <label>
 +144  L4 @0x00007fdb2ac5ed40  c4 63 7d 39 e8 00    vextracti128 %ymm13 $0x00 -> %xmm0          // Repeat for the second vector element.
 +150  L4 @0x00007fdb2ac5ee28  c4 e3 79 16 c1 01    vpextrd %xmm0 $0x01 -> %ecx
 +156  L4 @0x00007fdb2ac5eea8  c1 e9 1f             shr    $0x0000001f %ecx -> %ecx
 +159  L4 @0x00007fdb2ac5e638  81 e1 01 00 00 00    and    $0x00000001 %ecx -> %ecx
 +165  L4 @0x00007fdb2ac5e788  0f 84 fa ff ff ff    jz     @0x00007fdb2ac5ecc0[8byte]
 +171  L4 @0x00007fdb2ac5e720  c4 63 7d 39 d8 00    vextracti128 %ymm11 $0x00 -> %xmm0
 +177  L4 @0x00007fdb2ac5e958  c4 e3 79 16 c1 01    vpextrd %xmm0 $0x01 -> %ecx
 +183  L4 @0x00007fdb2ac5e8f0  48 63 c9             movsxd %ecx -> %rcx
 +186  L4 @0x00007fdb2ac5f028  8b 0c 8d 39 20 40 00 mov    0x00402039(,%rcx,4)[4byte] -> %ecx
 +193  L4 @0x00007fdb2ac5ef28  c4 63 7d 39 e0 00    vextracti128 %ymm12 $0x00 -> %xmm0
 +199  L4 @0x00007fdb2ac0baa0  c4 e3 79 22 c1 01    vpinsrd %xmm0 %ecx $0x01 -> %xmm0
 +205  L4 @0x00007fdb2ac5f128  c4 63 1d 38 e0 00    vinserti128 %ymm12 %xmm0 $0x00 -> %ymm12
 +211  L4 @0x00007fdb2ac0ea60  33 c9                xor    %ecx %ecx -> %ecx
 +213  L4 @0x00007fdb2ac0e9c8  c4 63 7d 39 e8 00    vextracti128 %ymm13 $0x00 -> %xmm0
 +219  L4 @0x00007fdb2ac0e930  c4 e3 79 22 c1 01    vpinsrd %xmm0 %ecx $0x01 -> %xmm0
 +225  L4 @0x00007fdb2ac0cbb8  c4 63 15 38 e8 00    vinserti128 %ymm13 %xmm0 $0x00 -> %ymm13
 +231  m4 @0x00007fdb2ac5ecc0                       <label>
 +231  L4 @0x00007fdb2ac5e538  c4 63 7d 39 e8 00    vextracti128 %ymm13 $0x00 -> %xmm0          // Repeat for the third vector element.
 +237  L4 @0x00007fdb2ac5e4b8  c4 e3 79 16 c1 02    vpextrd %xmm0 $0x02 -> %ecx
 +243  L4 @0x00007fdb2ac5df68  c1 e9 1f             shr    $0x0000001f %ecx -> %ecx
 +246  L4 @0x00007fdb2ac5e438  81 e1 01 00 00 00    and    $0x00000001 %ecx -> %ecx
 +252  L4 @0x00007fdb2ac5e3b8  0f 84 fa ff ff ff    jz     @0x00007fdb2ac0f018[8byte]
 +258  L4 @0x00007fdb2ac5e050  c4 63 7d 39 d8 00    vextracti128 %ymm11 $0x00 -> %xmm0
 +264  L4 @0x00007fdb2ac5e338  c4 e3 79 16 c1 02    vpextrd %xmm0 $0x02 -> %ecx
 +270  L4 @0x00007fdb2ac5e2b8  48 63 c9             movsxd %ecx -> %rcx
 +273  L4 @0x00007fdb2ac5e238  8b 0c 8d 39 20 40 00 mov    0x00402039(,%rcx,4)[4byte] -> %ecx
 +280  L4 @0x00007fdb2ac5e1b8  c4 63 7d 39 e0 00    vextracti128 %ymm12 $0x00 -> %xmm0
 +286  L4 @0x00007fdb2ac5e138  c4 e3 79 22 c1 02    vpinsrd %xmm0 %ecx $0x02 -> %xmm0
 +292  L4 @0x00007fdb2ac5e0b8  c4 63 1d 38 e0 00    vinserti128 %ymm12 %xmm0 $0x00 -> %ymm12
 +298  L4 @0x00007fdb2ac5dfd0  33 c9                xor    %ecx %ecx -> %ecx
 +300  L4 @0x00007fdb2ac5dee8  c4 63 7d 39 e8 00    vextracti128 %ymm13 $0x00 -> %xmm0
 +306  L4 @0x00007fdb2ac0f080  c4 e3 79 22 c1 02    vpinsrd %xmm0 %ecx $0x02 -> %xmm0
 +312  L4 @0x00007fdb2ac0f0e8  c4 63 15 38 e8 00    vinserti128 %ymm13 %xmm0 $0x00 -> %ymm13
 +318  m4 @0x00007fdb2ac0f018                       <label>
 +318  L4 @0x00007fdb2ac0f220  c4 63 7d 39 e8 00    vextracti128 %ymm13 $0x00 -> %xmm0          // Repeat for the fourth vector element.
 +324  L4 @0x00007fdb2ac0f2a0  c4 e3 79 16 c1 03    vpextrd %xmm0 $0x03 -> %ecx
 +330  L4 @0x00007fdb2ac0f150  c1 e9 1f             shr    $0x0000001f %ecx -> %ecx
 +333  L4 @0x00007fdb2ac0f388  81 e1 01 00 00 00    and    $0x00000001 %ecx -> %ecx
 +339  L4 @0x00007fdb2ac0f408  0f 84 fa ff ff ff    jz     @0x00007fdb2ac0f1b8[8byte]
 +345  L4 @0x00007fdb2ac0ba20  c4 63 7d 39 d8 00    vextracti128 %ymm11 $0x00 -> %xmm0
 +351  L4 @0x00007fdb2ac0f320  c4 e3 79 16 c1 03    vpextrd %xmm0 $0x03 -> %ecx
 +357  L4 @0x00007fdb2ac0f508  48 63 c9             movsxd %ecx -> %rcx
 +360  L4 @0x00007fdb2ac0f588  8b 0c 8d 39 20 40 00 mov    0x00402039(,%rcx,4)[4byte] -> %ecx
 +367  L4 @0x00007fdb2ac0ef30  c4 63 7d 39 e0 00    vextracti128 %ymm12 $0x00 -> %xmm0
 +373  L4 @0x00007fdb2ac0c050  c4 e3 79 22 c1 03    vpinsrd %xmm0 %ecx $0x03 -> %xmm0
 +379  L4 @0x00007fdb2ac0c1c8  c4 63 1d 38 e0 00    vinserti128 %ymm12 %xmm0 $0x00 -> %ymm12
 +385  L4 @0x00007fdb2ac0c3c8  33 c9                xor    %ecx %ecx -> %ecx
 +387  L4 @0x00007fdb2ac0bfd0  c4 63 7d 39 e8 00    vextracti128 %ymm13 $0x00 -> %xmm0
 +393  L4 @0x00007fdb2ac5de68  c4 e3 79 22 c1 03    vpinsrd %xmm0 %ecx $0x03 -> %xmm0
 +399  L4 @0x00007fdb2ac5dde8  c4 63 15 38 e8 00    vinserti128 %ymm13 %xmm0 $0x00 -> %ymm13
 +405  m4 @0x00007fdb2ac0f1b8                       <label>
 +405  L4 @0x00007fdb2ac5d898  c4 41 11 ef ed       vpxor  %xmm13 %xmm13 -> %xmm13              // Zero the mask reg.
 +410  m4 @0x00007fdb2ac5dd68  65 48 8b 0c 25 20 00 mov    %gs:0x20[8byte] -> %rcx
                               00 00
 +419  m4 @0x00007fdb2ac5dce8  48 8b 89 f0 0a 00 00 mov    0x00000af0(%rcx)[8byte] -> %rcx
 +426  m4 @0x00007fdb2ac5d980  48 8b 09             mov    (%rcx)[8byte] -> %rcx
 +429  m4 @0x00007fdb2ac5dc68  48 8b 49 10          mov    0x10(%rcx)[8byte] -> %rcx
 +433  m4 @0x00007fdb2ac5dbe8  48 8b 49 08          mov    0x08(%rcx)[8byte] -> %rcx
 +437  m4 @0x00007fdb2ac5db68  62 f1 7c 48 28 01    vmovaps {%k0} (%rcx)[64byte] -> %zmm0       // Manually restore the scratch zmm reg.
 +443  m4 @0x00007fdb2ac5dae8  65 48 8b 0c 25 e8 00 mov    %gs:0x000000e8[8byte] -> %rcx        // Restore the scratch GPR using drreg.
                               00 00
 +452  m4 @0x00007fdb2ac5da68  3c 81                cmp    %al $0x81
 +454  m4 @0x00007fdb2ac5d9e8  9e                   sahf   %ah
 +455  m4 @0x00007fdb2ac5d900  65 48 a1 e0 00 00 00 mov    %gs:0x000000e0[8byte] -> %rax        // Restore aflags using drreg.
                               00 00 00 00
 +466  m4 @0x00007fdb2ac5d818                       <label>
```


Expansion for
```
vpscatterdd {%k1} %xmm10 -> 0x00402039(,%xmm11,4)[4byte] %k1
```


```
 +0    m4 @0x00007fdb2ac106e0  65 48 a3 e0 00 00 00 mov    %rax -> %gs:0x000000e0[8byte]
                               00 00 00 00
 +11   m4 @0x00007fdb2ac10760  9f                   lahf    -> %ah
 +12   m4 @0x00007fdb2ac107e0  0f 90 c0             seto    -> %al                              // Spill aflags using drreg.
 +15   m4 @0x00007fdb2ac100a8  65 48 89 0c 25 e8 00 mov    %rcx -> %gs:0x000000e8[8byte]        // Spill the first scratch GPR using drreg.
                               00 00
 +24   m4 @0x00007fdb2ac10110  65 48 89 14 25 f0 00 mov    %rdx -> %gs:0x000000f0[8byte]        // Spill the second scratch GPR using drreg.
                               00 00
 +33   m4 @0x00007fdb2ac0fed8  65 48 8b 0c 25 20 00 mov    %gs:0x20[8byte] -> %rcx
                               00 00
 +42   m4 @0x00007fdb2ac0ff40  48 8b 89 f0 0a 00 00 mov    0x00000af0(%rcx)[8byte] -> %rcx
 +49   m4 @0x00007fdb2ac0f608  48 8b 09             mov    (%rcx)[8byte] -> %rcx
 +52   m4 @0x00007fdb2ac10860  48 8b 49 10          mov    0x10(%rcx)[8byte] -> %rcx
 +56   m4 @0x00007fdb2ac108e0  48 8b 49 08          mov    0x08(%rcx)[8byte] -> %rcx
 +60   m4 @0x00007fdb2ac0ca50  62 f1 7c 48 29 01    vmovaps {%k0} %zmm0 -> (%rcx)[64byte]       // Manually spill the scratch zmm reg.
 +66   m4 @0x00007fdb2ac10660                       <label>
 +66   L4 @0x00007fdb2ac10560  c5 f8 93 c9          kmovw  %k1 -> %ecx                          // Expansion for the first vector element starts here.
 +70   L4 @0x00007fdb2ac104f8  f7 c1 01 00 00 00    test   %ecx $0x00000001
 +76   L4 @0x00007fdb2ac10478  0f 84 fa ff ff ff    jz     @0x00007fdb2ac105e0[8byte]           // Check whether to store the first element based on mask.
 +82   L4 @0x00007fdb2ac103f8  62 73 7d 48 39 d8 00 vextracti32x4 {%k0} $0x00 %zmm11 -> %xmm0
 +89   L4 @0x00007fdb2ac10378  c4 e3 79 16 c1 00    vpextrd %xmm0 $0x00 -> %ecx                 // Extract index for the first store address.
 +95   L4 @0x00007fdb2ac102f8  62 73 7d 48 39 d0 00 vextracti32x4 {%k0} $0x00 %zmm10 -> %xmm0
 +102  L4 @0x00007fdb2ac10278  c4 e3 79 16 c2 00    vpextrd %xmm0 $0x00 -> %edx                 // Extract the element for the first store.
 +108  L4 @0x00007fdb2ac101f8  48 63 c9             movsxd %ecx -> %rcx
 +111  L4 @0x00007fdb2ac10178  89 14 8d 39 20 40 00 mov    %edx -> 0x00402039(,%rcx,4)[4byte]   // Store the first element.
 +118  L4 @0x00007fdb2ac10028  b9 01 00 00 00       mov    $0x00000001 -> %ecx
 +123  m4 @0x00007fdb2ac0ffa8  65 48 89 1c 25 f8 00 mov    %rbx -> %gs:0x000000f8[8byte]        // Spill the third scratch GPR using drreg.
                               00 00
 +132  m4 @0x00007fdb2ac0fe58  c5 f8 93 d8          kmovw  %k0 -> %ebx                          // Manually spill the scratch mask reg k0 to the scratch GPR.
 +136  L4 @0x00007fdb2ac0fdd8  c5 f8 92 c1          kmovw  %ecx -> %k0
 +140  L4 @0x00007fdb2ac0fbf0  c5 fc 42 c9          kandnw %k0 %k1 -> %k1                       // Clear bit for the first element in the mask reg.
 +144  m4 @0x00007fdb2ac0fd58  c5 f8 92 c3          kmovw  %ebx -> %k0                          // Manually restore the scratch mask reg from the scratch GPR.
 +148  m4 @0x00007fdb2ac0fcd8  65 48 8b 1c 25 f8 00 mov    %gs:0x000000f8[8byte] -> %rbx        // Restore the third scratch GPR using drreg.
                               00 00
 +157  m4 @0x00007fdb2ac105e0                       <label>
 +157  L4 @0x00007fdb2ac0fb70  c5 f8 93 c9          kmovw  %k1 -> %ecx                          // Repeat for the second vector element.
 +161  L4 @0x00007fdb2ac0faf0  f7 c1 02 00 00 00    test   %ecx $0x00000002
 +167  L4 @0x00007fdb2ac0fa70  0f 84 fa ff ff ff    jz     @0x00007fdb2ac0fc58[8byte]
 +173  L4 @0x00007fdb2ac0f9f0  62 73 7d 48 39 d8 00 vextracti32x4 {%k0} $0x00 %zmm11 -> %xmm0
 +180  L4 @0x00007fdb2ac0f970  c4 e3 79 16 c1 01    vpextrd %xmm0 $0x01 -> %ecx
 +186  L4 @0x00007fdb2ac0f8f0  62 73 7d 48 39 d0 00 vextracti32x4 {%k0} $0x00 %zmm10 -> %xmm0
 +193  L4 @0x00007fdb2ac0f870  c4 e3 79 16 c2 01    vpextrd %xmm0 $0x01 -> %edx
 +199  L4 @0x00007fdb2ac0f7f0  48 63 c9             movsxd %ecx -> %rcx
 +202  L4 @0x00007fdb2ac0f770  89 14 8d 39 20 40 00 mov    %edx -> 0x00402039(,%rcx,4)[4byte]
 +209  L4 @0x00007fdb2ac0f6f0  b9 02 00 00 00       mov    $0x00000002 -> %ecx
 +214  m4 @0x00007fdb2ac0f670  65 48 89 1c 25 f8 00 mov    %rbx -> %gs:0x000000f8[8byte]
                               00 00
 +223  m4 @0x00007fdb2ac0c448  c5 f8 93 d8          kmovw  %k0 -> %ebx
 +227  L4 @0x00007fdb2ac0c2c8  c5 f8 92 c1          kmovw  %ecx -> %k0
 +231  L4 @0x00007fdb2ac0c9e8  c5 fc 42 c9          kandnw %k0 %k1 -> %k1
 +235  m4 @0x00007fdb2ac0c980  c5 f8 92 c3          kmovw  %ebx -> %k0
 +239  m4 @0x00007fdb2ac0c818  65 48 8b 1c 25 f8 00 mov    %gs:0x000000f8[8byte] -> %rbx
                               00 00
 +248  m4 @0x00007fdb2ac0fc58                       <label>
 +248  L4 @0x00007fdb2ac0c518  c5 f8 93 c9          kmovw  %k1 -> %ecx                          // Repeat for the third vector element.
 +252  L4 @0x00007fdb2ac0c668  f7 c1 04 00 00 00    test   %ecx $0x00000004
 +258  L4 @0x00007fdb2ac0cab8  0f 84 fa ff ff ff    jz     @0x00007fdb2ac0ba20[8byte]
 +264  L4 @0x00007fdb2ac0cb20  62 73 7d 48 39 d8 00 vextracti32x4 {%k0} $0x00 %zmm11 -> %xmm0
 +271  L4 @0x00007fdb2ac0c248  c4 e3 79 16 c1 02    vpextrd %xmm0 $0x02 -> %ecx
 +277  L4 @0x00007fdb2ac0bd98  62 73 7d 48 39 d0 00 vextracti32x4 {%k0} $0x00 %zmm10 -> %xmm0
 +284  L4 @0x00007fdb2ac0c0d0  c4 e3 79 16 c2 02    vpextrd %xmm0 $0x02 -> %edx
 +290  L4 @0x00007fdb2ac0c900  48 63 c9             movsxd %ecx -> %rcx
 +293  L4 @0x00007fdb2ac0bfd0  89 14 8d 39 20 40 00 mov    %edx -> 0x00402039(,%rcx,4)[4byte]
 +300  L4 @0x00007fdb2ac0c3c8  b9 04 00 00 00       mov    $0x00000004 -> %ecx
 +305  m4 @0x00007fdb2ac0c1c8  65 48 89 1c 25 f8 00 mov    %rbx -> %gs:0x000000f8[8byte]
                               00 00
 +314  m4 @0x00007fdb2ac0c050  c5 f8 93 d8          kmovw  %k0 -> %ebx
 +318  L4 @0x00007fdb2ac0ef30  c5 f8 92 c1          kmovw  %ecx -> %k0
 +322  L4 @0x00007fdb2ac0f588  c5 fc 42 c9          kandnw %k0 %k1 -> %k1
 +326  m4 @0x00007fdb2ac0f508  c5 f8 92 c3          kmovw  %ebx -> %k0
 +330  m4 @0x00007fdb2ac0f320  65 48 8b 1c 25 f8 00 mov    %gs:0x000000f8[8byte] -> %rbx
                               00 00
 +339  m4 @0x00007fdb2ac0ba20                       <label>
 +339  L4 @0x00007fdb2ac0f408  c5 f8 93 c9          kmovw  %k1 -> %ecx                          // Repeat for the fourth vector element.
 +343  L4 @0x00007fdb2ac0f388  f7 c1 08 00 00 00    test   %ecx $0x00000008
 +349  L4 @0x00007fdb2ac0f150  0f 84 fa ff ff ff    jz     @0x00007fdb2ac0f488[8byte]
 +355  L4 @0x00007fdb2ac0f2a0  62 73 7d 48 39 d8 00 vextracti32x4 {%k0} $0x00 %zmm11 -> %xmm0
 +362  L4 @0x00007fdb2ac0f220  c4 e3 79 16 c1 03    vpextrd %xmm0 $0x03 -> %ecx
 +368  L4 @0x00007fdb2ac0f1b8  62 73 7d 48 39 d0 00 vextracti32x4 {%k0} $0x00 %zmm10 -> %xmm0
 +375  L4 @0x00007fdb2ac0f0e8  c4 e3 79 16 c2 03    vpextrd %xmm0 $0x03 -> %edx
 +381  L4 @0x00007fdb2ac0f080  48 63 c9             movsxd %ecx -> %rcx
 +384  L4 @0x00007fdb2ac0f018  89 14 8d 39 20 40 00 mov    %edx -> 0x00402039(,%rcx,4)[4byte]
 +391  L4 @0x00007fdb2ac0cbb8  b9 08 00 00 00       mov    $0x00000008 -> %ecx
 +396  m4 @0x00007fdb2ac0e930  65 48 89 1c 25 f8 00 mov    %rbx -> %gs:0x000000f8[8byte]
                               00 00
 +405  m4 @0x00007fdb2ac0e9c8  c5 f8 93 d8          kmovw  %k0 -> %ebx
 +409  L4 @0x00007fdb2ac0ea60  c5 f8 92 c1          kmovw  %ecx -> %k0
 +413  L4 @0x00007fdb2ac0eb28  c5 fc 42 c9          kandnw %k0 %k1 -> %k1
 +417  m4 @0x00007fdb2ac0ebf0  c5 f8 92 c3          kmovw  %ebx -> %k0
 +421  m4 @0x00007fdb2ac0ec70  65 48 8b 1c 25 f8 00 mov    %gs:0x000000f8[8byte] -> %rbx
                               00 00
 +430  m4 @0x00007fdb2ac0f488                       <label>
 +430  L4 @0x00007fdb2ac0ed38  c4 e1 f4 47 c9       kxorq  %k1 %k1 -> %k1                       // Clear the mask reg.
 +435  m4 @0x00007fdb2ac0ee00  65 48 8b 0c 25 20 00 mov    %gs:0x20[8byte] -> %rcx
                               00 00
 +444  m4 @0x00007fdb2ac0ee98  48 8b 89 f0 0a 00 00 mov    0x00000af0(%rcx)[8byte] -> %rcx
 +451  m4 @0x00007fdb2ac0efb0  48 8b 09             mov    (%rcx)[8byte] -> %rcx
 +454  m4 @0x00007fdb2ac0bee8  48 8b 49 10          mov    0x10(%rcx)[8byte] -> %rcx
 +458  m4 @0x00007fdb2ac0c750  48 8b 49 08          mov    0x08(%rcx)[8byte] -> %rcx
 +462  m4 @0x00007fdb2ac0be68  62 f1 7c 48 28 01    vmovaps {%k0} (%rcx)[64byte] -> %zmm0       // Manually restore the scratch zmm reg.
 +468  m4 @0x00007fdb2ac0bca0  65 48 8b 0c 25 e8 00 mov    %gs:0x000000e8[8byte] -> %rcx        // Restore the first scratch GPR using drreg.
                               00 00
 +477  m4 @0x00007fdb2ac0c6d0  65 48 8b 14 25 f0 00 mov    %gs:0x000000f0[8byte] -> %rdx        // Restore the second scratch GPR using drreg.
                               00 00
 +486  m4 @0x00007fdb2ac0be00  3c 81                cmp    %al $0x81
 +488  m4 @0x00007fdb2ac0b8b8  9e                   sahf   %ah
 +489  m4 @0x00007fdb2ac0cc20  65 48 a1 e0 00 00 00 mov    %gs:0x000000e0[8byte] -> %rax        // Restore aflags using drreg.
                               00 00 00 00
 +500  m4 @0x00007fdb2ac0baa0                       <label>
```

As shown by the above expanded scatter and gather sequences, we require scratch
registers for the expansion. The GPR scratch registers are obtained using
drreg, whereas the scratch `zmm` register and the scratch mask register are
obtained by manually spilling them.

We need to make sure that we restore the application state correctly when a state
restoration event occurs, which can be a fault in one of the scalar loads or stores
in the expanded sequence, a fault in instrumentation added by some other DR client,
or some async event like DR detach. While the spilled registers obtained from drreg are
restored by the drreg state restoration logic, drx still needs to restore the
scratch mask register that is spilled manually to a GPR, and the scratch `zmm` register
that is spilled manually to a drx spill slot. We also need to ensure that the bit
for the previous access is cleared if the state restore event happened after the load
or store completed but before we could reflect it in the mask. When a state restore
event occurs, we walk the expanded sequence using a state machine till we reach the
faulting pc, keeping track of the state that needs to be restored
([commit](https://github.com/DynamoRIO/dynamorio/commit/de511a72197e9b04a9732f33622fe7bc7fd12623),
[commit](https://github.com/DynamoRIO/dynamorio/commit/e54913b4845794b96e8682ba6a214d54e10741a0)).

As pointed out above, this expansion is done in the app2app phase. DR clients
may use drreg to get scratch registers for their instrumentation in later phases
(like insertion or instru2instru). While drreg indeed supports some basic usage
outside of the insertion phase, it does not mitigate bad interactions by such
multi-phase use. The following section talks about the changes made in drreg to
support multi-phase use.


## Drreg Support For Multi-phase Reservations

Owner: [Abhinav Sharma](https://github.com/abhinav92003)

Upstream issue: [DynamoRIO/dynamorio#3823](https://github.com/DynamoRIO/dynamorio/issues/3823)

Drreg is DynamoRIO’s register reservation framework. It allows users to reserve a
register to use as scratch. Internally, drreg automatically performs the following
functions so that the user does not need to. Drreg
- keeps all required book-keeping like the spill slot to spilled register mapping
- restores spilled registers to their application value before they are read by an
application instruction; also, it re-spills the spilled registers if they are
written by an application instruction.
- performs application state restoration on state restore events like encountering
an application fault, and DR detach.

While expanding a scatter or gather instruction in the app2app phase, we need a
scratch register to hold the scalar values and masks. In later phases (like the
insertion or the instru2instru phase), drcachesim and other DR clients may also use
drreg to get scratch registers for their instrumentation.

Drreg initially supported only insertion phase use, with some basic support in other
phases. Importantly, it did not attempt to avoid any bad interactions between the
multiple phases. To support multi-phase use of drreg, we needed to solve the following:
- avoid spill slot conflict across multiple phases: multi-phase use can potentially
lead to spill slot conflicts if the same slot is selected in multiple phases. This
may clobber the spilled application value and cause the application to crash or
otherwise fail.
- allow aflags spill to any slot: drreg hardcoded the aflags spill slot as the
zero-th slot, to simplify some logic. To support the ability to spill aflags in
multiple phases, drreg should be able to use any spill slot for aflags.
- application state restore logic: on a state restore event, we should be able to
figure out which slot contains each spilled register's app value. This is
complicated by the fact that registers may be spilled by instrumentation added by
multiple phases, and the spill regions may overlap which causes the spilled
application value to be moved between spill slots.

We explored the following ideas to avoid spill slot conflicts in drreg:

__Disjoint slot spaces or arenas__

We can ask drreg to create slot spaces or arenas at init time, which are assigned
disjoint spill slots. When reserving a register, the user passes in a "space/arena
Id" to instruct drreg to pick free slots only from that arena. This requires
keeping some global drreg state. This also requires the user to guess the best
configuration for assigning slots to the arenas, and passing the correct arena Id
before each reservation. It may artificially make some spill slots unavailable for
use, thereby reducing efficiency.


__Assign phase Id to slots__

Instead of creating slot spaces at init time with a best-guess assignment of slots,
we can instead assign a phase Id to slots when they are requested in that phase. We
then avoid using slots that are already assigned a phase Id, when we are not in
that phase where the slot was used before. This also requires keeping some global
drreg state. This does not help in avoiding spill slot conflicts between multiple
clients in the same phase.


__Preferred: Scan fragment to determine eligible slots__

When picking a spill slot, we can determine whether using it will cause a slot
conflict by scanning for its uses in the current fragment after the current
instruction. We pick only that spill slot which does not have any later uses in the
current fragment. This does not require any init time guesses or keeping any global
drreg state. It does not impose any additional responsibilities on the users, and
it also works for multiple clients in the same phase. This was implemented to pick
spill slots for GPRs
([commit](https://github.com/DynamoRIO/dynamorio/commit/238bb25de02d4741f52ec368e2118244329d76f0))
and aflags too
([commit](https://github.com/DynamoRIO/dynamorio/commit/a3d0419e4d05113ca8c665f4ef0edb970e3bcf58)).


### State Restoration For Drreg

Owner: [Abhinav Sharma](https://github.com/abhinav92003)

Upstream issue: [DynamoRIO/dynamorio#3823](https://github.com/DynamoRIO/dynamorio/issues/3823), [DynamoRIO/dynamorio#3801](https://github.com/DynamoRIO/dynamorio/issues/3801)

On a state restore event, drreg should be able to restore all spilled registers to
their application values.

Unfortunately, when a state restore event happens, we only have the encoded
fragment, and none of the drreg state, like the register to spill slot mappings. We
need to reconstruct this state based on the faulting pc and the encoded fragment.


It is complex to determine which registers need to be restored and from which spill
slot. This is because drreg automatically adds spill and restore instructions to
handle various complex cases like automatic re-spilling of reserved registers after
their application write instruction, and automatic restore of reserved registers
before their application read instruction. Drreg also uses various optimisations
like lazy restores for application values in case the register is reserved again.
This is even more complex for aflags, for which spill and restore require atleast
two steps (spilling aflags involves reading aflags into a register using `lahf` and
then writing that register to a spill slot; restoring aflags involves reading
aflags from its spill slot to a register, and then writing aflags from that
register using `sahf`); and an additional step for reading or writing the overflow
flag if needed. In some cases, aflags are even kept in a register as an optimisation.

Additionally, in multi-phase use, a register may be spilled by multiple phases,
with a separate spill slot for each phase. The application value for the register
may reside in one or more spill slots, and may also move between spill slots based
on how the spill regions from different phases overlap. See various tricky
scenarios in
[drreg-test.c](https://github.com/DynamoRIO/dynamorio/blob/f1d496b451eaa6e9aaff7125617030164c6cfdff/suite/tests/client-interface/drreg-test.c#L637).

We explored two ways to adapt drreg’s state restoration logic to multi-phase use.
This also fixed some known existing issues with drreg:
[Dynamorio/dynamorio#4933](https://github.com/DynamoRIO/dynamorio/issues/4933),
[DynamoRIO/dynamorio#4939](https://github.com/DynamoRIO/dynamorio/issues/4939).

__Track app values as they are moved between slots and registers__

At a state restoration event, we walk the faulting fragment from beginning to the
faulting instruction, and we keep track of where the native value of each register
is present. At any point, it may be present in the register itself, a spill slot,
or both. We track `gpr_is_native` to denote whether a register contains its native
app value or not; and `spill_slot_to_reg`, to denote which register’s app value a
spill slot contains.

- When a register is written by an application instruction, we invalidate all
`spill_slot_to_reg` entries that are mapped to that register, and also set
`gpr_is_native` for that register.
- When a register is written by a non-drreg meta instruction, we clear
`gpr_is_native` for that reg.
- When a register is loaded by drreg from the slot it was spilled to, we set
`gpr_is_native`.
- When a register is spilled to some spill slot, we set `spill_slot_to_reg` for
that spill slot to that reg.

This strategy allows us to robustly keep track of the various corner cases that can
arise in drreg, like spill regions from different phases overlapping (nesting or
just overlapping), and the other known issues linked above. This was implemented by
this
[commit](https://github.com/DynamoRIO/dynamorio/commit/f62441bc2a0ced41263e2229deb3691433d6abb9).

The drawback of this approach is that it needs to be aware of other methods of
spilling and restoring registers outside drreg ([dropped PR](https://github.com/DynamoRIO/dynamorio/pull/4987)).
DynamoRIO uses various such methods internally (spilling to stack, slots not
managed by drreg), and also the client may use their own unique methods. So, some
non-drreg meta instructions may actually restore an application value to a
register, but this approach will not be able to recognize that. This may cause it
to lose track of some register’s application value. We dropped this approach on
encountering
[DynamoRIO/dynamorio#4963](https://github.com/DynamoRIO/dynamorio/issues/4963).

__Preferred: Pairing restores with spills (instead of the other way)__

The key observation behind this approach is that it is easier to find the matching
spill for a given restore, than to find the matching restore for a given spill.
This is because there may be other restores besides the final restore, e.g.
restores for app read, user prompted restores, etc. This makes it hard to find
exactly where the spill region for a register/aflags ends.  Additional complexities
include the fact that aflags re-spills may not use the same slot, which makes
differentiating spills from multiple phases difficult.

Each restore must have a matching spill. Based on this observation, we scan the
faulting fragment from end to beginning, matching register restores to their
spills. When we reach the faulting instruction, any restore for which we did not
see the matching spill yet must be performed by the drreg state restoration. This
was implemented by
([commit](https://github.com/DynamoRIO/dynamorio/commit/be78c124eb787601182e3c73f4be8bb859c50ef8)).

This algorithm does not need to be aware of non-drreg methods of spilling/restoring
registers. Note that, like the general drreg operation, this method does not
restore the application value of a spilled GPR/aflags if they are dead at the
faulting instruction. However, even dead registers need to be restored when
`drreg_options_t.conservative` is set. This can be handled if there is additional
metadata available to the drreg state restore callback
([DynamoRIO/dynamorio#3801](https://github.com/DynamoRIO/dynamorio/issues/3801)).


## Simplifying Instrumentation For Emulated Instructions
Owner: [Derek Bruening](https://github.com/derekbruening)

Upstream Issue: [DynamoRIO/dynamorio#4865](https://github.com/DynamoRIO/dynamorio/issues/4865)

Emulated sequences like the expanded scatter and gather sequence described above
pose another challenge for clients that need to observe instructions and memory
references both. For observing instructions, these clients should see the original
application instruction (that is, the scatter or gather instruction), whereas for
observing memory references, they should see the emulated sequence (that is, all
the individual scalar stores or loads). DynamoRIO should absorb this complexity and
provide the required events to the client.

We implemented `drmgr_orig_app_instr_for_fetch`,
 `drmgr_orig_app_instr_for_operands` and `drmgr_in_emulation_region` APIs
([commit](https://github.com/DynamoRIO/dynamorio/commit/eb5d5af8e3444912c9f3f70e5ebf7969252ee4d6),
[commit](https://github.com/DynamoRIO/dynamorio/commit/6d84fea04a036038db5a3af2e979e77d2cd356c0))
that return the appropriate instruction
to the client to be used for either instruction instrumentation or memory reference
instrumentation. These were subsequently used in drcachesim as well
([commit](https://github.com/DynamoRIO/dynamorio/commit/8b9be0fd04e40deb41d993e8d846b69160fb4f04)).


## Support For Vector Reservation

Owner: [Abhinav Sharma](https://github.com/abhinav92003)

The scatter and gather expansions requires a scratch `xmm` register, for which we
need the capability to spill and restore vector registers. Following are the design
choices:

- Extend drreg to support reservation for vector registers.
  [DynamoRIO/dynamorio#3844](https://github.com/DynamoRIO/dynamorio/issues/3844)
  aims to add this support.

- Use custom spill and restore logic in drx. We can do this by reserving memory in
  TLS to use as a spill slot.


Some observations about this use-case for vector reservation:
- We need to spill only one vector register, so we do not need sophisticated spill
slot management logic.
- The spilled vector register will not need to be restored for app reads, or re-
spilled after app writes. Note that we will not encounter any application
instructions that use the spilled vector register, because it needs to be spilled
only for the duration of the expanded scatter or gather sequence.

Extending drreg to support vector spilling is a complex task. Given the above
observations, the current use case does not justify the effort. Therefore, we chose
to implement custom spill logic in drx
([commit](https://github.com/DynamoRIO/dynamorio/commit/88cba2817fef7a4d4ba3e8c2375784ea4165c133),
[commit](https://github.com/DynamoRIO/dynamorio/commit/84bf9288a12e8e38abab04d4d8273cc3226fa13c)).


## Using The Expansion In DR Clients

Owner: [Abhinav Sharma](https://github.com/abhinav92003)

Clients that need to observe each memory reference must use the
`drx_expand_scatter_gather` API. This was added in the app2app phase of drcachesim
and other DynamoRIO clients
([commit](https://github.com/DynamoRIO/dynamorio/commit/cf9d6a95262015581a5184e75ce599cc66ac4df4)).
This also required fixing some issues (crashes and correctness problems) that
surfaced when all pieces were integrated
([commit](https://github.com/DynamoRIO/dynamorio/commit/ced6e253b2e6bb7f1402398798d4bd8a988dacd0),
[commit](https://github.com/DynamoRIO/dynamorio/commit/e9f05212a7c1f2ff969cabdacd92914865345303)).

# Testing On Large Apps

Owner: [Abhinav Sharma](https://github.com/abhinav92003)

drcachesim was successfully used to trace an application with scatter and gather
instructions. The resulting trace was observed to have millions of such
instructions. We also verified correctness by comparing application output with and
without tracing.


 ***************************************************************************
 */