/* **********************************************************
 * Copyright (c) 2015-2023 Google, Inc.  All rights reserved.
 * **********************************************************/

/*
 * Redistribution and use in source and binary forms, with or without
 * modification, are permitted provided that the following conditions are met:
 *
 * * Redistributions of source code must retain the above copyright notice,
 *   this list of conditions and the following disclaimer.
 *
 * * Redistributions in binary form must reproduce the above copyright notice,
 *   this list of conditions and the following disclaimer in the documentation
 *   and/or other materials provided with the distribution.
 *
 * * Neither the name of Google, Inc. nor the names of its contributors may be
 *   used to endorse or promote products derived from this software without
 *   specific prior written permission.
 *
 * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
 * AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
 * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
 * ARE DISCLAIMED. IN NO EVENT SHALL GOOGLE, INC. OR CONTRIBUTORS BE LIABLE
 * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
 * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
 * SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
 * CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
 * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
 * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH
 * DAMAGE.
 */

/* We've upgraded these sections to pages, but kept the "sec_" names to
 * keep them off the Available Tools top-level menu.
 */

/**
***************************************************************************
***************************************************************************
\page page_drcachesim Tracing and Analysis Framework

\p drcachesim is a DynamoRIO client that collects instruction and
memory access traces using its \p drmemtrace component and
feeds them to either an online or offline tool for analysis.  The default
analysis tool is a CPU cache simulator, while other provided tools compute
metrics such as reuse distance.  The trace collector and simulator support
multiple processes each with multiple threads.  The analysis tool framework
is extensible, supporting the creation of new tools which can operate both
online and offline.

\b News: There are some new features in drmemtrace traces: conditional
branches are now marked as taken or untaken, and indirect branch
targets are provided up front.  These join the recent features of
[embedded instruction encodings and fast seeking](docs/new-features-encodings-seek.pdf).

 - \subpage sec_drcachesim
 - \subpage sec_drcachesim_format
 - \subpage sec_drcachesim_run
 - \subpage sec_drcachesim_tools
 - \subpage google_workload_traces
 - \subpage sec_drcachesim_config_file
 - \subpage sec_drcachesim_offline
 - \subpage sec_drcachesim_filter
 - \subpage sec_drcachesim_partial
 - \subpage sec_drcachesim_sim
 - \subpage sec_drcachesim_analyzer
 - \subpage sec_drcachesim_phys
 - \subpage sec_drcachesim_core
 - \subpage sec_drcachesim_extend
 - \subpage sec_drcachesim_tracer
 - \subpage sec_drcachesim_funcs
 - \subpage sec_drcachesim_newtool
 - \subpage sec_drcachesim_ops
 - \subpage sec_drcachesim_limit
 - \subpage sec_drcachesim_cmp

****************************************************************************
\page sec_drcachesim Overview

\p drcachesim consists of two components: a tracer \p drmemtrace and an analyzer.
The tracer collects a memory access trace from each thread within each
application process.
The analyzer consumes the traces (online or offline) and performs
customized analysis.
It is designed to be extensible, allowing users to easily implement a
simulator for different devices, such as CPU caches, TLBs, page
caches, etc. (see \ref sec_drcachesim_extend), or to build arbitrary trace
analysis tools (see \ref sec_drcachesim_newtool).
The default analyzer simulates the architectural behavior of caching
devices for a target application (or multiple applications).

****************************************************************************
\page sec_drcachesim_format Trace Format

\p drmemtrace traces are records of all user mode retired instructions
and memory accesses during the traced window.

A trace is presented to analysis tools as a stream of records.  Each
record entry is of type #dynamorio::drmemtrace::memref_t and
represents one instruction or data reference or a metadata operation
such as a thread exit or marker.  There are built-in scheduling
markers providing the timestamp and cpu identifier periodically and before and
after each system call.  Other built-in markers indicate disruptions in user mode
control flow such as signal handler entry and exit.

Each entry contains the common fields \p type, \p pid, and \p tid. The
\p type field is used to identify the kind of each entry via a value
of type #dynamorio::drmemtrace::trace_marker_type_t.  The \p pid and
\p tid identify the process and software thread owning the entry.  By
default, all traced software threads are interleaved together, but
with offline traces (see \ref sec_drcachesim_offline) each thread's
trace can easily be analyzed separately as they are stored in separate
files.

\section sec_drcachesim_format_instrs Instruction Records

Executed instructions are stored in
#dynamorio::drmemtrace::_memref_instr_t.  The program counter and
length of the encoded instruction are provided.  The length can be
used to compute the address of the subsequent instruction.

The raw encoding of the instruction is provided.  This can be decoded
using the drdecode decoder or any other decoder.  An additional field
`encoding_is_new` is provided to indicate when any cached decoding
information should be invalidated due to possibly changed application
code.  (For online traces, encodings are not provided unless the
option `-instr_encodings` is passed, as encodings add overhead and
are not needed for many tools.)

Older legacy traces may not contain instruction encodings.  For those
traces, encodings for static code can be obtained by
disassembling the application and
library binaries.  The provided interfaces
module_mapper_t::get_loaded_modules() and
module_mapper_t::find_mapped_trace_address() facilitate loading in
copies of the binaries and reading the raw bytes for each instruction
in order to obtain the opcode and full operand information.
See also \ref sec_drcachesim_core.

Whether conditional branches are taken or untaken is indicated by the
instruction types #dynamorio::drmemtrace::TRACE_TYPE_INSTR_TAKEN_JUMP
and #dynamorio::drmemtrace::TRACE_TYPE_INSTR_UNTAKEN_JUMP.  The target
of each indirect branch is explicitly provided by the "indirect_branch_target"
field in #dynamorio::drmemtrace::memref_t.
If the program flow is changed by the kernel such as by
signal delivery, the branch target is explicitly recorded in the trace in a metadata
marker entry of type #dynamorio::drmemtrace::TRACE_MARKER_TYPE_KERNEL_EVENT.

\section sec_drcachesim_format_data Memory Access Records

Memory accesses (data loads and stores) are stored in
#dynamorio::drmemtrace::_memref_data_t.  The program counter of the instruction
performing the memory access, the virtual address (convertable to physical: see \ref
sec_drcachesim_phys), and the size are provided.

\section sec_drcachesim_format_other Other Records

Besides instruction and memory records, other trace entry types include
#dynamorio::drmemtrace::_memref_marker_t, #dynamorio::drmemtrace::_memref_flush_t,
#dynamorio::drmemtrace::_memref_thread_exit_t, etc. These records provide specific
inforamtion about events that can alter the program flow or the system's states.

Trace markers are particularly important to allow reconstruction of the program
execution.  Marker records in #dynamorio::drmemtrace::_memref_marker_t provide metadata
identifying some event that occurred at this point in the trace.  Each marker record
contains two additional fields:

- \p marker_type - identifies the type of marker
- \p marker_value - carries the value of the marker

Some of the more important markers are:

- #dynamorio::drmemtrace::TRACE_MARKER_TYPE_KERNEL_EVENT - This identifies kernel-initiated control transfers such as signal delivery.  The next instruction record is the start of the handler for a kernel-initiated event.  The value of this type of marker contains the program counter at the kernel interruption point.  If the interruption point is just after a branch, this value is the target of that branch.

- #dynamorio::drmemtrace::TRACE_MARKER_TYPE_KERNEL_XFER - This identifies a system call that changes control flow, such as a signal return.

- #dynamorio::drmemtrace::TRACE_MARKER_TYPE_TIMESTAMP - The marker value provides a timestamp for this point of the trace (in units of microseconds since Jan 1, 1601 UTC). This value can be used to synchronize records from different threads as well as analyze latencies (however, tracing overhead inflates time unevenly, so time deltas should not be considered perfectly representative). It is used in the sequential analysis of a multi-threaded trace.

- #dynamorio::drmemtrace::TRACE_MARKER_TYPE_CPU_ID - The marker value contains the CPU identifier on which the subsequent records were collected. It is useful to help track thread migrations occurring during execution. This marker is written to the header of each trace buffer when the buffer is flushed. Note that if the thread migrates to a different CPU due to preemption by the kernel before a buffer is full, we do not output a separate #dynamorio::drmemtrace::TRACE_MARKER_TYPE_CPU_ID marker to capture the previous CPU identifier. However, we expect such cases to be rare.

- #dynamorio::drmemtrace::TRACE_MARKER_TYPE_FUNC_ID, #dynamorio::drmemtrace::TRACE_MARKER_TYPE_FUNC_RETADDR, #dynamorio::drmemtrace::TRACE_MARKER_TYPE_FUNC_ARG, #dynamorio::drmemtrace::TRACE_MARKER_TYPE_FUNC_RETVAL - These markers are used to capture information about function calls.  Which functions to capture must be explicitly selected at tracing time.  Typical candiates are heap allocation and freeing functions.  See \ref sec_drcachesim_funcs.

- #dynamorio::drmemtrace::TRACE_MARKER_TYPE_WINDOW_ID - The marker value contains the ordinal of the trace burst window when the subsequent entries until the next #dynamorio::drmemtrace::TRACE_MARKER_TYPE_WINDOW_ID or end-of-trace were collected (see \ref sec_drcachesim_partial).

- #dynamorio::drmemtrace::TRACE_MARKER_TYPE_SYSCALL - This identifies a system call.  A timestamp is inserted in the trace before and after marker of this type.  This marker should be considered to be the actual system call invocation by the kernel, rather than the prior system call gateway instruction fetch record.  Thus, these timestamps provide the system call latency.

The full set of markers is listed under the enum #dynamorio::drmemtrace::trace_marker_type_t.

****************************************************************************
\page sec_drcachesim_run Running the Simulator

To launch \p drcachesim, use the \p -t flag to \p drrun:

\code
$ bin64/drrun -t drcachesim -- /path/to/target/app <args> <for> <app>
\endcode

The target application will be launched under a DynamoRIO tracer
client that gathers all of its memory references and passes them to
the simulator via a pipe.  (See \ref sec_drcachesim_offline for how to dump
a trace for offline analysis.)
Any child processes will be followed into and profiled, with their
memory references passed to the simulator as well.

Here is an example:

\code
$ bin64/drrun -t drcachesim -- ~/test/pi_estimator
Estimation of pi is 3.142425985001098
---- <application exited with code 0> ----
Cache simulation results:
Core #0 (1 thread(s))
  L1I stats:
    Hits:                          258,433
    Misses:                          1,148
    Miss rate:                        0.44%
  L1D stats:
    Hits:                           93,654
    Misses:                          2,624
    Prefetch hits:                     458
    Prefetch misses:                 2,166
    Miss rate:                        2.73%
Core #1 (1 thread(s))
  L1I stats:
    Hits:                            8,895
    Misses:                             99
    Miss rate:                        1.10%
  L1D stats:
    Hits:                            3,448
    Misses:                            156
    Prefetch hits:                      26
    Prefetch misses:                   130
    Miss rate:                        4.33%
Core #2 (1 thread(s))
  L1I stats:
    Hits:                            4,150
    Misses:                            101
    Miss rate:                        2.38%
  L1D stats:
    Hits:                            1,578
    Misses:                            130
    Prefetch hits:                      25
    Prefetch misses:                   105
    Miss rate:                        7.61%
Core #3 (0 thread(s))
LL stats:
    Hits:                            1,414
    Misses:                          2,844
    Prefetch hits:                     824
    Prefetch misses:                 1,577
    Local miss rate:                 66.79%
    Child hits:                    370,667
    Total miss rate:                  0.76%
\endcode

****************************************************************************
\page sec_drcachesim_tools Analysis Tool Suite

In addition to a CPU cache simulator, other analysis tools are
available that operate on memory address traces.  Which tool is used
can be selected with the \p -simulator_type parameter.  New, custom
tools can also be created, as described in \ref sec_drcachesim_newtool.

- \ref sec_tool_cache_sim
- \ref sec_tool_TLB_sim
- \ref sec_tool_reuse_distance
- \ref sec_tool_reuse_time
- \ref sec_tool_basic_counts
- \ref sec_tool_opcode_mix
- \ref sec_tool_view
- \ref sec_tool_func_view
- \ref sec_tool_histogram
- \ref sec_tool_invariant_checker
- \ref sec_tool_syscall_mix

\section sec_tool_cache_sim Cache Simulator

This is the default tool.  Here is an exmample of running it on an offline trace:
\code
$ bin64/drrun -t drcachesim -offline -- ~/test/pi_estimator
Estimation of pi is 3.142425985001098
$ bin64/drrun -t drcachesim -indir drmemtrace.*.dir
Cache simulation results:
Core #0 (1 thread(s))
  L1I stats:
    Hits:                          258,433
    Misses:                          1,148
    Miss rate:                        0.44%
  L1D stats:
    Hits:                           93,654
    Misses:                          2,624
    Prefetch hits:                     458
    Prefetch misses:                 2,166
    Miss rate:                        2.73%
Core #1 (1 thread(s))
  L1I stats:
    Hits:                            8,895
    Misses:                             99
    Miss rate:                        1.10%
  L1D stats:
    Hits:                            3,448
    Misses:                            156
    Prefetch hits:                      26
    Prefetch misses:                   130
    Miss rate:                        4.33%
Core #2 (1 thread(s))
  L1I stats:
    Hits:                            4,150
    Misses:                            101
    Miss rate:                        2.38%
  L1D stats:
    Hits:                            1,578
    Misses:                            130
    Prefetch hits:                      25
    Prefetch misses:                   105
    Miss rate:                        7.61%
Core #3 (0 thread(s))
LL stats:
    Hits:                            1,414
    Misses:                          2,844
    Prefetch hits:                     824
    Prefetch misses:                 1,577
    Local miss rate:                 66.79%
    Child hits:                    370,667
    Total miss rate:                  0.76%
\endcode

\section sec_tool_TLB_sim TLB Simulator

To simulate TLB devices instead of caches, pass \p TLB to \p -simulator_type:

\code
$ bin64/drrun -t drcachesim -simulator_type TLB -- ~/test/pi_estimator
Estimation of pi is 3.142425985001098
---- <application exited with code 0> ----
TLB simulation results:
Core #0 (1 thread(s))
  L1I stats:
    Hits:                          252,412
    Misses:                            401
    Miss rate:                        0.16%
  L1D stats:
    Hits:                           87,132
    Misses:                          9,127
    Miss rate:                        9.48%
  LL stats:
    Hits:                            9,315
    Misses:                            213
    Local miss rate:                  2.24%
    Child hits:                    339,544
    Total miss rate:                  0.06%
Core #1 (1 thread(s))
  L1I stats:
    Hits:                            8,709
    Misses:                             20
    Miss rate:                        0.23%
  L1D stats:
    Hits:                            3,544
    Misses:                             55
    Miss rate:                        1.53%
  LL stats:
    Hits:                               15
    Misses:                             60
    Local miss rate:                 80.00%
    Child hits:                     12,253
    Total miss rate:                  0.49%
Core #2 (1 thread(s))
  L1I stats:
    Hits:                            1,622
    Misses:                             21
    Miss rate:                        1.28%
  L1D stats:
    Hits:                              689
    Misses:                             35
    Miss rate:                        4.83%
  LL stats:
    Hits:                                3
    Misses:                             53
    Local miss rate:                 94.64%
    Child hits:                      2,311
    Total miss rate:                  2.24%
Core #3 (0 thread(s))
\endcode

\section sec_tool_reuse_distance Reuse Distance

To compute reuse distance metrics:

\code
$ bin64/drrun -t drcachesim -simulator_type reuse_distance -reuse_distance_histogram -- ~/test/pi_estimator
Estimation of pi is 3.142425985001098
---- <application exited with code 0> ----
Reuse distance tool aggregated results:
Total accesses: 349632
Unique accesses: 196603
Unique cache lines accessed: 4235

Reuse distance mean: 14.64
Reuse distance median: 1
Reuse distance standard deviation: 104.10
Reuse distance histogram:
Distance       Count  Percent  Cumulative
       0      153029   44.36%   44.36%
       1      101294   29.37%   73.73%
       2       14116    4.09%   77.82%
       3       14248    4.13%   81.95%
       4        8894    2.58%   84.53%
       5        2733    0.79%   85.32%
...
==================================================
Reuse distance tool results for shard 29327 (thread 29327):
Total accesses: 335084
Unique accesses: 187927
Unique cache lines accessed: 4148

Reuse distance mean: 14.77
Reuse distance median: 1
Reuse distance standard deviation: 106.02
Reuse distance histogram:
Distance       Count  Percent  Cumulative
       0      147157   44.47%   44.47%
       1       96820   29.26%   73.72%
       2       13613    4.11%   77.84%
       3       13834    4.18%   82.02%
       4        8666    2.62%   84.64%
       5        2552    0.77%   85.41%
...
    3658          29    0.01%  100.00%
    3851           1    0.00%  100.00%

Reuse distance threshold = 100 cache lines
Top 10 frequently referenced cache lines
        cache line:     #references   #distant refs
    0x7f2a86b3fd80:        27980,            0
    0x7f2a86b3fdc0:        18823,            0
    0x7f2a88388fc0:        16409,          111
    0x7f2a8838abc0:        15176,            6
    0x7f2a883884c0:         9930,           20
    0x7f2a88388480:         7944,           20
    0x7f2a88388500:         7574,           20
    0x7f2a88398d00:         7390,          100
    0x7f2a86b3fd40:         6668,            0
    0x7f2a88388440:         5717,           20
Top 10 distant repeatedly referenced cache lines
        cache line:     #references   #distant refs
    0x7f2a885a4180:          246,          132
    0x7f2a87504ec0:          202,          128
    0x7f2a875044c0:          323,          126
    0x7f2a885a4480:          220,          126
    0x7f2a87504f00:          293,          124
    0x7f2a86fd7e00:          289,          124
    0x7f2a875049c0:          221,          124
    0x7f2a875053c0:          270,          122
    0x7f2a86db9c00:          269,          122
    0x7f2a875047c0:          201,          122

==================================================
Reuse distance tool results for shard 29328 (thread 29328):
Total accesses: 12216
Unique accesses: 7251
Unique cache lines accessed: 319

Reuse distance mean: 12.98
Reuse distance median: 1
Reuse distance standard deviation: 38.19
Reuse distance histogram:
Distance       Count  Percent  Cumulative
       0        4965   41.73%   41.73%
       1        3758   31.59%   73.32%
       2         411    3.45%   76.78%
       3         348    2.93%   79.70%
       4         179    1.50%   81.21%
       5         152    1.28%   82.48%
...
\endcode

\section sec_tool_reuse_time Reuse Time

A reuse time tool is also provided, which counts the total number of memory
accesses (without considering uniqueness) between accesses to the same
address:

\code
$ bin64/drrun -t drcachesim -simulator_type reuse_time -- ~/test/pi_estimator
Estimation of pi is 3.142425985001098
---- <application exited with code 0> ----
Reuse time tool aggregated results:
Total accesses: 88281
Total instructions: 261315
Mean reuse time: 433.47
Reuse time histogram:
Distance       Count  Percent  Cumulative
       1       27893   32.84%      32.84%
       2       10948   12.89%      45.73%
       3        5789    6.82%      52.54%
...
==================================================
Reuse time tool results for shard 29482 (thread 29482):
Total accesses: 84194
Total instructions: 250854
Mean reuse time: 450.01
Reuse time histogram:
Distance       Count  Percent  Cumulative
       1       26677   32.86%      32.86%
       2       10508   12.95%      45.81%
       3        5427    6.69%      52.50%
...
==================================================
Reuse time tool results for shard 29483 (thread 29483):
Total accesses: 3411
Total instructions: 8805
Mean reuse time: 86.36
Reuse time histogram:
Distance       Count  Percent  Cumulative
       1        1014   31.56%      31.56%
       2         363   11.30%      42.86%
       3         308    9.59%      52.44%
\endcode

\section sec_tool_basic_counts Event Counts

To simply see the counts of instructions and memory references broken down
by thread use the basic counts tool:

\code
$ bin64/drrun -t drcachesim -simulator_type basic_counts -- ~/test/pi_estimator
Estimation of pi is 3.142425985001098
---- <application exited with code 0> ----
Basic counts tool results:
Total counts:
      134566 total (fetched) instructions
       13838 total unique (fetched) instructions
         423 total non-fetched instructions
           0 total prefetches
       30919 total data loads
       13122 total data stores
           0 total icache flushes
           0 total dcache flushes
           3 total threads
        1134 total scheduling markers
           5 total transfer markers
          10 total function id markers
           0 total function return address markers
          30 total function argument markers
           5 total function return value markers
           0 total physical address + virtual address marker pairs
           0 total physical address unavailable markers
          75 total system call number markers
           0 total blocking system call markers
          12 total other markers
           0 total encodings
Thread 64951 counts:
      130900 (fetched) instructions
       13469 unique (fetched) instructions
         423 non-fetched instructions
           0 prefetches
       29844 data loads
       12594 data stores
           0 icache flushes
           0 dcache flushes
        1072 scheduling markers
           5 transfer markers
           4 function id markers
           0 function return address markers
          12 function argument markers
           2 function return value markers
           0 physical address + virtual address marker pairs
           0 physical address unavailable markers
          56 system call number markers
           0 blocking system call markers
           4 other markers
           0 encodings
Thread 64958 counts:
        1861 (fetched) instructions
        1009 unique (fetched) instructions
           0 non-fetched instructions
           0 prefetches
         538 data loads
         262 data stores
           0 icache flushes
           0 dcache flushes
          30 scheduling markers
           0 transfer markers
           2 function id markers
           0 function return address markers
           6 function argument markers
           1 function return value markers
           0 physical address + virtual address marker pairs
           0 physical address unavailable markers
           9 system call number markers
           0 blocking system call markers
           4 other markers
           0 encodings
Thread 64959 counts:
        1805 (fetched) instructions
         978 unique (fetched) instructions
           0 non-fetched instructions
           0 prefetches
         537 data loads
         266 data stores
           0 icache flushes
           0 dcache flushes
          32 scheduling markers
           0 transfer markers
           4 function id markers
           0 function return address markers
          12 function argument markers
           2 function return value markers
           0 physical address + virtual address marker pairs
           0 physical address unavailable markers
          10 system call number markers
           0 blocking system call markers
           4 other markers
           0 encodings
\endcode

The non-fetched instructions are x86 string loop instructions, where
subsequent iterations do not incur a fetch.  They are included in the trace
as a different type of trace entry to support core simulators in addition
to cache simulators.

\section sec_tool_opcode_mix Opcode Mix

The opcode_mix tool uses the non-fetched instruction information along with
the preserved libraries and binaries from the traced execution to gather
more information on each executed instruction than was stored in the trace.
To run on online traces, pass the `-instr_encodings` option.
The results are broken
down by the opcodes used in DR's IR, where for x86 \p mov is split into a separate
opcode for load and store but both have the same public string "mov":

\code
$ bin64/drrun -t drcachesim -offline -- ~/test/pi_estimator
Estimation of pi is 3.142425985001098
$ bin64/drrun -t drcachesim -simulator_type opcode_mix -indir drmemtrace.*.dir
Opcode mix tool results:
         267271 : total executed instructions
          36432 :       mov
          31075 :       mov
          24715 :       add
          22579 :      test
          22539 :       cmp
          12137 :       lea
          11136 :       jnz
          10568 :     movzx
          10243 :        jz
           9056 :       and
           8064 :       jnz
           7279 :        jz
           5659 :      push
           4528 :       sub
           4357 :       pop
           4001 :       shr
           3427 :      jnbe
           2634 :       mov
           2469 :       shl
           2344 :        jb
           2291 :       ret
           2178 :       xor
           2164 :      call
           2111 :   pcmpeqb
           1472 :    movdqa
...
\endcode

\section sec_tool_view Human-Readable View

The view tool prints out the contents of the trace for human viewing, including
disassembling instructions in AT&T, Intel, Arm, or DR format (to see
disassembly for online traces, pass the `-instr_encodings` option). The
-skip_refs and -sim_refs flags can be used to
set a start point and end point for the disassembled view. Note that these
flags compute the number of trace entry records which are skipped or displayed which
is distinct from the number of instruction records.

The tool displays loads and stores, as well as metadata marker entries for
timestamps, on which core and thread the subsequent instruction sequence was
executed, and kernel and system call transfers (these correspond to
signal or event handler interruptions of the regular execution flow).

In its first two columns, the tool displays the trace record ordinal
and the instruction fetch ordinal.

\code
$ $ bin64/drrun -t drcachesim -simulator_type view -indir drmemtrace.*.dir -sim_refs 20
Output format:
<--record#-> <--instr#->: <---tid---> <record details>
------------------------------------------------------------
           1           0:     3256418 <marker: version 6>
           2           0:     3256418 <marker: filetype 0x240>
           3           0:     3256418 <marker: cache line size 64>
           4           0:     3256418 <marker: chunk instruction count 1024>
           5           0:     3256418 <marker: page size 4096>
           6           0:     3256418 <marker: timestamp 13312410768080478>
           7           0:     3256418 <marker: tid 3256418 on core 7>
           8           1:     3256418 ifetch       3 byte(s) @ 0x00007fc205a61940 48 89 e7             mov    %rsp, %rdi
           9           2:     3256418 ifetch       5 byte(s) @ 0x00007fc205a61943 e8 b8 0c 00 00       call   $0x00007fc205a62600
          10           2:     3256418 write        8 byte(s) @ 0x00007fff9a9e3528 by PC 0x00007fc205a61943
          11           3:     3256418 ifetch       1 byte(s) @ 0x00007fc205a62600 55                   push   %rbp
          12           3:     3256418 write        8 byte(s) @ 0x00007fff9a9e3520 by PC 0x00007fc205a62600
          13           4:     3256418 ifetch       3 byte(s) @ 0x00007fc205a62601 48 89 e5             mov    %rsp, %rbp
          14           5:     3256418 ifetch       2 byte(s) @ 0x00007fc205a62604 41 57                push   %r15
          15           5:     3256418 write        8 byte(s) @ 0x00007fff9a9e3518 by PC 0x00007fc205a62604
          16           6:     3256418 ifetch       2 byte(s) @ 0x00007fc205a62606 41 56                push   %r14
          17           6:     3256418 write        8 byte(s) @ 0x00007fff9a9e3510 by PC 0x00007fc205a62606
          18           7:     3256418 ifetch       2 byte(s) @ 0x00007fc205a62608 41 55                push   %r13
          19           7:     3256418 write        8 byte(s) @ 0x00007fff9a9e3508 by PC 0x00007fc205a62608
          20           8:     3256418 ifetch       2 byte(s) @ 0x00007fc205a6260a 41 54                push   %r12
View tool results:
              8 : total instructions
\endcode

An example of thread switches:

\code
------------------------------------------------------------
       46        0:  3264758 <marker: timestamp 13312413437398055>
       47        0:  3264758 <marker: tid 3264758 on core 2>
       48        1:  3264758 ifetch       3 byte(s) @ 0x00007f4ea89e4940 48 89 e7             mov    %rsp, %rdi
       49        2:  3264758 ifetch       5 byte(s) @ 0x00007f4ea89e4943 e8 b8 0c 00 00       call   $0x00007f4ea89e5600
       50        2:  3264758 write        8 byte(s) @ 0x00007ffd93a0cf18 by PC 0x00007f4ea89e4943
...
  2854543  2149665:  3264758 ifetch       5 byte(s) @ 0x00007f4ea7c87f8c b8 0e 00 00 00       mov    $0x0000000e, %eax
  2854544  2149666:  3264758 ifetch       2 byte(s) @ 0x00007f4ea7c87f91 0f 05                syscall
------------------------------------------------------------
  2854545  2149666:  3264760 <marker: timestamp 13312413438835999>
  2854546  2149666:  3264760 <marker: tid 3264760 on core 11>
  2854547  2149667:  3264760 ifetch       3 byte(s) @ 0x00007f4ea7d0b099 48 85 c0             test   %rax, %rax
  2854548  2149668:  3264760 ifetch       2 byte(s) @ 0x00007f4ea7d0b09c 7c 18                jl     $0x00007f4ea7d0b0b6
...
\endcode


Here is an example of a signal handler interrupting the regular flow,
with metadata showing that the signal was delivered just after an
untaken conditional branch:

\code
      801343      601827:     1159769 ifetch       2 byte(s) @ 0x00007fc2c3aa5c70 75 57                jnz    $0x00007fc2c3aa5cc9 (untaken)
      801344      601827:     1159769 <marker: kernel xfer from 0x7fc2c3aa5c72 to handler>
      801345      601827:     1159769 <marker: timestamp 13335923552684013>
      801346      601827:     1159769 <marker: tid 1159769 on core 7>
      801347      601828:     1159769 ifetch       1 byte(s) @ 0x00007fc2c03fa259 55                   push   %rbp
      801348      601828:     1159769 write        8 byte(s) @ 0x00007fff8044e930 by PC 0x00007fc2c03fa259
      801349      601829:     1159769 ifetch       3 byte(s) @ 0x00007fc2c03fa25a 48 89 e5             mov    %rsp, %rbp
      801350      601830:     1159769 ifetch       3 byte(s) @ 0x00007fc2c03fa25d 89 7d fc             mov    %edi, -0x04(%rbp)
      801351      601830:     1159769 write        4 byte(s) @ 0x00007fff8044e92c by PC 0x00007fc2c03fa25d
      801352      601831:     1159769 ifetch       4 byte(s) @ 0x00007fc2c03fa260 48 89 75 f0          mov    %rsi, -0x10(%rbp)
      801353      601831:     1159769 write        8 byte(s) @ 0x00007fff8044e920 by PC 0x00007fc2c03fa260
      801354      601832:     1159769 ifetch       4 byte(s) @ 0x00007fc2c03fa264 48 89 55 e8          mov    %rdx, -0x18(%rbp)
      801355      601832:     1159769 write        8 byte(s) @ 0x00007fff8044e918 by PC 0x00007fc2c03fa264
      801356      601833:     1159769 ifetch       4 byte(s) @ 0x00007fc2c03fa268 83 7d fc 1a          cmp    -0x04(%rbp), $0x1a
      801357      601833:     1159769 read         4 byte(s) @ 0x00007fff8044e92c by PC 0x00007fc2c03fa268
      801358      601834:     1159769 ifetch       2 byte(s) @ 0x00007fc2c03fa26c 75 0f                jnz    $0x00007fc2c03fa27d (untaken)
      801359      601835:     1159769 ifetch       6 byte(s) @ 0x00007fc2c03fa26e 8b 05 c0 3e 00 00    mov    <rel> 0x00007fc2c03fe134, %eax
      801360      601835:     1159769 read         4 byte(s) @ 0x00007fc2c03fe134 by PC 0x00007fc2c03fa26e
      801361      601836:     1159769 ifetch       3 byte(s) @ 0x00007fc2c03fa274 83 c0 01             add    $0x01, %eax
      801362      601837:     1159769 ifetch       6 byte(s) @ 0x00007fc2c03fa277 89 05 b7 3e 00 00    mov    %eax, <rel> 0x00007fc2c03fe134
      801363      601837:     1159769 write        4 byte(s) @ 0x00007fc2c03fe134 by PC 0x00007fc2c03fa277
      801364      601838:     1159769 ifetch       1 byte(s) @ 0x00007fc2c03fa27d 90                   nop
      801365      601839:     1159769 ifetch       1 byte(s) @ 0x00007fc2c03fa27e 5d                   pop    %rbp
      801366      601839:     1159769 read         8 byte(s) @ 0x00007fff8044e930 by PC 0x00007fc2c03fa27e
      801367      601840:     1159769 ifetch       1 byte(s) @ 0x00007fc2c03fa27f c3                   ret (target 0x7fc2c3a5af90)
      801368      601840:     1159769 read         8 byte(s) @ 0x00007fff8044e938 by PC 0x00007fc2c03fa27f
      801369      601841:     1159769 ifetch       7 byte(s) @ 0x00007fc2c3a5af90 48 c7 c0 0f 00 00 00 mov    $0x0000000f, %rax
      801370      601842:     1159769 ifetch       2 byte(s) @ 0x00007fc2c3a5af97 0f 05                syscall
      801371      601842:     1159769 <marker: system call 15>
      801372      601842:     1159769 <marker: timestamp 13335923552684023>
      801373      601842:     1159769 <marker: tid 1159769 on core 7>
      801374      601842:     1159769 <marker: syscall xfer from 0x7fc2c3a5af99>
      801375      601842:     1159769 <marker: timestamp 13335923552684029>
      801376      601842:     1159769 <marker: tid 1159769 on core 7>
      801377      601843:     1159769 ifetch       4 byte(s) @ 0x00007fc2c3aa5c72 48 83 c4 48          add    $0x48, %rsp
\endcode

\section sec_tool_func_view View Function Calls

The func_view tool records function argument and return values for
function names specified at tracing time. See \ref sec_drcachesim_funcs for
more information.

\code
$ bin64/drrun -t drcachesim -offline -record_function 'fib|1' -- ~/test/fib 5
Estimation of pi is 3.142425985001098
$ bin64/drrun -t drcachesim -simulator_type func_view -indir drmemtrace.*.dir
0x7fc06d2288eb => common.fib!fib(0x5)
    0x7fc06d22888e => common.fib!fib(0x4)
        0x7fc06d22888e => common.fib!fib(0x3)
            0x7fc06d22888e => common.fib!fib(0x2)
                0x7fc06d22888e => common.fib!fib(0x1) => 0x1
                0x7fc06d22889d => common.fib!fib(0x0) => 0x1
            => 0x2
            0x7fc06d22889d => common.fib!fib(0x1) => 0x1
        => 0x3
        0x7fc06d22889d => common.fib!fib(0x2)
            0x7fc06d22888e => common.fib!fib(0x1) => 0x1
            0x7fc06d22889d => common.fib!fib(0x0) => 0x1
        => 0x2
    => 0x5
    0x7fc06d22889d => common.fib!fib(0x3)
        0x7fc06d22888e => common.fib!fib(0x2)
            0x7fc06d22888e => common.fib!fib(0x1) => 0x1
            0x7fc06d22889d => common.fib!fib(0x0) => 0x1
        => 0x2
        0x7fc06d22889d => common.fib!fib(0x1) => 0x1
    => 0x3
=> 0x8
Function view tool results:
Function id=0: common.fib!fib
       15 calls
       15 returns
\endcode

\section sec_tool_histogram Cache Line Histogram

The top referenced cache lines are displayed by the \p histogram tool:

\code
$ bin64/drrun -t drcachesim -simulator_type histogram -- ~/test/pi_estimator
Estimation of pi is 3.142425985001098
---- <application exited with code 0> ----
Cache line histogram tool results:
icache: 1134 unique cache lines
dcache: 3062 unique cache lines
icache top 10
    0x7facdd013780: 30929
    0x7facdb789fc0: 27664
    0x7facdb78a000: 18629
    0x7facdd003e80: 18176
    0x7facdd003500: 11121
    0x7facdd0034c0: 9763
    0x7facdd005940: 8865
    0x7facdd003480: 8277
    0x7facdb789f80: 6660
    0x7facdd003540: 5888
dcache top 10
    0x7ffcc35e7d80: 4088
    0x7ffcc35e7d40: 3497
    0x7ffcc35e7e00: 3478
    0x7ffcc35e7f40: 2919
    0x7ffcc35e7dc0: 2837
    0x7facdbe2e980: 2452
    0x7facdbe2ec80: 2273
    0x7ffcc35e7e80: 2194
    0x7facdb6625c0: 2016
    0x7ffcc35e7e40: 1997
\endcode

\section sec_tool_invariant_checker Invariant Checker

The invariant_checker tool performs sanity checks on a trace, focusing
on program counter continuity and guarantees around kernel control
transfer interruptions.  It optionally checks for restricted behavior
that technically is legal but is not expected to happen in the target
trace, helping to identify tracing problems and suitability for use of
a trace for core simulation.

\section sec_tool_syscall_mix System Call Mix

The system call mix tool counts the frequency of system calls in a trace. It works for
both online and offline traces. It uses the
#dynamorio::drmemtrace::TRACE_MARKER_TYPE_SYSCALL markers that store the system call
number. This works only with traces that have these markers; that is, offline traces
must have #dynamorio::drmemtrace::OFFLINE_FILE_TYPE_SYSCALL_NUMBERS in their file type.

\code
$ bin64/drrun -t drcachesim -indir drmemtrace.ls.*.dir -simulator_type syscall_mix
Syscall mix tool results:
          count : syscall_num
             17 :         9
             16 :         1
              8 :         3
              7 :       262
              6 :       257
              5 :         0
              5 :        10
              3 :        12
              2 :       217
              2 :        16
              2 :        17
              2 :       137
              2 :        21
              1 :       158
              1 :       302
              1 :       334
              1 :       218
              1 :       231
              1 :       318
              1 :        11
              1 :       273
\endcode

****************************************************************************
\page google_workload_traces Google Workload Traces

With the rapid growth of internet services and cloud computing,
workloads on warehouse-scale computers (WSCs) have become an important
segment of today’s computing market. These workloads differ from
others in their requirements of on-demand scalability, elasticity and
availability. They have fundamentally different characteristics from
traditional benchmarks and require changes to modern computer
architecture to achieve optimal efficiency.  Google is sharing
instruction and memory address traces from workloads running in Google
data centers so that computer architecture researchers can study and
develop new architecture ideas to improve the performance and
efficiency of this important class of workloads.

\section sec_google_format Trace Format

The Google workload traces are captured using DynamoRIO's
[drmemtrace](@ref page_drcachesim).  The traces are records of
instruction and memory accesses as described at \ref
sec_drcachesim_format.  We separate instruction and memory access
records from each software thread into a separate file
(.memtrace.gz). In addition, for each software thread, we also provide
a branch_trace which contains execution data (taken/not taken, branch
target) about each branch instruction (conditional, non-conditional,
calls, etc.).  Finally, for each workload trace, we provide a thread
statistics file (.threadstats.csv) which contains the thread ID (tid),
instruction count, non-fetched instruction count (e.g. implicit
instructions generated from microcode), load count, store count, and
prefetch count.

\section sec_google_get Getting the Traces

The Google Workload Traces can be downloaded from:

 - [Google workload trace folder](https://console.cloud.google.com/storage/browser/external-traces)

Directory convention:
- \verbatim
  workload/trace-X/
  \endverbatim
  where X is sequential starting from 1

Filename convention:
- Memory trace file:
  \verbatim
  <uuid>.<tid>.memtrace.gz
  \endverbatim
- Branch trace file:
  \verbatim
  <uuid>.branch_trace.<tid>.csv.gz
  \endverbatim
- Thread statistics summary:
  \verbatim
  <uuid>.threadstats.csv
  \endverbatim

\section sec_google_help Getting Help and Reporting Bugs

The Google Workload Traces are essentially inputs to drive third party
tools (such as analyzers or simulators, including those provided here:
\ref sec_drcachesim_tools).  If you encounter a crash in a tool
provided by a third party, please locate the issue tracker for the
tool you are using and report the crash there.  If you believe the
issue is with the Google Workload Traces or with DynamoRIO or tools
provided with DynamoRIO, you can file an issue as described at \ref
page_bug_reporting.

For general questions, or if you are not sure whether the problem you
hit is a bug in your own code or in provided code, use the
[DynamoRIO users group mailing list/discussion forum]
(http://groups.google.com/group/dynamorio-users) rather than
opening an issue in the tracker. The users list will reach a wider
audience of people who might have an answer, and it will reach other
users who may find the information beneficial.

\section sec_google_contrib Contributing

We welcome contributions to the Google workload trace project. The
goal of providing the Google workload traces is to enable computer
architecture researchers to develop insights and new architecture
ideas to improve the performance and efficiency of workloads that run
on warehouse-scale computers.

You can contribute to the project in many ways:

- Providing suggestions for improving trace formats.
- Sharing and collaborating on architecture research.
- Reporting issues: see \ref sec_google_help

****************************************************************************
\page sec_drcachesim_config_file Configuration File

\p drcachesim supports reconfigurable cache hierarchies defined in
a configuration file. The configuration file is a text file with the following
formatting rules.

- A comment starts with two slashes followed by one or more spaces. Anything
after the '// ' until the end of the line is considered a comment and ignored.
- A parameter's name and its value are listed consecutively with white space
(spaces, tabs, or a new line) between them.
- Parameters must be separated by white space. Including one parameter per line
helps keep the configuration file more human-readable.
- A cache's parameters must be enclosed inside braces and preceded by the
cache's user-chosen unique name.
- Parameters can be listed in any order.
- Parameters not included in the configuration file take their default values.
- String values must not be enclosed in quotations.

Supported common parameters and their value types (each of these parameters
sets the corresponding option with the same name described in \ref sec_drcachesim_ops):
- num_cores \<unsigned int\>
- line_size \<unsigned int\>
- skip_refs \<unsigned int\>
- warmup_refs \<unsigned int\>
- warmup_fraction \<float in [0,1]\>
- sim_refs \<unsigned int\>
- cpu_scheduling \<bool\>
- verbose \<unsigned int\>
- coherence \<bool\>
- use_physical \<bool\>

Supported cache parameters and their value types:
- type \<string, one of "instruction", "data", or "unified"\>
- core \<unsigned int in [0, num_cores)\>
- size \<unsigned int, power of 2\>
- assoc \<unsigned int, power of 2\>
- inclusive \<bool\>
- parent \<string\>
- replace_policy \<string, one of "LRU", "LFU", or "FIFO"\>
- prefetcher \<string, one of "nextline" or "none"\>
- miss_file \<string\>

Example:
\code
// Configuration for a single-core CPU.

// Common params.
num_cores       1
line_size       64
cpu_scheduling  true
sim_refs        8888888
warmup_fraction 0.8

// Cache params.
P0L1I {                        // P0 L1 instruction cache
  type            instruction
  core            0
  size            65536        // 64K
  assoc           8
  parent          P0L2
  replace_policy  LRU
}
P0L1D {                        // P0 L1 data cache
  type            data
  core            0
  size            65536        // 64K
  assoc           8
  parent          P0L2
  replace_policy  LRU
}
P0L2 {                         // P0 L2 unified cache
  size            512K
  assoc           16
  inclusive       true
  parent          LLC
  replace_policy  LRU
}
LLC {                          // LLC
  size            1M
  assoc           16
  inclusive       true
  parent          memory
  replace_policy  LRU
  miss_file       misses.txt
}
\endcode

****************************************************************************
\page sec_drcachesim_offline Offline Traces and Analysis

To dump a trace for future offline analysis, use the \p offline parameter:
\code
$ bin64/drrun -t drcachesim -offline -- /path/to/target/app <args> <for> <app>
\endcode

The collected traces will be dumped into a newly created directory,
which can be passed to drcachesim for offline cache simulation with the \p
-indir option:
\code
$ bin64/drrun -t drcachesim -indir drmemtrace.app.pid.xxxx.dir/
\endcode

The direct results of the \p -offline run are raw, compacted files, stored
in a \p raw/ subdirectory of the \p drmemtrace.app.pid.xxxx.dir directory.
The \p -indir option both converts the data to a canonical trace form and
passes the resulting data to the cache simulator.  The canonical trace data
is stored by \p -indir in a \p trace/ subdirectory inside the \p
drmemtrace.app.pid.xxxx.dir/ directory.  For both the raw and canonical
data, a separate file per application thread is used.  If the canonical
data already exists, future runs will use that data rather than
re-converting it.  Either the top-level directory or the \p trace/
subdirectory may be pointed at with \p -indir:

\code
$ bin64/drrun -t drcachesim -indir drmemtrace.app.pid.xxxx.dir/trace
\endcode

If built with the zlib library, the canonical trace files are
automatically compressed with zip or gzip.  The trace reader supports
reading zip, gzip, or snappy compressed files.

The raw files are also compressed, controlled by the -p raw_compress
option.  If built with lz4 support and not statically linked with the
application, lz4 is used by default.  Whether compressing the raw
files is a net reduction in overhead depends on the storage medium and
compression scheme.  The "lz4" and "snappy_nocrc" schemes are
generally performnce wins even for an SSD, while "gzip" or "zlib" slow
things down.  For a spinning disk, any compression should be a net
win.

Older versions of the simulator produced a single trace file containing all threads
interleaved.  The \p -infile option supports reading these legacy files:
\code
$ gzip drmemtrace.app.pid.xxxx.dir/drmemtrace.trace
$ bin64/drrun -t drcachesim -infile drmemtrace.app.pid.xxxx.dir/drmemtrace.trace.gz
\endcode

The same analysis tools used online are available for offline: the trace
format is identical.

For details on the offline trace format and how to diagnose problems
with offline traces, see \ref page_debug_memtrace.

****************************************************************************
\page sec_drcachesim_filter Filtered Traces

Filtered traces are \p drcachesim traces filtered by an online first-level
cache.

- The \p -L0I_filter and \p -L0D_filter options can be used to enable the
  filter. These caches are direct-mapped with size equal to \p
  -L0I_size/-L0D_size.  They use virtual addresses regardless of -use_physical.
  The dynamic (pre-filtered) per-thread instruction count is tracked and
  supplied via a #dynamorio::drmemtrace::TRACE_MARKER_TYPE_INSTRUCTION_COUNT
  marker at thread buffer boundaries and at thread exit.
- The \p -L0I_size and \p -L0D_size options specify the cache sizes. Must be a
  power of 2 and a multiple of \p -line_size, unless it is set to 0, which
  disables entries from appearing in the trace.
- The \p -L0_filter_until_instrs option is used to collect filtered traces
  together with full trace (see \ref sec_drcachesim_partial)

****************************************************************************
\page sec_drcachesim_partial Tracing a Subset of Execution

While the cache simulator supports skipping references, for large
applications the overhead of the tracing itself is too high to conveniently
trace the entire execution.  There are several methods of tracing only
during a desired window of execution.

The \p -trace_after_instrs option delays tracing by the specified number of
dynamic instruction executions.  This can be used to skip initialization
and arrive at the desired starting point.  The trace's length can be
limited in several ways:

- The \p -trace_for_instrs option stops tracing after the specified number
  of dynamic instrutions in the current window (since the last \p
  -retrace_every_instrs trigger, if set).
- The \p -retrace_every_instrs option augments -p -trace_for_instrs by
  executing its specified instruction count without tracing and then
  re-enabling tracing for \p -trace_for_instrs again, resulting in
  tracing windows repeated at regular intervals throughout the execution.
  There are two options for how these windows are stored for offline traces.
  If the \p -split_windows option is set (which is the default), each window
  produces a separate set of output files inside a window.NNNN subdirectory.
  Post-processing by default targets the first window; the others must be explicitly
  passed to separate post-processing invocations.  If \p -no_split_windows is set,
  a single trace is created with #dynamorio::drmemtrace::TRACE_MARKER_TYPE_WINDOW_ID
  markers (see \ref sec_drcachesim_format_other) identifying the trace window
  transitions.
- The \p -max_global_trace_refs option causes the recording of trace
  data to cease once the specified threshold is exceeded by the sum of
  all trace references across all threads.  One trace reference entry
  equals one recorded address, but due to post-processing expansion a
  final offline line trace will be larger.  Once recording ceases, the
  application will continue to run.  Threads that are newly created after
  the threshold is reached will not appear in the trace.
- The \p -exit_after_tracing option similarly specifies a global trace
  reference count, but once it is exceeded, the process is terminated.
- The \p -max_trace_size option sets a cap on the number of bytes written
  by each thread.  This is a per-thread limit, and if one thread hits the
  limit it does not affect the trace recoding of other threads.
- The \p -L0_filter_until_instrs option collects a filtered trace before
  transitioning to a full trace. It is compatible with the
  tracing options listed above. The filter trace and full trace are stored in a
  single file separated by a
  #dynamorio::drmemtrace::TRACE_MARKER_TYPE_FILTER_ENDPOINT marker. When used
  with windows (i.e., \p -retrace_every_instrs), each window contains a filter
  trace and a full trace. The
  #dynamorio::drmemtrace::TRACE_MARKER_TYPE_WINDOW_ID markers indicate start of
  filtered records.

If the application can be modified, it can be linked with the \p drcachesim
tracer and use DynamoRIO's start/stop API routines dr_app_setup_and_start()
and dr_app_stop_and_cleanup() to delimit the desired trace region.  As an
example, see <a
href="https://github.com/DynamoRIO/dynamorio/blob/master/clients/drcachesim/tests/burst_static.cpp">our
burst_static test application</a>.

****************************************************************************
\page sec_drcachesim_sim Simulator Details

Generally, the simulator is able to be extended to model a variety of
caching devices.  Currently, CPU caches and TLBs are implemented.  The type of
device to simulate can be specified by the parameter
"-simulator_type" (see \ref sec_drcachesim_ops).

The CPU cache simulator models a configurable number of cores,
each with an L1 data cache and an L1 instruction cache.
Currently there is a single shared L2 unified cache, but we would like to
extend support to arbitrary cache hierarchies (see \ref sec_drcachesim_limit).
The cache line size and each cache's total size and associativity are
user-specified (see \ref sec_drcachesim_ops).

The TLB simulator models a configurable number of cores, each with an
L1 instruction TLB, an L1 data TLB, and an L2 unified TLB.  Each TLB's
entry number and associativity, and the virtual/physical page size,
are user-specified (see \ref sec_drcachesim_ops).

Neither simulator has a simple way to know which core any particular thread
executed on for each of its instructions.  The tracer records which core a
thread is on each time it writes out a full trace buffer, giving an
approximation of the actual scheduling (at the granularity of the trace
buffer size).  By default, these cache and TLB simulators ignore that
information and schedule threads to simulated cores in a static round-robin
fashion with load balancing to fill in gaps with new threads after threads
exit.  The option "-cpu_scheduling" (see \ref sec_drcachesim_ops) can be
used to instead map each physical cpu to a simulated core and use the
recorded cpu that each segment of thread execution occurred on to schedule
execution in a manner that more closely resembles the traced execution on
the physical machine.  Below is an example of the output using this option
running an application with many threads on a pysical machine with 8 cpus.
The 8 cpus are mapped to the 4 simulated cores:

\code
$ bin64/drrun -t drcachesim -cpu_scheduling -- ~/test/pi_estimator 20
Estimation of pi is 3.141592653798125
<Stopping application /home/bruening/dr/test/threadsig (213517)>
---- <application exited with code 0> ----
Cache simulation results:
Core #0 (2 traced CPU(s): #2, #5)
  L1I stats:
    Hits:                        2,756,429
    Misses:                          1,190
    Miss rate:                        0.04%
  L1D stats:
    Hits:                        1,747,822
    Misses:                         13,511
    Prefetch hits:                   2,354
    Prefetch misses:                11,157
    Miss rate:                        0.77%
Core #1 (2 traced CPU(s): #4, #0)
  L1I stats:
    Hits:                          472,948
    Misses:                            299
    Miss rate:                        0.06%
  L1D stats:
    Hits:                          895,099
    Misses:                          1,224
    Prefetch hits:                     253
    Prefetch misses:                   971
    Miss rate:                        0.14%
Core #2 (2 traced CPU(s): #1, #7)
  L1I stats:
    Hits:                          448,581
    Misses:                            649
    Miss rate:                        0.14%
  L1D stats:
    Hits:                          811,483
    Misses:                          1,723
    Prefetch hits:                     378
    Prefetch misses:                 1,345
    Miss rate:                        0.21%
Core #3 (2 traced CPU(s): #6, #3)
  L1I stats:
    Hits:                          275,192
    Misses:                            154
    Miss rate:                        0.06%
  L1D stats:
    Hits:                          522,655
    Misses:                            850
    Prefetch hits:                     173
    Prefetch misses:                   677
    Miss rate:                        0.16%
LL stats:
    Hits:                           12,491
    Misses:                          7,109
    Prefetch hits:                   8,922
    Prefetch misses:                 5,228
    Local miss rate:                 36.27%
    Child hits:                  7,933,367
    Total miss rate:                  0.09%
\endcode

The memory access traces contain some optimizations that combine references
for one basic block together.  This may result in not considering some
thread interleavings that could occur natively.  There are no other
disruptions to thread ordering, however, and the application runs with all
of its threads concurrently just like it would natively (although slower).

Once every process has exited, the simulator prints cache miss statistics
for each cache to stderr.  The simulator is designed to be extensible,
allowing for different cache studies to be carried out: see \ref
sec_drcachesim_extend.

For L2 caching devices, the L1 caching devices are considered its _children_.
Two separate miss rates are computed, one (the "Local miss rate") considering
just requests that reach L2 while the other (the "Total miss rate")
includes the child hits.  This generalizes to deeper hierarchies:
lower level caches are children and reported child hits are cumulative
across all lower levels.

For memory requests that cross blocks, each block touched is
considered separately, resulting in separate hit and miss statistics.  This
can be changed by implementing a custom statistics gatherer (see \ref
sec_drcachesim_extend).

Software and hardware prefetches are combined in the prefetch hit and miss
statistics, which are reported separately from regular loads and stores.
To isolate software prefetch statistics, disable the hardware prefetcher by
running with "-data_prefetcher none" (see \ref sec_drcachesim_ops).
While misses from software prefetches are included in cache miss files,
misses from hardware prefetches are not.


****************************************************************************
\page sec_drcachesim_analyzer Cache Miss Analyzer

The cache simulator can be used to analyze the stream of last-level cache (LLC)
miss addresses. This can be useful when looking for patterns that can be utilized
in software prefetching. The current analyzer can only identify simple stride
patterns, but it can be extended to search for more complex patterns.
To invoke the miss analyzer, pass \p miss_analyzer to the \p -simulator_type
parameter. To write the prefetching hints to a file use the \p -LL_miss_file
parameter to specify the file's path and name.

For example, to run the analyzer on a benchmark called "my_benchmark" and store
the prefetching recommendations in a file called "rec.csv", run the following:

\code
$ bin64/drrun -t drcachesim -simulator_type miss_analyzer -LL_miss_file rec.csv -- my_benchmark
\endcode


****************************************************************************
\page sec_drcachesim_phys Physical Addresses

The memory access tracing client gathers virtual addresses.  On Linux, if
the kernel allows user-mode applications access to the \p
/proc/self/pagemap file or the application can be run with root
privileges, information to translate virtual addresses to physical addresses may be included in the trace.  This can be
requested via the \p -use_physical runtime option (see \ref
sec_drcachesim_ops).  On older kernels the pagemap file was readable without
privileges:
http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=ab676b7d6fbf4b294bf198fb27ade5b0e865c7ce.

When \p -use_physical is enabled, the regular trace entries remain
virtual, with a pair of markers of types
#dynamorio::drmemtrace::TRACE_MARKER_TYPE_PHYSICAL_ADDRESS and
#dynamorio::drmemtrace::TRACE_MARKER_TYPE_VIRTUAL_ADDRESS inserted at some prior point for
each new page mapping to show the corresponding physical
addresses.  If translation fails, a
#dynamorio::drmemtrace::TRACE_MARKER_TYPE_PHYSICAL_ADDRESS_NOT_AVAILABLE is inserted.
Limited support for detecting changes in page mappings is provided via
the \p -virt2phys_freq option to periodically clear cached
translations.

Each analysis tool must decide whether to use this translation
information.  The cache and TLB simulators provided are equipped to
read these markers and they use the marker data when \p -use_physical
is specified.

****************************************************************************
\page sec_drcachesim_core Core Simulation Support

The \p drcachesim trace format includes information intended for use by
core simulators as well as pure cache simulators.  For traces that are not
filtered by an online first-level cache, each data reference is preceded by
the instruction fetch entry for the instruction that issued the data
request, which includes the instruction encoding with the opcode and operands.
Additionally, on x86, string loop
instructions involve a single insruction fetch followed by a loop of loads
and/or stores.  A \p drcachesim trace includes a special "no-fetch"
instruction entry per iteration so that core simulators have the
instruction information to go along with each load and store, while cache
simulators can ignore these "no-fetch" entries and avoid incorrectly
inflating instruction fetch statistics.

Traces include scheduling markers providing the timestamp and hardware
thread identifier on each thread transition, allowing a simulator to more
closely match the actual hardware if so desired.

Traces also include markers indicating disruptions in user mode control
flow such as signal handler entry and exit.

Offline traces explicitly identify whether each conditional branch was
taken or not, and include the actual target of indirect
branches, for convenience to avoid having to read either the
subsequent entry or the kernel transfer event marker (or infer branch
behavior for rseq aborts):

```
      801394      601853:     1159769 ifetch       2 byte(s) @ 0x00007fc2c3aa91e3 7f 1b                jnle   $0x00007fc2c3aa9200 (untaken)
      801395      601854:     1159769 ifetch       4 byte(s) @ 0x00007fc2c3aa91e5 48 83 c4 10          add    $0x10, %rsp
      801396      601855:     1159769 ifetch       1 byte(s) @ 0x00007fc2c3aa91e9 5b                   pop    %rbx
      801397      601855:     1159769 read         8 byte(s) @ 0x00007fff8044f6c0 by PC 0x00007fc2c3aa91e9
      801398      601856:     1159769 ifetch       1 byte(s) @ 0x00007fc2c3aa91ea c3                   ret (target 0x7fc2c3aa81c1)
      801399      601856:     1159769 read         8 byte(s) @ 0x00007fff8044f6c8 by PC 0x00007fc2c3aa91ea
      801400      601857:     1159769 ifetch       2 byte(s) @ 0x00007fc2c3aa81c1 89 c5                mov    %eax, %ebp
```

Filtered traces (filtered via -L0_filter) include the dynamic
(pre-filtered) per-thread instruction count in a
#dynamorio::drmemtrace::TRACE_MARKER_TYPE_INSTRUCTION_COUNT marker at
each thread buffer boundary and at thread exit.

****************************************************************************
\page sec_drcachesim_extend Extending the Simulator

The \p drcachesim tool was designed to be extensible, allowing users to
easily model different caching devices, implement different models, and
gather custom statistics.

To model different caching devices, subclass the \p simulator_t,
caching_device_t, caching_device_block_t, caching_device_stats_t classes.

To implement a different cache model, subclass the \p cache_t class and
override the \p request(), \p access_update(), and/or \p
replace_which_way() method(s).

Statistics gathering is separated out into the \p caching_device_stats_t
class.  To implement custom statistics, subclass \p caching_device_stats_t
and override the \p access(), \p child_access(), \p flush(), and/or
\p print_stats() methods.

****************************************************************************
\page sec_drcachesim_tracer Customizing the Tracer

The tracer supports customization for special-purpose i/o via
drmemtrace_replace_file_ops(), allowing traces to be written to locations
not supported by simple UNIX file operations.  One option for using this
function is to create a new client which links with the provided
drmemtrace_static library, includes the \p drmemtrace/drmemtrace.h header via:

\code
use_DynamoRIO_drmemtrace_tracer(mytool)
\endcode

And includes its own dr_client_main() which calls
drmemtrace_client_main().

The tracer also supports storing custom data with each module (i.e.,
library or executable) such as a build identifier via
drmemtrace_custom_module_data().  The custom data may be retrieved by
creating a custom offline trace post-processor and using the
#dynamorio::drmemtrace::module_mapper_t class.

****************************************************************************
\page sec_drcachesim_funcs Tracing Function Calls

The tracer supports recording argument and return values for specified
functions.  This feature is currently limited to offline mode only
(\ref sec_drcachesim_offline).  The -record_function parameter lists
which function names to trace.  Requested names will be located per
library and each instance traced separately.  The number of arguments
to record is specified for each name, using a bar character to
separate them.  An ampersand separates functions.  Here is an example:

\code
$ bin64/drrun -t drcachesim -offline -record_function 'fib|1&calloc|2'
\endcode

Within the trace, each function is identified by a numeric identifier.
The list of recorded functions, each with its identifier, is placed
into a file "funclist.log" in the trace directory, where the sample
tool \p func_view uses it to provide a linear function call trace as
well as summary statistics as shown above.

The -record_heap parameter requests recording of a pre-determined set
of functions related to heap allocation.  The -record_heap_value
paramter controls the contents of this set.

****************************************************************************
\page sec_drcachesim_newtool Creating New Analysis Tools

\p drcachesim provides a \p drmemtrace analysis tool framework to make it
easy to create new trace analysis tools.  A new tool should subclass
#dynamorio::drmemtrace::analysis_tool_t.

Concurrent processing of traces is supported by logically splitting a trace into
"shards" which are each processed sequentially.  The default shard is a traced
application thread, but the tool interface also supports using physical cores as shards
with each containing an interleaved mix of application threads provided by the \ref
sec_drcachesim_sched.  The shard type is available to a tool by overriding the
initialize_shard_type() funcion.

For tools
that support concurrent processing of shards and do not need to see a single
time-sorted interleaved merged trace, the interface functions with the parallel_
prefix should be overridden, and parallel_shard_supported() should return true.
parallel_shard_init_stream() will be invoked for each shard prior to invoking
parallel_shard_memref() for each entry in that shard; the data structure returned
from parallel_shard_init() will be passed to parallel_shard_memref() for each
trace entry for that shard.  The concurrency model used guarantees that all
entries from any one shard are processed by the same single worker thread, so no
synchronization is needed inside the parallel_ functions.  A single worker thread
invokes print_results() as well.

For core-sharded analysis, if the thread-to-core scheduling occurs dynamically (this
depends on the options passed to the analyzer: see the `-core_sharding` option
documentation under \ref sec_drcachesim_ops), the speed of each parallel analysis thread
affects the actual schedule.  If a tool has a significant asymmetry and does not wish this
to affect the schedule, a desired schedule should be recorded without the tool and then
replayed with the tool.  In replay mode the tool's speed will not affect the schedule.

For serial operation, process_memref(), operates on a trace entry in a single, sorted,
interleaved stream of trace entries.  In the default mode of operation, the
#dynamorio::drmemtrace::analyzer_t class iterates over the trace and calls the
process_memref() function of each tool.  An alternative mode is supported which exposes
the iterator and allows a separate control infrastructure to be built.  This alternative
mode does not support parallel operation at this time.

Both parallel and serial operation can be supported by a tool, typically by having
process_memref() create data on a newly seen traced thread and invoking
parallel_shard_memref() to do its work.

For both parallel and serial operation, the function print_results() should be
overridden.  It is called just once after processing all trace data and it should
present the results of the analysis.  For parallel operation, any desired
aggregation across the whole trace should occur here as well, while shard-specific
results can be presented in parallel_shard_exit().

Tools can also perform trace analysis by intervals, e.g. to generate a time series of
their results, using the \p -interval_microseconds option. The
generate_interval_snapshot() API allows the tool to create a snapshot of its internal
state when a trace interval ends. These snapshots are then passed to the tool in a later
print_interval_results() API call where the tool can generate and print results for each
trace interval. The length of a trace interval is defined by the \p
-interval_microseconds option, measured using the
#dynamorio::drmemtrace::TRACE_MARKER_TYPE_TIMESTAMP marker values. Trace interval
analysis is supported also for the parallel mode where the tool implements
generate_shard_interval_snapshot() to generate a snapshot for shard-local intervals and
the framework automatically combines the shard-local interval snapshots to create the
whole-trace interval snapshots, using the tool's combine_interval_snapshots() API.

Today, parallel analysis is only supported for offline traces.
Support for online traces may be added in the future.

In the default mode of operation, the #dynamorio::drmemtrace::analyzer_t class iterates
over the trace and calls the appropriate #dynamorio::drmemtrace::analysis_tool_t
functions for each tool.  An alternative mode is supported which exposes the iterator
and allows a separate control infrastructure to be built.

As explained in \ref sec_drcachesim_format, each trace entry is of
type #dynamorio::drmemtrace::memref_t and represents one instruction or data reference or a
metadata operation such as a thread exit or marker.  There are
built-in scheduling markers providing the timestamp and cpu identifier
on each thread transition.  Other built-in markers indicate
disruptions in user mode control flow such as signal handler entry and
exit.

The absolute ordinals for trace records and instruction fetches are
available via the #dynamorio::drmemtrace::memtrace_stream_t interface passed to the
initialize_stream() function for serial operation and
parallel_shard_init_stream() for parallel operation.  If the iterator
skips over some records that are not passed to the tools, these
ordinals will include those skipped records.  If a tool wishes to
count only those records or instructions that it sees, it can add its
own counters.

In some cases, a tool may want to observe the exact sequence of
#dynamorio::drmemtrace::trace_entry_t in an offline trace stored on disk. To support
such use cases, the #dynamorio::drmemtrace::trace_entry_t specialization of
#dynamorio::drmemtrace::analysis_tool_tmpl_t and #dynamorio::drmemtrace::analyzer_tmpl_t
can be used. Specifically, such tools should subclass
#dynamorio::drmemtrace::record_analysis_tool_t, and use the
#dynamorio::drmemtrace::record_analyzer_t class.

CMake support is provided for including the headers and linking the
libraries of the \p drmemtrace framework.  A new CMake function is defined
in the DynamoRIO package which sets the include directory for using the \p
drmemtrace/ headers:

\code
use_DynamoRIO_drmemtrace(mytool)
\endcode

The \p drmemtrace_analyzer library exported by the DynamoRIO package is the main
library to link when building a new tool.  The tools described above are also
exported as the libraries \p drmemtrace_basic_counts, \p drmemtrace_view, \p
drmemtrace_opcode_mix, \p drmemtrace_histogram, \p drmemtrace_reuse_distance, \p
drmemtrace_reuse_time, \p drmemtrace_simulator, \p drmemtrace_func_view, and
\p drmemtrace_syscall_mix and can be created using the basic_counts_tool_create(),
opcode_mix_tool_create(), histogram_tool_create(), reuse_distance_tool_create(),
reuse_time_tool_create(), view_tool_create(), cache_simulator_create(),
tlb_simulator_create(), func_view_create(), and syscall_mix_tool_create()
functions.

\section sec_drcachesim_sched Scheduler

In addition to the analysis tool framework, which targets running
multiple tools at once either in parallel across all traced threads or
in a serial fashion, we provide a scheduler which will map inputs to a
given set of outputs in a specified manner.  This allows a tool such
as a core simulator, or just a tool wanting its own control over
advancing the trace stream (unlike the analysis tool framework where
the framework controls the iteration), to request the next trace
record for each output on its own.  This scheduling is also available to any analysis tool
when the input traces are sharded by core (see the `-core_sharding` option documentation
under \ref sec_drcachesim_ops as well as \ref sec_drcachesim_newtool).

Here is a simple example of a single-output, serial stream.  This also
serves as an example of how to replace the now-removed old analysis
tool framework's "external iterator" interface:

\code
    scheduler_t scheduler;
    std::vector<scheduler_t::input_workload_t> sched_inputs;
    sched_inputs.emplace_back(trace_directory);
    if (scheduler.init(sched_inputs, 1, scheduler_t::make_scheduler_serial_options()) !=
        scheduler_t::STATUS_SUCCESS) {
        FATAL_ERROR("failed to initialize scheduler: %s",
                    scheduler.get_error_string().c_str());
    }
    auto *stream = scheduler.get_stream(0);
    memref_t record;
    for (scheduler_t::stream_status_t status = stream->next_record(record);
         status != scheduler_t::STATUS_EOF; status = stream->next_record(record)) {
        if (status != scheduler_t::STATUS_OK)
            FATAL_ERROR("scheduler failed to advance: %d", status);
        if (!my_tool->process_memref(record)) {
            FATAL_ERROR("tool failed to process entire trace: %s",
                        my_tool->get_error_string().c_str());
        }
    }
\endcode

****************************************************************************
\page sec_drcachesim_ops Simulator Parameters

\p drcachesim's behavior can be controlled through options passed after the
\p -c \p drcachesim but prior to the "--" delimiter on the command line:

\code
$ bin64/drrun -t drcachesim <options> <to> <drcachesim> -- /path/to/target/app <args> <for> <app>
\endcode

Boolean options can be disabled using a "-no_" prefix.

The parameters available are described below:

REPLACEME_WITH_OPTION_LIST


****************************************************************************
\page sec_drcachesim_limit Current Limitations

The \p drcachesim tool is a work in progress.  We welcome contributions in
these areas of missing functionality:

- Multi-process online application simulation on Windows
  (https://github.com/DynamoRIO/dynamorio/issues/1727)
- Offline traces do not currently accurately record instruction fetches in
  dynamically generated code (https://github.com/DynamoRIO/dynamorio/issues/2062).
  All data references are included, but instruction fetches may be skipped.
  This problem is limited to offline traces.
- If an instruction with multiple memory accesses faults on the
  non-final access, the trace may incorrectly contain subsequent
  accesses which did not actually happen
  (https://github.com/DynamoRIO/dynamorio/issues/3958).
- Online traces may skip instructions immediately prior to
  non-load-or-store-related kernel transfer events
  (https://github.com/DynamoRIO/dynamorio/issues/3937).
- Online traces may include the committing store in the trace when a
  restartable sequence abort happened prior to that store
  (https://github.com/DynamoRIO/dynamorio/issues/4041).
- Application phase marking is not yet implemented
  (https://github.com/DynamoRIO/dynamorio/issues/2478).

****************************************************************************
\page sec_drcachesim_cmp Comparison to Other Simulators

\p drcachesim is one of the few simulators to support multiple processes.
This feature requires an out-of-process simulator and inter-process
communication.  A single-process design would incur less overhead.  Thus,
we expect \p drcachesim to pay for its multi-process support with
potentially unfavorable performance versus single-process simulators.

When comparing cache hits, misses, and miss rates across simulators, the
details can vary substantially.  For example, some other simulators (such
as \p cachegrind) do not split memory references that cross cache lines
into multiple hits or misses, while \p drcachesim does split them.
Instructions that reference multiple memory words on the same cache line
(such as \p ldm on ARM) are considered to be single accesses by \p
drcachesim, while other simulators (such as \p cachegrind) may split the
accesses into separate pieces.  A final example involves string loop
instructions on x86.  \p drcachesim considers only the first iteration to
involve an instruction fetch (presenting subsequent iterations as a
"non-fetched instruction" which the simulator ignores: the basic_counts
tool does show these as a separate statistics), while other simulators
(incorrectly) issue a fetch to the instruction cache on every iteration of
the string loop.

*/