Analyzing Memory Performance with Tinymembench
[ English | 简体中文 ]
I. Overview
tinymembench is a lightweight, cross-platform benchmarking tool that you can use to precisely measure your system's memory bandwidth and random-access latency. This tool provides critical data for analyzing and optimizing the performance of an embedded system's memory subsystem.
The main features of tinymembench include:
-
Official Source Code: https://github.com/ssvb/tinymembench
-
Supported Processor Architectures:
- AArch64
- ARM
- amd64
- MIPS32
-
Supported Vector Instruction Sets:
- SSE2 (Streaming SIMD Extensions 2)
- NEON
II. Usage Instructions
You can enable and run tinymembench in the openvela environment with simple configuration and commands.
1. Enabling tinymembench
Enable the tinymembench application in your project's configuration.
-
Enter the openvela configuration menu (e.g., by running
make menuconfig). -
Navigate to
Application Configuration->BenchMarks. -
Select the
tinymembenchoption.CONFIG_BENCHMARKS_TINYMEMBENCH=y -
Save the configuration and recompile your project.
The source code for tinymembench is located in the apps/benchmarks/tinymembench directory.
2. Running the Benchmark
In the command line (NuttShell), simply execute the tinymembench command to start the test. The command requires no arguments.
nsh> tinymembench
III. Analyzing Test Results
The output of tinymembench is divided into two main parts: memory bandwidth tests and memory latency tests.
1. Interpreting Core Metrics
The basic principles for performance evaluation are straightforward:
- Higher memory bandwidth is better: It indicates that more data can be transferred per unit of time.
- Lower memory latency is better: It indicates that a single memory access takes less time.
2. Example Output
After the test is complete, tinymembench will print a detailed performance report, as shown below:
nsh> tinymembench
tinymembench v0.4.9 (simple benchmark for memory throughput and latency)
==========================================================================
== Memory bandwidth tests ==
... (Detailed output for bandwidth tests omitted) ...
C copy : 7153.3 MB/s (3.7%)
standard memcpy : 13278.0 MB/s (7.2%)
standard memset : 5833.2 MB/s (1.0%)
SSE2 copy : 12823.8 MB/s (6.8%)
==========================================================================
== Memory latency test ==
... (Detailed output for latency tests omitted) ...
==========================================================================
block size : single random read / dual random read
1024 : 0.1 ns / 0.1 ns
...
16777216 : 66.7 ns / 84.9 ns
33554432 : 73.6 ns / 99.8 ns
67108864 : 69.4 ns / 94.7 ns
3. Key Factors Affecting Memory Performance
Memory performance is influenced by a combination of hardware and software configurations. When analyzing the results, consider the following key factors:
-
Data Cache: Enabling the data cache can significantly reduce average memory access latency, thereby improving overall performance.
-
Memory Management Unit (MMU): Enabling the MMU introduces overhead for virtual-to-physical address translation, increasing both average and worst-case memory access times. With multi-level page tables (e.g., 4-level tables), the worst-case memory access latency can increase significantly. In contrast, using a Memory Protection Unit (MPU) for block-based address translation has a smaller impact on performance.
-
Translation Lookaside Buffer (TLB): The TLB is the MMU's address translation cache. Enabling the TLB can effectively accelerate the address translation process, reducing the average memory access latency when the MMU is active.
-
Cacheable Attribute of Page Table Entries: If a memory region is configured as Non-Cacheable, the CPU bypasses the cache and accesses main memory directly. This increases the average access time but may slightly reduce worst-case latency jitter.
-
Virtualization Environment: Running in a virtualized environment typically introduces an additional layer of address translation (e.g., Intermediate Physical Address to Host Physical Address), which slightly increases average access time and can significantly increase worst-case access time.
-
DDR Memory Timings: The physical characteristics of DDR (Double Data Rate) SDRAM directly impact performance.
- Refresh Cycle: During a memory refresh period (defined by parameters like
t_REFandt_REFI), the memory controller pauses responses to access requests, which directly affects the worst-case access time. - Access Timings: Other key timing parameters, such as
t_CL(CAS Latency) andt_RCD(RAS to CAS Delay), also affect average and worst-case access times.
- Refresh Cycle: During a memory refresh period (defined by parameters like
IV. openvela Porting Notes
During the process of porting tinymembench to openvela, a key modification was made:
- The
__attribute__((weak))modifier was added to thefminfunction to resolve potential symbol conflicts with a function of the same name in the standard library.