Understanding Linux perf: stat and record
Published: Last Updated:
In this post we will walk through what actually happens under the hood when you run:
perf stat -e cycles,instructions ./my_program
perf record -F 99 -g ./my_program
perf looks like a simple user-space tool, but most of the heavy lifting is done by the Linux kernel through the perf events subsystem.
1. Quick mental model
At a high level:
perf(the CLI) is just a frontend that opens special file descriptors viaperf_event_open(2).- The kernel connects those file descriptors to hardware performance counters or software counters.
- The kernel counts events or takes periodic samples while your workload runs.
- When
perfexits, it reads aggregated counts or a stream of samples and formats them for you.
You can think of it as:
user-space
perfCLI →perf_event_opensyscalls → kernel perf subsystem → PMU / tracepoints → data comes back via file descriptors
2. perf stat -e – counting events
Example:
perf stat -e cycles,instructions,branches,branch-misses ./my_program
2.1 Argument parsing and event description
First, the perf binary parses your arguments and turns each -e entry into an event descriptor:
cycles→ hardware PMU event (CPU core cycles)instructions→ hardware PMU event (retired instructions)branches,branch-misses→ hardware PMU events related to branch unit
Each descriptor eventually becomes a struct perf_event_attr filled with:
- event type (hardware, software, tracepoint, raw, etc.)
- config (which specific event within that type)
- scope (per-task vs per-CPU)
- flags (inherit across children, pinned, exclude-kernel, exclude-user, etc.)
2.2 perf_event_open() syscalls and perf_event_attr
For each event, perf calls the perf_event_open(2) system call, which returns a file descriptor:
int fd = perf_event_open(&attr, pid, cpu, group_fd, flags);
Key parameters:
pid= target task or-1(all tasks)cpu= specific CPU or-1(any CPU the task runs on)group_fd= allows grouping multiple events so they start/stop togetherflags= things likePERF_FLAG_FD_CLOEXEC,PERF_FLAG_FD_NO_GROUP, etc.
The perf_event_attr that perf passes looks roughly like:
struct perf_event_attr attr = {
.type = PERF_TYPE_HARDWARE,
.config = PERF_COUNT_HW_CPU_CYCLES,
.size = sizeof(struct perf_event_attr),
.disabled = 1,
.exclude_kernel = 0,
.exclude_user = 0,
.inherit = 1,
.read_format = PERF_FORMAT_TOTAL_TIME_ENABLED |
PERF_FORMAT_TOTAL_TIME_RUNNING |
PERF_FORMAT_ID,
};
Important knobs you often tweak via perf CLI flags:
inherit– whether children inherit the event (e.g.,perf stat --no-inherit).pinned/exclusive– whether the event must stay scheduled at all times vs multiplexed.exclude_{kernel,user,hypervisor,guest}– filter which privilege levels contribute to counts.precise_ip– for sampling events, request PEBS/IBS-style precise IPs where the PMU supports it.
Internally, the kernel:
- Validates your event (do you have permission? is it supported on this PMU?).
- Allocates a
struct perf_eventobject in the kernel, linked into PMU-specific lists. - Sets up a buffer for counts (and possibly a ring buffer for sampling if requested).
- Programs the PMU registers if this is a hardware event and the PMU has room.
The return value is a file descriptor that represents an active counter inside the kernel.
2.3 Attaching counters to the workload
Depending on how you invoke perf stat:
perf stat -e ... ./my_program # spawn child process
perf stat -p <pid> -e ... # attach to existing process
perf stat -a -e ... # system-wide, per-CPU
perf will either:
fork()andexecve()your program, then enable the events just before the child runs, or- Attach events to an existing
pid, or - Create per-CPU events for system-wide collection.
The kernel tracks:
- Which task or CPU each
perf_eventbelongs to. - When the task is scheduled in/out.
- When to start/stop counting (e.g., when your command finishes).
For per-task events (pid != -1):
- Each
perf_eventis linked into the task’s perf-event context. - On context switch in,
perf_event_sched_in()programs the PMU with that task’s active events. - On context switch out,
perf_event_sched_out()reads the current counter value, accumulates deltas into theperf_event’s software state, and may program counters for the next runnable task.
For per-CPU events (pid == -1, specific cpu):
- Events are tied to a given CPU context; any task running on that CPU contributes to the counts.
2.4 Counting vs sampling and multiplexing
perf stat is pure counting mode by default:
- The kernel maintains running totals in its
perf_eventstructures. - No samples are generated, no stack traces, no large buffers.
- At the end,
perfreads each FD withread(2)to getstruct perf_event_read_format(counts, time enabled, time running, etc.).
When you ask for more concurrent hardware events than the PMU can support, the kernel multiplexes them:
- Each
perf_eventhastime_enabledandtime_runningfields. - When an event is scheduled on the PMU,
time_runningstarts accumulating. - When it is descheduled to make room for another event,
time_runningstops, buttime_enabledmay still grow (depending on context). perfscales counts bytime_enabled / time_runningto approximate what the count would have been if always scheduled.
This is why perf stat has very low overhead—it just programs some counters and asks for the final numbers.
2.5 Printing results
Finally, perf:
- Scales the counts (e.g., if counters were multiplexed between events).
- Computes derived metrics (IPC, miss rates, etc.).
- Prints the pretty table you see on the terminal.
3. perf record – sampling and profiles
Now consider:
perf record -F 99 -g ./my_program
Here you are asking perf to sample your program ~99 times per second and capture call stacks.
3.1 Event setup for sampling
perf record still uses perf_event_open(2), but with additional fields in perf_event_attr:
sample_type– what to capture in each sample (IP, TID, time, call chain, registers, etc.).sample_freqorsample_period– how often to sample (here, frequency = 99 Hz).wakeup_events/wakeup_watermark– when to wake up user-space to drain the buffer.sample_stack_user/exclude_kernel/exclude_user– how deep and where to sample.precise_ip– request PEBS/IBS or similar hardware-assisted sampling when available.
The kernel allocates a ring buffer per event (or group leader), implemented as an mmap-able region shared between kernel and the perf process. It stores a sequence of struct perf_event_header records followed by payloads.
3.2 Periodic interrupts, overflow handling and callchains
When the event is hardware-based (e.g., cycles, instructions) and configured for sampling, the kernel:
- Programs the PMU with an initial period (e.g., N events until overflow).
- Each time the counter overflows, a PMU interrupt fires.
- The perf PMU interrupt handler (
perf_event_overflow()path):- Identifies which
perf_eventoverflowed (via PMU-specific code). - Checks throttle limits (per-task and global) to avoid DoS from too many samples.
- Builds a sample: IP, pid/tid, CPU, time, and whatever else
sample_typerequests. - For callchains, it may:
- Use hardware call stack facilities (e.g., LBR on Intel with
branch_stack). - Or walk the user stack using frame pointers or DWARF unwind info (slower, more expensive).
- Use hardware call stack facilities (e.g., LBR on Intel with
- Writes a
PERF_RECORD_SAMPLEinto the ring buffer if not throttled. - Reloads the counter for the next period.
- Identifies which
If you specified -F 99, the kernel uses frequency mode: it adjusts the underlying sample_period dynamically so that, on average, you get ~99 samples per second, even if the CPU frequency or workload intensity changes.
3.3 Software & tracepoint events
perf record is not limited to hardware events:
- software events (
cpu-clock,task-clock,page-faults, sched events, etc.). - tracepoints (
syscalls:sys_enter_*,sched:sched_switch, etc.).
For these, the interrupt or hook comes from the kernel code path itself (not PMU), but the record flow is the same: build a PERF_RECORD_SAMPLE and push it into the buffer.
3.4 User-space reading the ring buffer and perf.data layout
While your workload runs, the perf process:
mmaps the ring buffer for each event FD.- Uses a producer/consumer protocol with
data_headanddata_tailindices shared between kernel and user space. - Periodically wakes up (based on
wakeup_events/wakeup_watermarkor signals). - Consumes all pending records and writes them into a data file, usually
perf.datain the current directory.
Records can include:
- Sample records (IP + stack + registers).
- MMAP/MMAP2 records (when code regions are mapped, with build-id and page offset info).
- COMM records (process/thread name changes).
- FORK/EXIT records.
perf.data starts with a header section (feature bits, machine info, event descriptions, etc.), followed by the raw stream of records. The MMAP/COMM/FORK metadata is exactly what lets later tools (perf report, perf script) reconstruct symbol context and attribute samples to the correct binaries and functions.
3.5 perf report and call graphs
perf record only collects data. To inspect it you usually run:
perf report
perf report:
- Opens
perf.dataand reads all records. - Uses symbol information (from the binary + debug info) to map IPs to functions/files/lines.
- Aggregates samples by symbol, DSO, or call path.
- Shows you the familiar TUI with percentages and stack traces.
The important point: perf record decouples sampling from visualization. The kernel just logs events; user-space tools do the heavy post-processing later.
4. Comparing perf stat and perf record
Both commands use the same kernel perf event API but in different modes:
perf stat- Counting mode
read()final values at the end- Very low overhead
- Good for high-level metrics (IPC, miss rates, bandwidth)
perf record- Sampling mode
- Continuous stream of records in a ring buffer
- More overhead (interrupts + stack walking)
- Good for where time is spent (hot functions, lines, call stacks)
You can even combine ideas (e.g., perf stat -I for periodic stats, or perf record with various event types), but the core mechanism—perf_event_open, kernel perf, PMU/tracepoints, ring buffers—remains the same.
5. Where to go next
If you want to dig deeper:
- Man pages:
man perf_event_open,man perf-stat,man perf-record,man perf-report - Kernel source:
kernel/events/in the Linux tree - Experiment:
- Vary events:
perf stat -e cache-misses,cache-references. - Try tracepoints:
perf record -e sched:sched_switch -a -- sleep 1. - Look into the raw dump:
perf scripton yourperf.datafile.
- Vary events:
Understanding this pipeline makes perf much less magical—you are really just driving a generic kernel facility that can count and sample almost anything the CPU or kernel can expose.