Understanding Linux perf: stat and record

9 minute read

Published: Last Updated:

performance profiling linux perf kernel

In this post we will walk through what actually happens under the hood when you run:

perf stat -e cycles,instructions ./my_program
perf record -F 99 -g ./my_program

perf looks like a simple user-space tool, but most of the heavy lifting is done by the Linux kernel through the perf events subsystem.


1. Quick mental model

At a high level:

  • perf (the CLI) is just a frontend that opens special file descriptors via perf_event_open(2).
  • The kernel connects those file descriptors to hardware performance counters or software counters.
  • The kernel counts events or takes periodic samples while your workload runs.
  • When perf exits, it reads aggregated counts or a stream of samples and formats them for you.

You can think of it as:

user-space perf CLI → perf_event_open syscalls → kernel perf subsystem → PMU / tracepoints → data comes back via file descriptors


2. perf stat -e – counting events

Example:

perf stat -e cycles,instructions,branches,branch-misses ./my_program

2.1 Argument parsing and event description

First, the perf binary parses your arguments and turns each -e entry into an event descriptor:

  • cycles → hardware PMU event (CPU core cycles)
  • instructions → hardware PMU event (retired instructions)
  • branches, branch-misses → hardware PMU events related to branch unit

Each descriptor eventually becomes a struct perf_event_attr filled with:

  • event type (hardware, software, tracepoint, raw, etc.)
  • config (which specific event within that type)
  • scope (per-task vs per-CPU)
  • flags (inherit across children, pinned, exclude-kernel, exclude-user, etc.)

2.2 perf_event_open() syscalls and perf_event_attr

For each event, perf calls the perf_event_open(2) system call, which returns a file descriptor:

int fd = perf_event_open(&attr, pid, cpu, group_fd, flags);

Key parameters:

  • pid = target task or -1 (all tasks)
  • cpu = specific CPU or -1 (any CPU the task runs on)
  • group_fd = allows grouping multiple events so they start/stop together
  • flags = things like PERF_FLAG_FD_CLOEXEC, PERF_FLAG_FD_NO_GROUP, etc.

The perf_event_attr that perf passes looks roughly like:

struct perf_event_attr attr = {
  .type           = PERF_TYPE_HARDWARE,
  .config         = PERF_COUNT_HW_CPU_CYCLES,
  .size           = sizeof(struct perf_event_attr),
  .disabled       = 1,
  .exclude_kernel = 0,
  .exclude_user   = 0,
  .inherit        = 1,
  .read_format    = PERF_FORMAT_TOTAL_TIME_ENABLED |
            PERF_FORMAT_TOTAL_TIME_RUNNING |
            PERF_FORMAT_ID,
};

Important knobs you often tweak via perf CLI flags:

  • inherit – whether children inherit the event (e.g., perf stat --no-inherit).
  • pinned / exclusive – whether the event must stay scheduled at all times vs multiplexed.
  • exclude_{kernel,user,hypervisor,guest} – filter which privilege levels contribute to counts.
  • precise_ip – for sampling events, request PEBS/IBS-style precise IPs where the PMU supports it.

Internally, the kernel:

  • Validates your event (do you have permission? is it supported on this PMU?).
  • Allocates a struct perf_event object in the kernel, linked into PMU-specific lists.
  • Sets up a buffer for counts (and possibly a ring buffer for sampling if requested).
  • Programs the PMU registers if this is a hardware event and the PMU has room.

The return value is a file descriptor that represents an active counter inside the kernel.

2.3 Attaching counters to the workload

Depending on how you invoke perf stat:

perf stat -e ... ./my_program      # spawn child process
perf stat -p <pid> -e ...          # attach to existing process
perf stat -a -e ...                # system-wide, per-CPU

perf will either:

  • fork() and execve() your program, then enable the events just before the child runs, or
  • Attach events to an existing pid, or
  • Create per-CPU events for system-wide collection.

The kernel tracks:

  • Which task or CPU each perf_event belongs to.
  • When the task is scheduled in/out.
  • When to start/stop counting (e.g., when your command finishes).

For per-task events (pid != -1):

  • Each perf_event is linked into the task’s perf-event context.
  • On context switch in, perf_event_sched_in() programs the PMU with that task’s active events.
  • On context switch out, perf_event_sched_out() reads the current counter value, accumulates deltas into the perf_event’s software state, and may program counters for the next runnable task.

For per-CPU events (pid == -1, specific cpu):

  • Events are tied to a given CPU context; any task running on that CPU contributes to the counts.

2.4 Counting vs sampling and multiplexing

perf stat is pure counting mode by default:

  • The kernel maintains running totals in its perf_event structures.
  • No samples are generated, no stack traces, no large buffers.
  • At the end, perf reads each FD with read(2) to get struct perf_event_read_format (counts, time enabled, time running, etc.).

When you ask for more concurrent hardware events than the PMU can support, the kernel multiplexes them:

  • Each perf_event has time_enabled and time_running fields.
  • When an event is scheduled on the PMU, time_running starts accumulating.
  • When it is descheduled to make room for another event, time_running stops, but time_enabled may still grow (depending on context).
  • perf scales counts by time_enabled / time_running to approximate what the count would have been if always scheduled.

This is why perf stat has very low overhead—it just programs some counters and asks for the final numbers.

2.5 Printing results

Finally, perf:

  • Scales the counts (e.g., if counters were multiplexed between events).
  • Computes derived metrics (IPC, miss rates, etc.).
  • Prints the pretty table you see on the terminal.

3. perf record – sampling and profiles

Now consider:

perf record -F 99 -g ./my_program

Here you are asking perf to sample your program ~99 times per second and capture call stacks.

3.1 Event setup for sampling

perf record still uses perf_event_open(2), but with additional fields in perf_event_attr:

  • sample_type – what to capture in each sample (IP, TID, time, call chain, registers, etc.).
  • sample_freq or sample_period – how often to sample (here, frequency = 99 Hz).
  • wakeup_events / wakeup_watermark – when to wake up user-space to drain the buffer.
  • sample_stack_user / exclude_kernel / exclude_user – how deep and where to sample.
  • precise_ip – request PEBS/IBS or similar hardware-assisted sampling when available.

The kernel allocates a ring buffer per event (or group leader), implemented as an mmap-able region shared between kernel and the perf process. It stores a sequence of struct perf_event_header records followed by payloads.

3.2 Periodic interrupts, overflow handling and callchains

When the event is hardware-based (e.g., cycles, instructions) and configured for sampling, the kernel:

  • Programs the PMU with an initial period (e.g., N events until overflow).
  • Each time the counter overflows, a PMU interrupt fires.
  • The perf PMU interrupt handler (perf_event_overflow() path):
    • Identifies which perf_event overflowed (via PMU-specific code).
    • Checks throttle limits (per-task and global) to avoid DoS from too many samples.
    • Builds a sample: IP, pid/tid, CPU, time, and whatever else sample_type requests.
    • For callchains, it may:
      • Use hardware call stack facilities (e.g., LBR on Intel with branch_stack).
      • Or walk the user stack using frame pointers or DWARF unwind info (slower, more expensive).
    • Writes a PERF_RECORD_SAMPLE into the ring buffer if not throttled.
    • Reloads the counter for the next period.

If you specified -F 99, the kernel uses frequency mode: it adjusts the underlying sample_period dynamically so that, on average, you get ~99 samples per second, even if the CPU frequency or workload intensity changes.

3.3 Software & tracepoint events

perf record is not limited to hardware events:

  • software events (cpu-clock, task-clock, page-faults, sched events, etc.).
  • tracepoints (syscalls:sys_enter_*, sched:sched_switch, etc.).

For these, the interrupt or hook comes from the kernel code path itself (not PMU), but the record flow is the same: build a PERF_RECORD_SAMPLE and push it into the buffer.

3.4 User-space reading the ring buffer and perf.data layout

While your workload runs, the perf process:

  • mmaps the ring buffer for each event FD.
  • Uses a producer/consumer protocol with data_head and data_tail indices shared between kernel and user space.
  • Periodically wakes up (based on wakeup_events/wakeup_watermark or signals).
  • Consumes all pending records and writes them into a data file, usually perf.data in the current directory.

Records can include:

  • Sample records (IP + stack + registers).
  • MMAP/MMAP2 records (when code regions are mapped, with build-id and page offset info).
  • COMM records (process/thread name changes).
  • FORK/EXIT records.

perf.data starts with a header section (feature bits, machine info, event descriptions, etc.), followed by the raw stream of records. The MMAP/COMM/FORK metadata is exactly what lets later tools (perf report, perf script) reconstruct symbol context and attribute samples to the correct binaries and functions.

3.5 perf report and call graphs

perf record only collects data. To inspect it you usually run:

perf report

perf report:

  • Opens perf.data and reads all records.
  • Uses symbol information (from the binary + debug info) to map IPs to functions/files/lines.
  • Aggregates samples by symbol, DSO, or call path.
  • Shows you the familiar TUI with percentages and stack traces.

The important point: perf record decouples sampling from visualization. The kernel just logs events; user-space tools do the heavy post-processing later.


4. Comparing perf stat and perf record

Both commands use the same kernel perf event API but in different modes:

  • perf stat
    • Counting mode
    • read() final values at the end
    • Very low overhead
    • Good for high-level metrics (IPC, miss rates, bandwidth)
  • perf record
    • Sampling mode
    • Continuous stream of records in a ring buffer
    • More overhead (interrupts + stack walking)
    • Good for where time is spent (hot functions, lines, call stacks)

You can even combine ideas (e.g., perf stat -I for periodic stats, or perf record with various event types), but the core mechanism—perf_event_open, kernel perf, PMU/tracepoints, ring buffers—remains the same.


5. Where to go next

If you want to dig deeper:

  • Man pages: man perf_event_open, man perf-stat, man perf-record, man perf-report
  • Kernel source: kernel/events/ in the Linux tree
  • Experiment:
    • Vary events: perf stat -e cache-misses,cache-references.
    • Try tracepoints: perf record -e sched:sched_switch -a -- sleep 1.
    • Look into the raw dump: perf script on your perf.data file.

Understanding this pipeline makes perf much less magical—you are really just driving a generic kernel facility that can count and sample almost anything the CPU or kernel can expose.