Adding a counter to the proc interface

11 minute read

Published: Last Updated:

Kernel Instrumentation

Introduction

The proc interface of Linux kernel is a pseudo-filesystem that provides a window into the kernel’s runtime state. Unlike regular files, the data in /proc is generated on-the-fly by the kernel and reflects real-time system information.

The proc interface exposes information about all major kernel subsystems:

  • Memory management (/proc/meminfo, /proc/vmstat)
  • Process information (/proc/[pid]/)
  • System interrupts (/proc/interrupts)
  • CPU information (/proc/cpuinfo)
  • Kernel modules (/proc/modules)

This tutorial demonstrates how to add custom performance counters to track kernel behavior, which is essential for:

  • Performance analysis - Understanding system bottlenecks
  • Debugging - Tracking code path execution
  • Research - Validating hypotheses about kernel behavior

Why Add Custom Counters?

When the built-in counters don’t provide the metrics you need, custom counters let you track specific kernel events. Benefits include:

  • Low overhead - Counter increments use atomic operations with minimal performance impact
  • Real-time visibility - Instantly see counter values without rebooting
  • Flexible placement - Add counters anywhere in kernel code

Performance Considerations:

  • Counters in hot paths (e.g., pte_alloc) can impact performance if the function is called millions of times per second
  • Use judiciously in critical sections
  • Consider using tracepoints for more detailed analysis if needed

Counter Scopes

Counters can track data at different granularities:

Global-level counters (system-wide):

  • Located in /proc/vmstat, /proc/meminfo, etc.
  • Aggregate metrics across all processes
  • Example: Total page faults, swap operations

Process-level counters (per-process):

  • Located in /proc/[pid]/status, /proc/[pid]/stat
  • Track individual process behavior
  • Example: Memory usage, CPU time for specific processes

How the Kernel Differentiates Counter Scopes

The kernel uses different data structures and APIs for global vs. process-level counters:

Global counters:

  • Stored in per-CPU arrays (vm_event_states)
  • Incremented using count_vm_events() or count_vm_event()
  • Aggregated across all CPUs when read from /proc/vmstat
  • No process context needed

Process-level counters:

  • Stored in task_struct (the process descriptor)
  • Incremented via direct field updates on the current process
  • Accessed through /proc/[pid]/ interfaces
  • Require process context

Key difference: Global counters use atomic per-CPU operations, while process-level counters update fields in the current task’s task_struct.

Example: Process-Level Counter Increment

Consider the min_flt counter (minor page faults per process) in /proc/[pid]/stat:

// In mm/memory.c - page fault handler
static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
{
    // ... page fault handling logic ...
    
    // Increment process-level counter
    current->min_flt++;  // current points to task_struct of running process
    
    // Also increment global counter
    count_vm_event(PGFAULT);
    
    return ret;
}

Key points:

  • current is a macro that returns pointer to the current process’s task_struct
  • task_struct contains fields like min_flt, maj_flt, utime, stime, etc.
  • Direct field access (no locks needed as each process updates its own counters)
  • Both global and process counters can be updated for the same event

Accessing Process-Level Counters

Process counters are read from task_struct fields:

// In fs/proc/array.c - generates /proc/[pid]/stat
static int do_task_stat(struct seq_file *m, struct pid_namespace *ns,
                       struct pid *pid, struct task_struct *task, int whole)
{
    // ... other fields ...
    
    seq_put_decimal_ull(m, " ", task->min_flt);  // Minor page faults
    seq_put_decimal_ull(m, " ", task->maj_flt);  // Major page faults
    seq_put_decimal_ull(m, " ", task->utime);    // User CPU time
    seq_put_decimal_ull(m, " ", task->stime);    // System CPU time
    
    // ... more fields ...
}

Location of task_struct definition: include/linux/sched.h

Adding a Custom Process-Level Counter

To add a process-level counter:

  1. Add field to task_struct in include/linux/sched.h:
    struct task_struct {
     // ... existing fields ...
     unsigned long my_custom_counter;
     // ... more fields ...
    };
    
  2. Initialize in process creation (kernel/fork.c):
    static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
    {
     // ... allocation code ...
     tsk->my_custom_counter = 0;
     // ... rest of initialization ...
    }
    
  3. Increment in your code:
    // Anywhere in kernel code with process context
    current->my_custom_counter++;
    
  4. Expose via /proc/[pid]/status in fs/proc/array.c:
    static inline void task_custom_stats(struct seq_file *m, struct task_struct *task)
    {
     seq_printf(m, "MyCustomCounter:\t%lu\n", task->my_custom_counter);
    }
    

Important: Process-level counters require rebuilding the kernel and are more invasive than global counters since they modify core kernel structures.

This is a global-level data:

sandeep@sandeep-Precision-3630-Tower:~$ cat /proc/meminfo 
MemTotal:       32585780 kB
MemFree:        28039452 kB
MemAvailable:   29530716 kB
Buffers:           99868 kB
Cached:          2151404 kB
SwapCached:            0 kB
Active:           649700 kB
Inactive:        3242992 kB
Active(anon):       6512 kB
Inactive(anon):  2164592 kB
Active(file):     643188 kB
Inactive(file):  1078400 kB
Unevictable:      345384 kB
Mlocked:              64 kB
SwapTotal:       2097148 kB
...

This is process-level data:

sandeep@sandeep-Precision-3630-Tower:~$ cat /proc/1645/status 
Name:	gsd-media-keys
Umask:	0002
State:	S (sleeping)
Tgid:	1645
Ngid:	0
Pid:	1645
PPid:	1329
TracerPid:	0
Uid:	1000	1000	1000	1000
Gid:	1000	1000	1000	1000
FDSize:	64
Groups:	4 24 27 30 46 120 131 132 1000 
NStgid:	1645
NSpid:	1645
NSpgid:	1645
NSsid:	1645
VmPeak:	  844100 kB

Understanding the Counter Architecture

This guide focuses on adding global-level counters to /proc/vmstat. The same principles apply to other proc interfaces with minor variations.

Target: Linux kernel v5.9 (steps are similar for v5.x kernels)

Key files involved:

  • mm/vmstat.c - Counter name definitions (display layer)
  • include/linux/vm_event_item.h - Counter enum declarations
  • mm/migrate.c - Example usage of counters
cat /proc/vmstat
...
pgmigrate_success 0
...

When we list the statistics from /proc/vsmstat, we have a pgmigrate_success. This counter is defined in “mm/vmstat.c”

...
#ifdef CONFIG_MIGRATION
        "pgmigrate_success",
        "pgmigrate_fail",
        "thp_migration_success",
        "thp_migration_fail",
        "thp_migration_split",
#endif
..

Understanding pgmigrate_success

The pgmigrate_success counter tracks successful page migrations in NUMA systems. On non-NUMA systems, this remains at 0.

Important: The string name in vmstat.c is just the display name. The actual counter is an enum constant defined elsewhere.

Two-part definition:

  1. Display name - String in mm/vmstat.c (what you see in /proc/vmstat)
  2. Counter enum - Constant in include/linux/vm_event_item.h (what code uses)

This separation allows the kernel to efficiently use integer enums internally while presenting readable names to users.

The actual counter enum is defined in include/linux/vm_event_item.h:

#ifdef CONFIG_MIGRATION
                PGMIGRATE_SUCCESS, PGMIGRATE_FAIL,
                THP_MIGRATION_SUCCESS,
                THP_MIGRATION_FAIL,
                THP_MIGRATION_SPLIT,
#endif

Searching for PGMIGRATE_SUCCESS (case-sensitive) we see that it is used in two other places, both in the same file /mm/migrate.c

int migrate_pages(... {
    ...
    count_vm_events(PGMIGRATE_SUCCESS, nr_succeeded);
    ...
}

int migrate_misplaced_transhuge_page(st
{
    ...
    count_vm_events(PGMIGRATE_SUCCESS, HPAGE_PMD_NR);
    ...
}

How Counters are Incremented

Base page migration (migrate_pages):

  • Migrates standard 4KB pages
  • Increments by nr_succeeded (total number of successfully migrated pages)

Huge page migration (migrate_misplaced_transhuge_page):

  • Migrates 2MB transparent huge pages
  • Increments by HPAGE_PMD_NR (512 base pages per huge page)

The API count_vm_events(COUNTER_NAME, count) is used to atomically increment counters.

Step-by-Step: Adding a Custom Counter

We’ll create a counter called custom_test to track system call invocations. This demonstrates the complete workflow from definition to usage.

Prerequisites

  • Linux kernel source (v5.9 recommended)
  • Development tools: build-essential, libncurses-dev, bison, flex, libssl-dev, libelf-dev
  • Root access for kernel installation

Step 1: Add Display Name

In mm/vmstat.c, add the human-readable string that will appear in /proc/vmstat:

...
        "numa_hint_faults_local",
        "numa_pages_migrated",
#endif
        "custom_test", // <-- here
#ifdef CONFIG_MIGRATION
        "pgmigrate_success",
        "pgmigrate_fail",
...

Note: The array indices must match between vmstat.c and vm_event_item.h, so add your counter in the same relative position.

Step 2: Add Counter Enum

In include/linux/vm_event_item.h, add the enum constant that code will reference:

		NUMA_HINT_FAULTS_LOCAL,
		NUMA_PAGE_MIGRATE,
#endif
	CUSTOM_TEST, // <-- Here
#ifdef CONFIG_MIGRATION
		PGMIGRATE_SUCCESS, PGMIGRATE_FAIL,
		THP_MIGRATION_SUCCESS,

Important: Use uppercase for the enum (e.g., CUSTOM_TEST) and lowercase with underscores for the display string (e.g., "custom_test").

Step 3: Instrument the Code

Now use the counter to track the migrate_pages system call. This syscall moves process memory pages between NUMA nodes.

Location: mm/mempolicy.c

Usage: We’ll track every invocation of the syscall, regardless of success or failure:

SYSCALL_DEFINE4(migrate_pages, pid_t, pid, unsigned long, maxnode,
		const unsigned long __user *, old_nodes,
		const unsigned long __user *, new_nodes)
{
    count_vm_events(CUSTOM_TEST, 1); // <- Our counter
	return kernel_migrate_pages(pid, maxnode, old_nodes, new_nodes);
}

Key points:

  • count_vm_events(CUSTOM_TEST, 1) atomically increments the counter by 1
  • Placed before kernel_migrate_pages() to count every attempt
  • Counter increments even if the syscall fails (useful for debugging)

Building and Installing the Kernel

Compilation

# Configure kernel (if needed)
make menuconfig

# Build kernel with parallel jobs
make -j$(nproc)

# Install modules
sudo make modules_install

# Install kernel
sudo make install

# Update bootloader
sudo update-grub  # For Ubuntu/Debian
# OR
sudo grub2-mkconfig -o /boot/grub2/grub.cfg  # For RHEL/CentOS

Build time: Expect 30-60 minutes depending on your system.

Reboot

sudo reboot

Verify kernel version after reboot:

uname -r

Testing the Counter

Initial State

After booting into the new kernel, verify the counter exists and starts at 0:

cat /proc/vmstat | grep custom_test

Expected output:

custom_test 0

Triggering the Counter

Use the migratepages command to invoke the migrate_pages syscall:

# Install numactl if not present
sudo apt-get install numactl  # Ubuntu/Debian
# OR
sudo yum install numactl      # RHEL/CentOS

# Attempt to migrate pages (PID doesn't need to be valid)
migratepages 1234 0 1

What’s happening:

  • migratepages calls the migrate_pages() syscall
  • Our counter increments before the actual migration logic
  • Even invalid PIDs trigger the counter (by design)

Verify Increment

cat /proc/vmstat | grep custom_test

Expected output:

custom_test 1

Success! The counter incremented.

Continuous Monitoring

Watch the counter in real-time:

watch -n 1 'cat /proc/vmstat | grep custom_test'

Or use a script to trigger and monitor:

for i in {1..5}; do
    migratepages 1234 0 1
    echo "Attempt $i:"
    grep custom_test /proc/vmstat
    sleep 1
done

Important Considerations

Counter Persistence

Limitation: Counters reset to 0 on reboot. They are not persistent across boots.

Workaround: Write a monitoring script that:

  • Polls /proc/vmstat periodically
  • Logs values to a file or time-series database
  • Calculates deltas for rate-based metrics

Example monitoring script:

#!/bin/bash
while true; do
    timestamp=$(date +%s)
    value=$(grep custom_test /proc/vmstat | awk '{print $2}')
    echo "$timestamp $value" >> /var/log/custom_counter.log
    sleep 60
done

Performance Impact

  • Atomic operations: count_vm_events uses atomic increments (low overhead)
  • Cache effects: Frequent updates to the same counter may cause cache line bouncing on multi-core systems
  • Critical paths: Avoid instrumenting functions called millions of times per second

Debugging Tips

If counter doesn’t appear:

  1. Check array alignment in vmstat.c and vm_event_item.h
  2. Verify kernel compiled and installed correctly: uname -r
  3. Check for compilation errors: dmesg | grep -i error

If counter doesn’t increment:

  1. Verify code path is actually executed
  2. Add printk() statements for debugging
  3. Check syscall is being invoked: strace migratepages 1234 0 1

Advanced Usage

Multiple Counters

Add multiple related counters to track success/failure separately:

// In vm_event_item.h
CUSTOM_TEST_SUCCESS,
CUSTOM_TEST_FAIL,

// In vmstat.c
"custom_test_success",
"custom_test_fail",

// In code
if (result >= 0)
    count_vm_events(CUSTOM_TEST_SUCCESS, 1);
else
    count_vm_events(CUSTOM_TEST_FAIL, 1);

Per-CPU Counters

For extremely hot paths, consider per-CPU counters to reduce contention:

DECLARE_PER_CPU(unsigned long, custom_counter);

// Increment
this_cpu_inc(custom_counter);

Analysis Tools

Parse /proc/vmstat programmatically:

#!/usr/bin/env python3
def read_vmstat():
    stats = {}
    with open('/proc/vmstat') as f:
        for line in f:
            key, value = line.split()
            stats[key] = int(value)
    return stats

# Monitor rate of change
import time
prev = read_vmstat()
time.sleep(1)
curr = read_vmstat()
rate = curr['custom_test'] - prev['custom_test']
print(f"Custom counter rate: {rate} events/sec")

Conclusion

Adding custom counters to /proc provides lightweight, real-time instrumentation for kernel analysis. This technique is invaluable for:

  • Performance tuning and bottleneck identification
  • Validating kernel modifications
  • Understanding system behavior under specific workloads

The low overhead makes it suitable for production environments, unlike more invasive debugging techniques. Combined with tools like perf, ftrace, and eBPF, custom counters form a complete kernel observability toolkit.