Adding a counter to the proc interface

5 minute read

Published:

The proc interface of Linux kernel contains a huge amount of information about what is going on in the system. The information here is added by the kernel and can be accessed like a part of the regular file system from the userspace. The proc interface contains information about all the sub-systems of the kernel: memory system, interrupts, allocators to name a few.

In case some required counter is not present, we can add it ourself in the kernel code. We can add as many counters as we want. And as it is just a counter increment (which will do using existing kernel APIs) the overheads is kept to minimum. Nevertheless, be wary of adding counters to critical functions like pte_alloc where a slight delay may also cause a slowdown in the system as it is called too frequently.

This information can be either at a global-level, or at a process-level.

This is a global-level data:

sandeep@sandeep-Precision-3630-Tower:~$ cat /proc/meminfo 
MemTotal:       32585780 kB
MemFree:        28039452 kB
MemAvailable:   29530716 kB
Buffers:           99868 kB
Cached:          2151404 kB
SwapCached:            0 kB
Active:           649700 kB
Inactive:        3242992 kB
Active(anon):       6512 kB
Inactive(anon):  2164592 kB
Active(file):     643188 kB
Inactive(file):  1078400 kB
Unevictable:      345384 kB
Mlocked:              64 kB
SwapTotal:       2097148 kB
...

This is process-level data:

sandeep@sandeep-Precision-3630-Tower:~$ cat /proc/1645/status 
Name:	gsd-media-keys
Umask:	0002
State:	S (sleeping)
Tgid:	1645
Ngid:	0
Pid:	1645
PPid:	1329
TracerPid:	0
Uid:	1000	1000	1000	1000
Gid:	1000	1000	1000	1000
FDSize:	64
Groups:	4 24 27 30 46 120 131 132 1000 
NStgid:	1645
NSpid:	1645
NSpgid:	1645
NSsid:	1645
VmPeak:	  844100 kB

Where are the Counters?

Here, we will see how do we add a new counter to the global-level data. For this we need to see the kernel code. We are going to use v5.9, but the steps should be same for versions around it.

We will be adding a new counter to /proc/vmstat, but the steps should be same for others also.

cat /proc/vmstat
...
pgmigrate_success 0
...

When we list the statistics from /proc/vsmstat, we have a pgmigrate_success. This counter is defined in “mm/vmstat.c”

...
#ifdef CONFIG_MIGRATION
        "pgmigrate_success",
        "pgmigrate_fail",
        "thp_migration_success",
        "thp_migration_fail",
        "thp_migration_split",
#endif
..

pgmigrate_success shows the number of successful page migrations in a NUMA system. As I am working on a non-NUMA system, the value of this counter will be always 0. Nevertheless, we can still see where and when in the kernel source code this counter is incremented.

If you just search for "pgmigrate_success" in the kernel, you wont find any other references for it. This is because, what we saw in /vm/vmstat.c” tells us how it will be printed. The actual counter is defined in */include/linux/vm_event_item.h

#ifdef CONFIG_MIGRATION
                PGMIGRATE_SUCCESS, PGMIGRATE_FAIL,
                THP_MIGRATION_SUCCESS,
                THP_MIGRATION_FAIL,
                THP_MIGRATION_SPLIT,
#endif

Searching for PGMIGRATE_SUCCESS (case-sensitive) we see that it is used in two other places, both in the same file /mm/migrate.c

int migrate_pages(... {
    ...
    count_vm_events(PGMIGRATE_SUCCESS, nr_succeeded);
    ...
}

int migrate_misplaced_transhuge_page(st
{
    ...
    count_vm_events(PGMIGRATE_SUCCESS, HPAGE_PMD_NR);
    ...
}

migrate_pages migrates the base 4K pages, and its value is incremented by total successful migrations (nr_succeeded). migrate_misplaced_transhuge migrates huge pages. A single huge page consists of 512 base pages (HPAGE_PMD_NR), and its value is incremented by the same in case of a success.

Adding a New Counter

Once we know the location of the counters, defining them is easy. We will call our counter custom_test

In the file /vm/vmstat.c, add this line

...
        "numa_hint_faults_local",
        "numa_pages_migrated",
#endif
        "custom_test", // <-- here
#ifdef CONFIG_MIGRATION
        "pgmigrate_success",
        "pgmigrate_fail",
...

Then, in the file /include/linux/vm_event_item.h

		NUMA_HINT_FAULTS_LOCAL,
		NUMA_PAGE_MIGRATE,
#endif
	CUSTOM_TEST, // <-- Here
#ifdef CONFIG_MIGRATION
		PGMIGRATE_SUCCESS, PGMIGRATE_FAIL,
		THP_MIGRATION_SUCCESS,

Now, we need to use it somewhere to see if it works or not. We will use it to track how many times the system call migrate_pages has been called. We can use this system call to move the pages of an application from one NUMA node to another. Specifically, we will use the migratepages command from the numactl package to call this system call.

The system call is defined in the /mm/mempolicy.c. If we see its definition:

SYSCALL_DEFINE4(migrate_pages, pid_t, pid, unsigned long, maxnode,
		const unsigned long __user *, old_nodes,
		const unsigned long __user *, new_nodes)
{
    count_vm_events(CUSTOM_TEST, 1); // <- Our counter
	return kernel_migrate_pages(pid, maxnode, old_nodes, new_nodes);
}

it calls kernel_migrate_pages internally. We will increment our counter just before this call. Every time it is called, we increment the counter by 1. And that is it,

Building the kernel

make -j12
sudo make modues_install -j12
sudo make install

and reboot your machine.

After the reboot:

> cat /proc/vmstat
...
drop_pagecache 0
drop_slab 0
oom_kill 0
custom_test 0 <-- Our counter
pgmigrate_success 0
pgmigrate_fail 0
thp_migration_success 0
thp_migration_fail 0
....

Let us test if the increment part is working. Now, as we added the counter before the call, it does not matter of there was an error during the actual migration of the pages, the count will be incremented.

migratepages 1234 0 1

It does not matter that the PID is not valid.

> cat /proc/vmstat
...
drop_pagecache 0
drop_slab 0
oom_kill 0
custom_test 1 <-- Our counter value incremented
pgmigrate_success 0
pgmigrate_fail 0
thp_migration_success 0
thp_migration_fail 0
....

Things to know

As far as I know, it is not possible to decrement the counter values or reset them back to 0. To do so, we need to reboot the machine.