Technical Blog Posts

The VNNI layout, and why your AMX kernel is silently wrong

Notes from getting Intel AMX to do a matrix multiply: the VNNI memory layout, why skipping it gives silently wrong answers, and LIBXSMM's dispatch model.

March 08, 2026 Updated July 10, 2026

amx

Everything on Precision: Numeric Formats in Modern ML and HPC

A walkthrough of numeric formats that matter today — FP32, TF32, FP16, BF16, FP8, INT8, INT4 — how they behave, when to use each, and where the pitfalls hide.

March 03, 2026 Updated March 03, 2026

amx

Cache Coherence Demystified: The MESI Protocol on Intel Granite Rapids

Deep dive into cache coherence and the MESI protocol on a 256-core Intel Granite Rapids system: microbenchmarks for ping-pong latency, false sharing, atomic contention, and lock primitives.

March 03, 2026 Updated March 03, 2026

mesi cpu performance

LLM Inference Introduction: the end-to-end flow and vector math

LLM Basic

February 10, 2026 Updated February 10, 2026

llm

GPU Memory Profiling Tools (NVIDIA and Intel)

A practical guide to observing GPU memory stats on NVIDIA and Intel GPUs (monitors, profilers, and attribution).

February 10, 2026 Updated February 10, 2026

gpu profiling performance memory nvidia intel

Library Calling in C

May 01, 2019 Updated December 04, 2025

binary-structure libraries libc

Adding a counter to the proc interface

A practical guide to instrumenting the Linux kernel by adding custom counters to /proc/vmstat

April 08, 2021 Updated December 04, 2025

kernel instrumentation

Understanding Linux perf: stat and record

What really happens in the kernel when you run perf stat or perf record.

December 04, 2025 Updated December 04, 2025

performance profiling linux perf kernel

No posts found matching your criteria.

Showing 13 of 13 technical blog posts.

Sandeep Kumar

Technical Blog Posts

The VNNI layout, and why your AMX kernel is silently wrong

Everything on Precision: Numeric Formats in Modern ML and HPC

Cache Coherence Demystified: The MESI Protocol on Intel Granite Rapids

LLM Inference Introduction: the end-to-end flow and vector math

GPU Memory Profiling Tools (NVIDIA and Intel)

Library Calling in C

Adding a counter to the proc interface

Understanding Linux perf: stat and record

Installing Intel SGX on Ubuntu 18.04

Attacks on the control flow integrity of applications

Intel SGX Internals

Fat File System

Install Graphene SGX

Sandeep Kumar

Technical Blog Posts

Filter by Tags Clear Filter

Filter by Tags