GPU Memory Profiling Tools (NVIDIA and Intel)

9 minute read

Published: Last Updated:

gpu profiling performance memory nvidia intel

What this post is (and isn’t)

This is a Linux-first, practical guide to observing GPU memory behavior.

  • If you want to answer “How much VRAM is used right now?” → use the monitoring tools.
  • If you want “Which kernel / line of code causes the spike?” → use profilers + application instrumentation.
  • If you want “Why is memory high even after tensors freed?” → you likely need allocator-level details (PyTorch/TF) and fragmentation/retention explanations.

I’ll cover NVIDIA (CUDA) and Intel (iGPU + Intel discrete GPUs via oneAPI/Level Zero) and focus specifically on memory-related stats: allocated/used memory, memory bandwidth, and where possible per-process attribution.


Memory stats: what you can and cannot observe

Before tools, it helps to align on terminology:

  • Device memory / VRAM: dedicated memory on the GPU (HBM/GDDR). Most “GPU memory used” counters refer to this.
  • Unified / shared memory: on integrated GPUs, the “GPU memory” may be shared with system RAM.
  • Allocated vs used: frameworks often keep memory in a pool (allocated/reserved) even after you “free” tensors.
  • Driver overhead: page tables, runtime bookkeeping, caching, and compilation artifacts may consume memory.
  • Bandwidth: memory traffic (GB/s) is often as important as capacity. Bottlenecks can appear even with plenty of free VRAM.

A common pattern:

  • “VRAM used” is easy to monitor.
  • “Which tensor/allocator keeps it” requires framework-specific instrumentation.
  • “Which kernel causes traffic” requires kernel-level profilers.

Quick start: one command per vendor

NVIDIA

  • Instant snapshot:
nvidia-smi
  • Machine-readable memory stats:
nvidia-smi --query-gpu=name,memory.total,memory.used,memory.free --format=csv
  • Per-process memory (when supported):
nvidia-smi --query-compute-apps=pid,process_name,used_memory --format=csv

Intel

Intel tooling depends on whether you have integrated graphics (iGPU) or Intel discrete GPUs, and which driver stack you’re using.

  • If available, try the top-like monitor:
intel_gpu_top
  • For newer Intel GPU stacks, a common option is an “XPU SMI” style tool (name may vary by distro/driver):
xpu-smi

If those tools aren’t installed, the rest of this post lists alternatives and what metrics you can realistically get.


NVIDIA: memory observability toolbox

1) nvidia-smi: the baseline operational monitor

What it’s good for:

  • Current VRAM used/total/free
  • GPU utilization and clocks
  • Per-process used memory for compute contexts (commonly works, but can vary with permissions/MIG)

Useful commands:

# Continuous updates
nvidia-smi -l 1

# “dmon” streaming monitor; include memory-related fields
nvidia-smi dmon -s mu

# Memory + utilization over time (quick sanity)
watch -n 1 "nvidia-smi --query-gpu=utilization.gpu,utilization.memory,memory.used,memory.total --format=csv,noheader,nounits"

Notes:

  • nvidia-smi reports device memory used as seen by the driver. It is not “framework allocated tensors”.
  • In containerized environments, make sure the container runtime exposes NVML (e.g., via NVIDIA Container Toolkit).

2) nvtop: interactive view (like htop for GPUs)

If installed (nvtop package), it provides:

  • Per-GPU graphs for memory and utilization
  • Per-process views (depending on driver support)

This is a good “operator dashboard” when you’re debugging memory regressions during development.

3) NVML: programmatic metrics (Python/C/C++)

The NVIDIA Management Library (NVML) is what backs many nvidia-smi counters. Use NVML when you want:

  • periodic logging to a file,
  • integration into an experiment runner,
  • exporting metrics to Prometheus.

In Python, typical routes include pynvml (package names vary by distro). At a conceptual level:

  • query memory totals/used,
  • query per-process accounting (if supported),
  • query ECC / retired pages (datacenter GPUs).

4) DCGM: datacenter-grade monitoring + exporters

If you care about fleet monitoring:

  • DCGM (Data Center GPU Manager) provides richer metric collection and health checks.
  • A common pattern is dcgm-exporter → Prometheus → Grafana.

Memory-related metrics can include:

  • frame buffer memory used,
  • memory bandwidth/utilization,
  • ECC error counters (hardware-dependent).

This is the best option when you need long time-series plots across nodes.

5) Nsight Systems: “timeline first” profiler

Use Nsight Systems when the question is:

  • “When does memory spike relative to CPU work, I/O, kernel launches, and memcpy?”
  • “Are we stalling on page faults, H2D/D2H copies, or synchronization?”

What you can extract:

  • CUDA API calls timing (allocations, memcpys)
  • Kernel launch timelines
  • Unified memory migrations (if used)

This won’t directly tell you “which tensor,” but it’s excellent for seeing allocation churn and copy bursts.

6) Nsight Compute: kernel-level counters (including memory traffic)

Use Nsight Compute when the question is:

  • “Is this kernel limited by memory bandwidth?”
  • “What is the achieved memory throughput vs theoretical?”

Memory-related counters include:

  • global load/store throughput,
  • L2 cache hit rates,
  • DRAM throughput,
  • shared memory behavior.

This is where you measure “the model is slow because attention kernels are memory-bound” vs “it’s compute-bound.”

7) Framework-level memory: PyTorch / TensorFlow

To connect “VRAM used” to “what my program thinks it allocated,” you need framework tools.

PyTorch (examples):

  • torch.cuda.memory_allocated() vs torch.cuda.memory_reserved()
  • memory summary functions that show allocator state and fragmentation

This is often the fastest way to explain why memory usage stays high after freeing tensors.


Intel: memory observability toolbox

Intel GPUs show up in multiple environments:

  • Integrated GPUs (common on laptops/desktops)
  • Intel discrete GPUs (workstations/servers), increasingly via oneAPI/Level Zero

The exact tool availability depends on:

  • kernel/driver (i915 vs newer stacks for newer GPUs),
  • user permissions, and
  • distro packages.

1) intel_gpu_top: live engine + client activity

intel_gpu_top (from intel-gpu-tools) is the closest analog to “top for Intel GPU.”

What it’s good for:

  • engine utilization (render/compute/copy engines depending on GPU)
  • frequently: memory bandwidth-related activity or residency hints (varies by generation)
  • listing “clients” (processes using the GPU) — this is not always a precise per-process VRAM accounting the way NVIDIA can provide

Run:

intel_gpu_top

If you’re trying to map memory issues to a specific process, this can still help identify “who is active” even if exact per-process memory isn’t exposed.

2) xpu-smi / Intel GPU SMI tools (datacenter/discrete focus)

On some systems, Intel provides an SMI-style utility for Intel GPUs (the name and features depend on your driver stack and distro).

When available, these tools can expose:

  • device memory usage (for discrete GPUs),
  • temperature, power,
  • engine utilization,
  • sometimes per-process accounting.

If xpu-smi is present, it’s often the quickest path to “how much device memory is in use?”

3) Intel VTune Profiler: system + GPU analysis

Intel VTune Profiler can analyze heterogeneous workloads and can help when you need:

  • CPU/GPU concurrency view,
  • offload overhead,
  • memory bandwidth pressure (system-level + sometimes device-level, depending on platform).

For memory-focused analysis, VTune is most useful to answer:

  • “Is the GPU waiting on memory or the CPU?”
  • “Does host-side memory bandwidth become the bottleneck for an integrated GPU?”

4) Intel Graphics Performance Analyzers (GPA)

Intel GPA is traditionally used for graphics workloads, but can still be relevant when:

  • the workload is graphics/compute mixed,
  • you need frame/dispatch-level inspection,
  • you’re using APIs that GPA supports well.

For pure compute/ML workloads, oneAPI + VTune/Level Zero tooling is often a better fit.

5) Level Zero / oneAPI low-level telemetry (advanced)

If you are using Intel’s oneAPI ecosystem, the runtime is often Level Zero under the hood.

At the conceptual level, Level Zero exposes device properties and management telemetry (e.g., memory properties). The exact CLI utilities vary, but this is the layer you would use if you’re building:

  • a custom metrics exporter,
  • per-job telemetry collection,
  • or deeper “what memory is available/used” integration.

6) Sysfs / DRM counters (last resort)

When high-level tools aren’t available, Linux exposes some GPU information via /sys/class/drm/ and related driver-specific paths. The exact files differ by GPU generation and driver.

This route is best for:

  • confirming device presence,
  • reading simple counters,
  • scripting quick checks in constrained environments.

A practical workflow for “memory debugging”

Here is a vendor-agnostic flow that works well in practice.

Step 1: confirm it’s a device-memory problem

  • NVIDIA: nvidia-smi memory.used trends
  • Intel: xpu-smi (if present) or intel_gpu_top activity + system RAM trends (for iGPU)

If the GPU is integrated and shares RAM, you may need to track system memory and swap too.

Step 2: find the owner (process attribution)

  • NVIDIA: nvidia-smi --query-compute-apps=...
  • Intel: attribution can be harder; intel_gpu_top may at least show active clients

If you can’t attribute via GPU tooling, fall back to:

  • container/job metadata (scheduler),
  • application-level logging,
  • and framework allocators.

Step 3: explain the allocator behavior

This is where framework tools usually matter most:

  • PyTorch allocator reserved vs allocated
  • TensorFlow BFC allocator stats

Typical explanations for “why memory doesn’t go down”:

  • caching allocator keeps blocks for reuse,
  • fragmentation prevents returning large contiguous blocks,
  • multiple CUDA contexts / processes hold memory,
  • graph compilation artifacts remain resident.

Step 4: attribute memory traffic to kernels

  • NVIDIA: Nsight Compute for per-kernel memory throughput, cache hit rates
  • Intel: use VTune/GPA/oneAPI tooling depending on your stack

Tool cheat-sheet (memory-focused)

NVIDIA

  • Operational snapshot: nvidia-smi
  • Interactive monitor: nvtop
  • Programmatic metrics: NVML
  • Fleet monitoring: DCGM (+ exporter)
  • Timeline + API: Nsight Systems
  • Kernel counters (bandwidth/cache): Nsight Compute

Intel

  • Live activity: intel_gpu_top
  • SMI-style telemetry (if available): xpu-smi
  • CPU/GPU analysis: Intel VTune Profiler
  • Graphics-focused deep dive: Intel GPA
  • Advanced/custom telemetry: Level Zero / sysfs

What I can add next (if you want)

If you tell me your exact setup (NVIDIA model + driver/CUDA version, or Intel GPU model + driver stack), I can add a short “recommended commands” appendix tailored to your environment, including the exact metric fields to query.