22 May 2025 4 min read Kubernetes

When 'df' lies, 'du' swears it’s innocent and Loki eats your disk: a forensic walk-through

Fall 2016 in Edinburgh, Scotland © Sergio Fernández

Some months ago, our production cluster started paging us with the classic Volume /var/loki ≥ 90 % alert.

A quick df -h /var/loki showed the volume completely full, yet du -hs /var/loki insisted only a few-hundred MiB were actually there. The only thing that reliably fixed the situation was bouncing the Loki pods – clearly not a sustainable strategy.

Below is the investigation path, the aha-moment and the permanent fix. I am documenting every step because (a) searching with Kagi turns up a lot of “my df and du disagree” posts with no Loki-specific closure, and (b) the root cause is still, in my opinion, a Loki bug that deserves more visibility in 2025.

Why `df` and `du` sometimes disagree

• df: asks the filesystem how many blocks are allocated – even blocks pointed to by a file that has already been unlinked (deleted).

• du: walks the directory tree and sums the space of every path it can see. A file that has been deleted is not part of the tree anymore, so du is blissfully unaware of its existence.

If a process deletes a file after opening it, the directory entry disappears, but the inode stays alive for as long as at least one file descriptor keeps it open. The space is therefore counted by df but not by du.

Loki 101 – where all that I/O comes from

A single-binary Loki instance uses local disk mainly for three things:

WAL (/var/loki/wal/) – write-ahead log used by the ingester. Rotated segments should be truncated/removed by the ingester once their data is safely in the object store.
TSDB blocks (/var/loki/tsdb-shipper-active/…) – fresh blocks that have not yet been shipped to the object store.
Compactor scratch & deletion requests (/var/loki/data/retention/…) – temp area for compactions and the new retention API.

If something prevents Loki from successfully shipping or compacting, it will happily delete the local file and keep the handle open. Voilà: invisible disk usage.

The smoking gun – hunting for open but deleted files

# inside an almost full Loki container
lsof +L1 | head
COMMAND    PID USER  FD   TYPE DEVICE SIZE/OFF NLINK      NODE NAME
/usr/bin/l  10 loki  ..   REG  0,34  89.3M     0  39299311 /var/loki/wal/00061847 (deleted)
/usr/bin/l  42 loki  ..   REG  0,34  68.0K     0  39491756 /var/loki/tsdb-shipper-active/wal/s3_2024-09-24/…/00000000 (deleted)
/usr/bin/l  54 loki  ..   REG  0,34  64.0K     0  39492350 /var/loki/tsdb-shipper-active/multitenant/index_20041/… (deleted)

Bingo – several hundred megabytes worth of WAL and shipper files are “deleted”, yet Loki is still holding them open.

Why only production?

Dev and prod run the same Helm values and version. The difference is therefore not in the code or the config, it must be in the state – i.e. the object store.

Things became interesting when I noticed two sets of WAL/shipper paths:

s3_2024-04-16/…
s3_2024-09-24/…

Those dates match two schema entries in schema_config:

configs:
- from: "2024-04-16"
  store: tsdb
  schema: v13
- from: "2024-09-24"
  store: tsdb
  schema: v13

In prod we had earlier experimented with a different store (boltdb-shipper) and later migrated to pure tsdb. Some legacy boltdb metadata and half-migrated blocks were still sitting in the bucket. The compactor/ingester logic was tripping over them, failing to mark blocks as uploaded, therefore never truncating the associated WAL segments. Every restart released the file handles, which explained why “reboot fixes it”.

The fix

Stop the world – scale Loki (and promtail/agents) down to zero to avoid new writes.
Backup – copy the entire bucket to a temporary location (versioning or S3 object-lock makes this easy).
Remove stale prefixes – in our case anything below index_*/chunks_* that still referenced the old schema or boltdb layout.
Start Loki – with an empty local disk the ingesters rebuilt state from the cleaned bucket, immediately started truncating WAL and no ghost files appeared.
Observe – 72 h later df and du still match; no manual restarts necessary.

A two-liner if you are absolutely sure about the prefixes:

aws s3 rm --recursive s3://loki-boltdb/{boltdb-shipper,index_*/old_schema_prefix,...}

(Use --exclude/--include surgically, or your S3 bill will punish you.)

Why I still think Loki is at fault

A leftover, read-only object in the bucket should not block WAL truncation locally.
Loki never logged an explicit error (“cannot ship block because schema mismatch”).
There is no watchdog that GC’s deleted-but-open files after a grace period.

I initially opened a post in the forum. But in the end, I have filed a GitHub issue (Grafana/loki #14914) with a minimal reproducer. Hopefully the compactor/ingester interaction will be hardened in a future release.

Hard-learned lessons & preventive tips

• Always watch inode usage as well as bytes – df -ih will show you leaking file handles long before the volume fills.

• lsof +L1 is your friend; wrap it in a periodic cronjob that pages you when > N MiB are in “(deleted)” state.

• Keep your schema_config history tidy; delete obsolete store prefixes right after migrations.

• Consider setting ingester.wal_cleanup_duration = 24h (3.3+ only) to force WAL truncation even when shipping is stuck.

• Prefer dedicated buckets per env; “prod accidentally keeps dev’s garbage alive” is a real thing.

Wrapping up

df told me the disk was full, du said “not my problem”, and Loki was quietly holding on to ghost WAL segments referencing ancient schema blocks in S3. Deleting the stale data immediately stopped the leak, but the episode exposed an edge case in Loki’s shipper/compactor path that still needs fixing upstream.

Hope this saves you a few evenings of head-scratching. If you hit the same symptom but your bucket looks clean, drop me a comment – I am genuinely curious about other root causes.

Happy logging, and may your df and du always agree!