0

I'm testing a Kafka producer application and noticed something strange about the disk usage of the Kafka logs. When looking at the total size of a certain partition's log directory, while the application is writing to Kafka, I see this:

$ ls -l --block-size=kB kafka-logs/mytopic-0
total 52311kB
-rw-rw-r-- 1 app-data app-data 10486kB Oct 29 12:45 00000000000000000000.index
-rw-rw-r-- 1 app-data app-data 46505kB Oct 29 12:45 00000000000000000000.log
-rw-rw-r-- 1 app-data app-data 10486kB Oct 29 12:45 00000000000000000000.timeindex
-rw-rw-r-- 1 app-data app-data     1kB Oct 29 11:55 leader-epoch-checkpoint

Then I stop my application, and a few minutes later I repeat the above command, and get this:

$ ls -l --block-size=kB kafka-logs/mytopic-0
total 46519kB
-rw-rw-r-- 1 app-data app-data 10486kB Oct 29 12:45 00000000000000000000.index
-rw-rw-r-- 1 app-data app-data 46505kB Oct 29 12:45 00000000000000000000.log
-rw-rw-r-- 1 app-data app-data 10486kB Oct 29 12:45 00000000000000000000.timeindex
-rw-rw-r-- 1 app-data app-data     1kB Oct 29 11:55 leader-epoch-checkpoint

Questions: Why does the ls total figure not represent the sum of sizes of all the files in that directory? Why does the total decrease a few minutes after stopping the producer application, even though all the files in the directory remain the same size?

Klitos Kyriacou
  • 10,634
  • 2
  • 38
  • 70

1 Answers1

3

The files might have holes. Can you run following commands :

du --apparent-size *
Philippe
  • 20,025
  • 2
  • 23
  • 32
  • Yes, the output of `du --apparent-size` is the sum of the individual file sizes. The manpage says "although the apparent size is usually smaller, it may be larger due to holes in ('sparse') files, internal fragmentation, indirect blocks, and the like." – Klitos Kyriacou Oct 29 '20 at 15:39
  • Thanks for pointing me in the right direction. Your answer has led me to some more searching that told me that Kafka index files are sparse memory-mapped files. So that would explain it. What I'm still not sure about is why the disk usage goes down even further once I stop producing data into the Kafka broker. – Klitos Kyriacou Oct 29 '20 at 16:03
  • @KlitosKyriacou - you should **accept** the answer, then ... – tink Oct 29 '20 at 17:07