HDFS block deletion speed - cause, expectance, tuning?

Question

I have a small (testing) HDFS cluster which I use as snapshot backup space for Flink. Flink creates and deletes roughly 1000 (small) files per second. The namenode seems to handle this without problems at first, but over time the Number of Blocks Pending Deletion builds up until the file system is full. When I stop my Flink job (i.e. no further create/delete/… operations), the number of pending blocks only decreases by about 1.2e6 per hour.

What I'd like to know is… which part is responsible for this slowness? The name, data, or journal nodes? Is this speed to be expected, or can I tune some configuration to get orders of magnitude faster?

Not really answer to your question hence just as comment. HDFS is not built for many small files created fast. This sound like an anti-pattern and should be avoided at all costs. Your files should ideally be at least around 64MB in size. — tcurdt, Nov 07 '19 at 12:21
Sadly, I have little control over the size of the files. They're coming from ~700 independently operating RocksDBs, and may need to be deleted individually. (Flink does allow to glob files smaller than 1 MB, I'm doing that.) Either way, only being able to handle around 300 deletions per second on fast SSDs sounds like a misconfiguration, and I wonder where to start searching. — Caesar, Nov 08 '19 at 02:37
Sounds like backup scenario. Then you you should come up with a good scheme and tar the files before ingestion. Other than that - all I can say is that this not what HDFS is built for. People run compaction jobs to reduce the number files in the cluster for a reason. — tcurdt, Nov 08 '19 at 13:27
Let me stress this again: I cannot do anything to significantly reduce the rate of file deletions. (Unless you or I can propose and implement a significantly improved checkpointing scheme to Flink. Yeah…) — Caesar, Nov 09 '19 at 02:50
I don't have enough information to help with that. But let *me* stress this again - like you are describing the situation you are using the wrong tool for the job. — tcurdt, Nov 09 '19 at 03:07
You know what… I know that. The people who built Flink probably also know that. But now what? — Caesar, Jan 07 '20 at 01:29
You know what ... do what other people do. Aggregate the data before you put them in HDFS or use something more suitable for your situation. Did you check with the Flink community if your approach is OK? Hard to help without knowing all the details. — tcurdt, Jan 08 '20 at 09:52
I did. Many have similar problems, but my combination of cluster size, required checkpointing frequency, state descriptor (=column family) count, and CPU load over state size seems to be especially unfavourable. AWS' Flink-as-a-Service thingie has a special gateway that globs files and limits write rate before putting state to S3. But it's far out of my reach to write something similar for HDFS. Meanwhile, I suspect that the deletion speed is artificially limited (cause it's much slower than creation), and hence I'd like to know whether it's configurable. Is that unreasonable? — Caesar, Jan 08 '20 at 12:19
Oh, and I forgot: globbing is tricky, because the files may be deleted individually, and may have wildly differing lifetimes. — Caesar, Jan 08 '20 at 12:20
Get the funds and people to write this for you. That's my advice. Depending of the requirements it can be tricky - but it's not rocket science. We built something like that, too. Or use AWS' Flink-as-a-Service. Trying to solve this by tuning HDFS will not get you far (enough). Especially not in the long run. That's my experience. — tcurdt, Jan 09 '20 at 14:07
That said - it does not sound like HDFS is even the right tool for that kind of data. But again - I would need to know more details to make a recommendation. — tcurdt, Jan 09 '20 at 14:10
HDFS operation is annoying anyway and I'd like to move away from it eventually. Putting that level of investment into it… does not sound like good advice. Until I can get rid of HDFS, I'd like to know whether there's an easy fix, on the level of changing one configuration parameter. — Caesar, Jan 10 '20 at 05:06
Well, it means your cluster cannot handle the amount transactions and is falling behind. You could add nodes or try to improve the write speed. For the later send an email to the hadoop users list at apache. Check the replication options. But the bottom line is - I would rather work on the connector or a plan B than an unscalable and temporary work around. Good luck! — tcurdt, Jan 10 '20 at 22:14

score 1 · Answer 1 · answered Jul 29 '21 at 07:04

1

i just suffered from this. you should change the parameter in hdfs-site.xml

<property>
    <name>dfs.block.invalidate.limit</name>
    <value>50000</value>
</property>

the default value is 1000 , which is too slow

answered Jul 29 '21 at 07:04

ico001

11
1

I consider this to be enough context. Sadly, I had to hand over my experimentation cluster to somebody else, so I can't try it for a few weeks. I'll give you the green checkmark as soon as I've tried it out. – Caesar Jul 29 '21 at 09:16
1

namenode is responsible for invalidate block. datanode report block infos to namenode ,then namenode got that and then tell datanodes to invalidate block – ico001 Jul 30 '21 at 02:08
I finally got around to verifying this and yes: it helps, a lot. Thank you. – Caesar Jun 09 '22 at 08:10

HDFS block deletion speed - cause, expectance, tuning?

1 Answers1