0

we have Hadoop cluster with 265 Linux RHEL machines.

from total 265 machines, we have 230 data nodes machines with HDFS filesystem.

total memory on each data-node is 128G and we run many spark applications on these machines.

last month we added another spark applications, so process takes more memory from data-nodes machines.

we noticed that cache. memory is very important part, and when more process are running on machines, then the right conclusion is to add more RAM memory.

since we can't do memory upgrade to 256G on next 5-6 month, then we are thinking about how to improve the performance of the RHEL machine and memory cash as possible.

from our experience, memory Casch is very important for applications stability.

one option is to clear the RAM memory cache and buffer as the following.

1. Clear PageCache only.

# sync; echo 1 > /proc/sys/vm/drop_caches
2. Clear dentries and inodes.

# sync; echo 2 > /proc/sys/vm/drop_caches
3. Clear PageCache, dentries and inodes.

# sync; echo 3 > /proc/sys/vm/drop_caches 

and run them from the cron as following. ( from https://www.wissenschaft.com.ng/blog/how-to-clear-ram-memory-cache-buffer-and-swap-space-on-linux/ )

#!/bin/bash
# Note, we are using "echo 3", but it is not recommended in production instead use "echo 1"
echo "echo 3 > /proc/sys/vm/drop_caches"
Set execute permission on the clearcache.sh file.

# chmod 755 clearcache.sh
Now you may call the script whenever you required to clear ram cache.

Now set a cron to clear RAM cache everyday at 2am. Open crontab for editing.

# crontab -e
Append the below line, save and exit to run it at 2am daily.

0  2  *  *  *  /path/to/clearcache.sh

but since we are talking on production data-nodes machines, then I am not so sure that above settings are safety, and they give (?) some solution until we can increase the memory from 128G to 256G

can we get yours ideas about what I wrote?

and if the "Clear RAM Memory Cache" is the right temporary solution until memory upgrade

King David
  • 549
  • 6
  • 20

1 Answers1

0

Do not do this at all.

There is no point to touching drop_caches in production workloads. On Linux, file caches are some of the first things reclaimed, automatically, when memory is needed. Likely this would throw out data in fast DRAM, and need more reads from slower storage.

Get good monitoring tools going that collect memory pressure stall information. This quantifies the possible delays of over utilized memory. And has been effective in sizing fleets of hosts. Use it to inform your own capacity planning, of where the limit is for safe workload, prior to the memory upgrade.

John Mahowald
  • 32,050
  • 2
  • 19
  • 34