4

I made a nodetool repair command on one node. This node went down, and the log files showed the following error message:

INFO  [STREAM-IN-/192.168.2.100] 2015-02-13 21:36:23,077 StreamResultFuture.java:180 - [Stream #8fb54551-b3bd-11e4-9620-4b92877f0505] Session with /192.168.2.100 is complete
INFO  [STREAM-IN-/192.168.2.100] 2015-02-13 21:36:23,078 StreamResultFuture.java:212 - [Stream #8fb54551-b3bd-11e4-9620-4b92877f0505] All sessions completed
INFO  [STREAM-IN-/192.168.2.100] 2015-02-13 21:36:23,078 StreamingRepairTask.java:96 - [repair #508bd650-b3bd-11e4-9620-4b92877f0505] streaming task succeed, returning response to node4/192.168.2.104
INFO  [AntiEntropyStage:1] 2015-02-13 21:38:52,795 RepairSession.java:237 - [repair #508bd650-b3bd-11e4-9620-4b92877f0505] repcode is fully synced
INFO  [AntiEntropySessions:27] 2015-02-13 21:38:52,795 RepairSession.java:299 - [repair #508bd650-b3bd-11e4-9620-4b92877f0505] session completed successfully
INFO  [AntiEntropySessions:27] 2015-02-13 21:38:52,795 RepairSession.java:260 - [repair #03858e40-b3be-11e4-9620-4b92877f0505] new session: will sync node4/192.168.2.104, /192.168.2.100, /192.168.2.101 on range (8805399388216156805,8848902871518111273] for data.[repcode]
INFO  [AntiEntropySessions:27] 2015-02-13 21:38:52,795 RepairJob.java:145 - [repair #03858e40-b3be-11e4-9620-4b92877f0505] requesting merkle trees for repcode (to [/192.168.2.100, /192.168.2.101, node4/192.168.2.104])
WARN  [StreamReceiveTask:74] 2015-02-13 21:41:58,544 CLibrary.java:231 - open(/user/jlor/apache-cassandra/data/data/data/repcode-398f26f0b11511e49faf195596ed1fd9, O_RDONLY) failed, errno (23).
WARN  [STREAM-IN-/192.168.2.101] 2015-02-13 21:41:58,672 CLibrary.java:231 - open(/user/jlor/apache-cassandra/data/data/data/repcode-398f26f0b11511e49faf195596ed1fd9, O_RDONLY) failed, errno (23).
WARN  [STREAM-IN-/192.168.2.101] 2015-02-13 21:41:58,871 CLibrary.java:231 - open(/user/jlor/apache-cassandra/data/data/data/repcode-398f26f0b11511e49faf195596ed1fd9, O_RDONLY) failed, errno (23).
ERROR [StreamReceiveTask:74] 2015-02-13 21:41:58,986 CassandraDaemon.java:153 - Exception in thread Thread[StreamReceiveTask:74,5,main]
org.apache.cassandra.io.FSWriteError: java.io.FileNotFoundException: /user/jlor/apache-cassandra/data/data/data/repcode-398f26f0b11511e49faf195596ed1fd9/data-repcode-tmp-ka-245139-TOC.txt (Too many open files in system)
        at org.apache.cassandra.io.sstable.SSTable.appendTOC(SSTable.java:282) ~[apache-cassandra-2.1.2.jar:2.1.2]
        at org.apache.cassandra.io.sstable.SSTableWriter.close(SSTableWriter.java:483) ~[apache-cassandra-2.1.2.jar:2.1.2]
        at org.apache.cassandra.io.sstable.SSTableWriter.closeAndOpenReader(SSTableWriter.java:434) ~[apache-cassandra-2.1.2.jar:2.1.2]
        at org.apache.cassandra.io.sstable.SSTableWriter.closeAndOpenReader(SSTableWriter.java:429) ~[apache-cassandra-2.1.2.jar:2.1.2]
        at org.apache.cassandra.io.sstable.SSTableWriter.closeAndOpenReader(SSTableWriter.java:424) ~[apache-cassandra-2.1.2.jar:2.1.2]
        at org.apache.cassandra.streaming.StreamReceiveTask$OnCompletionRunnable.run(StreamReceiveTask.java:120) ~[apache-cassandra-2.1.2.jar:2.1.2]
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) ~[na:1.8.0_31]
        at java.util.concurrent.FutureTask.run(FutureTask.java:266) ~[na:1.8.0_31]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) ~[na:1.8.0_31]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [na:1.8.0_31]
        at java.lang.Thread.run(Thread.java:745) [na:1.8.0_31]
Caused by: java.io.FileNotFoundException: /usr/jlo/apache-cassandra/data/data/data/repcode-398f26f0b11511e49faf195596ed1fd9/data-repcode-tmp-ka-245139-TOC.txt (Too many open files in system)
        at java.io.FileOutputStream.open(Native Method) ~[na:1.8.0_31]
        at java.io.FileOutputStream.<init>(FileOutputStream.java:213) ~[na:1.8.0_31]
        at java.io.FileWriter.<init>(FileWriter.java:107) ~[na:1.8.0_31]
        at org.apache.cassandra.io.sstable.SSTable.appendTOC(SSTable.java:276) ~[apache-cassandra-2.1.2.jar:2.1.2]
        ... 10 common frames omitted

We have a small cluster with 5 nodes: node0-node4. I have one table with about 3.4 billion rows, with replica 3. Here is the table description:

CREATE TABLE data.repcode (
rep int,
type text,
code text,
yyyymm int,
trd int,
eq map<text, bigint>,
iq map<text, bigint>,
PRIMARY KEY ((rep, type, pcode), yyyymm, trd))
WITH CLUSTERING ORDER BY (yyyymm ASC, co_trd ASC, md5 ASC)
AND bloom_filter_fp_chance = 0.1
AND caching = '{"keys":"ALL", "rows_per_partition":"NONE"}'
AND comment = ''
AND compaction = {'min_threshold': '4', 'class': 'org.apache.cassandra.db.compaction.LeveledCompactionStrategy', 'max_threshold': '32'}
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE';

I'm using Cassandra 2.1.2. I have set the limit of max open files for all my nodes to 200'000.

Before I issued the nodetool repair command I had counted the file count in my data directories. Here is the count on each of my nodes before the crash:

node0: 27'099 
node1: 27'187 
node2: 36'131 
node3: 26'635 
node4: 26'371 

Now after the crash:

node0:   946'555 
node1:   973'531 
node2:   844'211 
node3: 1'024'147 
node4: 1'971'772 

Is it normal that the amount of file in one unix directory raise to such extend? What can I do to avoid this? How can I avoid this problem in the future? Should I increase the amount of open files? This seems already very big to me. Is my cluster too small for this amount of records? Should I use another compaction strategy?

Thanks for your help.

BuckBazooka
  • 881
  • 1
  • 10
  • 18
  • What kind of files are these numbers refering to? Are those all regular sstables or snapshots? Do they match with the number of SSTables reported by `nodetool cfstats`? – Stefan Podkowinski Feb 16 '15 at 12:49

2 Answers2

2

What is the output of ulimit -a | grep "open files"?

The following recommended resource limits (ulimit) for Cassandra should be set as follows (for RHEL 6):

cassandra - memlock unlimited
cassandra - nofile 100000
cassandra - nproc 32768
cassandra - as unlimited

The exact file and username will differ based-on your install type and which user you run Cassandra as. The above assumes that the lines are from /etc/security/limits.d/cassandra.conf with a packaged install, running Cassandra as the "cassandra" user (for a tarball install you'll want /etc/security/limits.conf).

If your setup differs from that, check with the document that I have linked above. Note that if you run Cassandra as the root user, that some distros require limits to be set explicitly for the root user.

Edit 20180330

Note that the above /etc/security/limit.conf adjustment works for CentOS/RHEL 6 systems. Otherwise the adjustment should be made in /etc/security/limits.d/cassandra.conf.

Aaron
  • 55,518
  • 11
  • 116
  • 132
  • 1
    Slightly newer link related to resource limits: https://docs.datastax.com/en/landing_page/doc/landing_page/recommendedSettings.html Careful attention to the information regarding RHEL 6 vs RHEL 7 – dustmachine Mar 30 '18 at 15:38
  • Good call, @dustmachine ! Edit made. – Aaron Mar 30 '18 at 16:04
0

The number of open files should not be limited to Cassandra installations. It might be 10.000.000, no problem at all. The issue is that if you have too many open files: you have too many ss tables that lead to very long restart time. To prevent it use nodetool comact on all nodes where a number of open files exceed the initial number significantly. Use the following as Cassandra software owner user to calculate the number of the open files during repair:

for i in {1..1000}; do echo -n "This is a test in loop $i "; lsof -p $(ps -ef | grep "/var/log/cassandra/gc.log" | grep -v "color" |  awk '{print $2}') | wc -l ; sleep 500; done
Yuri Levinsky
  • 1,515
  • 2
  • 13
  • 26
  • Assuming you mean `nodetool compact`. I wouldn't do that, as this will likely prevent compaction from ever running again, putting you right back in the same situation. – Aaron Sep 02 '20 at 16:36
  • it depends on the number of CPU's every node has and the number of compaction processes allowed in Cassandra. I decreased the number of open files on each node from 1.2M to ~20k. The cluster size 15 nodes 10Tb, version 3.6. So it doesn't need to go to 100k anyway. – Yuri Levinsky Sep 07 '20 at 10:55