0

We have a solr cloud 4.7.1 with 3 collections on 8 servers. Each collection is broken into 4 shards with 4 servers holding a primary replica of each collection and 4 different servers holding the other replicas. Last week the servers holding the replicas for shard 2 of one of the collections started exhibiting a weird behavior. Files were being writting to one of the collections filling up the partition. When the partition hit 100% the files were deleted and the collection was back to its usual size; but the process would start again. This would go on for a few hours and then stop for a few hours. The issue occurred from Wednesday into Thursday afternoon but stopped from Thursday until early Monday morning.

In the directory holding the replica's files I see a single file growing to fill the drive's capacity: ????.nvd. From my reading this is a norms file. I see that in the schema.xml file for this collection omitNorms is set to true.

Nothing else is standing out in the logs and my searches are striking out. Any thoughts please?

1 Answers1

0

Tried to comment, but couldn't....

Please provide the size of the shard and total disk size. Are all replicas for the culprit shard marked as green in cloud view? What does the log file say? Are other shards exhibiting this behavior as well?

I've seen Solr try to duplicate the index in preparation for recovery or something else... (I never know what it's doing up there)....

Also, have you tried to restart the replica?

EDIT: Also, did you click the Optimize button?

nick_v1
  • 1,654
  • 1
  • 18
  • 29
  • The shard is about 8.5 GB. The partition size is 20GB with 70% free space when the problem is not occurring. In the cloud view both the replicas for the culprit shard are green (active). None of the other three shards are showing this problem. I have tried restarting the replica and that has not solved the problem. I have not clicked the optimize button and to the best of my knowledge I don't think anyone else has either. The only thing I see in the log file is when the partition fills up and it reports out of disk space. – user3884624 Jul 29 '14 at 16:52
  • I stand corrected, the shard size is 5.5 GB. not 8.5 GB. – user3884624 Jul 29 '14 at 17:01
  • What does the log file say? You may need to increase the log level? – nick_v1 Jul 29 '14 at 17:40
  • If I increase logging to DEBUG I see the flurry or requests to the server. But the only thing that seems to get logged regarding this problem is when the partition fills to 100% and then the error is: – user3884624 Aug 01 '14 at 15:28
  • org.apache.solr.common.SolrException: Error logging add..... Caused by: java.io.IOException: No space left on device – user3884624 Aug 01 '14 at 15:47
  • Interestingly, when the partition fills up the error is not always thrown. I suspect the error only occurs if the partition fills up and change is made to the collection at the same time. I have set the logging to DEBUG and watched the logs up until the partition reached 100% full and then dropped back to 35% and the error was not thrown. – user3884624 Aug 01 '14 at 15:48
  • We had a similar problem recently where two shards kept trying to recover, but couldn't. The symptom was that they were constantly duplicating index files on disk. It turned out to be a zookeeper connectivity issue. We had to stop the replica, fix zookeeper, clean up the filesystem and let it recover manually. – nick_v1 Aug 04 '14 at 16:13