I created a ~3TB binary file (located on an AWS EBS volume) intended to store an MxN matrix of doubles representing uniform financial time series across multiple days. There are M=37932 different time series, each of which has N=10415118 elements.
I have a C++ program that reads in financial market data for a specific date, creates M file pointers that point to the appropriate starting locations within the aforementioned binary file, and then writes in the desired time series data at the location of the correct file pointer as it processes the financial market data.
I am using a 72-core AWS EC2 instance running Ubuntu 16.04, and was running the above C++ program in 54 processes in parallel at a time (with a total of several hundred dates to go through overall). So in total, about 54*37932=2048328 file pointers were open at once on the system.
After some time, the processes began to get stuck in the uninterruptible sleep "D state" and just hung. Does anyone know why this could be? This issue tends to come up less often when I run fewer of the aforementioned processes in parallel.
I also noticed this for the EBS volume, maybe it is causing a problem? I'm not sure if it is meaningful for an EBS volume and if/how it should be fixed.
$ sudo xfs_db -c frag -r /dev/nvme2n1
actual 1468060, ideal 16154, fragmentation factor 98.90%
(not sure if this would be more appropriate for ServerFault instead)