How to process a large number of small files in limited memory?

Question

Here is the description of the problem:

I have a large number of small log files in a directory, assuming:

all files follow the naming convention: yyyy-mm-dd.log, for example: 2013-01-01.log, 2013-01-02.log .
there is roughly 1,000,000 small files.
the combined size for all the files is several terabytes.

Now I have to prepend a line number for each line in each file, and the line number is cumulative, spreading amongst all files(files are ordered by timestamp) in the folder. For example:

in 2013-01-01.log, line number from 1~2500
in 2013-01-02.log, line number from 2501~7802
...
in 2016-03-26.log, line number from 1590321~3280165

All the files are overwritten to include the line number.

The constrains are:

the storage device is an SSD and can handle multiple IO requests simultaneously.
the CPU is powerful enough.
the total memory you can use is 100MB.
try to maximize the performance of the application.
implement and test in Java.

After thinking and searching, here is the best solution I've thought of. The code is a little long, so I just give a brief description of each step:

count the number of lines of each file concurrently and save the mapping to a ConcurrentSkipListMap, the key is the file name, the value is the number of lines of the file, and the key is ordered.
count the start line number of each file by traversing the ConcurrentSkipListMap, for example, the start line number and line count of 2013-01-01.log are 1 and 1500 respectively, then the start line number of 2013-01-02.log is 1501.
prepend line number to each line of each file: read line by line of each file using BufferedReader, prepend line number and then write to a corresponding tmp file using BufferedWriter. Create a thread pool and process concurrently.
rename back all the tmp files to the original name concurrently using the thread pool.

I've tested the program on my MBP, step 1 and step 3 are bottlenecks as expected. Do you have a better solution, or some optimization of my solution? Thanks in advance!

With 100MB limit and 1M files, line 34 will likely bomb out already at `logPath.toFile().listFiles();`. 100 MB means you only have 100 bytes to use per file if you dare to keep information for all files in memory at the same time. — Harald, Mar 26 '16 at 17:19
@Harald, thank you. Maybe `logPath.toFile().list()` consume less memory. Also, someone suggests that `Files.walkFileTree` in Java 7 may work. I'll try both of them, but the problem is I cannot create so many test logs. — Michael, Mar 27 '16 at 00:31

score 1 · Answer 1 · edited May 23 '17 at 12:23

Not sure if this questions fits the SO model of Q&A, but I try some hints towards an answer.

Fact 1) Given 1M files and 100MB limit, there is nearly no way to keep information for all files in memory at the same time. Except potentially by doing a lot of bit fiddling like in the old days when we programmed in C.

Fact 2) I don't see a way to get around reading all files once to count the line numbers and then rewrite them all, which means to read them all again.

A) Is this a homework question? There may be a way to produce the file names from a folder lazily, one by one, in Java 7 or 8, but I am not aware of it. If there is, use it. If not, you might need to generate the file names instead of listing them. This would require that you can insert a start and an end date as input. Not sure if this is possible.

B) Given there is a lazy Iterator<File>, whether from the jdk to list files or self implemented to generate file names, get N of them to partition the work to N threads.

C) Now each thread takes care of its slice of files, reads them and keeps only the total number of lines of its slice.

D) From the totals for each slice compute the starting number for each slice.

E) Distribute iterators over N threads again to do the line numbering. Rename a tmp file immediately after it was written, don't wait for everything to finish as to not having to iterate over all files again.

At each point in time, the information kept in memory is rather small: one file name per thread, a line count over the whole slice, the current line of a file being read. 100MB is more than enough for this, if N is not outrageously large.

EDIT: Some say that Files.find() is lazily populated, yet I could not easily find the code behind it (some DirectoryStream in Java 8) to see if the lazyness pertains only to read the full contents of one folder at a time, or whether indeed one file name is read at a time. Or whether this even depends on the file system used.

**A)** this is a homework of a real problem, the file names may not be consecutive, it's hard to generate on need. **B)/C)/D)** I actually do the same as your idea. **E)** I don't rename the tmp file immediately after it was written because I leave it to be done later concurrently, I hope it will be faster. Thank you, Harald! — Michael, Mar 27 '16 at 00:55

How to process a large number of small files in limited memory?

1 Answers1