Here is the description of the problem:
I have a large number of small log files in a directory, assuming:
- all files follow the naming convention:
yyyy-mm-dd.log
, for example: 2013-01-01.log, 2013-01-02.log . - there is roughly 1,000,000 small files.
- the combined size for all the files is several terabytes.
Now I have to prepend a line number for each line in each file, and the line number is cumulative, spreading amongst all files(files are ordered by timestamp) in the folder. For example:
- in 2013-01-01.log, line number from 1~2500
- in 2013-01-02.log, line number from 2501~7802
- ...
- in 2016-03-26.log, line number from 1590321~3280165
All the files are overwritten to include the line number.
The constrains are:
- the storage device is an SSD and can handle multiple IO requests simultaneously.
- the CPU is powerful enough.
- the total memory you can use is 100MB.
- try to maximize the performance of the application.
- implement and test in Java.
After thinking and searching, here is the best solution I've thought of. The code is a little long, so I just give a brief description of each step:
count the number of lines of each file concurrently and save the mapping to a
ConcurrentSkipListMap
, the key is the file name, the value is the number of lines of the file, and the key is ordered.count the start line number of each file by traversing the
ConcurrentSkipListMap
, for example, the start line number and line count of 2013-01-01.log are 1 and 1500 respectively, then the start line number of 2013-01-02.log is 1501.prepend line number to each line of each file: read line by line of each file using
BufferedReader
, prepend line number and then write to a corresponding tmp file usingBufferedWriter
. Create a thread pool and process concurrently.rename back all the tmp files to the original name concurrently using the thread pool.
I've tested the program on my MBP, step 1 and step 3 are bottlenecks as expected. Do you have a better solution, or some optimization of my solution? Thanks in advance!