1

I am doing some unusual data manipulation. I have 36,000 input files. More then can be loaded into memory at once. I want to take the first byte of every file and put it in one output file, and then do this again for the second and so on. It does not need to be done in any specific order. Because the input files are compressed loading them takes bit longer, and they can't be read 1 byte at a time. I end up with a byte array of each input file.

The input files are about ~1-6MB uncompressed and ~.3-1MB compressed (lossy compression). The output files end up being the number of input files in bytes. ~36KB in my example.

I know the ulimit can be set on a Linux OS and the equivalent can be done on windows. Even though this number can be raised I don't think any OS will like millions of files being written to concurrently.

My current solution is to make 3000 or so bufferedwriter streams and loading each input file in turn and writing 1 byte to 3000 files and then closing the file and loading the next input. With this system each input file needs to be opened about 500 times each.

The whole operation takes 8 days to complete and is only a test case for a more practical application that would end up with larger input files, more of them, and more output files.

Catching all the compressed files in memory and then decompress them as needed does not sound practical, and would not scale to a larger input files.

I think the solution would be to buffer what I can from the input files(because memory constraints will not allow buffering it all), and then writing to files sequentially, and then doing it all over again.

However I do not know if there is a better solution using something I am not read up on.

EDIT I am grateful for the fast response. I know I was being vague in the application of what I am doing and I will try to correct that. I basically have a three dimensional array [images][X][Y] I want to iterate over every image and save each color from a specific pixel on every image, and do this for all images. The problems is memory constraints.

byte[] pixels = ((DataBufferByte) ImageIO.read( fileList.get(k) ).getRaster().getDataBuffer()).getData();

This is what I am using to load images, because it takes care of decompression and skipping the header.

I am not editing it as a video because I would have to get a frame, then turn it into an image (a costly color space conversion), and then convert it to a byte[] just to get pixel data int RGB color space.

I could load each image and split it into ~500 parts (size of Y) and write to separate files I leave open and write to for each image. The outputs would be easily under a gig. The resultant file could be loaded completely into memory and turned into an array for sequential file writing.

The intermediate steps does mean I could split the load up on a network but I am trying to get it done on a low quality laptop with 4gb ram, no GPU, and a low quality i7.

I had not considered saving anything to file as an intermediate step before reading davidbak's response. Size is the only thing making this problem not trivial and I now see the size can be divided into smaller more manageable chunks.

Audo Voice
  • 43
  • 5
  • not sure what the part 3 is. u need to uncompress a file and append first few bytes to a file? why to 3,000 files? if u have more than 8 servers can use hadoop – tgkprog Apr 27 '16 at 21:33
  • The inputs are all the same size for a given run but could very in size between runs, and also very in number of files. If it was 1MB per, and 36000 files, then it would be a 36GB file and that is the low end of things. I could then read that file in a very predictable way. Each byte I need would be 1MB(the size of one input file) apart exactly, but keeping in mind the amount of time to assemble it into one massive file, is this really much faster? It would load and then unload every byte of 36 gigs into memory just to complete 1 file. It would do this 1 million times. – Audo Voice Apr 27 '16 at 21:36

2 Answers2

5

Three phase operation:

Phase one: read all input files, one at a time, and write into a single output file. The output file will be record oriented - say, 8 byte records, 4 byte of "character offset", and 4 byte "character codepoint". As you're reading a file the character offset starts at 0, of course, so if the input file is "ABCD" you're writing (0, A) (1, B) (2, C) (3, D). Each input file is opened once, read sequentially and closed. The output file is opened once, written throughout sequentially, then closed.

Phase two: Use an external sort to sort the 8 byte records of the intermediate file on the 4 byte character offset field.

Phase three: Open the sorted intermediate file and make one pass through it. Open a new output file every time the character index field changes and write to that output file all the characters that belong to that index. Input file is opened once and read sequentially. Each output file is opened, written to sequentially, then closed.

Voilà! You need space for the intermediate file, and a good external sort (and space for its work files).

As @Jorge suggests, both phase 1 and phase 2 can be parallelized, and in fact, this sort of job as outlined (phases 1 to 3) is exactly in mapreduce/hadoop's sweet spot.

davidbak
  • 5,775
  • 3
  • 34
  • 50
2

You are being very vague in there, but, maybe a look at mapreduce could help. It seems the kind of job that could be distributed.

With the additional info you provided, I really don't see how to execute that task on common hardware like the 4GB i7 you mentioned. Your problem looks like a image stacking algorithm to get a decent image from a lot of not so good images, a typical problem in astronomical image processing, and I'm sure it is applied to other areas. A good lookup into astronomical imaging processing may be a good use of your time, there is a software called registax (not sure if it still exists) that does something like that but with video files.

Doing back some napkin math if you take 1 second to open a file you get 10h worth just of file opening.

An approach would be to get some FAST disk (SSD), the I'd decompress all the files into some raw format and store them on disk, from there on you'll have to use file pointers to read directly from the files without getting them into memory and write the output into a file, directly on the disk.

  • Thanks the the pointer to RegiStax (which [still exists](http://www.astronomie.be/registax/)) - I was completely unaware of that category of image processing software. – davidbak Apr 28 '16 at 15:52