1

I have a program where i generate a huge matrix and once it is calculated, i have to reuse it at later times. For that reason, i want to cache it to the local hard disk so that i can read it at later times. I am using it simply by writing data to file and then later reading it.

But is there anything special that i should take into consideration for doing such tasks in java. For example, do i need to serialize it or may be do something special. Is there something i should take care for doing such things where i store important application usage data. Should it be plain ASCII/xml or what?
The data is not sensitive, however the integrity of the data is important.

Johnydep
  • 6,027
  • 20
  • 57
  • 74

5 Answers5

2

You have a few options for storing your data. You can try simply stating in a header what the width is and throwing everything into a list with a separator (ex '\n','\t',' ',etc.). Otherwise, you can use the special ObjectOutputStream to store your data. Be wary: this will likely be more inefficient than your solution. However, it will be easier to use.

Other than that, you're free to do as you choose. I usually use a FileWriter and just write all of my data in plaintext. If you're for super-efficiency, FileOutputStream is what you need.

Ryan Amos
  • 5,422
  • 4
  • 36
  • 56
2

If your data is really huge, I'd recommend some binary form - this will make it smaller and faster to read and especially parse (XML or JSON are many times slower than reading/writing binary data). Serialization also brings a lot of overhead, so you might want to check DataInputStream and DataOutputStream. If you know you will be writing only numbers of specific type or you know what sequence the data will be in - these are certainly the fastest ones.

Do not forget to wrap File Streams with Buffered Streams - they will make your operations order of magnitude faster still.

Something like (8192 is example buffer size- you can tailor it to your needs):

    final File file = null; // get file somehow
    final DataOutputStream dos = new DataOutputStream(
       new BufferedOutputStream(new FileOutputStream(file), 8192));
    try {
        for (int x: ....) { //loop through your matrix (might be different if matrix is sparse)
           for (int y: ....) {
               if (matrix[x,y] != 0.0) {
                   dos.writeInt(x);
                   dos.writeInt(y);
                   dos.writeDouble(matrix[x,y]);                                     
               } 
           }
        }
     } finally {
       dos.writeInt(-1); // mark end (might be done differently)
       dos.close();
     }

and input:

    final File file = null; // get file somehow
    final DataInputStream dis = new DataInputStream(
      new BufferedInputStream(new FileInputStream(file), 8192));
    try {
        int x;
        while((x = dis.readInt()) != -1) { 
           int y = dis.readInt();
           double value = dis.readDouble();
           // store x,y, value in matrix
        } 
    } finally {
       dis.close();
    }

as correctly pointed out by Ryan Amos, in case matrix is not sparse, it could be faster to just write values (but all of them):

Out:

    dos.write(xSize);
    dos.write(ySize);
    for (int x=0; x<xSize; x++) {
        for (int y=0; y<ySize; y++) {
            value = matrix[x,y];
            dos.write(value);
        }
    }

In:

   int xSize = dis.readInt();
   int ySize = dis.readInt();
   for (int x=0; x<xSize; x++) {
        for (int y=0; y<ySize; y++) {
              double value = dis.readDouble();
              matrix[x,y] = value;
        }
   }

(mind I have not compiled it - so you might need to correct some stuff - it is out of the top of my head).

Without buffers, you will read byte by byte which will make it slow.

One more comment - with such a huge dataset, you should consider using SparseMatrix and write/read only the elements which are non-zero (unless you really have that many of significant elements).

As wrote in the comment above - if you really want to write/read every single element in the matrix of that size, then you are already talking about hours of write rather than seconds.

Jarek Potiuk
  • 19,317
  • 2
  • 60
  • 61
  • Buffered Streams? can you refer me to some example for this?? – Johnydep Jun 26 '11 at 16:55
  • This is too inefficient. What should be done is create a header that states the length of the matrix. This way, as you read each data point, you add it in to the matrix you have. When your x becomes larger than the length provided in the header, y++. When you run out of data, you'll run out of space. – Ryan Amos Jun 26 '11 at 18:32
  • That's very true - if the matrix is really full of data... If you have sparse matrix with lots of 0s (which I would expect with such huge array), storing x,y,value might be better. Depends on how sparse the matrix is. – Jarek Potiuk Jun 26 '11 at 18:37
  • @Jarek thank you for such detailed explanation. Yes i know it takes lot of time, i tried it. It goes on for about 2 hrs which is pretty OK for me. As once loaded it will be taking search queries from around the world. It's just working as a back up System. But thanks again for such a detailed response. – Johnydep Jun 27 '11 at 01:27
  • @Jarek @Johnydep For your example of my solution, you won't need to write one of the sizes, because you can calculate the other size by # of elements / other size. If you plan to have a 2d array of different lengths, I'd suggest allocating a specific size to each portion, as a "header" to each portion. Either than, or you could use an escape sequence for the line, such as -1 – Ryan Amos Jun 28 '11 at 15:40
  • Yep. Some optimisations might be implemented indeed (and in this case I think, taking into account the potential size of data these are not premature optimisations). It very much depends on the actual data - how sparse, how many repeated values etc. If the data is quite regular you could even think about wrapping it into ZipStream when writing/reading - because the data might turn to be very well compressable, further reducing the size/time needed to read/write.... You could choose between manual compression (by optimising some parts away) or automated (using zip compression). – Jarek Potiuk Jun 28 '11 at 21:20
1

If your entries are numbers then you could just save each row of your matrix as a line in your file separated by some delimiter. You don't need special serialization then. :)

Morten Kristensen
  • 7,412
  • 4
  • 32
  • 52
  • yes actually it's just matrix enteries which are numbers only in double format. it would be like 25000 x 34000 size matrix with double numbers. So you think only storing it as plain file is right thing? – Johnydep Jun 26 '11 at 16:46
  • @Johnydep As I stated in my answer, a plain file is fine :D – Ryan Amos Jun 26 '11 at 16:47
  • Yes, it is fine to save to a text file. – Morten Kristensen Jun 26 '11 at 16:53
  • Binary form will be many times faster. Trust me. Parsing text data is very slow. And in your case you certainly need to squeeze it. Quick calculation - and even if single number is parsed/written in 1us (which is highly unlikely) the whole read/write operation will take 14 minutes. – Jarek Potiuk Jun 26 '11 at 17:05
  • @Jarek Potiuk, can you give me an example how to do that? – Johnydep Jun 26 '11 at 17:36
  • Yep. Look at the answer I gave - using DataInput/DataOutputStrem should do the job. Simply write your data in a loop: dos.writeInt(x); dos.writeInt(y); dos.writeDouble(value); .... And read them in loop as well (x = dis.readInt(); y = dis.readInt(); value = dis.readDouble(); .... – Jarek Potiuk Jun 26 '11 at 18:12
  • And I updated my answer with some example how it might look like. – Jarek Potiuk Jun 26 '11 at 18:27
1

It all depends on how you'll output it later, or if you'll also be storing it in a database or somewhere else as well. If you're never outputting it or storing it anywhere else, then a text file would work.

rkulla
  • 2,494
  • 1
  • 18
  • 16
1

If there's no need to persist the data (i.e. keep it after the java program is terminated) it would be faster to keep it in-memory in a Java variable. There are a lot of types that should meet your requirements (hashmap, arraylist...). If you need to keep the data to use it in subsequent program executions, you can store it in a file using standard file read/write methods. Plain ASCII would be faster to read/write than XML. Regarding the integrity of the files, it is OS related, because -at the end- that would be a file on your local filesystem.

Jihed Amine
  • 2,198
  • 19
  • 32