4

I have a buffered reader that reads a large file line-by-line to remove duplicate lines.

Instead of loading the whole file in the memory I'd like to do this by using two buffered readers: The first iterates over fixed portions of the file, loading each portion one by one into memory.

In each iteration, the second buffered reader would from where the first one stops to the end of the file to check that the loaded portion doesn't exist anymore in the file.

The problem is that I can't make new independent buffered reader object (not reference) to start in the position the first one stopped.

I need a way to find out the first buffered reader's file position so I can tell the second buffered reader where to begin.

What I've tried so far:

Sending the first object to the second's constructor.

This actually worked, but both had the same iterator, so the first one moved with the second one to the end of the file

BufferedReader cleanfilereader2 = new BufferedReader(cleanfilereader);

bufferedReader.mark() sets the position of the buffered reader but I still need to know the position of the first one.

Notes:

  • The number of lines is not constant
  • Can't load the whole file in the memory
  • Both time and memory are issues
Dave Newton
  • 158,873
  • 26
  • 254
  • 302
Hady Elsahar
  • 2,121
  • 4
  • 29
  • 47

4 Answers4

1

If the file is large and time is an issue, this may not be an optimal way, because you have to read every line very often (O(n^2) times).

If you have enough memory for that, I would suggest to read the file line by line and store the hash value of each line in an ArrayList. This only needs 4 bytes (one integer) for every line. Then you can search for duplicates in this array list (fast, as it is in memory). This gives you a list of all potential duplicates, and you only have to check whether these are real duplicates while you remove them.

Jannis Froese
  • 1,347
  • 2
  • 10
  • 23
  • Nice solution! Though it would be better to track the hashes in a map, keyed on hash with value of the duplicate line number. That way when he iterates the file again for writing he can skip lines with a particular hash if their line number matches the stored value. – Perception Jan 24 '13 at 17:35
  • 1
    for sure i've thought of that ,, i've implemented that and i faced java heap problems , i could increase the heap size but i'm intending to write something that works with large files around 4 and 5 GB so this won't be applicable – Hady Elsahar Jan 24 '13 at 19:57
  • Then thumbs up for actually trying it out. There are sadly too many programers out there who wouldn't even consider multiple approaches to that problem. – Jannis Froese Jan 24 '13 at 22:09
0

You need BufferedReader.skip but there is not C like tell to give a current position. Hence drop BufferedReader, and use a simple random access file, or java.nio, a memory mapped file buffer.

Joop Eggen
  • 107,315
  • 7
  • 83
  • 138
0

If you need to read the current position, you can use a FileChannel as

A file channel has a current position within its file which can be both queried and modified

You can create an InputStream from the channel using Channels.newInputStream() (without closing it if you don't want to close the underlying channel) .

bwt
  • 17,292
  • 1
  • 42
  • 60
0

try this... (if i'm get you correctly. )

import java.io.*;
class delete{
public static void main(String args[])throws IOException{
FileInputStream fis1=new FileInputStream("delete.java");
FileInputStream fis2=fis1;
String temp="";
byte buff[]=new byte[100];
while(true){
if (fis1.read(buff)==-1)break;
temp=new String(buff);
System.out.print(temp);
if(fis2.read(buff)==-1)break;
temp=new String(buff);
System.out.print(temp);
}}
}

Output: the above code.

The Question is really interesting. so pls comment for discussion.

Arpit
  • 12,767
  • 3
  • 27
  • 40