3

In an interview I was asked the following question,

There is a file named sourceFile.txt containing random numbers aligned one below other like below,

608492
213420
23305
255572
64167
144737
81122
374768
535077
866831
496153
497059
931322

same number can occur more than once. The size of sourceFile.txt is around 65GB.

I need to read that file and write the numbers into new file lets say destinationFile.txt in sorted order.

I wrote the following code for this,

/*
    Copy the numbers present in the file, store in 
    list, sort it and than write into another file.
*/
public static void readFileThanWrite(String sourceFileName,String destinationFileName) throws Exception{
    String line = null;
    BufferedReader reader = new BufferedReader(new FileReader(sourceFileName));
    List<Integer> list = new ArrayList<Integer>();
    do{         
        if(line != null){               
            list.add(Integer.parseInt(line));
        }

        line = reader.readLine();
    }while(line != null);

    Collections.sort(list);

    File file = new File(destinationFileName);
    FileWriter fileWriter = new FileWriter(file,true); // 'True' means write content to end of file
    BufferedWriter buff = new BufferedWriter(fileWriter);
    PrintWriter out = new PrintWriter(buff);

    for(Iterator<Integer> itr = list.iterator();itr.hasNext();){
        out.println(itr.next());
    }

    out.close();
    buff.close();
    fileWriter.close();
}

But the interviewer said the above program will fail to load and sort numbers as the file is large.

What should be the better solution ?

Rahul Shivsharan
  • 2,481
  • 7
  • 40
  • 59
  • You tried to load a **65GB** file into _memory_?? What machine do you have that has over 100GB of RAM?? – Boris the Spider Nov 05 '16 at 10:02
  • Also, the resource handling is abysmal. If writing code for interview you _have_ to do better. – Boris the Spider Nov 05 '16 at 10:03
  • I tried the above code on source file of size 1.8MB. And it worked well. I don't know how the above code work on 65GB file – Rahul Shivsharan Nov 05 '16 at 10:05
  • 3
    Is it by chance that the numbers are all positive integers below one million or is this part of the problem specification. – Henry Nov 05 '16 at 10:06
  • What do you mean you don't know? You should be able to reason about this, and conclude that it won't work. This is what the interviewer is trying to determine. – Boris the Spider Nov 05 '16 at 10:06
  • 1
    What is maximum limit of these numbers? – Sanket Makani Nov 05 '16 at 10:07
  • 1
    You went for the extremely naive approach which fails for any significant workloads. The good news is that this gave the interviewers a very clear idea of your skills. The bad news is that you've still got a lot to learn. – Kayaman Nov 05 '16 at 10:29
  • I think a better approach would be to create a `Stream` with `Files.lines(...)` to read it line by line and sort it afterwards. – QBrute Nov 05 '16 at 10:30
  • 1
    @QBrute Then you probably wouldn't get that job either. Sorting 65GB of numbers requires a *specific* mechanism, or you'll run out of memory. – Kayaman Nov 05 '16 at 10:42
  • 1
    @QBrute, how would that relieve the problem of not being able to hold all the numbers in memory? – Ole V.V. Nov 05 '16 at 10:43
  • 1
    If we may assume all the numbers are in the range 0 through 999,999, keep an array of counts (an array of one million `int`s should fit in memory). For each number encountered, increse the count in the array. For output, iterate through the array and print each number as many times as the count says. Even performs in linear time. – Ole V.V. Nov 05 '16 at 10:46
  • I'm *quite* certain that they were looking for the candidate to use a sorting algorithm that doesn't require having all the elements in the memory at the same time. It's a quite common interview question since it tests whether you're familiar with algorithms. – Kayaman Nov 05 '16 at 10:53
  • If @Kayaman is correct, they are after an algorithm for sorting on disk. Such can be easily found on the Internet. In this case I’m still a bit puzzled why they gave an example with only positive 5 and 6 digit numbers. – Ole V.V. Nov 05 '16 at 11:13
  • @OleV.V. You can't assume that it's a representative sample, since it's only stated that the numbers are *random*. Also a classic interview question. – Kayaman Nov 05 '16 at 11:16
  • @Kayaman Random numbers also have some limit! – Sanket Makani Nov 05 '16 at 11:17
  • You are correct, of course, @Kayaman, only the OP can know whether they specified a range for the numbers or not. – Ole V.V. Nov 05 '16 at 11:19
  • @SanketMakani, the only limit we know of at the time is the number has to fit into a 65 Gb file, so it cannot have much more than 65,000,000,000 digits. – Ole V.V. Nov 05 '16 at 11:19
  • Yes and even if we think that all numbers have 1 digits in file then maximum number of lines is `65,000,000,000` So It will definitely not have any issue with space complexity. – Sanket Makani Nov 05 '16 at 11:25
  • @SanketMakani "_So It will definitely not have any issue with space complexity_". There is 65GB of data; unless you have at least 65GB of RAM you **do** have an issue with space complexity. Unless you can make some assumptions. – Boris the Spider Nov 05 '16 at 13:14
  • @BoristheSpider If you have read perfectly then I have clearly written that an input having only one digit so in `HashMap` maximum possible entries are 10 and maximum frequency is `65,000,000,000` which is then to be stored in `Long`. So Total Required space if **10*(4+4) = 80Bytes** which **definitely** **don't** have any issue with space complexity in that case! – Sanket Makani Nov 05 '16 at 14:45
  • @SanketMakani and if it has the same digit repeated then we don't even have to sort it. – Boris the Spider Nov 05 '16 at 15:30

1 Answers1

3

If you know that all the numbers are relatively small, keeping an array of occurences would do just fine. If you don't have any information about the input, you're looking for external sorting. Here's a Java project that could help you, and here is the corresponding class.

Eric Duminil
  • 52,989
  • 9
  • 71
  • 124
  • Given the same approach but still can't figure out where my approach fails but this upvotes on your answer tells me that my approach is right and some one misunderstood it. :) – Sanket Makani Nov 05 '16 at 11:37
  • 1
    I upvoted your post. I didn't test your code, but I dare say it should work when the input is right. – Eric Duminil Nov 05 '16 at 11:51