Fastest way for data look up from a .txt file which contains 11million+ lines of info

Question

So I have been trying to work this out for a while but unable to come to a "rapid" solution. I do have a solution in place but it takes literaly 3 days for it to complete but unfortunatly that is far too long.

What i am trying to do:

So I have a text file (call this 1.txt) that contains unique time stamps, and have a secondary text file (call this 2.txt) that contains mixed data, and the intention is to read the first time stamp from 1.txt and find the match in 2.txt and output it in a new file, and continually do this. There are approximately 100,000 time stamps in 1.txt and over 11million lines in 2.txt.

What i have achieved:

So far what I got is it gets the first time stamp, and have a nested loop where by it loops through the 11 million lines to find a match. Once match is found, itll store that in a variable, up until it moves onto the next timestamp, where it writes out that data. Solution below:

public class fedOrganiser5 {
    private static String directory = "C:\\Users\\xxx\\Desktop\\Files\\";
    private static String file = "combined.txt";
    private static Integer fileNo = 1;

    public static void main(String[] args) throws IOException {
        String sCurrentLine = "";
        int i = 1;
        String mapperValue = "";
        String outputFirst = "";
        String outputSecond = "";
        String outputThird = "";
        long timer;
        int counter = 0;
        int test = 0;
        timer = System.currentTimeMillis();
        try {
            BufferedReader reader = new BufferedReader(new FileReader(directory + "newfile" + fileNo + ".txt"));
            BufferedWriter writer = new BufferedWriter(new FileWriter(directory + "final_" + fileNo + ".txt"));
            BufferedReader mapper = new BufferedReader(new FileReader(directory + file));
            for (sCurrentLine = reader.readLine(); sCurrentLine != null; sCurrentLine = reader.readLine()) {
                if (!sCurrentLine.trim().isEmpty() && sCurrentLine.trim().length() > 2) {
                    sCurrentLine = sCurrentLine.replace(" ", "").replace(", ", "").replace(",", "").replace("[", "");
                    try {
                        if (counter>0) {
                            writer.write(outputFirst + outputSecond + outputThird);
                            outputFirst = "";
                            outputSecond = "";
                            outputThird = "";
                            counter = 0;
                            test=0;
                            i++;
                            mapper.close();
                            mapper = new BufferedReader(new FileReader(directory + file));                      
                            System.out.println("Writing out details for " + sCurrentLine);
                        }
                        for (mapperValue = mapper.readLine(); mapperValue != null; mapperValue = mapper.readLine()) {
                            test++;
                            System.out.println("Find match " + i + " - " + test);
                            if (mapperValue.contains(sCurrentLine)) {
                                System.out.println("Match found - Mapping " + sCurrentLine + i);
                                if (mapperValue.contains("[EVENT=agentStateEvent]")) {
                                    outputFirst += mapperValue.trim() + "\r\n";
                                    counter++;
                                } else if (mapperValue.contains("[EVENT=TerminalConnectionCreated]")) {
                                    outputSecond += mapperValue.trim() + "\r\n";
                                    counter++;
                                } else {
                                    outputThird += mapperValue.trim() + "\r\n";
                                    counter++;
                                }
                            }
                        }
                    } 
                    catch (Exception e) 
                    {
                        System.err.println("Error: "+sCurrentLine + " " + mapperValue);
                    }
                }
            }
            System.out.println("writing final record out");
            writer.write(outputFirst + outputSecond + outputThird);
            writer.close();
            System.out.println("complete!");
            System.out.print("Time taken: " + 
            ((TimeUnit.MILLISECONDS.toMinutes(System.currentTimeMillis())-TimeUnit.MILLISECONDS.toMinutes(timer)))
            + " minutes");
        } 
        catch (Exception e) 
        {
            System.err.println("Error: Target File Cannot Be Read");
        }
    }
}

The problem?

I have tried looking through other solutions on google and forums but unable to seek a suitable or a faster approach to do this (or its something thats beyond my depth of knowledge). Looping through 11million lines for every time stamp takes approximately 10 minutes, and with 10,000 timestamps, you can imagine how long the process will take. Can someone provide me some friendly advice of where to look or any APIs that can speed this process up?

Have you considered using a DB for the entries rather than text file? — Roman Pustylnikov, Oct 14 '15 at 08:24
Hi Roman, only concern I have using a DB like SQL server is that each of those 11million+ lines contain characters well above 20,000 characters per line. I wonder if this would be an issue for SQL? — Raziel, Oct 14 '15 at 08:26
Hi @MubeenHussain hope this link will help for you:http://codereview.stackexchange.com/questions/44021/fast-way-of-searching-for-a-string-in-a-text-file — soorapadman, Oct 14 '15 at 08:29
Did you check Apache Commons IO, https://commons.apache.org/proper/commons-io/description.html? (see Line iterator) However I think our problem is not the reading but the handling of every single line. What do you want to do with a line? — Thomas, Oct 14 '15 at 08:31
http://stackoverflow.com/questions/6219141/searching-for-a-string-in-a-large-text-file-profiling-various-methods-in-pytho — Ank, Oct 14 '15 at 08:31
As far as I know there is no issue with that, the servers are built to deal with much more than that so you should be fine. You can use BLOB or TEXT. Or just save a line index if you want. But this will be still time consumable to reach this line in large file. You can even consider some BigData DB, but I'm not an expert on this matter. — Roman Pustylnikov, Oct 14 '15 at 08:32
@Thomas, only wish to output that line into a new .txt file. Ofcourse that file would end up massive in the end but thats not so much the concern. — Raziel, Oct 14 '15 at 08:34
1. Search for a good grep implementation which gives you only the matching lines. 2. I would write a multi threaded workers which are working on the lines in parallel. — Thomas, Oct 14 '15 at 08:36
Regarding grep - here some interesting data http://www.inmotionhosting.com/support/website/ssh/speed-up-grep-searches-with-lc-all — Roman Pustylnikov, Oct 14 '15 at 08:39
The timestamps should be small enough to fit in memory, shouldn't they? 100'000 entries, even if they are pretty big, should fit into for example a hashmap and be small enough to keep in memory. That way you only need to iterate over the really big file once. — Stig Tore, Oct 14 '15 at 08:40
@StigTore this is true but in the future that timestamp number may exceed 2million+ mark which I believe may potentially cause an issue? could be wrong.... — Raziel, Oct 14 '15 at 08:41
What is the format of the timestamp? Is it a string? a long? — Stig Tore, Oct 14 '15 at 08:43
@mubeen-hussain Even at 32 characters per entry and 5 million entries that's 152.6MB, plus storage overhead. Unwieldy, certainly, but not unmanageable. — Stig Tore, Oct 14 '15 at 08:51

score 0 · Answer 1 · answered Oct 14 '15 at 08:43

0

Want to thank everyone for their suggestions. Will certainly try the database method proposed by Roman as it may be the quickest for the type of work I am trying to do, but if no success will try the other solutions proposed :)

answered Oct 14 '15 at 08:43

Raziel

1,448
2
17
35

Fastest way for data look up from a .txt file which contains 11million+ lines of info

1 Answers1