0

I have a couple of files, on average containing approx 30,000 data points stored in them. Each data point has 2 lines; a meta data line and the line containing the information.

example,

0) >data_point_number data_point_name

1) information

2) >data_point_number data_point_name

3) information

I am writing code to search the file to find specific entries using the data point number. the data point numbers are not ordered so they need to be sorted. I want to use Array.Sort() and Array.BinarySearch()

After collecting all of the data_point_number into an array and sorting it so that I can perform a binary search on the data, how do I then link it back to the original location in the file so that I can access the information?

I'd like to append the meta data and the information into an output file.

I've tried simply searching files as is using something that's effectively a line by line search of the file but it takes roughly 20 minutes to run through a file.

        for (int i = 0; i <= linecount; i = i + 2)
        {
            string currentline = System.IO.File.ReadLines(datafile).Skip(i).Take(1).First();
            string[] splitline = currentline.Split(' ');
            array1[i] = splitline[0].Trim(new char[] { '>' });
        }
  • If you're only performing one search per file read I wouldn't bother parsing it into an array and sorting it; just search it as you read it, as you will on average quit reading it after 50% of the data is read and disk IO is going to be more expensive than CPU time – Caius Jard Sep 04 '19 at 18:19
  • The list of what i'm searching for is generated and stored in a text file, it's the output of a different program. and I am searching for 100's of lines withing 10000's of lines. the only way I can figure to search the file quickly is to use a binary search. it takes 20 minutes to search for 1, it'll take 30+hours to search for 100... and I need to do it multiple searches using different inputs and outputs and it will fail to locate occasionally (ie, search 100% of the file before coming back with an error). –  Sep 05 '19 at 00:28
  • Binary search isn't as fast as Dictionary - take a look at my post edit; the technique will find 100s of things in 10000s of things in less than 5 seconds. BUT Your code is *incredibly* slow because you endlessly re-read the same file over and over again, not because of search. Never use File.Read[All]Lines twice on the same file in succession unless you're sure the file has changed since last read; IO costs. You even read a whole file into an array, then read it all again as an array just to get the length (line count) then toss the data away.. Read[All]Lines hits the disk; *do it seldom* – Caius Jard Sep 05 '19 at 07:06

2 Answers2

0

instead of sorting a string make a class consisting of the string and it's original location:

class datapoint {
string data {get;set;}
int originalline {get;set;}
}

Use the Array.Sort overload that takes a Comparison parameter. Or the class can implement Icomparable and work with just Array.sort();

Can you use linq's orderby? because that is another route.

terrencep
  • 675
  • 5
  • 16
0

Take a look at this:

private static void Search(string[] input)
{
    string datafile = @"C:\Users\User\Documents\text.txt";
    string inputfile = @"C:\Users\User\Documents\input.txt";
    string outputfile = @"C: \Users\User\Documents\output.txt";

    string[] parameters = System.IO.File.ReadAllLines(inputfile);

    string[] data = System.IO.File.ReadAllLines(datafile);

    var index = new Dictionary<string, int>();

    for (int i = 0; i < data.Length - 1; i += 2)
    {
        string currentline = data[i];
        string[] splitline = currentline.Split(' ');
        index[splitline[0].Trim('>')] = i;
    }

    foreach(var p in parameters)
    {
        if (index.ContainsKey(p))
          Console.WriteLine($"Found {p} at line {index[p]}");
        else 
          Console.WriteLine($"File doesn't contain {p}");
    }

}

You're looking for exact text matches, so you'd be ideally suited to loading them into a Dictionary. They will be hashed. The dictionary provides fast lookup and you can store the line number where the item was found

You made far too many calls to reading files, including inside loops; don't. You had a call that read a parameters file into an array, then you read the file again just to get the count of lines (hint: array length)

I stripped it down to a single read of each file, loading the dictionary with all the search data then looking through it for each parameter in the parameters file and outputting whether it is found or not. I don't know what your intention is with the output file

Other hints: you could gain a bit by not doing a split- it's more expensive to break a string into an array of string when all you really want is the data between 1 and the first occurrence of ' ' space. As such the first loop could be reduced to

for (int i = 0; i <= data.Length; i +=2)
  index[data[i].Substring(1, data[i].IndexOf(' ') -1)] = i;

You don't have to just store an int in the dictionary. You could upgrade it to store a class, use a tuple or anonymous type (A dictionary where value is an anonymous type in C#) and then you can track more than just the line number- you could track the whole data line, its number and the info line related to it, for example. Let's upgrade it.. I've added a routine at the top to generate some fake data and some other data that is and is not in the file:

    static void Main()
    {

        Console.WriteLine(DateTime.Now + " generate some fake data");
        StringBuilder datasb = new StringBuilder(100 * 1024 * 1024);//initialize for 100 megabytes
        var para = new List<Guid>();
        for (int i = 0; i < 500000; i++) {
            var g = Guid.NewGuid();
            datasb.AppendFormat(">{0} datapointname{1}\r\nInformation; generated at {2}\r\n", g, i, DateTime.Now);
            if (i % 20000 == 0) //25 items in 500,000
                para.Add(g);
            if (i % 40000 == 0) //~12 items not findable in 500,000
                para.Add(Guid.NewGuid());
        }
        var pfile = string.Join("\r\n", para.OrderBy(g => g.ToString()));


        string datafile = @"C:\temp\text.txt";
        string inputfile = @"C:\temp\input.txt";
        string outputfile = @"C:\temp\output.txt";

        //write fake files
        File.WriteAllText(datafile, datasb.ToString());
        File.WriteAllText(inputfile, pfile);


        var start = DateTime.Now;
        Console.WriteLine(DateTime.Now + " begin loading dictionary");

        //BEGIN USEFUL PART

        string[] parameters = System.IO.File.ReadAllLines(inputfile);

        string[] data = System.IO.File.ReadAllLines(datafile);

        var index = new Dictionary<string, Thing>();

        for (int i = 0; i < data.Length - 1; i += 2)
        {
            string currentline = data[i];
            string[] splitline = currentline.Split(' ');
            Thing t = new Thing()
            {
                DataPointNumber = splitline[0].Trim('>'),
                DataPointName = splitline[1],
                Information = data[i + 1],
                LineNumber = i
            };
            index[t.DataPointNumber] = t;
        }

        Console.WriteLine(DateTime.Now + " begin searching dictionary");

        int found = 0, notFound = 0;
        foreach (var p in parameters)
        {
            if (index.ContainsKey(p))
            {
                Console.WriteLine($" Found {p}: {index[p]}"); //ToString will be called
                found++;
            }
            else
            {
                Console.WriteLine($" File doesn't contain {p}");
                notFound++;
            }
        }


        Console.WriteLine($"{DateTime.Now } search complete, searched {index.Count} items looking for {parameters.Length} items, found {found}, didnt find {notFound}, took {(DateTime.Now-start).TotalSeconds} seconds");

    }

The bit you'll want for your program starts at //BEGIN USEFUL PART, take a look at the timings when loading a file into a dictionary and searching it - on my machine it takes 1.5 seconds to find 38 items in half a million (~50mb text file), and this includes the time taken to load the stuff into the dictionary in the first place:

2019-09-05 07:54:17 generate some fake data
2019-09-05 07:54:19 begin loading dictionary
2019-09-05 07:54:21 begin searching dictionary
 Found 0ae4b83a-95f0-46e1-acc2-fe802f51441b: Line:240000-0ae4b83a-95f0-46e1-acc2-fe802f51441b with info Information; generated at 2019-09-05 07:54:17
 Found 0d007ca2-f21c-4d3c-b52d-fcd3833d31a7: Line:480000-0d007ca2-f21c-4d3c-b52d-fcd3833d31a7 with info Information; generated at 2019-09-05 07:54:18
 Found 16849c07-c7a4-4b8b-b0fa-9ed8fd8dedde: Line:200000-16849c07-c7a4-4b8b-b0fa-9ed8fd8dedde with info Information; generated at 2019-09-05 07:54:17
 Found 1afdc959-297d-43fe-8106-58c648c25d76: Line:400000-1afdc959-297d-43fe-8106-58c648c25d76 with info Information; generated at 2019-09-05 07:54:18
 Found 21dcb6fd-1bd5-4920-b3fa-fd1a908f153d: Line:560000-21dcb6fd-1bd5-4920-b3fa-fd1a908f153d with info Information; generated at 2019-09-05 07:54:18
 File doesn't contain 2944f7f7-2fa8-425a-bbf9-f833cfdb1fd2
 Found 3b1c0712-2211-4a36-b6dd-739619142fa5: Line:80000-3b1c0712-2211-4a36-b6dd-739619142fa5 with info Information; generated at 2019-09-05 07:54:17
 Found 3b2fb141-61e9-4b2d-8ad5-44171648ac03: Line:840000-3b2fb141-61e9-4b2d-8ad5-44171648ac03 with info Information; generated at 2019-09-05 07:54:19
 File doesn't contain 487bc8d3-708d-40bc-9278-79ae34fb9732
 File doesn't contain 4a9b40b4-fe53-4ba8-9842-6f8d99dd405a
 Found 528943c2-d243-4963-b98d-6c60f9c5e118: Line:600000-528943c2-d243-4963-b98d-6c60f9c5e118 with info Information; generated at 2019-09-05 07:54:18
 Found 53f60bb6-cf12-4ac7-a0c0-c0d0daf9571f: Line:760000-53f60bb6-cf12-4ac7-a0c0-c0d0daf9571f with info Information; generated at 2019-09-05 07:54:18
 File doesn't contain 574d8611-eeec-4ea4-882e-9ea6f8c3a553
 Found 591c5ce9-c32f-4f88-a620-6f2b9f90de35: Line:120000-591c5ce9-c32f-4f88-a620-6f2b9f90de35 with info Information; generated at 2019-09-05 07:54:17
 Found 60ecbf50-c362-42d2-80e2-666c339b87cc: Line:0-60ecbf50-c362-42d2-80e2-666c339b87cc with info Information; generated at 2019-09-05 07:54:17
 Found 63a07cb7-e416-4da6-8a2f-a33fafc6c5d7: Line:720000-63a07cb7-e416-4da6-8a2f-a33fafc6c5d7 with info Information; generated at 2019-09-05 07:54:18
 File doesn't contain 69083432-4cb9-484c-8cd2-6b5412b1fccf
 Found 705dae03-54d8-48b0-a2ab-7d82d8afc59c: Line:40000-705dae03-54d8-48b0-a2ab-7d82d8afc59c with info Information; generated at 2019-09-05 07:54:17
 Found 7182cb17-5070-4801-92d4-bc01bc05e851: Line:960000-7182cb17-5070-4801-92d4-bc01bc05e851 with info Information; generated at 2019-09-05 07:54:19
 Found 71dbc2a3-4a40-4ce3-b3c2-1039aa866bf8: Line:360000-71dbc2a3-4a40-4ce3-b3c2-1039aa866bf8 with info Information; generated at 2019-09-05 07:54:18
 File doesn't contain 7cc9f35b-9524-4f95-b580-fbeef80c0557
 Found 8f8e89ae-3dcf-4a8a-bf34-36a1078c88c6: Line:800000-8f8e89ae-3dcf-4a8a-bf34-36a1078c88c6 with info Information; generated at 2019-09-05 07:54:18
 File doesn't contain 9807a242-48dc-47f2-8963-af323bf61b5c
 Found 9c8ccbfd-ff70-4fc5-b3a7-02872a9c731c: Line:680000-9c8ccbfd-ff70-4fc5-b3a7-02872a9c731c with info Information; generated at 2019-09-05 07:54:18
 File doesn't contain a3f6d083-588e-4337-b800-56af12bde5a9
 Found abe63355-6df4-452c-9b56-9879961cba38: Line:440000-abe63355-6df4-452c-9b56-9879961cba38 with info Information; generated at 2019-09-05 07:54:18
 File doesn't contain b709726c-e5f6-432e-8e22-4cea924ae29b
 Found b7c040c5-b5f9-4744-a0ec-61744c2f65d6: Line:640000-b7c040c5-b5f9-4744-a0ec-61744c2f65d6 with info Information; generated at 2019-09-05 07:54:18
 File doesn't contain bbdc590a-bbb6-42b0-8ba3-00c4bffa0a2a
 File doesn't contain bd0c1164-d754-41f4-afd3-bed92d4063f4
 Found c351a7ee-f4b8-449b-86d4-6d663942939f: Line:880000-c351a7ee-f4b8-449b-86d4-6d663942939f with info Information; generated at 2019-09-05 07:54:19
 Found c9296a3c-2167-4c40-b4dc-3d25b9fa285a: Line:320000-c9296a3c-2167-4c40-b4dc-3d25b9fa285a with info Information; generated at 2019-09-05 07:54:18
 Found cdfbce4a-cb9a-4617-a6c5-366bbbd6872f: Line:160000-cdfbce4a-cb9a-4617-a6c5-366bbbd6872f with info Information; generated at 2019-09-05 07:54:17
 File doesn't contain d057453e-c91b-4f11-8770-600780200835
 Found e0361b8a-25e1-4d6c-ae17-0f3ccb6f85fa: Line:280000-e0361b8a-25e1-4d6c-ae17-0f3ccb6f85fa with info Information; generated at 2019-09-05 07:54:18
 Found e38f4fb8-fc51-40af-bc06-60188139a0ba: Line:920000-e38f4fb8-fc51-40af-bc06-60188139a0ba with info Information; generated at 2019-09-05 07:54:19
 Found f4794410-7873-4fc5-adc2-7750667f88a7: Line:520000-f4794410-7873-4fc5-adc2-7750667f88a7 with info Information; generated at 2019-09-05 07:54:18
 File doesn't contain f736b5e2-acea-44f2-89eb-090fbe6cc50c
2019-09-05 07:54:21 search complete, searched 500000 items looking for 38 items, found 25, didnt find 13, took 1.5084395 seconds
Caius Jard
  • 72,509
  • 5
  • 49
  • 80
  • Only issue with this is that i'm trying to avoid storing the whole file in memory and want to access it on the disk. does this --> string[] data = System.IO.File.ReadAllLines(datafile); bring the whole file into memory? (other wise, it is a great solution and is a million times faster than my original idea) –  Sep 05 '19 at 08:37
  • Well, you've seen what trying to leave it on the disk does, but ReadLines returns an enumerable that, as you enumerate it, it reads the whole file into memory. It think it's quite likely that doing your Skip(i).Take().First() will (towards the end) be reading it all into memory anyway, potentially thousands of times so it's something of a moot point. Really , if you want to leave it on disk you would read all the Params into a dictionary, then gradually read the data file line by line (use a stream reader and read it line by line) and for each line pull the datapointnumber from the file and... – Caius Jard Sep 05 '19 at 08:47
  • ..ask if the datapointnumber you just read is in the params rather than the other way round (dont ask if the params is in the data point file) but honestly; my file of half a million guids (quite a long thing in itself) ended up at 50 megabytes. Even if it was 500 megabytes, it's still going to be something you can load into a dictionary with ease. It's one of those things where you shouldnt just assume that "reading the whole file into memory is going to be a problem" - do the read and test whether it's an actual problem or just an imagined one – Caius Jard Sep 05 '19 at 08:50