5

Introduction

I have a C++ process called MyProcess that I call nbLines times, where nbLines is the number of lines of a big file called InputDataFile.txt in which input data are to be found. For example the call

./MyProcess InputDataFile.txt 142

Inform MyProcess that the input data are to be found at the line 142 of the InputDataFile.txt file.

Issue

The issue is that InputDataFile.txt is so big (~ 150 GB) that the time for searching the correct line is not negligible. Inspired form this post, here is my (possibly not optimal) code

int line = 142;
int N = line - 1;
std::ifstream inputDataFile(filename.c_str());
std::string inputData;
for(int i = 0; i < N; ++i)
    std::getline(inputDataFile, inputData);

std::getline(inputDataFile,inputData);

Goal

My goal is to make the search of inputData faster for MyProcess.

Possible solution

It would be handy to match once the index of the first character of every line with the line number in bash. This way instead of giving 142 to MyProcess, I could give directly the index of the first character of interest. MyProcess could then directly jump to this position without having to search and count the '\n' characters. It would then read the data until a '\n' character is encounter. Is something like this feasible? How could this be implemented?

Of course, I welcome any other solution that would reduce the overall computational time for importing those input data.

Community
  • 1
  • 1
Remi.b
  • 17,389
  • 28
  • 87
  • 168
  • Possible duplicate: [In C++ is there a way to go to a specific line in a text file?](http://stackoverflow.com/questions/5207550/in-c-is-there-a-way-to-go-to-a-specific-line-in-a-text-file) – Thomas Matthews Mar 24 '17 at 18:07
  • Possible duplicate: [Getting the nth line of a text file in C++](http://stackoverflow.com/questions/7273326/getting-the-nth-line-of-a-text-file-in-c) – Thomas Matthews Mar 24 '17 at 18:08
  • Is there a reason that the input data must be stored as plain text? why not use a more searchable storage method? is this file changing constantly or is it always the same? – Alex Zywicki Mar 24 '17 at 18:13
  • @AlexZywicki The file does not change in size. There is no specific reason to store the input data as plain text I guess. I am just unaware of alternative solutions. – Remi.b Mar 24 '17 at 18:15
  • @anubhava Because the possible solution suggested imply using `bash` but I can remove the tag if you think it does not justify its usage. – Remi.b Mar 24 '17 at 18:16

3 Answers3

2

As Suggested in other answers it could be a good idea to build a map of the file. The way I would do this (in pseudocode) would be:

let offset be a unsigned 64 bit int =0;

for each line in the file 
    read the line
    write offset to a binary file (as 8 bytes rather as chars)
    offset += length of line in bytes

Now you have a "Map" file that is a list of 64 bit ints (one for each line in the file). To read the map you just compute where in the map the entry for the line you desire is located:

offset = desired_line_number * 8 // where line number starts at 0
offset2 = (desired_line_number+1) * 8

data_position1 = load bytes [offset through offset + 8] as a 64bit int from map
data_position2 = load bytes [offset2 through offset2 + 8] as a 64bit int from map

data = load bytes[data_position1 through data_position2-1] as a string from data.

The idea is that you read through the data file once and record the byte offset in the file where each line starts and then store the offsets sequentially in a binary file using a fixed size integer type. The map file should then have a size of number_of_lines * sizeof(integer_type_used). You then just have to seek into the map file by calculating the offset of where you stored the line number offset and read that offset as well as the next lines offset. From there you have a numerical range in bytes of where your data should be located.

Example:

Data:

hello\n 
world\n
(\n newline at end of file)

Create map.

Map: each grouping [number] will represent an 8 byte length in the file

[0][7][14]
//or in binary
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000111
00000000 00000000 00000000 00000000 00000000 00000000 00000000 00001110

Now say I want line 2:

line offset = 2-1 * 8 // offset is 8 

So since we are using a base 0 system that would be the 9th byte in the file. So out number is made up of bytes 9 - 17 which are :

00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000111
//or as decimal
7

So now we know that out line should start at offset 7 in our data file (This offset is base 1, it would be 6 if we started counting at 0).

We then do the same process to get the start offset of the next line which is 14.

Finally we look up the byte range 7-14 (base 1, 6-13 base 0) and store that as a string and get world\n.

C++ implementation:

#include <iostream>
#include <fstream>

int main(int argc, const char * argv[]) {
    std::string filename = "path/to/input.txt";

    std::ifstream inputFile(filename.c_str(),std::ios::binary);
    std::ofstream outfile("path/to/map/file.bin",std::ios::binary|std::ios::ate);

    if (!inputFile.is_open() || !outfile.is_open()) {
        //use better error handling than this
        throw std::runtime_error("Error opening files");
    }


    std::string inputData;
    std::size_t offset = 0;
    while(std::getline(inputFile, inputData)){
        //write the offset as binary
        outfile.write((const char*)&offset, sizeof(offset));
        //increment the counter
        offset+=inputData.length()+2;
        //add one becuase getline strips the \n and add one to make the index represent the next line
    }
    outfile.close();

    offset=0;

    //from here on we are reading the map
    std::ifstream inmap("/Users/alexanderzywicki/Documents/xcode/textsearch/textsearch/map",std::ios::binary);
    std::size_t line = 2;//your chosen line number
    std::size_t idx = (line-1) * sizeof(offset); //the calculated offset
    //seek into the map
    inmap.seekg(idx);
    //read the binary at that location
    inmap.read((char*)&offset, sizeof(offset));
    std::cout<<offset<<std::endl;

    //from here you just need to lookup from the data file in the same manor


    return 0;
}
Alex Zywicki
  • 2,263
  • 1
  • 19
  • 34
  • I will note that it is crucial that you use a 64 bit integer. Any smaller type of integer won't be able to represent the indecies for a data file as large as you are suggesting. – Alex Zywicki Mar 24 '17 at 19:16
  • I will also note that I am assuming a \n line newline rather than a \r\n newline which would change the computations slightly. – Alex Zywicki Mar 24 '17 at 19:18
  • That's awesome! I did not know the `seekg` function that makes everything much easier. I also appreciate the code to create the map. + 1 – Remi.b Mar 24 '17 at 21:16
1

There is no "fast" method to read the Nth text line of a file.

Text files contain variable length records. Each record is terminated by a newline. The text must be read, character by character, until the newline is found. This could be 1 character or could be 245 characters. There is no standard size.

The common practice is to read each line and ignore the line until you get to the line you need.

If you frequently need to go to a specific line in a file, you can maintain a map of line numbers and their file positions.

Otherwise, you can try reading chunks or blocks into a buffer and scan the buffer. This will speed up your program, but you have to account for the text line possibly crossing a buffer boundary. Remember, input is most efficient when it is kept streaming (think of a river of data).

Thomas Matthews
  • 56,849
  • 17
  • 98
  • 154
  • Thanks for your answer. I am not sure I understand correctly your sentence `map of line numbers and their file positions` but it feel very like what I suggest as a **possible solution** in my post. I could create the map once, for every line in bash and then give this map to `./MyProcess`. I would not know how to implement that though. – Remi.b Mar 24 '17 at 18:25
  • Maintain a database of files, their modification dates, and an array of offsets for each newline. When parsing a file, if you have a record for the file, and the date matches the files modification date, jump to the correct line. If not, read the file, line by line, and record the file offset of each newline. Add/update your record for that file, and the file's latest modification in your database. The database would likely have a hash lookup for filenames for performance. – blackghost Mar 24 '17 at 19:04
  • Since you have your post tagged as C++, you can have `std::map` to contain line numbers and file positions (respectfully). Before you read in a line, add the line number and file position pair into the `map`. – Thomas Matthews Mar 24 '17 at 19:49
0

since this is tagged with bash, here is a simple function with sed

define

getline() { sed "${2}q;d" "$1"; }

usage

getline InputData.txt 142
karakfa
  • 66,216
  • 7
  • 41
  • 56
  • Thank you. I know how to get a line in Bash. My issue is that each time I call `MyProcess`, `MyProcess` needs to find the correct line and I therefore thought that in Bash we could create a map of `InputData.txt` to give to `MyProcess` to make the search of the data at the specific line faster. Please let me know if the question is unclear to you. – Remi.b Mar 24 '17 at 18:42
  • Best solution depends on the context. Do you know the usage pattern? How many of these lines will be accessed and in what order? You can reduce the linear scan time but splitting the file to N segments and implement a two tiered access. – karakfa Mar 24 '17 at 18:48
  • All lines will be accessed but at very different times (like weeks appart). I will eventually call all `MyProcess ${dataFile} ${LineNumber}`. As `MyProcess` is too slow (as each `MyProcess` call currently search through the file for the correct line independently) I was thinking of computing a map once (which would require screening through the whole file only once). Storing the map on the hard disk and feed the map to `MyProcess` when calling it (`MyProcess ${dataFile} ${LineNumber} ${map}`). Does the question make sense to you? – Remi.b Mar 24 '17 at 19:04