0

I am trying to write a parser to read large text file in C++. Similar python code using readtable method is approximately 7 to 8 times faster.

I am wonder why it runs so slow in C++. Most of the time is taken in using istringstream to parse lines to separate table numbers. It will be great if someone can point issue with code or alternative to istringstream. The code is as below:

'''

   #include <fstream>
   #include <iostream>
   #include <string>
   #include <sstream>
   #include <vector>
   #include <algorithm>
   #include <chrono>
   
   using namespace std::chrono;
   int main()
   {
       auto start = high_resolution_clock::now();
       std::ifstream inf{ "/Users/***/some.bed" };
       std::istringstream iss;
       int aprox_nlines = 7000000;
    
    
       std::vector<int>* ptr_st = new std::vector<int>();
       std::vector<int>& start_v = *ptr_st;
       start_v.reserve(aprox_nlines);
    
       std::vector<int>* ptr_en = new std::vector<int>();
       std::vector<int>& end_v = *ptr_en;
       end_v.reserve(aprox_nlines);
    
       // If we couldn't open the output file stream for   reading
       if (!inf)
       {
           // Print an error and exit
           std::cerr << "Uh oh, File could not be opened for reading!" << std::endl;
           return 1;
       }
    
       int count=0;
       std::string line;
       int sstart;
       int end_val;
       std::string val;
 
       if (inf.is_open())
       {
          while (getline(inf, line))
          {
            count += 1;
            
            iss.str(line);
            iss >> val;
            iss >> sstart;
            start_v.push_back(sstart);
            iss >> end_val;
            end_v.push_back(end_val);

          }
          std::cout << count<<"\n";
        
          inf.close();
      }
      auto stop = high_resolution_clock::now();
      auto duration = duration_cast<microseconds>(stop - start);
    
      std::cout << "Time taken by function: " << duration.count() << " microseconds" <<"\n";
    
    
    
      return 0;
    
   }

'''

  • Why are you using `getline` to read a `string`, and then immediately constructing another stream from that string, only to read from it immediately? You appear to be doubling the work you need to do. – cigien Jun 23 '20 at 03:28
  • Also, why are you allocating a `vector` with `new`? – cigien Jun 23 '20 at 03:29
  • I am new to C++ so forgive my ignorance but my assumption wrt allocating vector with new is that it is going to need large memory since I am expecting 10 to 20 million size vector so I was trying to allocate memory using pointer. – Nachiket Patil Jun 23 '20 at 03:32
  • 1
    `reserve` will do that. Get rid of the `new` for the vector. – cigien Jun 23 '20 at 03:33
  • The `getline` and `stringstream` trick is used because it is extremely simple and nearly foolproof, not because it's fast. A lot of the time it's fast enough. Maybe it isn't in your case, but I recommend running your code through a profiler to make sure that anything you replace it with really is worth the effort. – user4581301 Jun 23 '20 at 03:40
  • I managed to get it as fast as python by using FILE *fopen() , getline and sscanf to parse the table. It's runs almost at same speed as python inbuilt function. Though somehow it runs at twice speed it recompiled. Not sure what is that. – Nachiket Patil Jun 23 '20 at 10:01
  • Have you compiled the program with the optimizer turned on? In an IDE this is often called a Release build. On the command line you probably need to add a -O2 or -O3 to the build command. – user4581301 Jun 23 '20 at 18:33
  • I will try today and get back. – Nachiket Patil Jun 23 '20 at 18:45

1 Answers1

0

It seems using FILE * = fopen() it runs much better. It is around 10 times faster than istringstream. Compared to python inbuilt (readtable) function it is 33% faster. '''

    FILE * ifile = fopen("*/N.bed", "r");
    size_t linesz = 60+1;
    char * nline = new char[linesz];
    char T[50], S[50];
    int sn,en;
    unsigned int i = 0;
     while(getline(&nline, &linesz, ifile) > 0)  {
         i++;
         //std::cout<<nline<<"\n";
         sscanf(nline, "%s %d %d", T, &sn, &en);
         start_v.push_back(sn);
         end_v.push_back(en);
         //std::cout<<T<<" "<< S <<"\n";
     }

'''