0

I have a custom type. Let's say it's a 3D point type with intensity, so it'll look like this :

struct Point{
public:
    double x,y,z;
    double intensity;
    //Imagine a constructor or everything we need for such a class here
};

and another class that use a vector of that point.

class Whatever{
//...
public:
    std::vector<Point> myPts;
//...
};

I would like to be able to create this vector from a file. It means I have a file like this:

X1 Y1 Z1 I1 
X2 Y2 Z2 I2
X3 Y3 Z3 I3 
....
Xn Yn Zn In

And I would like to find a fast technique to separate each lines and build a point per lines, and build my vector. Since it's an operation that I will have to do a lot, I am looking for the fastest way.

The basic solution would be to read the file line by line, convert into stringstream and create from this stringstream :

while(std::getline(file, line)){
    std::istringstream iss(line);
    Point pt;
    iss >> pt.x >>  pt.y >> pt.z >> pt.intensity;
    vector.add(pt);
}

But this is too much time consuming. So The second method would be to read the file (entire, or a part of the file) in a buffer and format the buffer to create the vector from it. Using a memory mapped file, I can read fastly into a buffer,, but How to format the buffer to create my points without using stringstream which I believe is slow?

EDIT : I used this code :

std::clock_t begin = clock();
    std::clock_t loadIntoFile, loadInVec;
    std::ifstream file(filename);
    if (file) {
        std::stringstream buffer;
        buffer << file.rdbuf();
        file.close();
        loadIntoFile = clock();
        while (!buffer.eof()) {
            Point pt = Point();
            buffer >> pt.x >> pt.y >> pt.z >> pt.intens >> pt.r >> pt.g >> pt.b;
            pts.push_back(pt);
        }
        loadInVec = clock();
    }
    std::cout << "file size" << pts.size() << std::endl;
    std::cout << "load into file : " << (loadIntoFile - begin) / (double)CLOCKS_PER_SEC << std::endl;
    std::cout << "load into vec : " << (loadInVec - loadIntoFile) / (double)CLOCKS_PER_SEC << std::endl;

And the result is :

file size : 7756849
load into file : 2.619
load into vec : 31.532

EDIT2 : After removing the stringstream buffer, I had a time of 34.604s And after changing push_back by emplace_back 34.023s

Raph Schim
  • 528
  • 7
  • 30
  • *But this is too much time consuming.* How do you know it's too time consuming? Because if it is, your problem is likely bound by how fast you read the file, not how fast you process the data in it. Changing how you process the data won't help at all if you need to buy a faster disk - or disks and disk system. – Andrew Henle Nov 20 '18 at 13:21
  • 1
    Have you *measured* that it's to "time consuming"? Is this a true bottleneck in your program? It's not a possible case of premature optimizations? And don't "believe" something, *measure*, *benchmark*, *profile*. – Some programmer dude Nov 20 '18 at 13:21
  • *Using a memory mapped file, I can read fastly into a buffer* And using a memory mapped file is usually no faster than just reading the data. In fact, it can actually be slower. There's no magic way to get bytes from storage into the CPU faster. – Andrew Henle Nov 20 '18 at 13:23
  • See my EDIT for more informations. – Raph Schim Nov 20 '18 at 13:36
  • 1
    why are you using the string buffer? Just read directly from the file stream – 463035818_is_not_an_ai Nov 20 '18 at 13:38
  • next consider to use `emplace_back` to construct the elements in the vector instead of pushing copies of temporaries – 463035818_is_not_an_ai Nov 20 '18 at 13:42
  • and for readability you should provide an overload for `std::ostream& operator<<` (which btw lets you stream a `Point` from a string stream as well as from a file stream) – 463035818_is_not_an_ai Nov 20 '18 at 13:44
  • I tried with emplace_back and remove the string buffer, But I still have 34s of loading. This is too much for what I want to achieve .... Because I will have a lot of files like this... That's why I think this is bottleneck for me /: – Raph Schim Nov 20 '18 at 13:47
  • 1
    If you're very concerned about performance you shouldn't be storing your data as text. – molbdnilo Nov 20 '18 at 13:47
  • did you consider that reading huge files unavoidably takes much time? 7 million entries is a lot and roughly 5 micro seconds per entry is actually not that much – 463035818_is_not_an_ai Nov 20 '18 at 13:48
  • I would like to apply filter on point cloud, but points clouds are really huge (100, 500Go?), and I can't save all the points in RAM. So I'm trying to create a file based Octree to store the points by emplacement in space so I can run filters on each files... – Raph Schim Nov 20 '18 at 13:51
  • 2
    imho, no matter how you put it, the correct answer to "How to format the buffer to create my points without using stringstream which I believe is slow?" has to convince you that "which I believe is slow" is going into the wrong direction. Reading data from files is slow, thats not c++ streams fault. Try to reduce the amount of data, use binary files, but blaming stringstream out of a believe doesnt help much ;) – 463035818_is_not_an_ai Nov 20 '18 at 13:53
  • You're right :( But since I have seen some software that was kinda quick using a file buffer, I thought that my bottleneck was really to format the stream into my points. :( – Raph Schim Nov 20 '18 at 14:00
  • 1
    You want to create N points. You have a text file with 4*N numbers represented as text. You need 4*N numbers represented as machine doubles. You can minimise all other overhead, but these 4*N text to double conversions, and copying the file to memory, will be there no matter what you do. They will likely remain your limiting factor. – n. m. could be an AI Nov 20 '18 at 14:38
  • Since nobody asked yet: You compile with optimisations turned on, right? And maybe it helps to turn off io-stream synchronisation with stdio: `std::ios_base::sync_with_stdio( false);`. – Rene Nov 20 '18 at 14:41
  • I saw no difference by using `std::ios_base::sync_with_stdio(false);` and I'm compiling using VS2017 O2 optimisation :) – Raph Schim Nov 20 '18 at 14:49
  • Unrelated, but please read [Why is iostream::eof inside a loop condition considered wrong?](https://stackoverflow.com/questions/5605125/why-is-iostreameof-inside-a-loop-condition-considered-wrong). – Some programmer dude Nov 20 '18 at 14:51
  • Yes! I already read that, I tried to do something fast in order to give an answer and didn't took the time to think about it! But Thanks! – Raph Schim Nov 20 '18 at 14:57
  • More related to your problem: How often will you need to do this reading? Only once? Multiple times a day? Once every hour? Once every minute? And how long will your program run? What will it do with the data once it's running? Is the loading of the data / processing of the data ratio big or small? If it takes little over 30 seconds to load the data, but then you spend a couple of hours processing the data, does it really matter? Large amounts of data is always going to take a lot of time *somewhere*. – Some programmer dude Nov 20 '18 at 14:57
  • 1
    In fact, I have a huge file (~100Go, as I said, so more or less 4billions points, maybe more). What I wanted to do is to take points 1 by 1 and add them in my octree. For example 15millions points per file maximum. So when I add 15Millions points, I create 8 files, read from the first file and put points in my 8 sub-files at the good place. Then continue by adding point, and when a file have more than 15M points, read it, and split it into 8 others files. and so on until I have file with maximum 15M points. Then I read them and apply filter on each of them – Raph Schim Nov 20 '18 at 15:03
  • 1
    I know it is going to take a lot of time, but here, 30seconds for a 7M point file is really too much. I think I would be able to make it faster. But In fact this is just my idea that was bad from start. I'll Have to think about doing something else. Because With that, the ratio would be awful... Too much loading time – Raph Schim Nov 20 '18 at 15:04
  • I did not find a way to separate my point cloud but preserving locality other than that. – Raph Schim Nov 20 '18 at 15:24

0 Answers0