5

I have a text file of lines (of several GB and ~ 12 millions of lines), where each line is a point x, y, z, + accessory info. I wish to read chunk-by-chunk the file, processing the point and split (following an spatial index based on the position of the points respect a square grid of 0.25 m) the result in several text file in a temporary folder.

449319.34;6242700.23;0.38;1;1;1;0;0;42;25;3;17;482375.326087;20224;23808;23808
449310.72;6242700.22;0.35;3;1;1;0;0;42;23;3;17;482375.334291;20480;24576;24576
449313.81;6242700.66;0.39;1;1;1;0;0;42;24;3;17;482375.342666;20224;24576;24576
449298.37;6242700.27;0.39;1;1;1;0;0;42;21;3;17;482375.350762;18176;22784;23552
449287.47;6242700.06;0.39;11;1;1;0;0;42;20;3;17;482375.358921;20736;24832;24832
449290.11;6242700.21;0.35;1;1;1;0;0;42;20;3;17;482375.358962;19968;24064;23808
449280.48;6242700.08;0.33;1;1;1;0;0;42;18;3;17;482375.367142;22528;25856;26624
449286.97;6242700.44;0.36;3;1;1;0;0;42;19;3;17;482375.367246;19712;23552;23296
449293.03;6242700.78;0.37;1;1;1;0;0;42;21;3;17;482375.367342;19456;23296;23808
449313.36;6242701.92;0.38;6;1;1;0;0;42;24;3;17;482375.367654;19968;24576;24576
449277.48;6242700.17;0.34;8;1;1;0;0;42;18;3;17;482375.375420;20224;23808;25088
449289.46;6242700.85;0.31;3;1;1;0;0;42;20;3;17;482375.375611;18944;23040;23040

where ";" is the delimiter and the first two columns the x and y any useful for give the ID position

the output result is another text files where for each ID only one point is randomly extracted

ex:

    20;10;449319.34;6242700.23;0.38;1;1;1;0;0;42;25;3;17;482375.326087;20224;23808;23808
    20;10;449310.72;6242700.22;0.35;3;1;1;0;0;42;23;3;17;482375.334291;20480;24576;24576
    20;10;449313.81;6242700.66;0.39;1;1;1;0;0;42;24;3;17;482375.342666;20224;24576;24576
    20;10;449298.37;6242700.27;0.39;1;1;1;0;0;42;21;3;17;482375.350762;18176;22784;23552
    20;11;449287.47;6242700.06;0.39;11;1;1;0;0;42;20;3;17;482375.358921;20736;24832;24832
    20;11;449290.11;6242700.21;0.35;1;1;1;0;0;42;20;3;17;482375.358962;19968;24064;23808

where the first two columns are the ID

the final output will be (example) without the ID values

         20;10;449313.81;6242700.66;0.39;1;1;1;0;0;42;24;3;17;482375.342666;20224;24576;24576
         20;11;449287.47;6242700.06;0.39;11;1;1;0;0;42;20;3;17;482375.358921;20736;24832;24832

I am using a solution from this blog

# File: readline-example-3.py

file = open("sample.txt")

while 1:
    lines = file.readlines(100000)
    if not lines:
        break
    for line in lines:
        pass # do something

my code is the following:

from __future__ import division
import os
import glob
import tempfile
import sys

def print_flulsh(n, maxvalue = None):
    sys.stdout.write("\r")
    if maxvalue is None:
        sys.stdout.write("Laser points processed: %d" % n)
    else:
        sys.stdout.write("%d of %d laser points processed" % (n, maxvalue))
    sys.stdout.flush()


def point_grid_id(x, y, minx, maxy, size):
    """give id (row,col)"""
    col = int((x - minx) / size)
    row = int((maxy - y) / size)
    return row, col


def tempfile_tile_name(line, temp_dir, minx, maxy, size, parse):
    x, y = line.split(parse)[:2]
    row, col = point_grid_id(float(x), float(y), minx, maxy, size)
    return os.path.normpath(os.path.join(temp_dir + os.sep,"tempfile_%s_%s.tmp" % (row, col)))

# split the text file in small text files following the ID value given by tempfile_tile_name
# where:
# filename : name+path of text file
# temp_dir: temporary folder
# minx, maxy: origin of the grid (left-up corner)
# size: size of the grid
# parse: delimeter of the text file
# num: number of lines (~ 12 millions)

def tempfile_split(filename, temp_dir, minx, maxy, size, parse, num):
    index = 1
    with open(filename) as file:
        while True:
            lines = file.readlines(100000)
            if not lines:
                break
            for line in lines:
                print_flulsh(index, num)
                index += 1
                name = tempfile_tile_name(line, temp_dir, minx, maxy, size, parse)
                with open(name, 'a') as outfile:
                    outfile.write(line)

The main problem of my code is a decreasing of speed when ~ 2 millions of split text files are saved in the temporary folder. I wish to know with to the respect the solution of effbot.org if there is an optimized method to create a buffer?

Gianni Spear
  • 7,033
  • 22
  • 82
  • 131
  • 1
    What's the reason for saving two million individual files? I know I know - premature optimization is the root of all evil and you should always (when possible) use pure text files - but I would look to find another way of storing this data. – Anders Mar 24 '13 at 19:48
  • 1
    For every line you read, you open and close an ouput file. That's the bottleneck. Consider writing to a database instead. – Janne Karila Mar 24 '13 at 19:49
  • @Anders, "premature optimization is the root of all evil" so true!!!. After the split i need to open again each file and select only one line random. – Gianni Spear Mar 24 '13 at 19:53
  • 1
    Why use `file.readlines()`? File objects should already use buffered IO, and you're iterating over the lines one-by-one anyway so it's not like you're getting rid of loop overhead, and allocating and releasing a lot of memory over and over again. Reading data in chunks makes more sense when you don't really care about what's inside those chunks, like when you're trying to download a binary file to disk or so. Seems like just doing `for line in file:` would be fine. (Where the memory overhead should be minimal.) – millimoose Mar 24 '13 at 19:54
  • 1
    @millimoose guessing because it reads that many bytes, and then upto the end of the next line.... So, I would imagine the 100000 should be the equiv of 512mb or so... but yeah - looking at the code, nothing's really happening on a chunk basis anyway so good point – Jon Clements Mar 24 '13 at 19:56
  • @millimoose, the fist version of my code was reading line-by-line but after 1 millions of lines the process became extremely slow. Whit file.readlines() i can see an increase of speed. By the way I am really open about new solution. – Gianni Spear Mar 24 '13 at 19:57
  • @Anders "Not doing something *utterly* pointless" isn't premature optimisation. That is, doing the thing with the 2 million files only makes sense if it makes some other part of the code much more straightforward to implement. Avoiding premature optimisation doesn't mean "go with the very first approach you can think of." – millimoose Mar 24 '13 at 19:59
  • @Gianni Fair enough I suppose, though I wonder why that would happen at all. Maybe if you resolve the bottleneck with too many garbage files being opened, getting rid of the chunking will be a possibility or even help. – millimoose Mar 24 '13 at 20:00
  • Shove it into a sqlite3 database using `executemany` with the points pre-calculated, then retrieve that with an order by query on the key, then work from there... (I think could be what you're after) – Jon Clements Mar 24 '13 at 20:00
  • @JonClements, executemany it's new for me. Is it complex to implement in Python? – Gianni Spear Mar 24 '13 at 20:03
  • @Gianni no, just set up an insert query and use the `excutemany` object on the db connection, giving it a generator over your file that returns the necessary data, then query it... – Jon Clements Mar 24 '13 at 20:04
  • @JonClements don't looks so easy :D – Gianni Spear Mar 24 '13 at 20:05
  • @JonClements but i am here to learn!!!! – Gianni Spear Mar 24 '13 at 20:09
  • 1
    @Gianni I'll see if I can find an example... or write you a quick one - but I can see this taking a while ;) – Jon Clements Mar 24 '13 at 20:10
  • 1
    @Gianni can you post some sample data and expected out please? – Jon Clements Mar 24 '13 at 20:27
  • It sounds like the domain of IDs is the smallest part of the problem. How big is it in relation to the whole thing? That will give you a handle on how big your answer set will really be, and you can pick a back end appropriate to that scale. – theodox Mar 24 '13 at 20:49
  • Hey @theodox the bottleneck is open and close the split file. I cannot load all data in my memory (~ 3- 8 GB is the standard size of my file). For this reason my strategy was split the files, where each file is all point with the same spatial ID (= they drop in the same grid cell) – Gianni Spear Mar 24 '13 at 20:59
  • I'm just curious how big your total # of id's is. That X the size of your data will scale the final output, so if you can create a container (Database sounds like a good idea to me) that handles as many records as you expect IDs, you'll be good. You can handle the random selection by taking the first entry for each ID and then randomly overwriting it in memory... you only have to worry about (cound of ID's X size of records) amount of memory – theodox Mar 24 '13 at 21:09
  • ...unless you're worried about the ordering of the records somehow biasing your samples. Doing a random overwrite rather than collecting all examples for id X and picking from there will bias towards later samples. – theodox Mar 24 '13 at 21:13
  • @theodox more small is the square grid (ex. .25 m) more ID you have. The range is number ID == to number of lines (ex: if you have 8 millions of point you 8 millions of different ID) and less of the total number of lines – Gianni Spear Mar 24 '13 at 21:13
  • one solution is always save a single temp file for each point with temp_row_col_number.tmp where row_col are the spatial ID, and Number is unique from 0 to lenght(all points) – Gianni Spear Mar 24 '13 at 21:14
  • 1
    (The comment police are telling us to move to chat). I think you don't want any solution that requires millions of files: it's the IO overhead that's murderous here. – theodox Mar 24 '13 at 21:17
  • Hey @RomanC, i don't wish to decrease the memory usage – Gianni Spear Mar 25 '13 at 11:10
  • @Gianni If you don't want to decrease the memory usage, why do you ask for the "most memory efficient way" ? – Frank Schmitt Mar 25 '13 at 11:20
  • Yes it's true. Sorry it's monday morning i tried to fix this strategy all night. Consider the strategy above is wrong, but my suspicious is when more than 1 millions of files are create in the folder (in window) the performance became extremely slow. I am thinking a new strategy as read the first x numbers of line, create a dictionary with this lines (consider they are points), read the rest and if you point drop in the grid cell (it has the valid id) is than inserted in the dictionary. Random select in the dictionay and save. The next step is read the next x lines and repeat – Gianni Spear Mar 25 '13 at 11:27

1 Answers1

1

The bottleneck in your code is not in reading, but in opening and closing an output file for every line read. In comments you mention your final objective: After the split i need to open again each file and select only one line random.

theodox mentions a possible approach, taking the first entry for each ID and then randomly overwriting it in memory. Note that the overwriting must take place at probability 1/n, where n is the number of lines so far seen with the same ID, to avoid a bias towards later samples.

EDIT. You can save memory by doing two passes over the file. The first pass builds a set of line numbers excluded by the random selection, the second pass processes the lines that are not excluded.

from random import random

def random_selection(filename, temp_dir, minx, maxy, size, parse, num):
    selection = {}
    excluded = set()
    with open(filename) as file:
        for i, line in enumerate(file):
            x, y, _ = line.split(parse, 2)
            row_col = point_grid_id(float(x), float(y), minx, maxy, size)
            try:
                n, selected_i = selection[row_col]
            except KeyError:
                selection[row_col] = 1, i
            else:
                n += 1
                if random() < 1.0 / n:
                    excluded.add(selected_i)
                    selected_i = i
                selection[row_col] = n, selected_i

    with open(filename) as file:
        for i, line in enumerate(file):
            if i not in excluded:
                #process the line
Janne Karila
  • 24,266
  • 6
  • 53
  • 94
  • Thanks Janne, I got the point. There is a memory problem to store 12 GB of data. When you select inside a small square grid (ex: 0.25 m) often only one or two points drop inside the square grid. With ~12 GB the final output is around ~10 GB – Gianni Spear Mar 25 '13 at 11:06
  • @Gianni OK, I edited my code to keep line numbers instead of full lines. Maybe it fits your memory now? – Janne Karila Mar 25 '13 at 11:28