Splitting a large text file based on some criteria

Question

I have a very huge text file ( ~80G) which has about a billion lines. A sample example of the content(first column represents the line number and is not part of the file's content) of the file would be:

Note: the file's contents are first ordered on the first column and then on the second column.

1 60400 60420 12 14 123 144
2 60400 60520 11 14 123 144
...
i 60420 60400 10 11 233 341
i+1 60420 60410 14 20 244 268
...

Filtering criteria: I want to split the file based on unique (id1,id2) [or (id2,id1)] pairs, where as in the sample shown above, if i consider (60400,60420) as an id pair, then the ith line would also belong to that pair. So, the split files would contain all the lines belonging to such unique id pairs. So, all the split files would be exclusive w.r.t the id pairs. So far, the method that i've applied is as follows:

1) paritioned all the unique id pairs into three files, where the first two files have 200million unique ids and the third file has a 157million or so unique ids. These id pairs were created such that the id1 < id2.

2) For each of the partition ids, I'm partitioning it again like so.

partition_ids = []
# read the partition id and populate the partition_ids

# read the original file(87G file)
for line in original_file:
    # parse the line
    toks = line.split()
    id1 = int(toks[0])
    id2 = int(toks[1])

    # create the unique id pair key
    if id1 < id2:
        key = str(id1)+','+str(id2)
    else:
        key = str(id2)+','+str(id2)

    if key in partition_ids[:40mil]: #(short hard for first 40 million unique ids, just for purpose of explaining)
        # write line to the file

This process is still taking me a long time ( > 20 hours) and I really want to speed up this process. This was the solution I could think of to process the large file. If there are any other ways or any suggestions (that are faster), it would be much appreciated.

Im pretty sure that python waste too much time only for read that big file, try to open more times with less content, for example from line 0-10000, 10001-20000 etc... Also you can use threads/multiprocess to speedup — Wonka, Dec 23 '19 at 15:52
@Wonka , just to read the file and iterate through each line without any sort of parsing, it takes about 4 minutes. I'd have to read through the big file anyway, how do you suggest multithreading/processing would help here ? Do i process , say line0-10000 with a thread while another thread is processing 10000-some other line ? — pramesh shakya, Dec 23 '19 at 15:55
Additionally, I'd use a set rather than a list. Sets are significantly faster when checking if an element is present. — Axe319, Dec 23 '19 at 15:56
@prameshshakya yes, but I get other thing, if key in BIG_LIST if so bad, use dictionary structure, so it will be cost O(1) --> (partition_ids dictionary ) — Wonka, Dec 23 '19 at 15:58
@Axe319 , i thought about it, but since sets don't preserve the order, I can't use them because just for the first partition, it has 200Mil unique id pairs and i'm trying to partition those again into smaller mutually exclusive subsets. — pramesh shakya, Dec 23 '19 at 15:59
@Wonka I'm sorry if i poorly explained the problem but i don't see how dictionaries would be of use here. — pramesh shakya, Dec 23 '19 at 16:03
@prameshshakya do this example, fill a list with 1.000.000 element (can be numbers from 0 to 1.000.000. If you want to know if number 555.555 is only list, cost will be O(n) n = 1.000.000 in wortst escenary. You can define a dictionary with keys 0 - 1.000.000. To check if 555.555 is ony dict cost is O(1), assignt True as value to reduce memory. d[1] = True, d[555555] = True etc.... Also can use tuples as key d[(v1, v2)] = True but change code concatenate v1+","+v2 — Wonka, Dec 23 '19 at 16:10
Maybe have a look at the pyarrow library. It offers a lot of very efficient read and write options and takes full advantage of you available hardware. — Rob, Dec 23 '19 at 16:29
Hey @prameshshakya, did you test my answer? Does it reduce your time execution? Ask if you need more help to adapt it for your code — Wonka, Dec 23 '19 at 16:53
@prameshshakya: Try this [repl.it](https://repl.it/repls/GrubbyStainedGravity) if it will improve things. — stovfl, Dec 23 '19 at 20:45

score 1 · Answer 1 · answered Dec 23 '19 at 16:19

Try to change your list partition_ids with dict: (to reduce cost element in list)

partition_ids = {}
# read the partition id and populate the partition_ids

# read the original file(87G file)
for line in original_file:
    # parse the line
    toks = line.split()
    id1 = int(toks[0])
    id2 = int(toks[1])

    # create the unique id pair key
    if id1 < id2:
        key = str(id1)+','+str(id2)
    else:
        key = str(id2)+','+str(id2)

    #YOUR OLD CODE
    """
    if key in partition_ids[:40mil]: #(short hard for first 40 million unique ids, just for purpose of explaining)
    # write line to the file
    """

    #MY propose
    if key in partition_ids:
        #Do your stuf if it exists



   #To asign keys when you want, cause you miss that part on your code
   partition_ids[key] = True

I'll give this a try and see if it performs any faster. Thank you. — pramesh shakya, Dec 23 '19 at 18:15

keiv.fly · Answer 2 · 2019-12-24T13:28:54.667

Benchmarks:

Fast python (sets, rb,wb) with partitions:  3.75 s
Fast python (sets... with an internal loop: 1.39 s
Original python with partitions:           19.4 s
Original python ... with an internal loop: 23.4 s

Original python ... with internal loop:   
Cython:                                       512 ms
Python with sets and binary read and write:   820 ms
Python with dicts (Wonka's variant):        1.31 s
Original Python                            12.1 s

Using sets on part of lists is also beneficial for speed.

Fast python (sets, rb,wb) with partitions:

for i,partition_ids in enumerate(l_partition_ids):
    partition_ids_s = set(partition_ids)
    with open("in.txt", "rb") as in_file:
        with open(f"out{i}.txt", "wb") as out_file:
            for line in in_file:
                # parse the line
                toks = line.split()
                id1 = int(toks[1])
                id2 = int(toks[2])

                # create the unique id pair key
                if id1 < id2:
                    key = b"%d,%d" % (id1,id2)
                else:
                    key = b"%d,%d" % (id2,id1)

                if key in partition_ids_s:
                    out_file.write(line)

Fast python (sets... with an internal loop:

out_files = []
l_partition_ids_sets = [set(x) for x in l_partition_ids]
with open("in.txt", "rb") as in_file:
    for i in range(len(l_partition_ids)):
        out_files.append(open(f"out{i}.txt", "wb"))
    for line in in_file:
        # parse the line
        toks = line.split()
        id1 = int(toks[1])
        id2 = int(toks[2])

        # create the unique id pair key
        if id1 < id2:
            key = b"%d,%d" % (id1,id2)
        else:
            key = b"%d,%d" % (id2,id1)

        for i,partition_ids in enumerate(l_partition_ids_sets):
            if key in partition_ids:
                out_files[i].write(line)
for out_file in out_files:
    out_file.close()

Original python with partitions:

for i,partition_ids in enumerate(l_partition_ids):
    with open("in.txt", "r") as in_file:
        with open("out.txt", "w") as out_file:
            for line in in_file:
                # parse the line
                toks = line.split()
                id1 = int(toks[1])
                id2 = int(toks[2])

                # create the unique id pair key
                if id1 < id2:
                    key = str(id1)+','+str(id2)
                else:
                    key = str(id2)+','+str(id1)

                if key in partition_ids: #(short hard for first 40 million unique ids, just for purpose of explaining)
                    out_file.write(line)

In the line_profiler below we can see that splitting the line and converting to integers takes almost 45% of the time. Reading only takes 11% of the time. There are much faster converters to integers implemented in cython (fast_atoi here), but I have not implemented it here. I tried to improve the speed of line.split() in cython but could not.

Cython (the fastest variant):

%%cython

from libc.stdint cimport (uint8_t, uint16_t, uint32_t, uint64_t,
                          int8_t, int16_t, int32_t, int64_t)
import numpy as np

def f_set_cy(partition_ids):
    cdef int64_t id1, id2
    partition_ids_s = set(x.encode() for x in partition_ids)
    with open("in.txt", "rb") as in_file:
        with open("out.txt", "wb") as out_file:
            for line in in_file:
                # parse the line
                toks = line.split()
                id1 = int(toks[1])
                id2 = int(toks[2])

                # create the unique id pair key
                if id1 < id2:
                    key = b"%d,%d" % (id1,id2)
                else:
                    key = b"%d,%d" % (id2,id1)


                if key in partition_ids_s:
                    out_file.write(line)

Python with sets and binary read and write:

partition_ids_s = set(x.encode() for x in partition_ids)
with open("in.txt", "rb") as in_file:
    with open("out.txt", "wb") as out_file:
        for line in in_file:
            # parse the line
            toks = line.split()
            id1 = int(toks[1])
            id2 = int(toks[2])

            # create the unique id pair key
            if id1 < id2:
                key = b"%d,%d" % (id1,id2)
            else:
                key = b"%d,%d" % (id2,id1)


            if key in partition_ids_s:
                out_file.write(line)

Line profiler:

Timer unit: 1e-07 s

Total time: 2.67841 s
File: <ipython-input-157-900077df3ca6>
Function: f_py at line 1

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     1                                           def f_py(partition_ids):
     2         1      10037.0  10037.0      0.0      partition_ids_s = set(x.encode() for x in partition_ids)
     3         1       2877.0   2877.0      0.0      with open("in.txt", "rb") as in_file:
     4         1       9213.0   9213.0      0.0          with open("out.txt", "wb") as out_file:
     5    500001    2914824.0      5.8     10.9              for line in in_file:
     6                                                           # parse the line
     7    500000    4207575.0      8.4     15.7                  toks = line.split()
     8    500000    3891864.0      7.8     14.5                  id1 = int(toks[1])
     9    500000    3768049.0      7.5     14.1                  id2 = int(toks[2])
    10                                           
    11                                                           # create the unique id pair key
    12    500000    2798327.0      5.6     10.4                  if id1 < id2:
    13    300000    2768751.0      9.2     10.3                      key = b"%d,%d" % (id1,id2)
    14                                                           else:
    15    200000    1844449.0      9.2      6.9                      key = b"%d,%d" % (id2,id1)
    16                                           
    17                                           
    18    500000    3008688.0      6.0     11.2                  if key in partition_ids_s:
    19    200000    1559435.0      7.8      5.8                      out_file.write(line)

Data initialization:

import pandas as pd
import io
from random import shuffle

s= """60300 60420 12 14 123 144
60400 60420 12 14 123 144
60400 60520 11 14 123 144
60420 60400 10 11 233 341
60420 60410 14 20 244 268
"""
s = s * 100000  
df = pd.read_csv(io.StringIO(s), sep=" ", names=["id1", "id2", "a1", "a2", "a3", "a4"])
df = df.reset_index()[["index"] + list(df.columns[:-1])] 
df.to_csv("in.txt", sep=" ", index=False, header=False) #500000 lines 14MB
partition_ids = [str(x)+","+str(x+20) for x in range(0, 500000,200)] #2500 elements

For multiple partitions:

partition_ids = [str(x)+","+str(x+20) for x in range(0, 500000,200)] #2500 elements
shuffle(partition_ids)
l_partition_ids = l_split(partition_ids, 5)

Using binary strings:

partition_ids = [b"%d,%d" % (x,x+20) for x in range(0, 500000,200)] #2500 elements
shuffle(partition_ids)
l_partition_ids = l_split(partition_ids, 5)

so the entire partition file(say, partition1) that I read is still large (200million) lines, each line being a key pair. So, my thought was to read into a data structure, split the 200million into 5 sets ( 40million each, say partition1.1 , partition1.2, etc.). To be able to do this, I thought list would be a good choice since I can check if the idpairs I'll be reading will be in the first subset or second or third etc. and output to different files. So when you're using set() in line 2, does that still help me do what I want to ? — pramesh shakya, Dec 24 '19 at 03:29
Check my update. It is still faster to use sets even if you have multiple partitions. — keiv.fly, Dec 24 '19 at 13:03

Splitting a large text file based on some criteria

2 Answers2