Benchmarks:
Fast python (sets, rb,wb) with partitions: 3.75 s
Fast python (sets... with an internal loop: 1.39 s
Original python with partitions: 19.4 s
Original python ... with an internal loop: 23.4 s
Original python ... with internal loop:
Cython: 512 ms
Python with sets and binary read and write: 820 ms
Python with dicts (Wonka's variant): 1.31 s
Original Python 12.1 s
Using sets on part of lists is also beneficial for speed.
Fast python (sets, rb,wb) with partitions:
for i,partition_ids in enumerate(l_partition_ids):
partition_ids_s = set(partition_ids)
with open("in.txt", "rb") as in_file:
with open(f"out{i}.txt", "wb") as out_file:
for line in in_file:
# parse the line
toks = line.split()
id1 = int(toks[1])
id2 = int(toks[2])
# create the unique id pair key
if id1 < id2:
key = b"%d,%d" % (id1,id2)
else:
key = b"%d,%d" % (id2,id1)
if key in partition_ids_s:
out_file.write(line)
Fast python (sets... with an internal loop:
out_files = []
l_partition_ids_sets = [set(x) for x in l_partition_ids]
with open("in.txt", "rb") as in_file:
for i in range(len(l_partition_ids)):
out_files.append(open(f"out{i}.txt", "wb"))
for line in in_file:
# parse the line
toks = line.split()
id1 = int(toks[1])
id2 = int(toks[2])
# create the unique id pair key
if id1 < id2:
key = b"%d,%d" % (id1,id2)
else:
key = b"%d,%d" % (id2,id1)
for i,partition_ids in enumerate(l_partition_ids_sets):
if key in partition_ids:
out_files[i].write(line)
for out_file in out_files:
out_file.close()
Original python with partitions:
for i,partition_ids in enumerate(l_partition_ids):
with open("in.txt", "r") as in_file:
with open("out.txt", "w") as out_file:
for line in in_file:
# parse the line
toks = line.split()
id1 = int(toks[1])
id2 = int(toks[2])
# create the unique id pair key
if id1 < id2:
key = str(id1)+','+str(id2)
else:
key = str(id2)+','+str(id1)
if key in partition_ids: #(short hard for first 40 million unique ids, just for purpose of explaining)
out_file.write(line)
In the line_profiler below we can see that splitting the line and converting to integers takes almost 45% of the time. Reading only takes 11% of the time. There are much faster converters to integers implemented in cython (fast_atoi here), but I have not implemented it here. I tried to improve the speed of line.split()
in cython but could not.
Cython (the fastest variant):
%%cython
from libc.stdint cimport (uint8_t, uint16_t, uint32_t, uint64_t,
int8_t, int16_t, int32_t, int64_t)
import numpy as np
def f_set_cy(partition_ids):
cdef int64_t id1, id2
partition_ids_s = set(x.encode() for x in partition_ids)
with open("in.txt", "rb") as in_file:
with open("out.txt", "wb") as out_file:
for line in in_file:
# parse the line
toks = line.split()
id1 = int(toks[1])
id2 = int(toks[2])
# create the unique id pair key
if id1 < id2:
key = b"%d,%d" % (id1,id2)
else:
key = b"%d,%d" % (id2,id1)
if key in partition_ids_s:
out_file.write(line)
Python with sets and binary read and write:
partition_ids_s = set(x.encode() for x in partition_ids)
with open("in.txt", "rb") as in_file:
with open("out.txt", "wb") as out_file:
for line in in_file:
# parse the line
toks = line.split()
id1 = int(toks[1])
id2 = int(toks[2])
# create the unique id pair key
if id1 < id2:
key = b"%d,%d" % (id1,id2)
else:
key = b"%d,%d" % (id2,id1)
if key in partition_ids_s:
out_file.write(line)
Line profiler:
Timer unit: 1e-07 s
Total time: 2.67841 s
File: <ipython-input-157-900077df3ca6>
Function: f_py at line 1
Line # Hits Time Per Hit % Time Line Contents
==============================================================
1 def f_py(partition_ids):
2 1 10037.0 10037.0 0.0 partition_ids_s = set(x.encode() for x in partition_ids)
3 1 2877.0 2877.0 0.0 with open("in.txt", "rb") as in_file:
4 1 9213.0 9213.0 0.0 with open("out.txt", "wb") as out_file:
5 500001 2914824.0 5.8 10.9 for line in in_file:
6 # parse the line
7 500000 4207575.0 8.4 15.7 toks = line.split()
8 500000 3891864.0 7.8 14.5 id1 = int(toks[1])
9 500000 3768049.0 7.5 14.1 id2 = int(toks[2])
10
11 # create the unique id pair key
12 500000 2798327.0 5.6 10.4 if id1 < id2:
13 300000 2768751.0 9.2 10.3 key = b"%d,%d" % (id1,id2)
14 else:
15 200000 1844449.0 9.2 6.9 key = b"%d,%d" % (id2,id1)
16
17
18 500000 3008688.0 6.0 11.2 if key in partition_ids_s:
19 200000 1559435.0 7.8 5.8 out_file.write(line)
Data initialization:
import pandas as pd
import io
from random import shuffle
s= """60300 60420 12 14 123 144
60400 60420 12 14 123 144
60400 60520 11 14 123 144
60420 60400 10 11 233 341
60420 60410 14 20 244 268
"""
s = s * 100000
df = pd.read_csv(io.StringIO(s), sep=" ", names=["id1", "id2", "a1", "a2", "a3", "a4"])
df = df.reset_index()[["index"] + list(df.columns[:-1])]
df.to_csv("in.txt", sep=" ", index=False, header=False) #500000 lines 14MB
partition_ids = [str(x)+","+str(x+20) for x in range(0, 500000,200)] #2500 elements
For multiple partitions:
partition_ids = [str(x)+","+str(x+20) for x in range(0, 500000,200)] #2500 elements
shuffle(partition_ids)
l_partition_ids = l_split(partition_ids, 5)
Using binary strings:
partition_ids = [b"%d,%d" % (x,x+20) for x in range(0, 500000,200)] #2500 elements
shuffle(partition_ids)
l_partition_ids = l_split(partition_ids, 5)