Something like this should do what you want.
import itertools as it
with open('test.txt') as in_file:
splitted_lines = (line.split(None, 1) for line in in_file)
for num, group in it.groupby(splitted_lines, key=lambda x: x[0]):
with open(num + '.txt', 'w') as out_file:
out_file.writelines(line for _, line in group)
- The
with
statement allows to safely use resources. In this case they automatically close the files.
- the
splitted_lines = (...)
line creates an iterable over the field that takes each line, and yield the pair first-element, rest of line.
- The
itertools.groupby
function is the function that does most of the work. It iterates over the lines of the file and groups them according to the first element.
- The
(line for _, line in group)
iterate over the "splitted lines". It simply drops the first element and writes only the rest of the lines. (the _
is just an identifier as any other. I could have used x
or first
, but I _
is often used to denote something that you have to assign, but you don't use)
We could probably simplify the code. For example the outermost with
is unlikely to be useful since we are only opening the file in reading mode, not modifying it.
Removing it we can take off an indent:
import itertools as it
splitted_lines = (line.split(None, 1) for line in open('test.txt'))
for num, group in it.groupby(splitted_lines, key=lambda x: x[0]):
with open(num + '.txt', 'w') as out_file:
out_file.writelines(line for _, line in group)
I have done a very simple benchmark to test the python solution versus the awk solution.
The performance is about the same with python being slightly faster using a file where each line has 10 fields, and with 100 "line groups" each of random size between 2 and 30 elements.
Timing of the python code:
In [22]: from random import randint
...:
...: with open('test.txt', 'w') as f:
...: for count in range(1, 101):
...: num_nums = randint(2, 30)
...: for time in range(num_nums):
...: numbers = (str(randint(-1000, 1000)) for _ in range(10))
...: f.write('{}\t{}\n'.format(count, '\t'.join(numbers)))
...:
In [23]: %%timeit
...: splitted_lines = (line.split(None, 1) for line in open('test.txt'))
...: for num, group in it.groupby(splitted_lines, key=lambda x: x[0]):
...: with open(num + '.txt', 'w') as out_file:
...: out_file.writelines(line for _, line in group)
...:
10 loops, best of 3: 11.3 ms per loop
Awk timings:
$time awk '{print $2,$3,$4 > ("test"$1)}' OFS='\t' test.txt
real 0m0.014s
user 0m0.004s
sys 0m0.008s
Note that 0.014s
is about 14 ms
.
Anyway, depending on the OS load the timings can vary and effectively they are equally fast. In practice almost all the time is taken reading from/writing to files and this is done efficiently by both python and awk. I believe using C you wont see huge speed gains.