-1

I have a very large text file(20GB) which as rows like this,

1 Some text
1 More text
2 Text 
2 Follow up text
..
..
n

I want to covert file to like this:

1, sometext, more text
2, text , followup text

How can i do it python. I can not keep entire file in memory.

  • Are these already sorted by id? – user2390182 Jan 10 '18 at 15:30
  • yes sorted by ID –  Jan 10 '18 at 15:31
  • do you really mean to convert `Some text` to `sometext` , etc? You'll need to define rules on why `follow up text` got converted to `followup text`. but `more text` remains `more text` (or cleanup your example). ALSO, what have you tried? Good luck. – shellter Jan 10 '18 at 15:38
  • I need to clean up the example. I am clueless as to how to implement logic, that keep tab of current ID and last ID, and output once last line of ID is reached. –  Jan 10 '18 at 15:43

1 Answers1

2

You can use itertools.groupby to do sth along the following lines:

from itertools import groupby
# from itertools import groupby, imap  # Python2 map returns a list

def tokens(line):
  return [t.strip() for t in line.strip().split(' ', 1)]

with open('infile.txt', 'r') as fin, open('outfile.txt', 'w') as fout:
  for k, g in groupby(map(tokens, fin), key=lambda t: t[0]):
  # for k, g in groupby(imap(tokens, fin), key=lambda t: t[0]):  # Py2
    fout.write(', '.join([k] + [x[1] for x in g]) + '\n')
    # not to be too silent
    print('Processing id: ' + k)
user2390182
  • 72,016
  • 6
  • 67
  • 89