Sorting huge files with millions of lines

Question

I have tens of millions of strings in text file like these:

aa kk
bb mm
cc tt
ee ff
aa xx
bb ss
cc gg
ee rr

And I want to make them look like:

aa kk,xx
bb mm,ss
cc tt,gg
ee ff,rr

I have tried to sort and rearrange it with grep, sed and other tools but it looks like it is very slow way on really huge files even with

LC_ALL=C grep something

You may want to add some information about what the rule is that determines how the sorting/appending of data to go from "aa kk" to "aa kk,xx" so that readers can assist you. — Martin Noreke, Jun 07 '15 at 19:43

score 1 · Answer 1 · answered Jun 07 '15 at 19:58

I'm not clear if you specifically want to do this with just standard shell tools or not, but, Python is nearly universal on Linux these days. It can be done with a fairly simple program:

#!/usr/bin/python

import sys

data = { }
while True:
    l = sys.stdin.readline()
    if len(l)==0:
        break
    a,b = l.split()
    data.setdefault(a, [ ]).append(b)

for k in sorted(data.keys()):
    vs = data[k]
    print k, ",".join(vs)

I ran it on 50,000,000 lines of data generated by the following C program, and it finishes in about 60 seconds of my years-old laptop:

#include <stdio.h>
#include <stdlib.h>
char letter() { return (rand() % (123-97)) + 97; }
void main(void)
{
  int i;
  for(i=0; i<50000000; i++)
    printf("%c%c%c %c%c%c\n",
           letter(), letter(), letter(),
           letter(), letter(), letter());
}

This solution is faster than my posted awk version. – Cyrus Jun 08 '15 at 01:14 — Cyrus, Jun 08 '15 at 01:14

score 1 · Answer 2 · edited May 23 '17 at 11:51

1

awk '{if(b[$1])b[$1] = b[$1]","; b[$1] = b[$1] $2 $3}; END{for(i in b)print i, b[i]}' file

Output:

aa kk,xx
bb mm,ss
cc tt,gg
ee ff,rr

Source: https://stackoverflow.com/a/26450166/3776858

edited May 23 '17 at 11:51

Community

1
1

answered Jun 07 '15 at 20:17

Cyrus

84,225
14
89
153

score 1 · Answer 3 · answered Jun 08 '15 at 06:58

for the performance and memory conservative

sort -u YourFile | awk '{if (Last == $1) {Linked=Linked","$2} else { if (Last != "") print Last " " Linked; Last=$1;Linked=$2}} END{print Last " " Linked}'

First sort reduce the scope and arrance in order that allow the awk to read line by line and not loading a huge array (due to million of lines you specify) The awk concatene while header is same as previous line and print if not. Add END for last group and a if for first line

maybe a bit faster

sort -u YourFile | awk 'FNR==1{Last=$1;Linked=$2} FNR>1{if (Last == $1) {Linked=Linked","$2} else { print Last " " Linked; Last=$1;Linked=$2}} END{print Last " " Linked}'

score 0 · Answer 4 · answered Jun 07 '15 at 19:37

0

If you have to deal with very large data sets ,I suggest you use Map Reduce pattern.For example Hadoop framework /spark.Take a look at here https://hadoop.apache.org

answered Jun 07 '15 at 19:37

Ashouri

906
4
19

2

That would be excessive even if this fellow had 100 times more data. – Jay Kominek Jun 07 '15 at 19:59

Sorting huge files with millions of lines

4 Answers4