0

I have tens of millions of strings in text file like these:

aa kk
bb mm
cc tt
ee ff
aa xx
bb ss
cc gg
ee rr

And I want to make them look like:

aa kk,xx
bb mm,ss
cc tt,gg
ee ff,rr

I have tried to sort and rearrange it with grep, sed and other tools but it looks like it is very slow way on really huge files even with

LC_ALL=C grep something

sflk
  • 1
  • 1
  • If you have something that works, you should include that. – Jay Kominek Jun 07 '15 at 19:42
  • You may want to add some information about what the rule is that determines how the sorting/appending of data to go from "aa kk" to "aa kk,xx" so that readers can assist you. – Martin Noreke Jun 07 '15 at 19:43

4 Answers4

1

I'm not clear if you specifically want to do this with just standard shell tools or not, but, Python is nearly universal on Linux these days. It can be done with a fairly simple program:

#!/usr/bin/python

import sys

data = { }
while True:
    l = sys.stdin.readline()
    if len(l)==0:
        break
    a,b = l.split()
    data.setdefault(a, [ ]).append(b)

for k in sorted(data.keys()):
    vs = data[k]
    print k, ",".join(vs)

I ran it on 50,000,000 lines of data generated by the following C program, and it finishes in about 60 seconds of my years-old laptop:

#include <stdio.h>
#include <stdlib.h>
char letter() { return (rand() % (123-97)) + 97; }
void main(void)
{
  int i;
  for(i=0; i<50000000; i++)
    printf("%c%c%c %c%c%c\n",
           letter(), letter(), letter(),
           letter(), letter(), letter());
}
Jay Kominek
  • 8,674
  • 1
  • 34
  • 51
1
awk '{if(b[$1])b[$1] = b[$1]","; b[$1] = b[$1] $2 $3}; END{for(i in b)print i, b[i]}' file

Output:

aa kk,xx
bb mm,ss
cc tt,gg
ee ff,rr

Source: https://stackoverflow.com/a/26450166/3776858

Community
  • 1
  • 1
Cyrus
  • 84,225
  • 14
  • 89
  • 153
1

for the performance and memory conservative

sort -u YourFile | awk '{if (Last == $1) {Linked=Linked","$2} else { if (Last != "") print Last " " Linked; Last=$1;Linked=$2}} END{print Last " " Linked}'

First sort reduce the scope and arrance in order that allow the awk to read line by line and not loading a huge array (due to million of lines you specify) The awk concatene while header is same as previous line and print if not. Add END for last group and a if for first line

maybe a bit faster

sort -u YourFile | awk 'FNR==1{Last=$1;Linked=$2} FNR>1{if (Last == $1) {Linked=Linked","$2} else { print Last " " Linked; Last=$1;Linked=$2}} END{print Last " " Linked}'
NeronLeVelu
  • 9,908
  • 1
  • 23
  • 43
0

If you have to deal with very large data sets ,I suggest you use Map Reduce pattern.For example Hadoop framework /spark.Take a look at here https://hadoop.apache.org

Ashouri
  • 906
  • 4
  • 19