Command line to merge lines with matching first field, 50 GB input

Question

A while back, I asked a question about merging lines which have a common first field. Here's the original: Command line to match lines with matching first field (sed, awk, etc.)

Sample input:

a|lorem
b|ipsum
b|dolor
c|sit
d|amet
d|consectetur
e|adipisicing
e|elit

Desired output:

b|ipsum|dolor
d|amet|consectetur
e|adipisicing|elit

The idea is that if the first field matches, then the lines are merged. The input is sorted. The actual content is more complex, but uses the pipe as the sole delimiter.

The methods provided in the prior question worked well on my 0.5GB file, processing in ~16 seconds. However, my new file is approx 100x larger, and I prefer a method that streams. In theory, this will be able to run in ~30 minutes. The prior method failed to complete after running 24 hours.

Running on MacOS (i.e., BSD-type unix).

Ideas? [Note, the prior answer to the prior question was NOT a one-liner.]

The accepted answer in the previous question streams. I'm not sure I could improve it much. — Barmar, Jul 30 '15 at 16:31
In your last question you reported that you had an inexplicably undesirable result when you ran the script I posted (http://stackoverflow.com/a/18494009/1745001) and chalked it up to the data being more complex than what you had posted. Did you figure out what exactly was causing the problem? If not, not much point trying to tackle this one when we don't know what caused the last one to fail. — Ed Morton, Jul 30 '15 at 16:31
You could try to join the file with himself and than filter out rows with duplicated values: join -1 1 -2 1 -t '|' test.txt test.txt (works only if the row with common first field are 2) — Vincenzo Petrucci, Jul 30 '15 at 16:45
BTW I'll suggest to use a scripting language, read the input one line at a time and build the input one line at a time. — Vincenzo Petrucci, Jul 30 '15 at 16:54
In order to better optimize, answer some questions of the data such as: Is there a maximum number of lines between any two duplicates? Can there be more than two duplicates? Also, `mawk` is a version of `awk` optimized for speed, so using that should improve performance on a big file. — John B, Jul 30 '15 at 17:42
max lines between two duplicates - unknown, but probably around 1k; more than 2 duplicates - absolutely, often dozens. — some ideas, Jul 30 '15 at 17:50
Hi Barmar -- Ha!... you are right. The multiline script (which I had accepted as the answer) does stream. I was using this one, since it was concise, which does not stream: awk -F'|' '{a[$1]++;b[$1]=b[$1]FS$2}END{for(k in a)if(a[k]>1)print k b[k]}' ... I don't really understand why the latter does not. — some ideas, Jul 30 '15 at 17:53
Based on that last comment from @MichaelDouma, this question is a duplicate of [Command line to match lines with matching first field (sed, awk, etc.)](http://stackoverflow.com/questions/18493326/command-line-to-match-lines-with-matching-first-field-sed-awk-etc) — Adam Katz, Jul 30 '15 at 18:07
@MichaelDouma hey Michael, I might be missing something, but I think my answer will be a lot simpler and it bypasses checking results in the `END` statement. The only issue would be if you want the new concatenated results in one big file rather than separate files, although `cat *little_files > one_big_file` would solve that. — isosceleswheel, Jul 30 '15 at 18:15
Hi Adam, No it technically isn't, since the 'solution' in the old question was not a one-liner. — some ideas, Jul 30 '15 at 18:16
@MichaelDouma any `awk` program is a one liner if you want it to be ;) — isosceleswheel, Jul 30 '15 at 18:19
@isosceleswheel, I don't (currently) need to process multiple files, and `cat` would solve that anyway as you note. — some ideas, Jul 30 '15 at 18:21
@MichaelDouma no but if you use my command you will create a file for each of the indices so if you want to re-create a 50GB result file you would need to use this extra step. — isosceleswheel, Jul 30 '15 at 18:24
@Barmar what does "stream" mean in this context? Does that just refer to not storing lots of data during the processing? — isosceleswheel, Jul 30 '15 at 18:37
@isosceleswheel That's how I interpreted it. In particular, the memory usage is not a function of the total length of the input. — Barmar, Jul 30 '15 at 18:41
@Barmar - By "stream" I meant that it immediately starts to return output, either to stout or that could be tailed on the > output file. Also, as you note that memory does not depend on the file size. — some ideas, Jul 30 '15 at 18:43
@MichaelDouma That's how I would have normally interpreted it, but I thought you might have meant something else because you accepted the streaming answer. That was before you added the comment here admitting that you were using a different answer. — Barmar, Jul 30 '15 at 19:06

isosceleswheel · Answer 1 · 2015-07-30T19:20:53.480

2

You can append you results to a file on the fly so that you don't need to build a 50GB array (which I assume you don't have the memory for!). This command will concatenate the join fields for each of the different indices in a string which is written to a file named after the respective index with some suffix.

EDIT: on the basis of OP's comment that content may have spaces, I would suggest using -F"|" instead of sub and also the following answer is designed to write to standard out

(New) Code:

# split the file on the pipe using -F
# if index "i" is still $1 (and i exists) concatenate the string
# if index "i" is not $1 or doesn't exist yet, print current a
# (will be a single blank line for first line)
# afterwards, this will print the concatenated data for the last index
# reset a for the new index and take the first data set
# set i to $1 each time
# END statement to print the single last string "a"
awk -F"|" '$1==i{a=a"|"$2}$1!=i{print a; a=$2}{i=$1}END{print a}'

This builds a string of "data" while in a given index and then prints it out when index changes and starts building the next string on the new index until that one ends... repeat...

edited Jul 30 '15 at 19:20

answered Jul 30 '15 at 18:06

isosceleswheel

1,516
12
20

Thanks. My content is more complex, and could have spaces before or after the pipe delimiter. I would process only one file. Any adjustments you might have are welcome. – some ideas Jul 30 '15 at 18:19
@MichaelDouma I see, well in that case the -F"|" should work and you can just drop the `sub` command I think... assuming there aren't OTHER pipes in the content later? – isosceleswheel Jul 30 '15 at 18:23
Can you write that in code so I am sure I get it right? I'll test it against my file, and report the speed vs. the larger non-one-liner. – some ideas Jul 30 '15 at 18:28
@MichaelDouma I updated the post. Basically remove `sub` completely and use `-F` as suggested by others. – isosceleswheel Jul 30 '15 at 18:29
thx! Will try it in a few minutes after the non-one-liner is done so I can semi-accurately compare speeds. – some ideas Jul 30 '15 at 18:31
1

Sure! I'm curious to know how the two methods stack up. – isosceleswheel Jul 30 '15 at 18:33
@MichaelDouma also one consideration is how you want to format the "results" that are concatenated, i.e. `a|X\nb|Y\na|Z` gives `a|XZ` and `b|Y` in my command, so you might want to add a delimiter e.g. `printf("%s", $2)` – isosceleswheel Jul 30 '15 at 18:40
What am I doing wrong with `cat tmp1 | awk -F"|" '{s=sprintf("%s", $1"SUFFIX"); printf("%s | ", $2) >> s; close(s)}' | head` which results with: *awk: can't open file* – some ideas Jul 30 '15 at 18:47
The file is closed. The command I posted writes to files, I doesn't produce any standard output. – isosceleswheel Jul 30 '15 at 18:51
Try this mini-example to see expected output: `echo -e "abc|tt hh 56\ndef|rtr 6\nabc|OK" | awk -F"|" '{ s=sprintf("%s", $1".txt"); printf("%s ", $2) >> s; close(s)}'` it should produce two files called "abc.txt" and "def.txt" – isosceleswheel Jul 30 '15 at 18:53
If you are interested, can you adapt this for stdin/out? I use this as part of a much longer series of pipes with sed, other awks, to extract data from a JSON file, and I process the output much more. – some ideas Jul 30 '15 at 18:53
I see your mini example. That does not help, unfortunately. The nature of my 50GB file is that it would make WAY too many subfiles. – some ideas Jul 30 '15 at 18:54
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/84732/discussion-between-isosceleswheel-and-michael-douma). – isosceleswheel Jul 30 '15 at 18:55
output is not corresponding to OP, "joined" lines with only 1 argument are filtered and here are printed (and missing first element). Not too hard to adapt but need some more line of code – NeronLeVelu Jul 31 '15 at 08:05

score 0 · Answer 2 · answered Jul 31 '15 at 07:56

sed '# label anchor for a jump
   :loop
# load a new line in working buffer (so always 2 lines loaded after)
   N
# verify if the 2 lines have same starting pattern and join if the case
   /^\(\([^|]\)*\(|.*\)\)\n\2/ s//\1/
# if end of file quit (and print result)
   $ b
# if lines are joined, cycle and re make with next line (jump to :loop)
   t loop
# (No joined lines here)
# if more than 2 element on first line, print first line
   /.*|.*|.*\n/ P
# remove first line (using last search pattern)
   s///
# (if anay modif) cycle (jump to :loop)
   t loop
# exit and print working buffer
   ' YourFile

posix version (maybe --posix on Mac)
self commented
assume sorted entry, no empty line, no pipe in data (nor escaped one)
used unbufferd -u for a stream process if available

Command line to merge lines with matching first field, 50 GB input

2 Answers2

Linked

Related