0

What is fast and succinct way to remove dupes from within a line?

I have a file in the following format:

alpha • a | b | c | a | b | c | d
beta • h | i | i | h | i | j | k
gamma •  m | n | o
delta • p | p | q | r | s | q

So there's a headword in column 1, and then various words delimited by pipes, with an unpredictable amount of duplication. The desired output has the dupes removed, as:

alpha • a | b | c | d
beta • h | i | j | k
gamma •  m | n | o
delta • p | q | r | s 

My input file is a few thousand lines. The Greek names above correspond to category names (e.g., "baseball"); and the alphabet corresponds English dictionary words (which might contain spaces or accents), e.g. "ball game | batter | catcher | catcher | designated hitter".

This could be programmed many ways, but I suspect there's a smart way to do it. I encounter variations of this scenario a lot, and wonder if there's a concise and elegant way to do this. I am using MacOS, so a few fancy unix options are not available.

Bonus complexity, I often have a comment at the end which should be retained, e.g.,

zeta • x | y | x | z | z ; comment here

P.S. this input is actually the output of a prior StackOverflow question: Command line to match lines with matching first field (sed, awk, etc.)

Community
  • 1
  • 1
some ideas
  • 64
  • 3
  • 14
  • So you have three delimiters, the middle dot or bullet, the pipes, and (sometimes) a semicolon. Do these symbols ever appear except as delimiters? Is it crucial that the names are in alphabetic order after being uniquified? – Jonathan Leffler Jul 02 '15 at 19:14
  • My example happens to be sorted, but the real input is not sorted. Those three delimiters (•,|,;) ONLY appear in the field delimiters. The order of the output is flexible (could be same as input or sorted). – some ideas Jul 02 '15 at 19:16

3 Answers3

1

BSD awk does not have sort functions builtin where GNU awk does, but I'm not sure they're necessary. The bullet, • (U+2022), causes some grief with awk.

I suggest pre-processing the bullet to a single-byte character. I chose @, but you could use Control-A or something else if you prefer. Your data was in a file data. I note that there was a double space before m in the gamma line; I'm assuming that isn't significant.

sed 's/•/@/' data |
awk -F ' *[@|] *' '
{
    delete names
    delete comments
    delete fields;
    if ($NF ~ / *;/) { split($NF, comments, / *; */); $NF=comments[1]; }
    j = 1;
    for (i = 2; i <= NF; i++)
    {
        if (names[$i]++ == 0)
            fields[j++] = $i;
    }
    printf("%s", $1);
    delim = "•"
    for (k = 1; k < j; k++)
    {
        printf(" %s %s", delim, fields[k]);
        delim = "|";
    }
    if (comments[2])
        printf(" ; %s", comments[2]);
    printf("\n");
}'

Running this yields:

alpha • a | b | c | d
beta • h | i | j | k
gamma • m | n | o
delta • p | q | r | s
zeta • x | y | z ; comment here
Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
  • Appears to work great, thanks! I changed my first delimiter to a "@", omitting the need for the sed. Maybe it's not a one-liner, but it's clean and better than a more complex little program I had in mind. – some ideas Jul 02 '15 at 20:36
1

With bash, sort, xargs, sed:

while IFS='•;' read -r a b c; do
  IFS="|" read -ra array <<< "$b"
  array=( "${array[@]# }" )
  array=( "${array[@]% }" )
  readarray -t array < <(printf '%s\0' "${array[@]}" | sort -zu | xargs -0n1)
  SAVE_IFS="$IFS"; IFS="|"
  s="$a• ${array[*]}"
  [[ $c != "" ]] && s="$s ;$c"
  sed 's/|/ | /g' <<< "$s"
  IFS="$SAVE_IFS"
done < file

Output:

alpha • a | b | c | d
beta • h | i | j | k
gamma •  m | n | o
delta • p | q | r | s
zeta • x | y | z ; comment here

I suppose the two spaces before "m" are a typo.

Cyrus
  • 84,225
  • 14
  • 89
  • 153
  • yeah, the 2 spaces before the "m" are a typo, and I could clean my content to exactly match my demo. – some ideas Jul 02 '15 at 20:15
  • Thanks for writing this. I'm always a little afraid of straight bash scripts in case there's something wrong with my input and I accidentally rm -r /. I didn't test it, since the answer from Jonathan works, but thanks. – some ideas Jul 02 '15 at 20:37
1

This might work for you (GNU sed):

sed  'h;s/.*• \([^;]*\).*/cat <<\\! | sort -u |\1|!/;s/\s*|\s*/\n/2ge;s/\n/ | /g;G;s/^\(.*\)\n\(.*• \)[^;]*/\2\1/;s/;/ &/' file

The sketch of this idea is: to remove the head and tail of each line, morph the data into a mini file, use standard utilities to sort and remove duplicates, then put the line back together again.

Here a copy of the line is held in the hold space. The id and comments removed. The data is munged into a file using cat and the bash here-document syntax and piped through a sort (and uniq if your sort does not come equipped with the -u option). The pattern space is evaluated and the line reassembled by appending the original line to the pattern space and using regexp pattern matching.

potong
  • 55,640
  • 6
  • 51
  • 83