Replace repeated elements in a list with unique identifiers

Question

I have a list like the below:

1 . Fred 1 6 78 8 09
1 1 Geni 1 4 68 9 34
2 . Sam 3 4 56 6 89
3 . Flit 2 4 56 8 34
3 4 Dog 2 5 67 8 78
3 . Pig 2 5 67 2 21

(except the real list is 40 million lines long).

There are repeated elements in the second column (i.e. the ".")

I want to replace these with unique identifers (e.g. ".1", ".2", ".3"...".n")

I tried to do this with a bash loop / sed combination, but it didn't work...

Failed attempt:

for i in 1..4
  do
    sed -i "s_//._//."$i"_"$i""
  done

(Essentially, I was trying to get sed to replace each n th "." with ".n", but this didn't work).

score 5 · Accepted Answer · answered Jan 24 '14 at 17:21

Here's a way to do it with awk (assuming your file is called input:

$ awk '$2=="."{$2="."++counter}{print}' input 
1 .1 Fred 1 6 78 8 09
1 1 Geni 1 4 68 9 34
2 .2 Sam 3 4 56 6 89
3 .3 Flit 2 4 56 8 34
3 4 Dog 2 5 67 8 78
3 .4 Pig 2 5 67 2 21

The awk program replaces the second column ($2) by a string formed by concatenating . and a pre-incremented counter (++counter) if the second column was exactly .. It then prints out all the columns it got (with $2 modified or not) ({print}).

Plain bash alternative:

c=1
while read -r a b line ; do
  if [ "$b" == "." ] ; then
    echo "$a ."$((c++))" $line"
  else
    echo "$a $b $line"
  fi
done < input

I have no idea how to do this with `sed` . Some of the answers [here](http://stackoverflow.com/questions/12496717/sed-replace-pattern-with-line-number) might help though. — Mat, Jan 24 '14 at 17:32

grebneke · Answer 2 · 2014-01-25T09:11:41.710

Since your question is tagged sed and bash, here are a few examples for completeness.

Bash only

Use parameter expansion. The second column will be unique, but not sequential:

i=1; while read line; do echo ${line/\./.$((i++))}; done < input

1 .1 Fred 1 6 78 8 09
1 1 Geni 1 4 68 9 34
2 .3 Sam 3 4 56 6 89
3 .4 Flit 2 4 56 8 34
3 4 Dog 2 5 67 8 78
3 .6 Pig 2 5 67 2 21

Bash + sed

sed cannot increment variables, it has to be done externally.

For each line, increment $i if line contains a ., then let sed append $i after the .

i=0                                    
while read line; do                 
    [[ $line == *.* ]] && i=$((i+1))   
    sed "s#\.#.$i#" <<<"$line" 
done < input

Output:

1 .1 Fred 1 6 78 8 09
1 1 Geni 1 4 68 9 34
2 .2 Sam 3 4 56 6 89
3 .3 Flit 2 4 56 8 34
3 4 Dog 2 5 67 8 78
3 .4 Pig 2 5 67 2 21

krishna murti · Answer 3 · 2014-01-25T14:13:30.627

0

you can use this command:

awk '{gsub(/\./,c++);print}' filename

Output:

1 0 Fred 1 6 78 8 09
1 1 Geni 1 4 68 9 34
2 2 Sam 3 4 56 6 89
3 3 Flit 2 4 56 8 34
3 4 Dog 2 5 67 8 78
3 5 Pig 2 5 67 2 21

edited Jan 25 '14 at 14:13

answered Jan 25 '14 at 06:57

krishna murti

1,061
8
9

Replace repeated elements in a list with unique identifiers

3 Answers3