2

I have a list like the below:

1 . Fred 1 6 78 8 09
1 1 Geni 1 4 68 9 34
2 . Sam 3 4 56 6 89
3 . Flit 2 4 56 8 34
3 4 Dog 2 5 67 8 78
3 . Pig 2 5 67 2 21

(except the real list is 40 million lines long).

There are repeated elements in the second column (i.e. the ".")

I want to replace these with unique identifers (e.g. ".1", ".2", ".3"...".n")

I tried to do this with a bash loop / sed combination, but it didn't work...

Failed attempt:

for i in 1..4
  do
    sed -i "s_//._//."$i"_"$i""
  done 

(Essentially, I was trying to get sed to replace each n th "." with ".n", but this didn't work).

Joni
  • 375
  • 1
  • 11

3 Answers3

5

Here's a way to do it with awk (assuming your file is called input:

$ awk '$2=="."{$2="."++counter}{print}' input 
1 .1 Fred 1 6 78 8 09
1 1 Geni 1 4 68 9 34
2 .2 Sam 3 4 56 6 89
3 .3 Flit 2 4 56 8 34
3 4 Dog 2 5 67 8 78
3 .4 Pig 2 5 67 2 21

The awk program replaces the second column ($2) by a string formed by concatenating . and a pre-incremented counter (++counter) if the second column was exactly .. It then prints out all the columns it got (with $2 modified or not) ({print}).

Plain bash alternative:

c=1
while read -r a b line ; do
  if [ "$b" == "." ] ; then
    echo "$a ."$((c++))" $line"
  else
    echo "$a $b $line"
  fi
done < input
Mat
  • 202,337
  • 40
  • 393
  • 406
  • I have no idea how to do this with `sed` . Some of the answers [here](http://stackoverflow.com/questions/12496717/sed-replace-pattern-with-line-number) might help though. – Mat Jan 24 '14 at 17:32
1

Since your question is tagged sed and bash, here are a few examples for completeness.

Bash only

Use parameter expansion. The second column will be unique, but not sequential:

i=1; while read line; do echo ${line/\./.$((i++))}; done < input

1 .1 Fred 1 6 78 8 09
1 1 Geni 1 4 68 9 34
2 .3 Sam 3 4 56 6 89
3 .4 Flit 2 4 56 8 34
3 4 Dog 2 5 67 8 78
3 .6 Pig 2 5 67 2 21

Bash + sed

sed cannot increment variables, it has to be done externally.

For each line, increment $i if line contains a ., then let sed append $i after the .

i=0                                    
while read line; do                 
    [[ $line == *.* ]] && i=$((i+1))   
    sed "s#\.#.$i#" <<<"$line" 
done < input                           

Output:

1 .1 Fred 1 6 78 8 09
1 1 Geni 1 4 68 9 34
2 .2 Sam 3 4 56 6 89
3 .3 Flit 2 4 56 8 34
3 4 Dog 2 5 67 8 78
3 .4 Pig 2 5 67 2 21
grebneke
  • 4,414
  • 17
  • 24
0

you can use this command:

awk '{gsub(/\./,c++);print}' filename

Output:

1 0 Fred 1 6 78 8 09
1 1 Geni 1 4 68 9 34
2 2 Sam 3 4 56 6 89
3 3 Flit 2 4 56 8 34
3 4 Dog 2 5 67 8 78
3 5 Pig 2 5 67 2 21
krishna murti
  • 1,061
  • 8
  • 9