0

What's the most efficient way to convert a factor vector (not all levels are unique) into a numeric vector in bash? The values in the numeric vector do not matter as long as each represents a unique level of the factor.

To illustrate, this would be the R equivalent to what I want to do in bash:

numeric<-seq_along(levels(factor))[factor]

I.e.:

factor

AV1019A
ABG1787
AV1019A
B77hhA
B77hhA

numeric

1
2
1
3
3

Many thanks.

9987
  • 43
  • 4

1 Answers1

2

It is most probably not the most efficient, but maybe something to start.

#!/bin/bash

input_data=$( mktemp ) 
map_file=$( mktemp )

# your example written to a file 
echo -e "AV1019A\nABG1787\nAV1019A\nB77hhA\nB77hhA" >> $input_data 

# create a map <numeric, factor> and write to file
idx=0
for factor in $( cat $input_data | sort -u )
do 
    echo $idx $factor
    let idx=$idx+1
done > $map_file 

# go through your file again and replace values with keys 
while read line
do 
    key=$( cat $map_file | grep -e ".* ${line}$" | awk '{print $1}' )
    echo $key
done < $input_data 

# cleanup 
rm -f $input_data $map_file

I initially wanted to use associative arrays, but it's a bash 4+ feature only and not available here and there. If you have bash 4 then you have one file less, which is obviously more efficient.

#!/bin/bash

# your example written to a file 
input_data=$( mktemp )
echo -e "AV1019A\nABG1787\nAV1019A\nB77hhA\nB77hhA" >> $input_data 

# declare an array 
declare -a factor_map=($( cat $input_data | sort -u | tr "\n" " " ))

# go through your file replace values with keys 
while read line
do 
    echo ${factor_map[@]/$line//} | cut -d/ -f1 | wc -w | tr -d ' '
done < $input_data 

# cleanup 
rm -f $input_data