I have a np.array matrix(1826*5000) where the rows are my samples and the columns are the features. That means I have a genotype in each line with the individual nucleotides as a string. like this:
[['G' 'G' 'G' ... 'T' 'T' 'A']
['G' 'G' 'G' ... 'A' 'T' 'A']
['A' 'G' 'A' ... 'A' 'T' 'A']
...
['G' 'A' 'G' ... 'T' 'T' 'A']
['G' 'G' 'A' ... 'A' 'T' 'A']
['G' 'G' 'G' ... 'A' 'T' 'C']]
And only two different nucleotides appear in each column.
Now I would like to replace the individual strings with the numbers 0 and 2 in such a way that in each column the nucleotide that occurs more frequently gets the number 0 and the nucleotide that occurs less frequently gets the number 2.
This means that in column one the "G" should be replaced by 0 and the "A" by 2 since the "G" is more frequent.
Should look like this in the end.
[['0' '0' '0' ... '2' '0' '0']
['0' '0' '0' ... '0' '0' '0']
['2' '0' '2' ... '0' '0' '0']
...
['0' '2' '0' ... '2' '0' '0']
['0' '0' '2' ... '0' '0' '0']
['0' '0' '0' ... '0' '0' '2']]
Can someone tell me how to do this (with the help of Sklearn and Numpy functions)?