Can anyone tell me how to replace strings with floats in an np.array(of several genotypes) by frequence per column?

Question

I have a np.array matrix(1826*5000) where the rows are my samples and the columns are the features. That means I have a genotype in each line with the individual nucleotides as a string. like this:

[['G' 'G' 'G' ... 'T' 'T' 'A']
 ['G' 'G' 'G' ... 'A' 'T' 'A']
 ['A' 'G' 'A' ... 'A' 'T' 'A']
 ...
 ['G' 'A' 'G' ... 'T' 'T' 'A']
 ['G' 'G' 'A' ... 'A' 'T' 'A']
 ['G' 'G' 'G' ... 'A' 'T' 'C']]

And only two different nucleotides appear in each column.

Now I would like to replace the individual strings with the numbers 0 and 2 in such a way that in each column the nucleotide that occurs more frequently gets the number 0 and the nucleotide that occurs less frequently gets the number 2.

This means that in column one the "G" should be replaced by 0 and the "A" by 2 since the "G" is more frequent.

Should look like this in the end.

[['0' '0' '0' ... '2' '0' '0']
 ['0' '0' '0' ... '0' '0' '0']
 ['2' '0' '2' ... '0' '0' '0']
 ...
 ['0' '2' '0' ... '2' '0' '0']
 ['0' '0' '2' ... '0' '0' '0']
 ['0' '0' '0' ... '0' '0' '2']]

Can someone tell me how to do this (with the help of Sklearn and Numpy functions)?

Alaa M. · Answer 1 · 2022-07-09T10:43:42.087

You can do this using pandas rank function:

import numpy as np
import pandas as pd
from collections import Counter


a = np.array([['G', 'G', 'G', 'T', 'T', 'A'],
              ['G', 'G', 'G', 'A', 'T', 'A'],
              ['A', 'G', 'A', 'A', 'T', 'A'],
              ['G', 'A', 'G', 'T', 'T', 'A'],
              ['G', 'G', 'A', 'A', 'T', 'A'],
              ['G', 'G', 'G', 'A', 'T', 'C']])



df = pd.DataFrame(a)
print('Before:')
print(df)
print('After:')
# get frequency of the values in each column
cnt = df.apply(Counter)
# perform `rank` on the frequencies by column
print(df.replace(cnt).rank(axis=0, method='dense', ascending=False).replace(1, 0).astype(int))

The replace(1,0) is needed because normally the rank function assigns consecutive values (1, 2, 3, ...), so for your case the number 1 needs to be replaced with 0.

Output:

Before:
   0  1  2  3  4  5
0  G  G  G  T  T  A
1  G  G  G  A  T  A
2  A  G  A  A  T  A
3  G  A  G  T  T  A
4  G  G  A  A  T  A
5  G  G  G  A  T  C

After:
   0  1  2  3  4  5
0  0  0  0  2  0  0
1  0  0  0  0  0  0
2  2  0  2  0  0  0
3  0  2  0  2  0  0
4  0  0  2  0  0  0
5  0  0  0  0  0  2

SupremeSavageMao · Accepted Answer · 2022-07-08T16:46:40.990

Given an array arr, the easiest way of solving it is:

import pandas as pd
df = pd.DataFrame(arr)
for column in df:
    df[column] = np.where(df[column]==df[column].mode()[0], "2", "0")
arr1 = df.to_numpy()

Explanation: First, you turn the array into a Pandas dataframe. Then, for each column you replace the mode with "2" and the other values by "0". Finally, you convert the dataframe back into an array that we name arr1.

Mechanic Pig · Answer 3 · 2022-07-08T14:55:57.193

Considering that there are only two kinds of elements in each column of your array, the simplest way is to compare each row of the array with first row, calculate the occurrence times of the elements in the first row in each column, and then determine whether the elements in the first row are elements with higher frequency. If not, invert the comparison result, and finally use np.where generate results:

>>> rand
array([['G', 'C'],
       ['G', 'T'],
       ['G', 'C'],
       ['G', 'T'],
       ['A', 'T'],
       ['G', 'C'],
       ['G', 'C'],
       ['A', 'C'],
       ['G', 'T'],
       ['G', 'T']], dtype='<U1')
>>> comp = rand == rand[np.newaxis, 0]
>>> count = comp.sum(0)
>>> not_mode = count < rand.shape[0] / 2
>>> comp[:, not_mode] = ~comp[:, not_mode]
>>> np.where(comp, '0', '2')
array([['0', '0'],
       ['0', '2'],
       ['0', '0'],
       ['0', '2'],
       ['2', '2'],
       ['0', '0'],
       ['0', '0'],
       ['2', '0'],
       ['0', '2'],
       ['0', '2']], dtype='<U1')

If you think the cost of inversion is high, you can consider using np.where 3 times:

>>> comp = rand == rand[np.newaxis, 0]
>>> count = comp.sum(0)
>>> not_mode = count < rand.shape[0] / 2
>>> np.where(comp, np.where(not_mode, '2', '0'), np.where(not_mode, '0', '2'))
array([['0', '0'],
       ['0', '2'],
       ['0', '0'],
       ['0', '2'],
       ['2', '2'],
       ['0', '0'],
       ['0', '0'],
       ['2', '0'],
       ['0', '2'],
       ['0', '2']], dtype='<U1')

Can anyone tell me how to replace strings with floats in an np.array(of several genotypes) by frequence per column?

3 Answers3