1

I have a cheminformatics dataset(input data) with row id as index and the encoder function is converting my smile string into a binary number in the form of an numpy ndarray. I want to add another column to my input dataframe as the fingerprint but am getting an error when I am converting into a pandas series. Can anyone tell me how to do this?

for index, row in input_table.iterrows():
        fp_a=(mhfp_encoder.secfp_from_smiles(row['usmiles_c']))   #creates a binary num
        column_series = pd.Series(fp_a)
        input_table['new_col']=pd.Series(fp_a)

error: Length of values does not match length of index

nurlubanu
  • 71
  • 1
  • 1
  • 5

1 Answers1

1

You get the error because pd.Series gives you a dataframe with 2048 rows (the bit length of the MHFP fingerprints) but your dataframe has another number of rows.

You can go another way to append the fingerprints to your dataframe.

If you have a dataframe like this

import pandas as pd

smiles = ['CCC(C)(C)N', 'NCC(O)CO', 'NCCN1CCNCC1','NCCN']
input_table = pd.DataFrame(smiles, columns=['usmiles_c'])

print(input_table)

     usmiles_c
0   CCC(C)(C)N
1     NCC(O)CO
2  NCCN1CCNCC1
3         NCCN

and made the fingerprints like this

from mhfp.encoder import MHFPEncoder
mhfp_encoder = MHFPEncoder()

fps = []
for smiles in input_table['usmiles_c']:
    fp = mhfp_encoder.secfp_from_smiles(smiles)
    fps.append(fp)

you can append the whole fingerpints in one column

input_table['new_col'] = fps
print(input_table)

     usmiles_c                                            new_col
0   CCC(C)(C)N  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..., 0
1     NCC(O)CO  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..., 0
2  NCCN1CCNCC1  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..., 0
3         NCCN  [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ..., 0

or make a seperate column for each bit

col_name = range(len(fps[0]))

for n in col_name:
    input_table[n] = [m[n] for m in fps]

print(input_table)

     usmiles_c  0  1  2  3  4  5  ...  2041  2042  2043  2044  2045  2046  2047
0   CCC(C)(C)N  0  0  0  0  0  0  ...     0     0     0     0     0     0     0
1     NCC(O)CO  0  0  0  0  0  0  ...     0     0     0     0     0     0     0
2  NCCN1CCNCC1  0  0  0  0  0  0  ...     0     0     0     0     0     0     0
3         NCCN  0  0  0  0  0  0  ...     0     0     0     0     0     0     0
rapelpy
  • 1,684
  • 1
  • 11
  • 14