0

To summarize as concisely as I can, I have data file containing a list of chemical compounds along with their ID numbers ("CID" numbers). My goal is to use pubchempy's pubchempy.get_properties function along with pandas' df.map function to essentially obtain the properties of each compound (there is one compound per row) using the "CID" number as an identifier. The parameters of pubchempy.get_properties is an identifier ("CID" number in this case) along with the property of the chemical that you want to obtain from the pubchem website (Molecular weight in this case).


This is the code that I have written currently:

import pandas as pd

import pubchempy

import numpy as np

df = pd.read_csv("Data.tsv.txt", sep="\t")

from pubchempy import get_properties

df['CID'] = df['CID'].astype(str).apply(lambda x: x.replace('.0',''))

df['CID'] = df['CID'].astype(str).apply(lambda x: x.replace('0',''))

df = df.drop(df[df.CID=='nan'].index)

df = df.drop( df.index.to_list()[5:] ,axis = 0 )

df['CID']= df['CID'].map(lambda x: get_properties(identifier=x, properties='MolecularWeight') if float(x) > 0 else pd.NA)

df = df.rename(columns={'CID.': 'MolecularWeight'})

print(df)

This is the output that I was initially getting for that column (only including a few rows, in reality, dataset is very big):

MolecularWeight

[{'CID': 5339, 'MolecularWeight': '398.4'}]

[{'CID': 3889, 'MolecularWeight': '520.5'}]

[{'CID': 2788, 'MolecularWeight': '305.50'}]

[{'CID': 1422517, 'MolecularWeight': '440.5'}]

.

.

.


Now, the code was somewhat working in that it is providing me with the molecular weight of the compound (398.4) but I didn't want all that extra bit of writing nor did I want the quote marks around the molecular weight number (both of these get in the way of the next bit of code that I plan to write).

So I then added this bit of code:

df['MolecularWeight'] = df.MolecularWeight[0][0].get('MolecularWeight')

This is the output that I am now getting:

MolecularWeight

398.4

398.4

398.4

398.4

.

.

.


What I want to do is pretty much exactly the same it's just that instead of getting the molecular weight of the first row in the MolecularWeight column and copying it onto all the other rows, I want to have the molecular weight value of each individual row in that column as the output.

What I was hoping to get is something like this:

MolecularWeight

398.4

520.5

305.50

440.5

.

.

.

Does anyone know how I can solve this issue? I've spent many hours trying to figure it out myself with no luck. I'd appreciate any help!


Few lines of text file:

NO. compound_name   IUPAC_name  SMILES  CID     Inchi   threshold   reference   group   comments
1   sulphasalazine  2-hydroxy-5-[[4-(pyridin-2-ylsulfamoyl)phenyl]diazenyl]benzoic acid O=C(O)c1cc(N=Nc2ccc(S(=O)(=O)Nc3ccccn3)cc2)ccc1O    5339    InChI=1S/C18H14N4O5S/c23-16-9-6-13(11-15(16)18(24)25)21-20-12-4-7-14(8-5-12)28(26,27)22-17-3-1-2-10-19-17/h1-11,23H,(H,19,22)(H,24,25)      R2|R2|R25|R46|  A   
2   moxalactam  7-[[2-carboxy-2-(4-hydroxyphenyl)acetyl]amino]-7-methoxy-3-[(1-methyltetrazol-5-yl)sulfanylmethyl]-8-oxo-5-oxa-1-azabicyclo[4.2.0]oct-2-ene-2-carboxylic acid   COC1(NC(=O)C(C(=O)O)c2ccc(O)cc2)C(=O)N2C(C(=O)O)=C(CSc3nnnn3C)COC21 3889    InChI=1S/C20H20N6O9S/c1-25-19(22-23-24-25)36-8-10-7-35-18-20(34-2,17(33)26(18)13(10)16(31)32)21-14(28)12(15(29)30)9-3-5-11(27)6-4-9/h3-6,12,18,27H,7-8H2,1-2H3,(H,21,28)(H,29,30)(H,31,32)      R25|    A   
3   clioquinol  5-chloro-7-iodoquinolin-8-ol    Oc1c(I)cc(Cl)c2cccnc12  2788    InChI=1S/C9H5ClINO/c10-6-4-7(11)9(13)8-5(6)2-1-3-12-8/h1-4,13H      R18|R26|R27|    A   
  • Could you share a couple of lines from your text file in your question? – BeRT2me May 29 '22 at 21:04
  • `get_properties` accepts a list of identifiers and a list of desired properties, so there's a far better way to do this that I wrote up below :) – BeRT2me May 29 '22 at 22:11
  • @BeRT2me Thank you so much!! This is exactly what I needed! For some reason tho, when I do ```df = df.merge(df2) print(df) ``` , I get an error that says, "ValueError: You are trying to merge on object and int64 columns. If you wish to proceed you should use pd.concat" so I'm unable to actually combine them – New_to_coding May 29 '22 at 22:33
  • That means you already have columns labeled `'HBondDonorCount', 'RotatableBondCount',` and/or `'MolecularWeight'` in `df`. You can specify the merge with `df.merge(df2, on='CID')` to address the error and get duplicate columns with suffixes, or drop them from your original `df` first. – BeRT2me May 29 '22 at 22:35

2 Answers2

0

If you cast the column to float, that should help you: df['MolecularWeight'] = df['MolecularWeight'].astype(float).

0

It appears that you may want multiple properties from each CID:

props = ['HBondDonorCount', 'RotatableBondCount', 'MolecularWeight']
df2 = pd.DataFrame(get_properties(identifier=df.CID.to_list(), properties=props))
print(df2)

Output:

    CID HBondDonorCount  RotatableBondCount  MolecularWeight
0  5339           398.4                   3                6
1  3889           520.5                   4                9
2  2788          305.50                   1                0

You can then merge this information onto the original dataframe:

df = df.merge(df2)  # df = df.merge(pd.DataFrame(get_properties(identifier=df.CID.to_list(), properties=props)))
print(df)

...

   NO.   compound_name                                         IUPAC_name                                             SMILES   CID                                              Inchi  threshold       reference group  comments MolecularWeight  HBondDonorCount  RotatableBondCount
0    1  sulphasalazine  2-hydroxy-5-[[4-(pyridin-2-ylsulfamoyl)phenyl]...   O=C(O)c1cc(N=Nc2ccc(S(=O)(=O)Nc3ccccn3)cc2)ccc1O  5339  InChI=1S/C18H14N4O5S/c23-16-9-6-13(11-15(16)18...        NaN  R2|R2|R25|R46|     A       NaN           398.4                3                   6
1    2      moxalactam  7-[[2-carboxy-2-(4-hydroxyphenyl)acetyl]amino]...  COC1(NC(=O)C(C(=O)O)c2ccc(O)cc2)C(=O)N2C(C(=O)...  3889  InChI=1S/C20H20N6O9S/c1-25-19(22-23-24-25)36-8...        NaN            R25|     A       NaN           520.5                4                   9
2    3      clioquinol                       5-chloro-7-iodoquinolin-8-ol                             Oc1c(I)cc(Cl)c2cccnc12  2788  InChI=1S/C9H5ClINO/c10-6-4-7(11)9(13)8-5(6)2-1...        NaN    R18|R26|R27|     A       NaN          305.50                1                   0
BeRT2me
  • 12,699
  • 2
  • 13
  • 31
  • Hey! I added a picture of a few lines of the text file. – New_to_coding May 29 '22 at 21:16
  • @New_to_coding That looks an awful lot like a dataframe output, rather than a text file. Also, could you copy/paste it as text rather than an image? You can wrap it in ``` above and below to format it nicely. – BeRT2me May 29 '22 at 21:18
  • Aka, open `Data.tsv.txt` in something like the notepad and copy/paste the first couple lines~ – BeRT2me May 29 '22 at 21:19
  • Done! My bad, you were right, I had taken a pic of the output accidentally. There's thousands of rows of data, I just copy pasted three of them. – New_to_coding May 29 '22 at 21:27