0

Essentially I'm reading a csv file containing a bunch of chemical compounds and I'm trying to apply the pubchempy.get_properties function to the CID column that contains the CID (identifier) number of each of the chemical compounds.

However, I can't get it to work. The weird thing is I'm not getting any errors (I did get some initially and kind of tinkered with it to try to fix them), it's just that the column(CID) values are remaining the same and are not changing after the function runs.

This is the code I've written for now:

import pandas as pd
import pubchempy

df = pd.read_csv("Chemical datavase.tsv.txt", sep="\t")
pd.set_option('display.width', 1000)
pd.set_option('display.max_columns', 12)
pd.set_option('display.max_rows', 12)

from pubchempy import get_compounds, Compound, get_properties

if (df['CID']>0).all():
    df['CID'] = pubchempy.get_properties(df['CID'], 'MolecularWeight')

print(df['CID'])

I'd appreciate any help on this!

NotAName
  • 3,821
  • 2
  • 29
  • 44

1 Answers1

1

IIUC, you can try:

from pubchempy import get_properties

if (df['CID']>0).all():
    df['CID'] = df['CID'].map(lambda x: get_properties(x, 'MolecularWeight'))

Or without the test, and the correct parameters:

df['CID'] = df['CID'].map(lambda x: get_properties(identifier=x, properties='MolecularWeight') if x>0 else pd.NA)
mozway
  • 194,879
  • 13
  • 39
  • 75
  • Thank you for the reply! I tried it out and it still didn't work although it makes sense... – New_to_coding May 16 '22 at 02:13
  • How did it fail? Are you sure the `(df['CID']>0).all()` condition is True? – mozway May 16 '22 at 02:15
  • There was no error or anything, it's just that the CID values (column values) remain the same. They don't change after the function is applied to them. – New_to_coding May 16 '22 at 02:17
  • What is the output of `(df['CID']>0).all()`? – mozway May 16 '22 at 02:17
  • It just says, "Process finished with exit code 0" after I run that input. I checked my database again and I believe the condition is true (considering it means all the CID data values are greater than 0). – New_to_coding May 16 '22 at 02:19
  • There must be a True/False output to `(df['CID']>0).all()` (or an error if the comparison fails), what about `print((df['CID']>0).all())`? – mozway May 16 '22 at 02:22
  • Ohhh my bad, I just did an if else statement and apparently it's not true... Shouldn't be possible since a compound's CID value can't be zero or below so I guess I'll take a look through my database again. My apologies, completely new to this. Thank you!! – New_to_coding May 16 '22 at 02:23
  • Check if there are NaNs maybe? – mozway May 16 '22 at 02:24
  • Yup looking through it right now, I'm hoping to find an issue so that I can at least know the code is fine. – New_to_coding May 16 '22 at 02:27
  • You can also skip the test, see update – mozway May 16 '22 at 02:29
  • You're right, some of the CID values are empty. When I skip the if statement part, I keep getting errors like "TypeError: 'float' object is not iterable". Previously I was getting an error that said "Truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all()" so I kinda forced that if statement in there. Just saw the update, I'll give that a shot! – New_to_coding May 16 '22 at 02:32
  • Have you tested my update? – mozway May 16 '22 at 02:33
  • It's now saying "TypeError: 'float' object is not iterable". I had this error come up before but wasn't really sure what the issue was. This is with the updated code. – New_to_coding May 16 '22 at 02:34
  • What does the `pubchempy.get_properties` expect as input? And what should it return? How would you use it on a single item? – mozway May 16 '22 at 02:34
  • The parameters are: pubchempy.get_properties(properties, identifier, namespace=u'cid', searchtype=None, as_dataframe=False, **kwargs). However, only the properties and identifier are mandatory, the rest are optional. It's supposed to give you the property you ask for by linking to the pubchem website (so in this case, the numerical molecular weight). I'm considering the CID number column to be the identifier since that is quite literally what the CID of a chemical compound is for. – New_to_coding May 16 '22 at 02:37
  • I think the issue is there, it doesn't take your input. Try to get it to run successfully on one item first. You can try to swap the order `get_properties('MolecularWeight', x)`. Also try to convert `x` to string – mozway May 16 '22 at 02:39
  • Yup, works when given a single CID number(5339) from the column. print(get_properties(identifier=5339, properties='MolecularWeight')) produced a correct value. I'll keep playing around with it though, thank you very much for your help. – New_to_coding May 16 '22 at 02:43
  • Yes but here you are using named parameters! Not in the positional order. Check the update – mozway May 16 '22 at 02:58
  • Thank you so much! Keeps giving me an error that says "TypeError: 'float' object is not iterable". I think it's because the molecular weight values have decimals... I'll play around with it to hopefully fix the issue, I really appreciate all your help with this though! – New_to_coding May 17 '22 at 03:12