Get molecules from PubChem which have an Exact Mass e.g. 1176.784 +/- 0.01 Dalton by using Python

Question

I wrote the following code to find all molecules in PubChem which have an ExactMass of, in this case, 1176.784 +/- 0.01 Da. I get an error request fail [code 400]. The url should be ok, I checked the PubChem documentation, however I can't find the problem.

import requests

exact_mass = 1176.784  # set the exact mass value
tolerance = 0.01  # set the tolerance value

# set the API endpoint URL
url = "https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/list/exactmass/%f+-%0.3f/property/IUPACName/JSON" % (exact_mass, tolerance / 2)

# make the API request and retrieve the response
response = requests.get(url)

# check if the request was successful
if response.ok:
    # extract the JSON data from the response
    json_data = response.json()

    # extract the list of compounds from the JSON data
    compound_list = json_data['IdentifierList']['CID']

    # print the IUPAC names of the compounds in the list
    for cid in compound_list:
        # set the API endpoint URL to retrieve IUPAC name for a specific CID
        url = 'https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/cid/%d/property/IUPACName/JSON' % cid
        response = requests.get(url)
        json_data = response.json()
        iupac_name = json_data['PropertyTable']['Properties'][0]['IUPACName']
        print(iupac_name)

else:
    # print an error message if the request failed
    print('Error: Request failed with status code %d' % response.status_code)

I expect to get a list of names of all molecules which have an ExactMass which is in the range of 1176.784 +/- 0.01 Da.

So if you take the exact URL that your script generates and paste it into a browser, does that work? If not, what seems to be the difference between URLs that work and ones that don't? — Random Davis, Feb 24 '23 at 20:47
The URL gives indeed a 'bad request' error. I need to check the PubChem documentation again... — John Mommers, Feb 24 '23 at 20:54
What could be wrong with this URL: url = "https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/list/exactmass/%f+-%0.3f/property/IUPACName/JSON" % (exact_mass, tolerance / 2) — John Mommers, Feb 24 '23 at 21:00
That's not a URL. That's Python code which formats the URL string. What is the _resulting URL_ of that Python code? You already shared that Python code. Obviously I can see how you are building the URL. I said to extract the URL that your script _generates_ - as in, show us the **result** of that code. And I also said to compare it to a working URL, but you didn't do that either. I thought my suggestion was really basic and simple and hard to mess up, but maybe you misread it? — Random Davis, Feb 24 '23 at 21:14
URL:https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/list/exactmass/1176.784000+-0.005/property/IUPACName/JSON — John Mommers, Feb 24 '23 at 21:19
Just put the URL in your post, the full one doesn't show up in comments. Also, I can see that it's passing `1176.784000+-0.005` in the URL. That seems like an invalid URL. Don't you have to use special URL encoding to send special characters in URLs? And you still didn't share a legitimate, working URL that you got from using the site in the browser versus Python. So, I still have nothing to compare it to. It seems invalid, but you still just haven't shared enough info. — Random Davis, Feb 24 '23 at 21:23
Hi Random. Thanks. I tried the real URL, of course. I did not find a 'working' URL where the input is a mass and a tolerance and output are compounds names, otherwise, I would post this question :) — John Mommers, Feb 24 '23 at 21:26
So even with the info from the documentation, you still are completely unable to figure out how to write a valid URL, even manually? I think we need to see the documentation you're using. Because clearly that's not even an issue with Python, but with your understanding (or lack thereof) of the documentation. If you can't do it without Python, then obviously you can't do it _with_ Python. This, therefore isn't even a code question, IMO. it feels manipulative that you'd only just now say that you actually couldn't figure this out even outside of Python, and you're acting like it's a code issue. — Random Davis, Feb 24 '23 at 21:38
On https://pubchem.ncbi.nlm.nih.gov/docs/pug-rest I'm not seeing anything about a /compound/list/[...]. You get the same error if you try to hit the API for an endpoint at /compound/thing-that-doesnt-exist — Kaia, Feb 24 '23 at 22:06
@RandomDavis. I'm voting to close as unreproducible or typo, because that's what the programming issue books down to at this point — Mad Physicist, Feb 25 '23 at 00:18
Hi Random. I just would like to get help to get it to work. You can find the specific PubChem documentation (about URL) here: https://pubchem.ncbi.nlm.nih.gov/docs/pug-rest. Thanks for your advice. — John Mommers, Feb 25 '23 at 08:17

score 0 · Answer 1 · answered Feb 25 '23 at 09:26

I found another way, using PubChem E-Util's "esearch" to retrieve CIDs (database entries of molecules) from PubChem whose Exact mass is between two values. I wrote the following function for this:

    import requests

def search_cids_exactmass(min_mass, max_mass):
    base_url = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi"
    db = "pccompound"
    term = f"{min_mass}:{max_mass}[exactmass]"
    retmode = "json"
    url = f"{base_url}?db={db}&term={term}&retmode={retmode}"

    response = requests.get(url)
    data = response.json()
    cids = data['esearchresult']['idlist']
    
    return cids

D.L · Answer 2 · 2023-02-25T00:22:23.013

as per the comments, you have to go no further than the first few lines to identify the error. But for clarity i show the complete answer here.

Essentially, you can do this:

import requests

exact_mass = 1176.784  # set the exact mass value
tolerance = 0.01  # set the tolerance value

# set the API endpoint URL
url = "https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/list/exactmass/%f+-%0.3f/property/IUPACName/JSON" % (exact_mass, tolerance / 2)

print(url)

the above returns this:

https://pubchem.ncbi.nlm.nih.gov/rest/pug/compound/list/exactmass/1176.784000+-0.005/property/IUPACName/JSON

you can then take the printed URL and paste it into a browser. Which will then return this:

{
  "Fault": {
    "Code": "PUGREST.BadRequest",
    "Message": "Unrecognized identifier namespace"
  }
}

So it is identified that the url is a bad url. The error message gives you the error code 400, which you can find here: https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/400

I went to the website and picked a similar (but working URL) for the purpose of testing, i used this:

url = 'https://pubchem.ncbi.nlm.nih.gov/rest/pug_view/data/compound/5754/JSON/?response_type=display'

And the if response.ok: block is entered successfully.

While this is true, I don't think this answers the question, it just points out that the example code is a bunch of nonsense. the question then becomes 'what is the correct way to make this query, if possible.' — Kaia, Feb 25 '23 at 00:14
@Kaia, i would actually say that the code looks clean and well written actually. it was easy to debug and identify the error. The error is the URL... knowing the correct URL would then progress the code into the `if response.ok:` block as expected. — D.L, Feb 25 '23 at 00:17
@Kaia, i also demonstrate that the code `if response.ok:` does in fact work when a working URL is given. — D.L, Feb 25 '23 at 00:24

Get molecules from PubChem which have an Exact Mass e.g. 1176.784 +/- 0.01 Dalton by using Python

2 Answers2