0

I'm very new to Python so not sure if this can be done but I hope it can!

I have accessed the Scopus API and managed to run a search query which gives me the following results in a pandas dataframe:

                                                            search-results
entry                    [{'@_fa': 'true', 'affiliation': [{'@_fa': 'tr...
link                     [{'@_fa': 'true', '@ref': 'self', '@type': 'ap...
opensearch:Query         {'@role': 'request', '@searchTerms': 'AFFIL(un...
opensearch:itemsPerPage                                                200
opensearch:startIndex                                                    0
opensearch:totalResults                                             106652

If possible, I'd like to export the 106652 results into a csv file so that they can be analysed. Is this possible at all?

R Thompson
  • 353
  • 3
  • 15
  • Standart Python library contains [csv](https://docs.python.org/3.3/library/csv.html) module with some [useful function](https://www.getdatajoy.com/examples/python-data-analysis/read-and-write-a-csv-file-with-the-csv-module). – Stanislav Ivanov Sep 26 '16 at 14:01
  • Are you sure you have all the 106652 results inside your "entry" list? The api only downloads 200 items per page, and you got a start index of 0. Check that before. I am also implementing a python api for Scopus Search, i may release it as soon is ready. – valleymanbs Oct 04 '16 at 13:54
  • Yes sorry, I should've readdressed this question once I realised what was going wrong. It is a pain that only a max of 200 search results can be downloaded at one time! – R Thompson Oct 04 '16 at 13:56
  • Yeah, i know, i iterate over the number of totalResult subtracting the count and then i combine all the "entry" fields in a list (which is the same as a JSON object actually...). I then use a home-made script to convert the data to a filetype which is very similar to .csv but is not the same. I'll post an answer with a snippet from my class implementation of the scopes search api so you can inspire yourself... – valleymanbs Oct 04 '16 at 16:18

1 Answers1

0

first you need to get all the results (see comments under question). The data you need (search results) is inside the "entry" list. You can extract that list and append it to a support list, iterating until you got all the results. Here i cycle and at every round i subtract the downloaded items (count) from the total number of results.

        found_items_num = 1
        start_item = 0
        items_per_query = 25
        max_items = 2000
        JSON = []

        print ('GET data from Search API...')

        while found_items_num > 0:

            resp = requests.get(self._url,
                                headers={'Accept': 'application/json', 'X-ELS-APIKey': MY_API_KEY},
                                params={'query': query, 'view': view, 'count': items_per_query,
                                        'start': start_item})

            print ('Current query url:\n\t{}\n'.format(resp.url))

            if resp.status_code != 200:
                # error
                raise Exception('ScopusSearchApi status {0}, JSON dump:\n{1}\n'.format(resp.status_code, resp.json()))

            # we set found_items_num=1 at initialization, on the first call it has to be set to the actual value
            if found_items_num == 1:
                found_items_num = int(resp.json().get('search-results').get('opensearch:totalResults'))
                print ('GET returned {} articles.'.format(found_items_num))

            if found_items_num == 0:
                pass
            else:
                # write fetched JSON data to a file.
                out_file = os.path.join(str(start_item) + '.json')

                with open(out_file, 'w') as f:
                    json.dump(resp.json(), f, indent=4)
                    f.close()

                # check if results number exceed the given limit
                if found_items_num > max_items:
                    print('WARNING: too many results, truncating to {}'.format(max_items))
                    found_items_num = max_items



                # check if returned some result
                if 'entry' in resp.json().get('search-results', []):
                    # combine entries to make a single JSON
                    JSON += resp.json()['search-results']['entry']
            # set counters for the next cycle
            self._found_items_num -= self._items_per_query
            self._start_item += self._items_per_query
            print ('Still {} results to be downloaded'.format(self._found_items_num if self._found_items_num > 0 else 0))

        # end while - finished downloading JSON data

then, outside the while, you can save the complete file like this...

out_file = os.path.join('articles.json')
        with open(out_file, 'w') as f:
            json.dump(JSON, f, indent=4)
            f.close()

or you can follow this guide i found online(not tested, you can search 'json to cvs python' and you get many guides) to convert the json data to a csv

valleymanbs
  • 487
  • 3
  • 14
  • Thanks, currently I was running a for loop with different start points in the results but that takes ages to run and keeps giving random errors, so I will try your method thanks! – R Thompson Oct 05 '16 at 10:27
  • Quick question, how does the `self._JSON` line work? – R Thompson Oct 05 '16 at 11:37
  • sorry, i lost an = [] somewhere, should be fine now... i also removed the self. coming from the class – valleymanbs Oct 05 '16 at 12:16
  • Thanks, this should really help me a lot! – R Thompson Oct 05 '16 at 13:18
  • As a rough statement, running it as above (max 2000 results) how long should the script take to run (mine is currently taking a while) – R Thompson Oct 05 '16 at 13:24
  • 1
    I think it's because in the code above I can;t see the step for subtracting the count from the search results so I'm stuck in an infinite loop! Or is that because this is only a snippet?! – R Thompson Oct 05 '16 at 14:00
  • You are absolutely right, sorry, i missed the end of the while cycle, i was in a hurry yesterday night and i did a mess. Added just now, i'm really sorry for making you waste so many cpu cycles :P Downloading 2000 articles, 25 for query, actually takes a few minutes (let's say 3 minutes?) with my 10 Mbps connection. If you can get a subscriber account/api key you can get up to 100 or 200 for a single query, depending whether you ask a complete or a standard view. – valleymanbs Oct 05 '16 at 16:05
  • That's great, at least I know it wasn't me being stupid for some reason! I think I already can get 200 records for a single query so will give that a go, thanks! – R Thompson Oct 05 '16 at 16:08
  • try 100 if 200 doesn't work, you can refer to this limits: https://dev.elsevier.com/api_key_settings.html – valleymanbs Oct 05 '16 at 16:12
  • Is there a max limit on how many record we can pull here (less than the quota of course) i.e. 10,000 – R Thompson Oct 05 '16 at 16:17
  • I can't understand what you mean. Those limits reset every 7 days anyway, so if you run out of queries you don't have to wait a lot... – valleymanbs Oct 06 '16 at 10:58
  • For what I'm doing I need the data relatively fast! As a final resort option, is it only one API Key per person? – R Thompson Oct 06 '16 at 11:48
  • Just as a heads up, this script will only allow me to retrieve search records at a maximum of 5000 before an error occurs. I'm not sure why this is though.... – R Thompson Oct 11 '16 at 15:13
  • I'll try a big query tomorrow and let you know. Thanks for feedback – valleymanbs Oct 12 '16 at 20:50
  • No problem. If it helps I think the script always assumes that the total number of search results is 5000 when it is not. Rather strange – R Thompson Oct 13 '16 at 08:36
  • After 5000 results the line Exception('ScopusSearchApi status {0}, JSON dump:\n{1}\n'.format(resp.status_code, resp.json())) Correctly throws an exception, because (as I discovered now) the max limit per-query is 5000 results. This is a server side limit imposed by Elsevier. You get an http code 404 if you try any query on scopus search api with the parameter start=5000 or more. – valleymanbs Oct 13 '16 at 11:27
  • Do you have any idea what's going on? – R Thompson Oct 13 '16 at 11:32
  • Ignore my comment above sorry – R Thompson Oct 13 '16 at 11:32
  • Are there ways to run the script with a different start point (ie multiples of 5000) since I tried to edit it to account for a different start point but kept getting an error)? – R Thompson Oct 13 '16 at 11:33
  • No way, the limit is 5000 results per query. You could find ways to join the results of different queries, though... I think i'll be going that way. – valleymanbs Oct 14 '16 at 17:17