Python: How do I save scholarly.search_pubs() result as a dataframe?

Question

I used the following code to find an article using the scholarly.search_pubs() function:

search_query = scholarly.search_pubs('A Bayesian Analysis of the Style Goods Inventory Problem')
scholarly.pprint(next(search_query))

Output:

{'author_id': ['', ''],
 'bib': {'abstract': 'A style goods item has a finite selling period during '
                     'which the sales rate varies in a seasonal and, to some '
                     'extent, predictable fashion. There are only a limited '
                     'number of opportunities to purchase or manufacture the '
                     'style goods item, and the cost, in general, will depend '
                     'on the time at which the item is obtained. The unit '
                     'revenue achieved from sales of the item also varies '
                     'during the selling season, and, in particular, reaches '
                     'an appreciably lower terminal salvage value. Previous '
                     'work on this class of problem has assumed one of the '
                     'following:(a)',
         'author': ['GR Murray Jr', 'EA Silver'],
         'pub_year': '1966',
         'title': 'A Bayesian analysis of the style goods inventory problem',
         'venue': 'Management Science'},
 'citedby_url': '/scholar?cites=9014559854426428787&as_sdt=5,33&sciodt=0,33&hl=en',
 'filled': False,
 'gsrank': 1,
 'num_citations': 208,
 'pub_url': 'https://pubsonline.informs.org/doi/abs/10.1287/mnsc.12.11.785',
 'source': 'PUBLICATION_SEARCH_SNIPPET',
 'url_add_sclib': '/citations?hl=en&xsrf=&continue=/scholar%3Fq%3DA%2BBayesian%2BAnalysis%2Bof%2Bthe%2BStyle%2BGoods%2BInventory%2BProblem%26hl%3Den%26as_sdt%3D0,33&citilm=1&update_op=library_add&info=c5WVKW0mGn0J&ei=4DdoYri8IoySyASZk6HgCA&json=',
 'url_related_articles': '/scholar?q=related:c5WVKW0mGn0J:scholar.google.com/&scioq=A+Bayesian+Analysis+of+the+Style+Goods+Inventory+Problem&hl=en&as_sdt=0,33',
 'url_scholarbib': '/scholar?q=info:c5WVKW0mGn0J:scholar.google.com/&output=cite&scirp=0&hl=en'}

I want to save this output as a pandas dataframe. Can someone please help me with it?

Edit(1): Thank you for answering my question.

When I run this code:

data = next(search_query)
df = pd.json_normalize(data)

... it gives the following error message:

StopIteration                             Traceback (most recent call last)
<ipython-input-78-ef73437b55a5> in <module>
----> 1 data = next(search_query)
      2 df = pd.json_normalize(data)

~\Anaconda3\lib\site-packages\scholarly\publication_parser.py in __next__(self)
     91             return self.__next__()
     92         else:
---> 93             raise StopIteration
     94 
     95     # Pickle protocol
StopIteration:

FOLLOW UP QUESTION

I have an excel file that contains the titles of multiple articles. Instead of separately searching for each article, I imported my excel file as a dataframe, and used the following code to find the info about the articles:

for i in df['Title']:
    search_query_1 = scholarly.search_pubs(i)

Now, search_query_1 iterator contains multiple articles. How can I save them as a dataframe?

Chinny84 · Accepted Answer · 2022-05-02T16:28:06.267

1

Try using pd.json_normalize

# python 3.8.9
# scholarly==1.6.0

search_query = scholarly.search_pubs('A Bayesian Analysis of the Style Goods Inventory Problem')
data = next(search_query)
# you can use data = list(search_query) to get the entire search back
df = pd.json_normalize(data)

#output
>>> df.T                                                                      
                                                                     0
container_type                                              Publication
source                     PublicationSource.PUBLICATION_SEARCH_SNIPPET
filled                                                            False
gsrank                                                                1
pub_url               https://pubsonline.informs.org/doi/abs/10.1287...
author_id                                                          [, ]
url_scholarbib        /scholar?q=info:c5WVKW0mGn0J:scholar.google.co...
url_add_sclib         /citations?hl=en&xsrf=&continue=/scholar%3Fq%3...
num_citations                                                       209
citedby_url           /scholar?cites=9014559854426428787&as_sdt=5,33...
url_related_articles  /scholar?q=related:c5WVKW0mGn0J:scholar.google...
bib.title             A Bayesian analysis of the style goods invento...
bib.author                                    [GR Murray Jr, EA Silver]
bib.pub_year                                                       1966
bib.venue                                            Management Science
bib.abstract          A style goods item has a finite selling period...
>>> df.columns
Index(['container_type', 'source', 'filled', 'gsrank', 'pub_url', 'author_id',
       'url_scholarbib', 'url_add_sclib', 'num_citations', 'citedby_url',
       'url_related_articles', 'bib.title', 'bib.author', 'bib.pub_year',
       'bib.venue', 'bib.abstract'],
      dtype='object')

collect the iterated search and do the json normalize

To handle iterating over multiple titles

titles_to_search = list(df['Title'].unique())

dfs = []
for title_to_search in titles_to_search:
    search_query = scholarly.search_pubs(title_to_search)
    search_results = list(search_query)
    
    temp_df = pd.json_normalize(data=search_results)
    if not temp_df.empty:
        dfs += [temp_df]

total_search_df = pd.concat(dfs)

edited May 02 '22 at 16:28

answered Apr 27 '22 at 21:45

Chinny84

956
6
16

Thank you for answering my question. When I run your code, it gives me an error: StopIteration Traceback Please see my main post to view the complete message. – Addy Apr 28 '22 at 02:33
@user18968748 this is because that is a iterator that once all the results have been iterated over then you can not call next on it - like a cup of water once drunk it will be empty until you fill it again :). Not tried it myself but you can wrap the query object with a list() instead of next(). – Chinny84 Apr 28 '22 at 07:40
I see. I thought since 'search_query' is an iterator object, so next() should work. I tried list() instead of next(). The code runs but the dataframe turns out to be empty. Nothing seems to work :( – Addy Apr 28 '22 at 19:01
Let me run it again. It is an iterator and usually wrapping it in a list will generate the whole iteration - there is only one in the object based on that search last time I ran it. – Chinny84 Apr 28 '22 at 20:32
1

Thank you. I appreciate your help. If next() is working for you, may I ask which python version are you using? – Addy Apr 29 '22 at 01:32
@user18968748 I updated it. How did you get the original json out before? – Chinny84 Apr 29 '22 at 08:35
@user18968748 can you confirm if you are rerunning the `search_query` before you call next again? – Chinny84 Apr 29 '22 at 10:22
Ooh that's what I was doing wrong. I was not rerunning the search_query before calling next again. It works now. Thank you so much. I appreciate your help! :D – Addy May 01 '22 at 03:19
Thank you again for answering my question. I have one follow-up question (Please see my main post). Can you please help? – Addy May 02 '22 at 02:42
@Addy I have updated the answer - but any more updates please add it in another question :). Feel free to tag me in the comments. – Chinny84 May 02 '22 at 16:29
@Chinny84, the list on search_query results with an empty data frame, any thoughts on that? – sampak Jan 18 '23 at 18:30
@sampak Not really ran this since I wrote this answer - 1) is this the case for other queries 2) was it after you ran it once before - as this is a generator it seems so when it has ran once it will be empty or pointing to null – Chinny84 Jan 19 '23 at 14:46
1

It was for all queries, not sure what was happening but resolved my issue differently. thanks! – sampak Feb 03 '23 at 16:05

Python: How do I save scholarly.search_pubs() result as a dataframe?

1 Answers1