2

I'm using the code shown below in order to retrieve papers from arXiv. I want to retrieve papers that have words "machine" and "learning" in the title. The number of papers is large, therefore I want to implement a slicing by year (published).

How can I request records of 2020 and 2019 in search_query? Please notice that I'm not interested in post-filtering.

import urllib.request

import time
import feedparser

# Base api query url
base_url = 'http://export.arxiv.org/api/query?';

# Search parameters
search_query = urllib.parse.quote("ti:machine learning")
start = 0
total_results = 5000
results_per_iteration = 1000
wait_time = 3

papers = []

print('Searching arXiv for %s' % search_query)

for i in range(start,total_results,results_per_iteration):
    
    print("Results %i - %i" % (i,i+results_per_iteration))
    
    query = 'search_query=%s&start=%i&max_results=%i' % (search_query,
                                                         i,
                                                         results_per_iteration)

    # perform a GET request using the base_url and query
    response = urllib.request.urlopen(base_url+query).read()

    # parse the response using feedparser
    feed = feedparser.parse(response)

    # Run through each entry, and print out information
    for entry in feed.entries:
        #print('arxiv-id: %s' % entry.id.split('/abs/')[-1])
        #print('Title:  %s' % entry.title)
        #feedparser v4.1 only grabs the first author
        #print('First Author:  %s' % entry.author)
        paper = {}
        paper["date"] = entry.published
        paper["title"] = entry.title
        paper["first_author"] = entry.author
        paper["summary"] = entry.summary
        papers.append(paper)
    
    # Sleep a bit before calling the API again
    print('Bulk: %i' % 1)
    time.sleep(wait_time)
Fluxy
  • 2,838
  • 6
  • 34
  • 63
  • 1
    According to the arXiv's API (https://arxiv.org/help/api/user-manual#search_query_and_id_list and https://arxiv.org/help/api/user-manual#query_details), you do not have that filter in the query. – horro Sep 24 '20 at 13:27

1 Answers1

3

According to the arXiv documentation, there is no published or date field available.

What you can do is to sort the results by date (by adding &sortBy=submittedDate&sortOrder=descending to your query parameters) and stop making requests when you reach 2018.

Basically your code should be modified like this:

import urllib.request

import time
import feedparser

# Base api query url
base_url = 'http://export.arxiv.org/api/query?';

# Search parameters
search_query = urllib.parse.quote("ti:machine learning")
i = 0
results_per_iteration = 1000
wait_time = 3
papers = []
year = ""  
print('Searching arXiv for %s' % search_query)

while (year != "2018"): #stop requesting when papers date reach 2018
    print("Results %i - %i" % (i,i+results_per_iteration))
    
    query = 'search_query=%s&start=%i&max_results=%i&sortBy=submittedDate&sortOrder=descending' % (search_query,
                                                         i,
                                                         results_per_iteration)

    # perform a GET request using the base_url and query
    response = urllib.request.urlopen(base_url+query).read()

    # parse the response using feedparser
    feed = feedparser.parse(response)
    # Run through each entry, and print out information
    for entry in feed.entries:
        #print('arxiv-id: %s' % entry.id.split('/abs/')[-1])
        #print('Title:  %s' % entry.title)
        #feedparser v4.1 only grabs the first author
        #print('First Author:  %s' % entry.author)
        paper = {}
        paper["date"] = entry.published
        year = paper["date"][0:4]
        paper["title"] = entry.title
        paper["first_author"] = entry.author
        paper["summary"] = entry.summary
        papers.append(paper)
    # Sleep a bit before calling the API again
    print('Bulk: %i' % 1)
    i += results_per_iteration
    time.sleep(wait_time)

for the "post-filtering" approach, once enough results are collected, I'd do something like this:

papers2019 = [item for item in papers if item["date"][0:4] == "2019"]
corbin-c
  • 669
  • 8
  • 20
  • That could be a feasible solution, but he specifies he is not interested in post-filtering, so I guess that option is not valid for him. – horro Sep 24 '20 at 13:30
  • Thanks, my final goal is to retrieve all possible papers that include "machine learning" in the title. I struggled with this task, because the query returns different results each time I run it... Therefore I decided to slice by year. I'm not sure that I correctly understood how sortBy can be applied to slicing by date. Can you please give an example how to get the records for 2020 and then only for 2019 and then only for 2018? – Fluxy Sep 24 '20 at 13:34
  • you won't be able to get the records "only for 2019" without post-filtering. The better you can do is to stop the requests when reaching a given date... – corbin-c Sep 24 '20 at 13:39
  • can you please show how to do so? also, could you please explain how your post-filtering approach allows me retrieving all possible records from arXiv that have "machine learning" in the title? at the end, the approach does not matter for me. I'm interested in the final result. – Fluxy Sep 24 '20 at 13:43
  • Thanks. When I run this code and then do `len(papers)`, I see 532 papers, which seems to be unrealistic. This is the issue. How many records did you get? – Fluxy Sep 24 '20 at 15:28