0

I used guardian news api to fetch data. Then it documentation said, results are returned as paginated list of containing, by default, 10 entries per page. And I get output JSON as this. guardian documentation can find here

{
    "response": {
        "status": "ok",
        "userTier": "developer",
        "total": 8174,
        "startIndex": 1,
        "pageSize": 10,
        "currentPage": 1,
        "pages": 818,
        "orderBy": "relevance",
        "results": []
}

I want to colect all data(total of 8174 in example) instace of 10 entities. Is there any way to fetch all data ?

Chamith
  • 69
  • 10

2 Answers2

1

I found the answer. Default guardian fetches 10 entries per page. We can override default values using page-size parameter in API and providing needed data count.

https://content.guardianapis.com/search?q={query}&page-size={data count}
Chamith
  • 69
  • 10
0

Your solution will not work in all cases, since there is usually a limit to the page-size parameter. For the Guardian API this is 200 at the moment.

If you need more items than you can get in a single call to the API, simply iterate over pages with a definite loop (if you know how many pages you need) or with an open-ended while loop if you want to grab everything, e.g.

current_page = 1
total_pages = 1
while current_page <= total_pages:
   try:
      r = requests.get(url, params)
      r.raise_for_status()
   except:
      SystemExit(err)
   current_page += 1
   total_pages = r.json()['response']['pages']

p.s. always good to add a way out your while loops if something fails, you don't want to flood the api with requests forever!

invariant
  • 81
  • 6
  • Would you also happen to know how to use the Guardian API to get the content of a puzzle, such as a crossword puzzle? It is there in the HTML, but the API does not seem to return it by default. – Obie 2.0 Sep 26 '21 at 04:15
  • What do you mean by "the content of a crossword puzzle"? Do you mean the clues? Have you tried using the "Explore" tool of the API's website? I managed to pull crossword puzzles with the "content" API and q="crossword". Does this help? – invariant Sep 28 '21 at 07:12
  • Precisely. The clues and answers. Were you able to get those? Even when I use a single-item URL with the API, it does not return the clues and answers, only the surrounding context. I wrote a small web scraper to gather this data, but my IP addresses keep getting blocked. – Obie 2.0 Sep 28 '21 at 07:13
  • Yes. I just got the clues using `soup.find_all('div', class_='crossword__clues__text')`. – invariant Sep 28 '21 at 07:27
  • Yes, it is easy to find the clues if they are getting returned. But for me, they are not getting returned. Did you use something like this to get the page: `payload = { 'api-key': 'KEY', 'page-size': 10, 'show-editors-picks': 'true', 'show-elements': 'image', 'show-fields': 'all' }; response = requests.get('https://content.guardianapis.com/crosswords/cryptic/28551', params=payload)`? – Obie 2.0 Sep 28 '21 at 07:34
  • I see, sorry, I thought I'd answered your question with my first reply. As I was saying, I used the content API endpoint, which looks like: `query = 'crossword AND cryptic'`; `query_url = f"http://content.guardianapis.com/search?&api-key={apikey}&q={query}"`; `r = requests.get(query_url)`. Notice that I added an extra keyword (here "cryptic") to the query to weed out content in the "crosswords blog" section, which is just text. – invariant Sep 28 '21 at 07:52
  • I see. For me, that returns a number of elements of the following form, when I look at r.text: `{"id":"crosswords/cryptic/28541","type":"crossword","sectionId":"crosswords","sectionName":"Crosswords","webPublicationDate":"2021-09-02T23:00:12Z","webTitle":"Cryptic crossword No 28,541","webUrl":"https://www.theguardian.com/crosswords/cryptic/28541","apiUrl":"https://content.guardianapis.com/crosswords/cryptic/28541","isHosted":false,"pillarId":"pillar/lifestyle","pillarName":"Lifestyle"}`, just as I was getting before. Still without clue or solution elements. It gives you a different response? – Obie 2.0 Sep 28 '21 at 07:55
  • I see now what your confusion is. By passing `'show-fields': 'all'` you expected to see the text returned as one of the fields (such as how the text of an article is in `'body'`). It's true that here the textual contents are not returned in any of the fields. But I went another way: you can extract the url from the `'webUrl'` field and send another request with `r = requests.get(url)`. – invariant Sep 28 '21 at 08:03
  • So you only used the API to get the URLs of the content, and then just regular webscraping with requests? I tried the regular webscraping earlier, and I got several IP addresses permanently blocked after a small number of requests. Anyway, when I put the URL in directly, I still don't see any
    elements that contain clues.
    – Obie 2.0 Sep 28 '21 at 08:07
  • That's right, that's what I did. Regarding getting blocked, I don't know why it happens to you, you may want to check the limits. Also, I would say to end here, as it is becoming too long for a comment section. If still unsure, maybe move this to a new question. – invariant Sep 28 '21 at 08:10
  • Never mind, I figured out the issue. Some of the cryptic crosswords are actually in the Prize crossword section instead. I was iterating from the earliest to the latest and still had hundreds left, but I had exhausted all the ones that were labeled as Cryptic. – Obie 2.0 Sep 28 '21 at 08:42