1

After searching 100s of answers, I'm here again, asking new question that might help someone in the future.

I'm scraping this website: https://inview.doe.in.gov/state/1088000000/school-list.
The school list is in a flex box and I believe that I can get the data fetched by using selenium. But I want get this job done only by using BeautifulSoup.

By inspecting and tracking the Network connections, I found 2 API calls and I'm not which API gives me the school list. I do have their IPv4 address as well.

api = 'https://inview.doe.in.gov/api/entities?lang=en&merges=[{"route": "entities", "name": "district", "local_field": "district_id", "foreign_field": "id", "fields": "id,name"}]&filter=state_id==1088000000'
ipv4 = '104.18.21.238:443'
api2 = 'https://inview.doe.in.gov/api/entities?filter=type==district,type==network,type==school,type==state&fields=name,type,id,district_id'
ipv4 = '104.18.21.238:443'

Trying to access the content directly gives None as it is dynamaically loaded (at least that's what I believe).

import json
import requests
from bs4 import BeautifulSoup


def url_parser(url):
  html_doc = requests.get(url, headers={"Accept":"*/*"}).text
  soup = BeautifulSoup(html_doc,'html.parser')
  return html_doc, soup


def data_fetch(url):
  html_doc, soup = url_parser(url)
  api_link = 'https://inview.doe.in.gov/api/entities?lang=en&merges=[{"route": "entities", "name": "district", "local_field": "district_id", "foreign_field": "id", "fields": "id,name"}]&filter=state_id==1088000000'
  html_doc2, soup2 = url_parser(api_link)
  #school_id = soup2.find_all('div', {'class':'result-table table--results mt-3'})
  print(soup2)


def main():
  url = "https://inview.doe.in.gov/state/1088000000/school-list"
  data_fetch(url)

main()

Trying to open the api link directly gives me the same error message as what I get in the code as below:

{"message":"The resource identified by the request is only capable of generating response entities which have content characteristics not acceptable according to the accept headers sent in the request. Supported entities are: application/json, application/vnd.tembo.api+json, application/vnd.tembo.api+json;version=1","status":406}

Is there any way I can fix that?

chitown88
  • 27,527
  • 4
  • 30
  • 59
theycallmepix
  • 474
  • 4
  • 13

2 Answers2

3

Try this:

import requests

api_url = "https://inview.doe.in.gov/api/entities?filter=type==district,type==network,type==school,type==state&fields=name,type,id,district_id"

headers = {
    "Accept": "application/vnd.tembo.api+json",
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36",
    "Referer": "https://inview.doe.in.gov/state/1088000000/school-list",
    "X-Requested-With": "XMLHttpRequest",
}

data = requests.get(api_url, headers=headers).json()
for item in data["entities"]:
    print(f'{item["name"]} - https://inview.doe.in.gov/schools/{item["id"]}/profile')

Output:

Country View School (A002) - https://inview.doe.in.gov/schools/1000001002/profile
Salem School (A003) - https://inview.doe.in.gov/schools/1000001003/profile
Hagia Sophia Classical Academy (A004) - https://inview.doe.in.gov/schools/1000001004/profile
CRC Academy (A005) - https://inview.doe.in.gov/schools/1000001005/profile
Morning Star School (A006) - https://inview.doe.in.gov/schools/1000001006/profile
Maple Lane School (A008) - https://inview.doe.in.gov/schools/1000001008/profile
Bloomington Islamic School (A009) - https://inview.doe.in.gov/schools/1000001009/profile
Winchester Amish School (A010) - https://inview.doe.in.gov/schools/1000001010/profile
Rainbow Valley (A011) - https://inview.doe.in.gov/schools/1000001011/profile
Hidden Valley School (A012) - https://inview.doe.in.gov/schools/1000001012/profile
Children of the Earth Montessori (A013) - https://inview.doe.in.gov/schools/1000001013/profile
Sunrise Ridge School (A014) - https://inview.doe.in.gov/schools/1000001014/profile
Pleasant Valley Amish School (A015) - https://inview.doe.in.gov/schools/1000001015/profile

and a lot more ...
baduker
  • 19,152
  • 9
  • 33
  • 56
  • So I'm supposed to fix the headers with the necessary proper user agents and stuff, right? I'll try this and revert you back. Thanks – theycallmepix Sep 27 '22 at 09:53
  • That's correct. – baduker Sep 27 '22 at 09:57
  • But the question that I'm wondering is that, how would you know in advance that you're supposed to add all these 4 lines in order to get it working? Trial and error? Based on what @SergeyK has answered, it is not required to mention `user agent` and `referrer.` In that case, that would definitely solve the issue. Getting back to PC. – theycallmepix Sep 27 '22 at 10:01
  • No need for trial and error. All you had to do was to inspect the header for a given URL in the Network tab of your browser's Developer Tools. Sure, you can omit all but the `Accept` header but I've added these to make the request look more like it's coming from a web browser. – baduker Sep 27 '22 at 10:05
  • Ah, makes sense. I still feel like I ignored that `application/vnd.tembo.api+json` even after doubting that would work for no reason. Anyway thank you the clarification. – theycallmepix Sep 27 '22 at 10:10
3

for example:

import requests
import pandas as pd

url = "https://inview.doe.in.gov/api/entities?lang=en&merges=[{%22route%22:%20%22entities%22,%20%22name%22:%20%22district%22,%20%22local_field%22:%20%22district_id%22,%20%22foreign_field%22:%20%22id%22,%20%22fields%22:%20%22id,name%22}]&filter=state_id==1088000000"
headers = {
  'accept': 'application/vnd.tembo.api+json',
}
schools = []
response = requests.request("GET", url, headers=headers)
for school in response.json()['entities']:
    schools.append({
        'ID': school['id'],
        'Name': school['name'],
        'Type': school['type'],
        'Grades': (lambda grade: ' - '.join([grade['grades'][0]['name'], grade['grades'][-1]['name']]) if 'grades' in grade else 'NA')(school),
        'Phone': (lambda phone: phone['phone_number'] if 'phone_number' in phone else 'NA')(school),
    })
df = pd.DataFrame(schools)
print(df.to_string(index=False))

OUTPUT:

        ID                                                                Name     Type                      Grades          Phone
1053105210                                 Edgewood Intermediate School (5210)   school           Grade 4 - Grade 6 (317) 803-5024
1053105317                              Wanamaker Early Learning Center (5317)   school               Pre-K - Pre-K (317) 860-4500
1045353742                              Wolcott Mills Elementary School (3742)   school               Pre-K - Pre-K (260) 499-2450
1045353746                                     Lima-Brighton Elementary (3746)   school               Pre-K - Pre-K (260) 499-2440
1033352672                                      Little Cadets Preschool (2672)   school               Pre-K - Pre-K (000) 000-0000
1014051133                                           Washington Primary (1133)   school             Pre-K - Grade 1 (812) 254-8360
1018751365                                   Royerton Elementary School (1365)   school      Kindergarten - Grade 5 (765) 282-2044
1018751367                                          Delta Middle School (1367)   school           Grade 6 - Grade 8 (765) 747-0869
1018751369                                            Delta High School (1369)   school          Grade 9 - Grade 12 (765) 288-5597
1018751409                                      Eaton Elementary School (1409)   school      Kindergarten - Grade 5 (765) 396-3301
1018751520                                     Albany Elementary School (1520)   school      Kindergarten - Grade 5 (765) 789-6102
1019101387                                       Yorktown Middle School (1387)   school           Grade 6 - Grade 8 (765) 759-2660
1019101389                                         Yorktown High School (1389)   school          Grade 9 - Grade 12 (765) 759-2550
1019101393                                   Yorktown Elementary School (1393)   school           Grade 3 - Grade 5 (765) 759-2770
1019101395                              Pleasant View Elementary School (1395)   school      Kindergarten - Grade 2 (765) 759-2800
1018951375                                         Wapahani High School (1375)   school          Grade 9 - Grade 12 (765) 289-7323
1018951377                                          Selma Middle School (1377)   school           Grade 6 - Grade 8 (765) 288-7242
1018951381                                      Selma Elementary School (1381)   school      Kindergarten - Grade 5 (765) 282-2455
1019701500                                       Muncie Virtual Academy (1500)   school     Kindergarten - Grade 12             NA
1019701513                                      East Washington Academy (1513)   school             Pre-K - Grade 5 (765) 747-5434
...
Sergey K
  • 1,329
  • 1
  • 7
  • 15
  • I believe I was almost there. If I had tried putting `application/vnd... ` in headers instead of `*/*`, that would've given me an idea.. I'll try and revert you back. – theycallmepix Sep 27 '22 at 09:56
  • @theycallmepix In general, the answer to your question was in the server's response, which says which entries are supported - application/json, application/vnd.tembo.api+json, application/vnd.tembo.api+json;version=1. But usually the API does not require any headers, only payload with a token. In this case, this is a site vulnerability because the request have a very big size. – Sergey K Sep 27 '22 at 10:24