We are using Google's Custom Search JSON API for higher-education research, where we essentially are parsing through a large amount of URLs to find information on various organizations' responses to COVID-19. We are using Google's API to find top search results. However, we have found that there are inconsistent results when using different search parameters within the API query. The inconsistencies are an issue because we are trying to hone our query to a certain error rate (error rate being how many URLs provide effective research information). We are looking for someone to help explain how Google's API works, because the documentation is extremely minimal. An example of our base query: 'https://www.googleapis.com/customsearch/v1?key=KEY&cx=SEARCHENGINE&q="School Name" intext:(term1 | term2 | term3) -inurl:(unwanted1 | unwanted2 | unwanted3) inurl:(wanted1 | wanted2 | wanted3)&start=1'
Where "School Name" is the name of a higher-ed institution. Term1, term2, etc., are specific variables that we want to find in the body text of the search results. The intext parameter helps to avoid invisible text in certain documents. For example, insidehighered.com includes many higher-ed institutions in invisible text without the actual article being applicable. Unwanted 1, etc,. are words or phrases that we don't want included in the URL title. For example, we want to avoid PDF documents, so one could be ".pdf". Wanted1, etc,. are words that we do want in the URL, like "news". We use "|" to signify "or", which allows us to utilize one query for multiple types of searches, thus helping to minimize the cost of our API usage.
So far, we've found the following issues/inconsistencies:
- "-" and "NOT" to negate terms return different results.
- The order of parameters matters. For example, "inurl:(some wanted search terms) -inurl:(some unwanted search terms)" returns different results than "-inurl:(some unwanted search terms) inurl:(some wanted search terms)"
- Nesting of terms is also inconsistent. For example, "inurl:( (wanted terms | wanted terms) NOT(unwanted terms | unwanted terms))" returns different results than "inurl:(wanted term | wanted term| NOT unwanted term NOT unwanted term)"
- Furthermore, the API returns different results once in a while on certain queries using the same exact query two different times. It seems like the query will return 10 results, but spontaneously mix in the last 1 or 2 from either the next page, or somewhere else. For example, this query: "https://www.googleapis.com/customsearch/v1?key=KEY&cx=SEARCHENGINE&q="Miami University-Hamilton" intext:(reduce tuition | freeze tuition | decrease tuition | lower tuition) inurl:(news | announcement | article | story) -inurl:(registrar | admissions | tuition-and-fees | tuition-schedule | schedule | state | office | employment-opportunities | about-us | about | linkedin | events | .uk | irs | .gov | information-technology | wikipedia | wiki | employee-handbook | student-handbook | shop | annual | youtube | pinterest | store | openings | indeed | amazon | contact | job-board | jobboard | policies | frequently-asked-questions | faq | forms | hours | academic-calendar | calendar | directory | glassdoor | facebook | encyclopedia)&start=31" (and then start=41 for the next page) will return "http://www.harbison.one/archive/z_1985_national_cc_directory.pdf" as both the last item in the 4th page, and the 1st item on the 5th page. When we run our GET request, it will sometimes return a different result for the last item on the 4th page, but then will return that same duplicate URL for both pages.
Our code being used to pull the items off of each page is:
response = requests.get(query)
content = response.json()
hrefs = []
try:
for i in content['items'][0:num]:
hrefs.append(i['link'].lower())
except Exception as e:
print(str(e))
hrefs.append('a')
Thank you!