0

I am trying to return a list of URLs from a search using google news. I am using the GoogleNews and pandas dataframe modules to organize the results. I am then taking those URLs and downloading the webpages using pywebcopy.

Right now, my for loop increments in groups of 9 instead of 1 at a time, which I believe is the issue when downloading the webpage using the save_webpage function. I believe the save_webpage function can only handle 1 URL at a time. I have no clue how to shorten the range of results returned.

I've tried adjusting the range but (1,1) seems to be the lowest it can go, and that always returns 9 URLs instead of 1.

Here is my code:

from GoogleNews import GoogleNews
from newspaper import Article
import pandas as pd

googlenews=GoogleNews(start = '12/01/2021',end= '12/31/2021')
googlenews.search('test search')
result=googlenews.result()
df=pd.DataFrame(result)

for i in range(1,1):
    googlenews.getpage(i)
    result=googlenews.result()
    df=pd.DataFrame(result)
list = []

for ind in df.index:
    try:
        dict={}
        article = Article(df['link'][ind])
        article.download()
        article.parse()
        dict['Article Title'] = article.title
        dict['Article Text'] = article.text

        url = str(df['link'])
        print(str(url))

        download_folder = 'C:\Test_Data'

        kwargs = {'bypass_robots': True, 'project_name': 'PROJECT'}

        save_webpage(url, download_folder, **kwargs)
        list.append(dict)
    except:
        pass
Nick ODell
  • 15,465
  • 3
  • 32
  • 66
  • `range(start, stop, step)` -> What do you try to achieve with `range(1, 1)`? – Mr. T Jan 08 '22 at 17:01
  • `for loop increments in groups of 9 instead of 1 at a time` Which one? There are two for loops. – Nick ODell Jan 08 '22 at 17:02
  • What happens if you don't swallow all exceptions with `try: ... except: pass`? – Jasmijn Jan 08 '22 at 17:04
  • @Mr.T - I am trying to grab 1 URL from result of result variable. I see what you're saying this is having me think of things differently. – jrb0831 Jan 08 '22 at 17:15
  • @NickODell It happens in the first for loop where. There seems to be an issue with my logic of range(1,1). Sorry for not clarifying that. – jrb0831 Jan 08 '22 at 17:19
  • @Jasmijn - I tried this and it fails at save_webpage with the following error: Invalid URL '/robots.txt': No scheme supplied. Perhaps you meant http:///robots.txt? – jrb0831 Jan 08 '22 at 17:31

1 Answers1

0

I've tried adjusting the range but (1,1) seems to be the lowest it can go, and that always returns 9 URLs instead of 1.

If you write the following loop, you'll actually get a loop which executes 0 times:

for i in range(1,1):
    print("Looping at index " + str(i))

If you run this, it will not print anything, because it is looping 0 times. A shortcut for figuring how many times a loop will loop is to subtract the start from the end. So, e.g. this loops 1 time, because 2 - 1 = 1:

for i in range(1,2):
    print("Looping at index " + str(i))

So, why are you getting ten results back? The GoogleNews library is designed to fetch one "page" of results at a time. This line fetches one page:

result=googlenews.result()
df=pd.DataFrame(result)

Since the line is outside the loop, even though the loop is not running, the line is still executed.

To fix this, I recommend looping over the page of results, and calling save_webpage() once per article.

Nick ODell
  • 15,465
  • 3
  • 32
  • 66