1

I've written a script in python which is able to take search input from an excel file, put it into the input_val variable within my scraper and fetch the image link matching with the search parameter. When I print it, I get the accurate results. I'm using openpyxl to read and write the values.

However, when I try to write the populated results next to each search keyword in a new excel file I get stuck.

There are three search keyword (three movies) in my current excel file named item.xlsx. Which are:

Shutter Island 
Black Swan
True Grit

If I run my existing scraper without any modification then the results in a new file looks like:

Shutter Island    
Black Swan        
True Grit         
image_link 1
image_link 2
image_link 3

I expect my scraper will fetch the image links and write those just next to each movies in a new excel file. As a result the output should look like below:

Column A          Column B
Shutter Island    image_link 1
Black Swan        image_link 2
True Grit         image_link 3

This Is what I've tried so far:

import requests
from bs4 import BeautifulSoup
from openpyxl import load_workbook

wb = load_workbook('item.xlsx')
ws = wb['Sheet1']

for row in range(1, ws.max_row + 1):
    input_val = ws["A" + str(row)].value  #the search keyword holds here

    response = requests.get("http://www.boxofficemojo.com/search/?",params = {'q':input_val})
    soup = BeautifulSoup(response.text,"lxml")
    table = soup.select("table")[1]
    for items in table.select('tr')[4:5]:
        [elem.extract() for elem in soup.select("script")] #kicking out script from result
        data = [img['src'] for img in items.select('td img')]
        ws.append(data)
        wb.save("new_one.xlsx")
SIM
  • 21,997
  • 5
  • 37
  • 109

1 Answers1

1

Below I've provided an example of how you can do it. In Column B, there's a string of the url. I've also added a third column that utilizes the HYPERLINK function in excel to display the text of column A and link to the desired url provide in Column B.

import requests
from bs4 import BeautifulSoup
from openpyxl import load_workbook

wb = load_workbook('item.xlsx')
ws = wb['Sheet1']

for row in range(1, ws.max_row + 1):
    input_val = ws["A" + str(row)].value  #the search keyword holds here
    response = requests.get("http://www.boxofficemojo.com/search/?",params = {'q':input_val})
    soup = BeautifulSoup(response.text,"lxml")
    table = soup.select("table")[1]
    for items in table.select('tr')[4:5]:
        [elem.extract() for elem in soup.select("script")] #kicking out script from result
        data = [img['src'] for img in items.select('td img')]
        if row != 1:
            cell_str = data[0]
            ws.cell(row=row, column=2).value = '%s' % (cell_str)
            ws.cell(row=row, column=3).value = '=HYPERLINK(B%d, A%d)' % (row, row)
        else:
            ws.cell(row=row, column=2).value = 'Column B'
            ws.cell(row=row, column=3).value = 'Hyperlink'
        wb.save("new_one.xlsx")

Output: Excel Output

patrickjlong1
  • 3,683
  • 1
  • 18
  • 32
  • 1
    I've never expected such an ideal answer. You are the legend @patrickjlong1. Just a little oneliner explanation for the `else` block and if any link you know of where I can learn the way you did here. Thanks again. – SIM Jan 19 '18 at 17:47
  • 1
    Perhaps, I've understood that you used the `else` block for writing headers. Now, my only expectation is you may provide me with any link tracking which I can learn such ideal usage. – SIM Jan 19 '18 at 18:04
  • 1
    Thanks @novice-coder. And now you mention it, the `else` statement is a bit unnecessary, you could skip the `else` and hardcode `row = 1`. I found this [answer](https://stackoverflow.com/questions/39077661/adding-hyperlinks-in-some-cells-openpyxl) that give you some other options for hyperlinks in openpyxl. – patrickjlong1 Jan 19 '18 at 18:16