0

New to Python and BeautifulSoup. Any help is highly appreciated

I have an idea of how to build one list of a companies info, but that's after clicking on one link.

import requests 
from bs4 import BeautifulSoup


url = "http://data-interview.enigmalabs.org/companies/"
r = requests.get(url)

soup = BeautifulSoup(r.content)

links = soup.find_all("a")

link_list = []

 for link in links:
    print link.get("href"), link.text

 g_data = soup.find_all("div",{"class": "table-responsive"})

 for link in links:
    print link_list.append(link)

Can anyone give an idea of how to go about first scraping the links then building a JSON of all of the company listings data for the site?

I attached sample images for a better visualization as well.

How would I scrape the site and build a JSON like my example below without having to click on each individual link?

Example Expected Output:

all_listing = [ {"Dickens-Tillman":{'Company Detail': 
 {'Company Name': 'Dickens-Tillman',
  'Address Line 1   ': '7147 Guilford Turnpike Suit816',
  'Address Line 2   ': 'Suite 708',
  'City': 'Connfurt',
  'State': 'Iowa',
  'Zipcode  ': '22598',
  'Phone': '00866539483',
  'Company Website  ': 'lockman.com',
  'Company Description': 'enable robust paradigms'}}},
`{'"Klein-Powlowski" ':{'Company Detail': 
 {'Company Name': 'Klein-Powlowski',
  'Address Line 1   ': '32746 Gaylord Harbors',
  'Address Line 2   ': 'Suite 866',
  'City': 'Lake Mario',
  'State': 'Kentucky',
  'Zipcode  ': '45517',
  'Phone': '1-299-479-5649',
  'Company Website  ': 'marquardt.biz',
 'Company Description': 'monetize scalable paradigms'}}}]

print all_listing`

enter image description here

enter image description here

enter image description here

Vash
  • 141
  • 11
  • Hmm... would you provide us with the actual url? – cs95 Jul 07 '17 at 04:43
  • @cᴏʟᴅsᴘᴇᴇᴅ yeah no problem the actual url is [link](http://data-interview.enigmalabs.org/companies/) – Vash Jul 07 '17 at 04:53
  • Argh, this looks like a job for selenium + bs4. – cs95 Jul 07 '17 at 04:55
  • Are corporate info displayed on the link pages or only in separate pages (one pr corporate)? – jlaur Jul 07 '17 at 10:52
  • @jlaur You are correct. That's why i am so confused. if it was all on one page it would be easier but i have no idea how to get all the info at its current state – Vash Jul 07 '17 at 12:59
  • So you need to make two scrapers. 1) gets the links and puts them in a list. 2) takes a link as input and scrapes the content (for the company). Once both these work you tie them together. What happens if you run your code? Do you get links? – jlaur Jul 07 '17 at 13:50
  • @jlaur After I run the code I get the link but not the content. – Vash Jul 08 '17 at 03:31
  • So your scraper for task 1 works? Instead of printing the links put them in an existing list by using append (links_list.append(link)). You should now build a new scraper (task 2) that sends off a request to one of the content pages. Do exactly what you did with the scraper for task 1, but instead of getting links you get company info. Edit your question with this new code once you're done. Then we can procede by joining the two scrapers into one. – jlaur Jul 08 '17 at 07:48
  • The reason you're not getting content is that you have only built the scraper for task 1. – jlaur Jul 08 '17 at 07:52
  • @jlaur I am not sure how the second scraper would look. I updated my code on how i have the first task in a list. – Vash Jul 10 '17 at 18:50
  • @cᴏʟᴅsᴘᴇᴇᴅ Below you can view the final solution. Just needed to step back and talk it out. Much easier than I thought. I had to change a few things around – Vash Jul 13 '17 at 22:21
  • @jlaur Below you can view the final solution. Just needed to step back and talk it out. I had to change a few things around – Vash Jul 13 '17 at 22:22
  • @mcd5185 Glad to hear you sorted it out. – cs95 Jul 13 '17 at 22:26

1 Answers1

1

Here is my final solution to the question I asked.

import bs4, urlparse, json, requests,csv
from os.path import basename as bn

links = []
data = {}
base = 'http://data-interview.enigmalabs.org/'

#Approach 
#1. Each individual pages, collect the links
#2. Iterate over each link in a list
#3. Before moving on each the list for links if correct move on, if not review step 2 then 1
#4. Push correct data to a JSON file



def bs(r):
    return bs4.BeautifulSoup(requests.get(urlparse.urljoin(base, r).encode()).content, 'html.parser').find('table')

for i in range(1,11):
    print 'Collecting page %d' % i
    links += [a['href'] for a in bs('companies?page=%d' % i).findAll('a')]
# Search a the given range of "a" on each page

# Now that I have collected all links into an list,iterate over each link
# All the info is within a html table, so search and collect all company info in data
for link in links:
    print 'Processing %s' % link
    name = bn(link)
    data[name] = {}
    for row in bs(link).findAll('tr'):
        desc, cont = row.findAll('td')
        data[name][desc.text.encode()] = cont.text.encode()

print json.dumps(data)

# Final step is to have all data formating 
json_data = json.dumps(data, indent=4)
file = open("solution.json","w")
file.write(json_data)
file.close()
Vash
  • 141
  • 11