-1

Why doesn't BeautifulSoup manage to download information from wix? I'm trying to use BeautifulSoup in order to download images from my website, while other sites do work (example of the code actually working) wix does not work... Is there anything I can change in my site's settings in order for it to work?

EDIT: CODE

from bs4 import BeautifulSoup
import urllib2
import shutil
import requests
from urlparse import urljoin
import time


def make_soup(url):
    req = urllib2.Request(url, headers={'User-Agent': "Magic Browser"})
    html = urllib2.urlopen(req)
    return BeautifulSoup(html, 'html.parser')


def get_images(url):
    soup = make_soup(url)
    images = [img for img in soup.findAll('img')]
    print (str(len(images)) + " images found.")
    print 'Downloading images to current working directory.'
    image_links = [each.get('src') for each in images]
    for each in image_links:
        try:
            filename = each.strip().split('/')[-1].strip()
            src = urljoin(url, each)
            print 'Getting: ' + filename
            response = requests.get(src, stream=True)
            # delay to avoid corrupted previews
            time.sleep(1)
            with open(filename, 'wb') as out_file:
                shutil.copyfileobj(response.raw, out_file)
        except:
            print '  An error occurred. Continuing.'
    print 'Done.'


def main():
    url = HIDDEN ADDRESS
    get_images(url)

if __name__ == '__main__':
    main()
Stein Åsmul
  • 39,960
  • 25
  • 91
  • 164
Lior shem
  • 31
  • 6
  • 2
    BeautifulSoup doesn't download anything. So, whatever your problem is with the downloading, it's in the way you're using some other library like `urllib` or `requests`. – abarnert Mar 29 '18 at 21:38
  • Also, we can't possibly debug your code without seeing it. We can make some wild guesses (maybe it _is_ downloading just fine, but the website is almost entirely generated by JavaScript running in the browser; maybe you've put an `except: pass` in your code; maybe you're violating the ToS for the site and they've blocked you; …), but that's about it. Please post a [mcve]. – abarnert Mar 29 '18 at 21:39
  • Sorry, yes you are right it's probably urllib type of thing and i'll add the code and edit my thread – Lior shem Mar 29 '18 at 21:40

2 Answers2

1

BeautifulSoup can only parse html. Wix sites are generated by javascript that runs when you load the page. When you request the page's html via urllib, you don't get the rendered html, you just get the base html with scripts to build the rendered html. In order to do this, you'd need something like selenium or a headless chrome browser to render the site via it's javascript, and then get the rendered html and feed it to beautifulsoup.

Here's an example of the body of a wix site, which you can see has no content other than a single div that gets populated via javascript.

...
    <body>
        <div id="SITE_CONTAINER"></div>









    </body>
...
Ngenator
  • 10,909
  • 4
  • 41
  • 46
-1

For anyone out there trying to download images from the wix website, I managed to figure out a simple idea. Open an HTML Code frame in your page and in your code link the img srcs of the pictures in your site. When you use BeautifulSoup on the HTML code's URL, all of the images (linked in the code) will be downloaded!

Lior shem
  • 31
  • 6