3

I'm using dryscrape/webkit_server for scraping javascript enabled websites.

The memory usage of the process webkit_server seems to increase with each call to session.visit(). It happens to me using the following script:

import dryscrape

for url in urls: 
    session = dryscrape.Session()
    session.set_timeout(10)
    session.set_attribute('auto_load_images', False)
    session.visit(url)
    response = session.body()

I'm iterating over approx. 300 urls and after 70-80 urls webkit_server takes up about 3GB of memory. However it is not really the memory that is the problem for me, but it seems that dryscrape/webkit_server is getting slower with each iteration. After the said 70-80 iterations dryscrape is so slow that it raises a timeout error (set timeout = 10 sec) and I need to abort the python script. Restarting the webkit_server (e.g. after every 30 iterations) might help and would empty the memory, however I'm unsure if the 'memory leaks' are really responsible for dry scrape getting slower and slower.

Does anyone know how to restart the webkit_server so I could test that?

I have not found an acceptable workaround for this issue, however I also don't want to switch to another solution (selenium/phantomjs, ghost.py) as I simply love dryscrape for its simplicity. Dryscrape is working great btw. if one is not iterating over too many urls in one session.

This issue is also discussed here

https://github.com/niklasb/dryscrape/issues/41

and here

Webkit_server (called from python's dryscrape) uses more and more memory with each page visited. How do I reduce the memory used?

Community
  • 1
  • 1
Baili
  • 159
  • 2
  • 12
  • as long as dryscrape has some python code, you could throw `@profile` decorators and run [mprof](https://pypi.python.org/pypi/memory_profiler/0.33) or [kernprof](https://github.com/rkern/line_profiler). You could run it on your own code, but that probably won't be nearly as helpful. – Wayne Werner Mar 31 '16 at 20:22

5 Answers5

5

The memory leak you're having may also be related to the fact the webkit_process is never actually killed (and that you're spawning a new dryscrape.Session every iteration, that spawns a webkit_server process in the background that never gets killed). So it will just keep spawning a new process every time it restarts. @Kenneth answer may work but any solution that requires calling command line is sketchy. A better solution would be to declare the session once at the beginning and kill the webkit_server process from python at the end:

import webkit_server
import dryscrape

server = webkit_server.Server()
server_conn = webkit_server.ServerConnection(server=server)
driver = dryscrape.driver.webkit.Driver(connection=server_conn)
sess = dryscrape.Session(driver=driver)
# set session settings as needed here

for url in urls:
    sess.visit(url)
    response = session.body()
    sess.reset()

server.kill() # the crucial line!

Frankly, this is a shortcoming in the dryscrape library. The kill command should be accessible from the dryscrape Session.

nico
  • 2,022
  • 4
  • 23
  • 37
3

Hi,

Sorry for digging up this old post but what i did to solve the issue(After googling and only finding this post) was to run the dryscrape in a seperate process and then killing Xvfb after each run.

So my dryscrape script is:

dryscrape.start_xvfb()
session = dryscrape.Session()
session.set_attribute('auto_load_images', False)
session.visit(sys.argv[1])
print session.body().encode('utf-8')

And to run it:

p = subprocess.Popen(["python", "dryscrape.py", url],
                     stdout=subprocess.PIPE)
result = p.stdout.read()
print "Killing all Xvfb"
os.system("sudo killall Xvfb")

I know it's not the best way, and the memory leak should be fixed, but this works.

Kenneth
  • 43
  • 1
  • 6
  • 1
    Ideally, dryscrape should've provided something like dryscape.stop_xvfb() but even without it, your solution seems to be okay. I am going to try it and see if it works for my use case! – Gaurav Ojha Oct 01 '16 at 03:43
3

I have had the same problem with memory leaking. Solved it by resetting session after every page view!

Simplified workflow would look like this.

Setting up server:

dryscrape.start_xvfb()
sess = dryscrape.Session()

Then iterate through Url's and reset session after every url

for url in urls:
    sess.set_header('user-agent', 'Mozilla/5.0 (Windows NT 6.4; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2225.0 Safari/537.36')
    sess.set_attribute('auto_load_images', False)
    sess.set_timeout(30)
    sess.visit(url)
    response = sess.body()
    sess.reset()

Update

I still have encountered the problem with memory leak and better answer is the one provided by @nico.

I have ended up abandoning dryscrape all together and now been using Selenium and PhantomJS. There are still memory leaks but they are manageable.

Community
  • 1
  • 1
Ernest
  • 186
  • 10
0

Make 2 scripts like this

call.py

import os
#read urls.txt make a list
urls = open('urls.txt').read().split('\n')
for url in urls:
  print(url)
  os.system("./recive_details.py %s" % url)

recive_details.py

import sys
url = sys.argv[1]
import dryscrape as d
d.start_xvfb()
br = d.Session()
br.visit(url)
#do something here
#print title
print br.xpath("//title")[0].text()

Run Always call.py like this "python call.py" it will automatically execute 2nd script and kill session immediately. i try many other methods but this method work for me like a magic try this once

NobodyNada
  • 7,529
  • 6
  • 44
  • 51
0

Omitting session.set_attribute('auto_load_images', False) resolved the issue for me as described here. It seems there is a memory leak when images are not loaded.

Jul3k
  • 1,046
  • 1
  • 10
  • 19