1

I'm using ghost package in my script for scraping a website. Since I have many pages to scrape, ghost is used many times, about 30 times per page and I might have hundreds of pages to scrape. I noticed, when running my script, that after about 25 pages I start getting Ghost::Qt::Qthread errors and even before that, it seems like ghost is not consistent meaning : basically ghost is used to extract a phone number from a simple page looking like this :

this is how the webpage looks like - I'm extracting this phone #

I'm suspecting that its about overloading memory, or something like that but I must admit that I'm new to Python and not skilled enough in programming (I come from Hardware world).

Has anyone encounter this type of problems ? I know ghost has a method called remove_page that should remove the "page" created but I have tried using it and I think its not working (or I'm missing something), here is a code where I try using this remove and after removing, I can still use the object:

from ghost import Ghost
gh=Ghost()
page, page_name = gh.create_page()
gh.remove_page(page)

After running this, and typing page I would expect not to have any page defined. How do I release resources, delete the page, even delete the gh object created ?

Ric
  • 581
  • 5
  • 26
sivan
  • 223
  • 1
  • 3
  • 12

1 Answers1

1

The current version of Ghost.py (0.2.3) is supposed to have fixed this. However, versions past 0.1.2 have some errors with loading certain websites. Running the Ghost.py code in it's own Process will fix these memory issues on older version:

from multiprocessing import Process
from ghost import Ghost

def load_page(url):
    gh = Ghost()
    page, page_name = gh.open(url)

p = multiprocessing.Process(target=load_page, args=(url))
p.start()
p.join

If you need to get data back from the Process, you'll have to look into using a multiprocessing Queue.

Ric
  • 581
  • 5
  • 26