Simple Python web crawler causes high memory usage

Question

I am a beginner of Python. I just wrote a very simple web crawler and caused high memory usage when I was running the crawler.Not sure what's wrong in my codes, I spent quite some time but can't resolve it.

I intend to use it to capture some job info from following link: http://search.51job.com/jobsearch/search_result.php?fromJs=1&jobarea=070200%2C00&district=000000&funtype=0000&industrytype=00&issuedate=9&providesalary=06%2C07%2C08%2C09%2C10&keywordtype=2&curr_page=1&lang=c&stype=1&postchannel=0000&workyear=99&cotype=99&degreefrom=99&jobterm=99&companysize=99&lonlat=0%2C0&radius=-1&ord_field=0&list_type=0&dibiaoid=0&confirmdate=9

The crawler extracts the links of each job,and generates the id of each job from the links. Then it reads the job title from the link through xpath, print all the info out in the end. Even the link number is only 50, but it caused my computer nearly unresponsive every time before printing out all the info. Below is my codes. I just added the header, this is needed to parse the link of each job. My environment is Ubuntu16.04, Python3.5,Pycharm.

import requests
from lxml import etree
import re

headers = {"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
               "Accept-Encoding": "gzip, deflate",
               "Accept-Language": "en-US,en;q=0.5",
               "Connection": "keep-alive",
               "Host": "jobs.51job.com",
               "Upgrade-Insecure-Requests": "1",
               "User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0"}

def generate_info(url):
    html = requests.get(url, headers=headers)
    html.encoding = 'GBK'
    select = etree.HTML(html.text.encode('utf-8'))
    job_id = re.sub('[^0-9]', '', url)
    job_title=select.xpath('/html/body//h1/text()')
    print(job_id,job_title)

sum_page='http://search.51job.com/jobsearch/search_result.php?fromJs=1&jobarea=070200%2C00&district=000000&funtype=0000&industrytype=00&issuedate=9&providesalary=06%2C07%2C08%2C09%2C10&keywordtype=2&curr_page=1&lang=c&stype=1&postchannel=0000&workyear=99&cotype=99&degreefrom=99&jobterm=99&companysize=99&lonlat=0%2C0&radius=-1&ord_field=0&list_type=0&dibiaoid=0&confirmdate=9'
sum_html=requests.get(sum_page)
sum_select=etree.HTML(sum_html.text.encode('utf-8'))
urls= sum_select.xpath('//*[@id="resultList"]/div/p/span/a/@href')

for url in urls:
    generate_info(url)

I have tested your code but the bug you said can not be reproduced on my MBP. Maybe this another question about the python memory leaks can help you. link: http://stackoverflow.com/questions/1435415/python-memory-leaks — Yohn, Apr 05 '17 at 06:22

Simple Python web crawler causes high memory usage

0 Answers0