Scrapy + extract only text + carriage returns in output file

Question

I am new to Scrapy and trying to extract content from web page, but getting lots of extra characters in the output. See image attached.

How can I update my code to get rid of the characters? I need to extract only the href from the web page.

Web Extracted output file image

My code:

class AttractionSpider(CrawlSpider):
name = "get-webcontent"
start_urls = [
    'http://quotes.toscrape.com/page/1/'
]
rules = ()
def create_dirs(dir):    
    if not os.path.exists(dir):
        os.makedirs(dir)
    else:
        shutil.rmtree(dir)           #removes all the subdirectories!
        os.makedirs(dir)

def __init__(self, name=None, **kwargs):
    super(AttractionSpider, self).__init__(name, **kwargs)
    self.items_buffer = {}
    self.base_url = "http://quotes.toscrape.com/page/1/"        
    from scrapy.conf import settings
    settings.overrides['DOWNLOAD_TIMEOUT'] = 360
def write_to_file(file_name, content_list):
    with open(file_name, 'wb') as fp:
        pickle.dump(content_list, fp)

def parse(self, response):
    print ("Start scrapping webcontent....")        
    try:            
        str = ""
        hxs = Selector(response)
        links = hxs.xpath('//li//@href').extract()
        with open('test1_href', 'wb') as fp:
            pickle.dump(links, fp)
        if not links:               
            return
            log.msg("No Data to scrap")
        for link in links:
            v_url = ''.join( link.extract() )           
            if not v_url:
                continue
            else:
                _url = self.base_url + v_url
    except Exception as e:
        log.msg("Parsing failed for URL {%s}"%format(response.request.url))
        raise 

def parse_details(self, response):
    print ("Start scrapping Detailed Info....")
    try:
        hxs = Selector(response)            
        yield l_venue
    except Exception as e:
        log.msg("Parsing failed for URL {%s}"%format(response.request.url))
        raise

I have never seen anything like that LOL I'm going to try to replicate your code even though I'm not exactly what modules you imported but I imagine OS and course the respected modules for the crawl spider so... Here I go — scriptso, Oct 05 '17 at 00:41
from scrapy.spiders import CrawlSpider, Rule from scrapy.selector import HtmlXPathSelector from scrapy.selector import Selector from scrapy.http import Request from scrapy.linkextractors import LinkExtractor from scrapy import log from scrapy.http import HtmlResponse import pickle from scrapy.pipelines.images import ImagesPipeline from scrapy.exceptions import DropItem import datetime import scrapy from scrapy.contrib.pipeline.images import ImagesPipeline from scrapy.exceptions import DropItem from scrapy import Selector — ScrapyDev, Oct 05 '17 at 01:08
dear God man LOL... That's all the modules you imported? If you're just trying to get... Just wait I have a nice big O explanation for you... Hope I can break down some sodas but it's obvious that you know some python not questioning that I just think you definitely need some guidance to how Scrappy works but since you are doing the tutorial from there official Docs I'm a bit alarmed? On how convoluted your code is... I mean again I can see that you done some python pro but... Just give me 15-20 minutes I'll give you a nice little answer hopefully I can help you and other people as well — scriptso, Oct 05 '17 at 01:36
links = response.css('img').xpath('@src').extract() #Image extract with open('imgs', 'wb') as fp: pickle.dump(links, fp) links = response.xpath('//a[contains(@href, "image")]/img/@src').extract() with open('image', 'wb') as fp: pickle.dump(links, fp) #Video extract with open('video', 'wb') as fp: pickle.dump(links, fp) links = hxs.xpath('//a[contains(@href,"pdf")]/text()')#Pdf extract within with open('a_pdf','wb') as fp: pickle.dump(links, fp) — ScrapyDev, Oct 05 '17 at 02:06
oh yeah... just answerd but ill drop code on how i would by replicating ur project flow — scriptso, Oct 05 '17 at 02:20
Please read [Under what circumstances may I add “urgent” or other similar phrases to my question, in order to obtain faster answers?](//meta.stackoverflow.com/q/326569) - the summary is that this is not an ideal way to address volunteers, and is probably counterproductive to obtaining answers. Please refrain from adding this to your questions. — halfer, Oct 05 '17 at 10:07
Sorry, this is my first time using this platform. Will be careful next time. — ScrapyDev, Oct 05 '17 at 13:06

score 1 · Answer 1 · edited Jun 20 '20 at 09:12

Now I must say... obviously you have some experience with Python programming congrats, and you're obviously doing the official Scrapy docs tutorial which is great but for the life of me I have no idea exactly given the code snippet you have provided of what you're trying to accomplish. But that's ok, here's a couple of things:

You are using a Scrapy crawl spider. When using a cross spider the rules set the follow or pagination if you will as well as pointing in a car back to the function when the appropriate regular expression matches the rule to a page to then initialize the extraction or itemization. This is absolutely crucial to understand that you cannot use a crossfire without setting the rules and equally as important when using a cross spider you cannot use the parse function, because the way the cross spider is built parse function is already a native built-in function within itself. Do go ahead and read the documents or just create a cross spider and see how it doesn't create in parse.

Your code

class AttractionSpider(CrawlSpider):
name = "get-webcontent"
start_urls = [
    'http://quotes.toscrape.com/page/1/'
]
rules = () #big no no ! s3 rul3s

How it should look like

class AttractionSpider(CrawlSpider):
    name = "get-webcontent"
    start_urls = ['http://quotes.toscrape.com'] # this would be cosidered a base url
    

# regex is our bf, kno him well, bassicall all pages that follow 
#this pattern ... page/.* (meant all following include no exception)
    rules = (
            Rule(LinkExtractor(allow=r'/page/.*'), follow=True),callback='parse_item'),
            )

Number two: go over the thing I mentioned about using the parts function with a Scrapy crawl spider, you should use "parse-_item"; I assume that you at least looked over the official docs but to sum it up, the reason that it cannot be used this because the crawl spider already uses Parts within its logic so by using Parts within a cross spider you're overriding a native function that it has and can cause all sorts of bugs and issues.

That's pretty straightforward; I don't think I have to go ahead and show you a snippet but feel free to go to the official Docs and on the right side where it says "spiders" go ahead and scroll down until you hit "crawl spiders" and it gives some notes with a caution...

To my next point: when you go from your initial parts you are not (or rather you do not) have a call back that goes from parse to Parts details which leads me to believe that when you perform the crawl you don't go past the first page and aside from that, if you're trying to create a text file (or you're using the OS module 2 write out something but you're not actually writing anything) so I'm super confused to why you are using the right function instead of read.

I mean, myself I have in many occasions use an external text file or CSV file for that matter that includes multiple URLs so I don't have to stick it in there but you're clearly writing out or trying to write to a file which you said was a pipeline? Now I'm even more confused! But the point is that I hope you're well aware of the fact that if you are trying to create a file or export of your extracted items there are options to export and to three already pre-built formats that being CSV JSON. But as you said in your response to my comment that if indeed you're using a pipeline and item and Porter intern you can create your own format of export as you so wish but if it's only the response URL that you need why go through all that hassle?

My parting words would be: it would serve you well to go over again Scrapy's official docs tutorial, at nauseam and stressing the importance of using also the settings.py as well as items.py.

# -*- coding: utf-8 -*-
import scrapy
import os
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from quotes.items import QuotesItem

class QcrawlSpider(CrawlSpider):
    name = 'qCrawl'
    allowed_domains = ['quotes.toscrape.com']
    start_urls = ['http://quotes.toscrape.com/']

    rules = (
        Rule(LinkExtractor(allow=r'page/.*'), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
    rurl = response.url
        item = QuotesItem()
        item['quote'] =response.css('span.text::text').extract()
        item['author'] = response.css('small.author::text').extract()
        item['rUrl'] = rurl
        yield item

    with open(os.path.abspath('') + '_' + "urllisr_" + '.txt', 'a') as a:
            a.write(''.join([rurl, '\n']))
            a.close()

Of course, the items.py would be filled out appropriately by the ones you see in the spider but by including the response URL both itemized I can do both writing out given even the default Scrappy methods CSV etc or I can create my own.

In this case being a simple text file but one can get pretty crafty; for example, if you write it out correctly that's the same using the OS module you can, for example as I have create m3u playlist from video hosting sites, you can get fancy with a custom CSV item exporter. But even with that then using a custom pipeline we can then write out a custom format for your csvs or whatever it is that you wish.

First: I am not a coder, atall. Sorry for all the jargons in my code. I am not trying to crawl website at this point, thats my next step. As a first step I am trying to extract H1, H2, H3, href, img, paragraph text etc from webpage and store/display that for one Webpage in a Text file/Excel spreadsheet whatever form. And you saw the text file I were able to store LOL. Basically I am doing Document Obj Model and extract everything — ScrapyDev, Oct 05 '17 at 12:17
Second: I need to extract the exact TEXT of H1/H2/H3..
, href, img link not the extra characters/html tags...but I am receiving all sorts of extra characters and html tags in my text file. And I am unable to remove them. Do you think you can help me please. my email scrapydev1@gmail.com if you can shoot me an email. Thank you for all the help. — ScrapyDev, Oct 05 '17 at 12:24
line 27 Rule(LinkExtractor(allow=r'/page/.*'), follow=True),callback='parse_item'), ^ SyntaxError: invalid syntax — ScrapyDev, Oct 05 '17 at 12:37

Scrapy + extract only text + carriage returns in output file

1 Answers1

Your code

How it should look like