0

I'm using Scrapy to crawl a site with some odd formatting conventions. The basic idea is that I want all the text and sub-elements of a certain div, EXCEPT a few div in the middle. Here is the piece of code below :-

<div align="center" class="article"><!--wanted-->
    <img src="http://i.imgur.com/12345.jpg" width="500" alt="abcde" title="abcde"><br><br>     
    <div style="text-align:justify"><!--wanted-->
        Sample Text<br><br>Demo: <a href="http://www.example.com/?http://example.com/item/asash/asdas-asfasf-afaf.html" target="_blank">http://example.com/dfa/asfa/aasfa</a><br><br>
        <div class="quote"><!--wanted-->
            http://www.coolfiles.ro/download/kleo13.rar/1098750<br>http://www.ainecreator.com/files/0MKOGM6D/kleo13.rar_links<br>
        </div>
        <br>
        <div align="left"><!--not wanted-->
            <div id="ratig-layer-2249"><!--not wanted-->
                <div class="rating"><!--not wanted-->
                    <ul class="unit-rating">
                        <li class="current-rating" style="width:80%;">80</li>
                        <li><a href="#" title="Bad" class="r1-unit" onclick="doRate('1', '2249'); return false;">1</a></li>
                        <li><a href="#" title="Poor" class="r2-unit" onclick="doRate('2', '2249'); return false;">2</a></li>
                        <li><a href="#" title="Fair" class="r3-unit" onclick="doRate('3', '2249'); return false;">3</a></li>
                        <li><a href="#" title="Good" class="r4-unit" onclick="doRate('4', '2249'); return false;">4</a></li>
                        <li><a href="#" title="Excellent" class="r5-unit" onclick="doRate('5', '2249'); return false;">5</a></li>
                    </ul>
                </div>
                (votes: <span id="vote-num-id-2249">3</span>)
            </div>
        </div>
        <div class="reln"><!--not wanted-->
            <strong>
                <h4>Related News:</h4>
            </strong>
            <li><a href="http://www.example.com/themes/tf/a-b-c-d.html">1</a></li>
            <li><a href="http://www.example.com/plugins/codecanyon/a-b-c-d">2</a></li>
            <li><a href="http://www.example.com/themes/tf/a-b-c-d.html">3</a></li>
            <li><a href="http://www.example.com/plugins/codecanyon/a-b-c-d.html">4</a></li>
            <li><a href="http://www.example.com/plugins/codecanyon/a-b-c-d.html">5</a></li>
        </div>
    </div>
</div>

The final output should look like :-

<div align="center" class="article"><!--wanted-->
    <img src="http://i.imgur.com/12345.jpg" width="500" alt="abcde" title="abcde"><br><br>     
    <div style="text-align:justify"><!--wanted-->
        Sample Text<br><br>Demo: <a href="http://www.example.com/?http://example.com/item/asash/asdas-asfasf-afaf.html" target="_blank">http://example.com/dfa/asfa/aasfa</a><br><br>
        <div class="quote"><!--wanted-->
            http://www.coolfiles.ro/download/kleo13.rar/1098750<br>http://www.ainecreator.com/files/0MKOGM6D/kleo13.rar_links<br>
        </div>
        <br>
    </div>
</div>

Here is the piece of my Scrapy code. Please suggest the addition to this script :-

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from isbullshit.items import IsBullshitItem


class IsBullshitSpider(CrawlSpider):
    """ General configuration of the Crawl Spider """
    name = 'isbullshitwp'
    start_urls = ['http://example.com/themes'] # urls from which the spider will start crawling
    rules = [Rule(SgmlLinkExtractor(allow=[r'page/\d+']), follow=True), 
        # r'page/\d+' : regular expression for http://example.com/page/X URLs
        Rule(SgmlLinkExtractor(allow=[r'\w+']), callback='parse_blogpost')]
        # r'\d{4}/\d{2}/\w+' : regular expression for http://example.com/YYYY/MM/title URLs

    def parse_blogpost(self, response):
        hxs = HtmlXPathSelector(response)
        item = IsBullshitItem()
        item['title'] = hxs.select('//span[@class="storytitle"]/text()').extract()[0]
        item['article_html'] = hxs.select("//div[@class='article']").extract()[0]

        return item

Here are the following xpath that I experimented with but did not get the desired results :-

item['article_html'] = hxs.select("//div[@class='article']").extract()[0]
item['article_html'] = hxs.select("//div[@class='article']/following::node() [not(preceding::div[@class='reln']) and not(@class='reln')]").extract()[0]
item['article_html'] = hxs.select("//div[@class='article']/div[@class='reln']/preceding-sibling::node()[preceding-sibling::div[@class='quote']]").extract()[0]
item['article_html'] = hxs.select("//div[@class='article']/following::node() [not(preceding::div[@class='reln'])]").extract()[0]
item['article_html'] = hxs.select("//div[@class='article']/div[@class='quote']/*[not(self::div[@class='reln'])]").extract()[0]
item['article_html'] = hxs.select("//div[@class='article']/*[(self::name()='reln'])]").extract()[0]

Thanks in advance...

vrtech77
  • 149
  • 3
  • 14
  • XPath doesn't work that way. Either use XSLT templates or just select the paths you need inside the `div.article > div`, concatenate them and wrap the whole string with `div.article > div`. – Artjom B. May 10 '14 at 10:46
  • I think your concatenating and wraping the whole string is useful. It would be great if you could provide me a edit of my above Scrapy code with your idea. I can't do what you said since I am new to Scrapy. Thank You. – vrtech77 May 11 '14 at 00:47
  • There is a [solution](http://stackoverflow.com/questions/12179821/scrapy-remove-elements-from-an-xpath-selector) but don't know how to implement this in my scenario. – vrtech77 May 11 '14 at 01:18
  • Have you tried it? What is the error you are experiencing? SO is not there to offload your work, we want to help with interesting questions. It seems that your question is actually answered in the link. I suggest you try to implement it yourself and if you are unsuccessful post your what you tried. – Artjom B. May 11 '14 at 08:33
  • I have added some Xpath that I had experimented with and unable to get the desired results. It is not that I don't want to learn, I could not understand the solution mentioned in my comment as it is not indented properly. Thank you for the comment. @artjom-b – vrtech77 May 13 '14 at 07:06

1 Answers1

2

With Scrapy, it seems you can't do this. I have my own function to remove specific node (with its children):

def removeNode(context, nodeToRemove):
    for element in nodeToRemove:
        contentToRemove = context.css(element)

        if contentToRemove:
            contentToRemove = contentToRemove[0].root
            contentToRemove.getparent().remove(contentToRemove)

    return context.extract()

Hope it help

Jeromearsene
  • 119
  • 1
  • 9
  • Thank you. Core of the function works. But here confusing name `nodeToRemove`, is it list? if no then no way for iterate it. Else if it selector then how it works in .css()? And index [0] with loop better to optimize. – bl79 Jul 11 '18 at 04:24
  • But seems it also removes the necessary text after removed node. And has not the `strip_elements(with_tail=False)` to solve this problem, which has lxml parser https://stackoverflow.com/a/41359368/6357045 – bl79 Jul 11 '18 at 05:56