Strip \n \t \r in scrapy

Question

I'm trying to strip \r \n \t characters with a scrapy spider, making then a json file.

I have a "description" object which is full of new lines, and it doesn't do what I want: matching each description to a title.

I tried with map(unicode.strip()) but it doesn't really works. Being new to scrapy I don't know if there's another simpler way or how map unicode really works.

This is my code:

def parse(self, response):
    for sel in response.xpath('//div[@class="d-grid-main"]'):
        item = xItem()
        item['TITLE'] = sel.xpath('xpath').extract()
        item['DESCRIPTION'] = map(unicode.strip, sel.xpath('//p[@class="class-name"]/text()').extract())

I tried also with:

item['DESCRIPTION'] = str(sel.xpath('//p[@class="class-name"]/text()').extract()).strip()

But it raised an error. What's the best way?

Hello, what do you mean by "it doesn't really work"? `strip()` only considers leading and trailing characters, so if you want to strip anything that's inside the string you need some other way. `import re` and `re.sub('[\r\n\t]', '', 'Hel\nlo\r!')` could help if that's your issue. — Quentin Pradet, Feb 09 '16 at 09:48
I would suggest to checkout `ItemLoader`s http://doc.scrapy.org/en/latest/topics/loaders.html which allow you to manage input and output of your `Item`s — Granitosaurus, Feb 09 '16 at 15:23
QuentinPradet thanks, in fact paul's answer was good, I didn't know that. And Granitosaurus I'll study that thanks — Lara M., Feb 10 '16 at 11:15

paul trmbrth · Accepted Answer · 2016-02-09T10:18:34.803

unicode.strip only deals with whitespace characters at the beginning and end of strings

Return a copy of the string with the leading and trailing characters removed.

not with \n, \r, or \t in the middle.

You can either use a custom method to remove those characters inside the string (using the regular expression module), or even use XPath's normalize-space()

returns the argument string with whitespace normalized by stripping leading and trailing whitespace and replacing sequences of whitespace characters by a single space.

Example python shell session:

>>> text='''<html>
... <body>
... <div class="d-grid-main">
... <p class="class-name">
... 
...  This is some text,
...  with some newlines \r
...  and some \t tabs \t too;
... 
... <a href="http://example.com"> and a link too
...  </a>
... 
... I think we're done here
... 
... </p>
... </div>
... </body>
... </html>'''
>>> response = scrapy.Selector(text=text)
>>> response.xpath('//div[@class="d-grid-main"]')
[<Selector xpath='//div[@class="d-grid-main"]' data=u'<div class="d-grid-main">\n<p class="clas'>]
>>> div = response.xpath('//div[@class="d-grid-main"]')[0]
>>> 
>>> # you'll want to use relative XPath expressions, starting with "./"
>>> div.xpath('.//p[@class="class-name"]/text()').extract()
[u'\n\n This is some text,\n with some newlines \r\n and some \t tabs \t too;\n\n',
 u"\n\nI think we're done here\n\n"]
>>> 
>>> # only leading and trailing whitespace is removed by strip()
>>> map(unicode.strip, div.xpath('.//p[@class="class-name"]/text()').extract())
[u'This is some text,\n with some newlines \r\n and some \t tabs \t too;', u"I think we're done here"]
>>> 
>>> # normalize-space() will get you a single string on the whole element
>>> div.xpath('normalize-space(.//p[@class="class-name"])').extract()
[u"This is some text, with some newlines and some tabs too; and a link too I think we're done here"]
>>>

I want to normalize-space whole body: response.xpath('.').extract() This works, but using normalize-space: response.xpath('normalize-space(.)').extract() html tags like are removed, why? — , Jun 22 '17 at 10:31
@Baks, [`normalize-space(.)`](https://www.w3.org/TR/xpath/#function-normalize-space) returns the space-normalized [string value](https://www.w3.org/TR/xpath/#element-nodes) of the context node, which is a concatenation of descendant text nodes: _"The string-value of an element node is the concatenation of the string-values of all text node descendants of the element node in document order."_ — paul trmbrth, Aug 18 '17 at 09:57

score 7 · Answer 2 · answered Sep 23 '17 at 20:30

I'm a python, scrapy newbie, I've had a similar issue today, solved this with the help of the following module/function w3lib.html.replace_escape_chars I've created a default input processor for my item loader and it works without any issues, you can bind this on the specific scrapy.Field() also, and the good thing it works with css selectors and csv feed exports:

from w3lib.html import replace_escape_chars
yourloader.default_input_processor = MapCompose(relace_escape_chars)

score 3 · Answer 3 · edited May 23 '17 at 12:02

As paul trmbrth suggests in his answer,

div.xpath('normalize-space(.//p[@class="class-name"])').extract()

is likely to be what you want. However, normalize-space also condenses whitespace contained within the string into a single space. If you want only to remove \r, \n, and \t without disturbing the other whitespace you can use translate() to remove characters.

trans_table = {ord(c): None for c in u'\r\n\t'}
item['DESCRIPTION] = ' '.join(s.translate(trans_table) for s in sel.xpath('//p[@class="class-name"]/text()').extract())

This will still leave leading and trailing whitespace that is not in the set \r, \n, or \t. If you also want to be rid of that just insert a call to strip():

item['DESCRIPTION] = ' '.join(s.strip().translate(trans_table) for s in sel.xpath('//p[@class="class-name"]/text()').extract())

Perfect. I never knew about this and it solved all my whitespace issues without regexes. — Echelon, Aug 08 '17 at 11:03
div.xpath('normalize-space(.//p[@class="class-name"])').extract() worked for me, thanks. — Janib Soomro, Feb 13 '18 at 09:46

score 1 · Answer 4 · answered Feb 09 '20 at 17:39

1

The simplest example to extract price from alibris.com is

response.xpath('normalize-space(//td[@class="price"]//p)').get()

answered Feb 09 '20 at 17:39

user1994

477
6
12

Ryan · Answer 5 · 2020-03-10T07:05:49.617

When I use scrapy to crawl a web page, I encounter the same problem.I have two ways to solve this problem. First use replace() function. AS "response.xpath" return a list format but replace function only operate string format.so i fetch each item of the list as a string by using a for loop, replace '\n''\t' in each item,and than append to a new list.

import re
test_string =["\n\t\t", "\n\t\t\n\t\t\n\t\t\t\t\t", "\n", "\n", "\n", "\n", "Do you like shopping?", "\n", "Yes, I\u2019m a shopaholic.", "\n", "What do you usually shop for?", "\n", "I usually shop for clothes. I\u2019m a big fashion fan.", "\n", "Where do you go shopping?", "\n", "At some fashion boutiques in my neighborhood.", "\n", "Are there many shops in your neighborhood?", "\n", "Yes. My area is the city center, so I have many choices of where to shop.", "\n", "Do you spend much money on shopping?", "\n", "Yes and I\u2019m usually broke at the end of the month.", "\n", "\n\n\n", "\n", "\t\t\t\t", "\n\t\t\t\n\t\t\t", "\n\n\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\t"]
print(test_string)
        # remove \t \n    
a = re.compile(r'(\t)+')     
b = re.compile(r'(\n)+')
text = []
for n in test_string:
    n = a.sub('',n)
    n = b.sub('',n)
    text.append(n)
print(text)
        # remove all ''
while '' in text:
    text.remove('')
print(text)

The second method use map() and strip.The map() function directly processes the list and get the original format.'Unicode' is used in python2 and changed to 'str' in python3, as following:

text = list(map(str.strip, test_string))
print(text)

The strip function only removes the \n\t\r from the beginning and end of the string, not the middle of the string.It different from remove function.

score 0 · Answer 6 · answered Nov 25 '20 at 05:00

If you want to preserve the list instead all joint strings, there is no need to add extra steps, you could just simply do call the getall() instead get():

response.xpath('normalize-space(.//td[@class="price"]/text())').getall()

Also, you should add the text() at the end.

Hope it helps anybody!

score 0 · Answer 7 · answered Oct 06 '21 at 11:34

0

You can try to use css combined with get().strip(), it works for me

answered Oct 06 '21 at 11:34

Hao

1

1

Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Oct 06 '21 at 12:09

score 0 · Answer 8 · edited Oct 29 '22 at 20:47

0

str(i.css("p::text")[1].extract()).strip()

edited Oct 29 '22 at 20:47

blackgreen

34,072
23
111
129

answered Oct 29 '22 at 19:26

Suny Rajput

1

Strip \n \t \r in scrapy

8 Answers8