When I use scrapy to crawl a web page, I encounter the same problem.I have two ways to solve this problem. First use replace() function. AS "response.xpath" return a list format but replace function only operate string format.so i fetch each item of the list as a string by using a for loop, replace '\n''\t' in each item,and than append to a new list.
import re
test_string =["\n\t\t", "\n\t\t\n\t\t\n\t\t\t\t\t", "\n", "\n", "\n", "\n", "Do you like shopping?", "\n", "Yes, I\u2019m a shopaholic.", "\n", "What do you usually shop for?", "\n", "I usually shop for clothes. I\u2019m a big fashion fan.", "\n", "Where do you go shopping?", "\n", "At some fashion boutiques in my neighborhood.", "\n", "Are there many shops in your neighborhood?", "\n", "Yes. My area is the city center, so I have many choices of where to shop.", "\n", "Do you spend much money on shopping?", "\n", "Yes and I\u2019m usually broke at the end of the month.", "\n", "\n\n\n", "\n", "\t\t\t\t", "\n\t\t\t\n\t\t\t", "\n\n\t\t\t\n\t\t\t\t\t\t\t\t\t\t\t\t"]
print(test_string)
# remove \t \n
a = re.compile(r'(\t)+')
b = re.compile(r'(\n)+')
text = []
for n in test_string:
n = a.sub('',n)
n = b.sub('',n)
text.append(n)
print(text)
# remove all ''
while '' in text:
text.remove('')
print(text)
The second method use map() and strip.The map() function directly processes the list and get the original format.'Unicode' is used in python2 and changed to 'str' in python3, as following:
text = list(map(str.strip, test_string))
print(text)
The strip function only removes the \n\t\r from the beginning and end of the string, not the middle of the string.It different from remove function.