0

I am scraping the fields from http://164.100.47.132/LssNew/psearch/QResult16.aspx?qref=15844. All fields are properly 'returned' on console, with usual HTML tags. I need to pipe these fields to a CSV file (CSVItemExporter). If I try to put the html response in a temp register and apply the converter operation in the second step when assigning to the item field, I get a separate set of error messages.

I tried solutions in BeautifulSoup get_text and html2text, as in Is it possible that Scrapy to get plain text from raw html data directly instead of using xPath selectors? and How can I get all the plain text from a website with Scrapy?. The solutions therein 'print' well but fail to assign to the respective fields.

Any converter operation on the response function (converter(response +extract)) leads to errors such as "str object has no attribute 'get_text'" (html2text) or returns text with random \r\n items inserted (BeautifulSoup). I suspect this is because of hard CRs in the original text, which the original author may have put to keep stuff aligned. How do I get around this problem? Python 2.7 on Win 32.

Community
  • 1
  • 1
Pradeep
  • 350
  • 3
  • 16
  • Please post the specific fields html with which your getting errors. – Vikas Ojha Jun 15 '15 at 13:56
  • @Vikas: here is the relevant log: DEBUG: Scraped from <200 http://164.100.47.132/LssNew/psearch/QResult16.aspx?qref=15944> {'q_date': [u'30.04.2015'], 'q_house': u'LOK SABHA', 'q_main': u'(a) whether a huge amount is outstanding from various State Electricity Boards and power companies against \r\nthe coal supplied to them by Coal India Ltd. and its subsidiary companies;\r\n\r\n', 'q_ministry': [u'COAL'], 'q_name': [u'Poddar Smt. Aparupa'], 'q_no': [u'6107'], 'q_subject': [u'OUTSTANDING DUES '], 'q_type': [u'UNSTARRED ']} NOTE the \r\n in field q_main. – Pradeep Jun 15 '15 at 14:36
  • here is the log for another page: Scraped from <200 http://164.100.47.132/LssNew/psearch/QResult16.aspx?qref=15844> {'q_date': [u'08.05.2015'], 'q_house': u'LOK SABHA', 'q_main': u'(a) whether the Government has notified a compre-\r\nhensive regulatory framework for Television Rating\r\nAgency;',} ends and here is the source html (text): (a) whether the Government has notified a compre- hensive regulatory framework for Television Rating Agency; Confirms my doubts that there are hard carriage returns in the source text (badly formed html). How to handle this? – Pradeep Jun 15 '15 at 14:54
  • (a) whether the Government has notified a compre- hensive regulatory framework for Television Rating Agency; – Pradeep Jun 15 '15 at 15:00

0 Answers0