0

I made a pipeline

PARSE = 'api.parse.com' PORT = 443

However, I can't find the right way to post the data in Parse. Because everytime it creates undefined objects in my Parse DB.

 class Newscrawlbotv01Pipeline(object):
    def process_item(self, item, spider):
        for data in item:
            if not data:
                raise DropItem("Missing data!")
        connection = httplib.HTTPSConnection(
            settings['PARSE'],
            settings['PORT']
        )
        connection.connect()
        connection.request('POST', '/1/classes/articlulos', json.dumps({item}), {
       "X-Parse-Application-Id": "XXXXXXXXXXXXXXXX",
       "X-Parse-REST-API-Key": "XXXXXXXXXXXXXXXXXXX",
       "Content-Type": "application/json"
     })
        log.msg("Question added to PARSE !", level=log.DEBUG, spider=spider)
        return item
        #self.collection.update({'url': item['url']}, dict(item), upsert=True)

Example of an error :

    2016-03-16 20:13:19 [scrapy] ERROR: Error processing {'image': 'http://eedl.eodi.org/wp-content/uploads/sites/3/2016/01/Figaro.png',
 'language': 'FR',
 'publishedDate': u'2016-03-16T18:52:24+01:00',
 'publisher': 'Le Figaro',
 'theme': 'Actualites',
 'title': u'Interpellations Paris: \xable niveau de menace reste tr\xe8s \xe9lev\xe9\xbb selon Hollande',
 'url': u'http://www.lefigaro.fr/flash-actu/2016/03/16/97001-20160316FILWWW00315-interpellations-paris-la-menace-reste-tres-elevee-selon-hollande.php'}
Traceback (most recent call last):
  File "c:\python27\lib\site-packages\twisted\internet\defer.py", line 588, in _runCallbacks
    current.result = callback(current.result, *args, **kw)
  File "C:\Users\simon\Documents\NewsSwipe\PROTOTYPE\v0.1\NewsCrawlBotV0_1\NewsCrawlBotV0_1\pipelines.py", line 49, in process_item
    connection.request('POST', '/1/classes/articlulos', json.dumps({data}), {
  File "c:\python27\lib\json\__init__.py", line 243, in dumps
    return _default_encoder.encode(obj)
  File "c:\python27\lib\json\encoder.py", line 207, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "c:\python27\lib\json\encoder.py", line 270, in iterencode
    return _iterencode(o, 0)
  File "c:\python27\lib\json\encoder.py", line 184, in default
    raise TypeError(repr(o) + " is not JSON serializable")
TypeError: set(['theme']) is not JSON serializable
  • now it is a duplicate with http://stackoverflow.com/questions/36045159/scrapy-pipeline-to-parse, maybe you should accept the answer of creating a pipeline, and then we can continue your pipeline on your other question – eLRuLL Mar 16 '16 at 21:29
  • I'm sorry but how you accept the answer ? I'm new in StackOverFlow. Oh wait I get it – Thomas Simonini Mar 16 '16 at 21:32
  • I think you did, please remember to leave the question as it was before. – eLRuLL Mar 16 '16 at 21:34

2 Answers2

0

You need to use a Pipeline, which will process all the output items on its process_item method, there you can do whatever you want with that item.

eLRuLL
  • 18,488
  • 9
  • 73
  • 99
0

Scrapy has a built in feed exporter for JSON files, all you need to do is add

-o example.json

to your scrapy command line. See the docs here.

Steve
  • 976
  • 5
  • 15