3

I have a script that I want to read entries in an RSS feed and store the individual entries in JSON format into a CouchDB database.

The interesting part of my code looks something like this:

Feed = namedtuple('Feed', ['name', 'url'])

couch = couchdb.Server(COUCH_HOST)
couch.resource.credentials = (COUCH_USER, COUCH_PASS)

db = couch['raw_entries']

for feed in map(Feed._make, csv.reader(open("feeds.csv", "rb"))):
    d = feedparser.parse(feed.url)
    for item in d.entries:
        db.save(item)

When I try to run that code, I get the following error from the db.save(item):

AttributeError: object has no attribute 'read'

OK, so I then did a little debugging...

for feed in map(Feed._make, csv.reader(open("feeds.csv", "rb"))):
    d = feedparser.parse(feed.url)
    for item in d.entries:
        print(type(item))

results in <class 'feedparser.FeedParserDict'> -- ahh, so feedparser is using its own dict type... well, what if I try explicitly casting it to a dict?

for feed in map(Feed._make, csv.reader(open("feeds.csv", "rb"))):
    d = feedparser.parse(feed.url)
    for item in d.entries:
        db.save(dict(item))

Traceback (most recent call last):
  File "./feedchomper.py", line 32, in <module>
    db.save(dict(item))
  File "/home/dealpref/lib/python2.7/couchdb/client.py", line 407, in save
_, _, data = func(body=doc, **options)
  File "/home/dealpref/lib/python2.7/couchdb/http.py", line 399, in post_json
status, headers, data = self.post(*a, **k)
  File "/home/dealpref/lib/python2.7/couchdb/http.py", line 381, in post
**params)
  File "/home/dealpref/lib/python2.7/couchdb/http.py", line 419, in _request
credentials=self.credentials)
  File "/home/dealpref/lib/python2.7/couchdb/http.py", line 239, in request
    resp = _try_request_with_retries(iter(self.retry_delays))
  File "/home/dealpref/lib/python2.7/couchdb/http.py", line 196, in _try_request_with_retries
    return _try_request()
  File "/home/dealpref/lib/python2.7/couchdb/http.py", line 222, in _try_request
    chunk = body.read(CHUNK_SIZE)
AttributeError: 'dict' object has no attribute 'read'

w-what? That doesn't make sense, because the following works just fine and the type is still dict:

some_dict = dict({'foo': 'bar'})
print(type(some_dict))
db.save(some_dict)

What am I missing here?

ashgromnies
  • 3,266
  • 4
  • 27
  • 43
  • 1
    Can you post the stack trace for these errors? It's possible that the error is somewhere deeper in the CouchDB module. It's true that `dict` objects don't have a `read()` method, but that could be a red herring. – kindall Mar 31 '11 at 20:16
  • @kindall -- I posted the whole stacktrace... it's looking like CouchDB is trying to do a chunked upload for some reason(maybe because the dict is large)? However, I can't replicate the behavior by constructing the dict from hand(that is, it saves fine if I write it out by hand...). – ashgromnies Mar 31 '11 at 20:21
  • 1
    Yeah, it seems to think your dict is a file for some reason. Very odd. – kindall Mar 31 '11 at 20:57
  • Very curious, @kindall -- stepping through I do see that for a save like `db.save({'a': 'b'})` it does the simple non-chunked conn.sock.sendall(body) – ashgromnies Mar 31 '11 at 21:09

3 Answers3

4

I found a way by serializing the structure to JSON, then back to a Python dict that I pass to CouchDB -- which will then reserialize it back to JSON to save(yeah, weird and not favorable, but it works?)

I had to do a custom serializer method for dumps because the repr of a time_struct can't be eval'd.

Source: http://diveintopython3.org/serializing.html

Code:

#!/usr/bin/env python2.7

from collections import namedtuple
import csv
import json
import time

import feedparser
import couchdb

def to_json(python_object):
    if isinstance(python_object, time.struct_time):
        return {'__class__': 'time.asctime',
                '__value__': time.asctime(python_object)}

    raise TypeError(repr(python_object) + ' is not JSON serializable')

Feed = namedtuple('Feed', ['name', 'url'])

COUCH_HOST = 'http://mycouch.com'
COUCH_USER = 'user'
COUCH_PASS = 'pass'

couch = couchdb.Server(COUCH_HOST)
couch.resource.credentials = (COUCH_USER, COUCH_PASS)

db = couch['raw_entries']

for feed in map(Feed._make, csv.reader(open("feeds.csv", "rb"))):
    d = feedparser.parse(feed.url)
    for item in d.entries:
        j = json.dumps(item, default=to_json)
        db.save(json.loads(j))
ashgromnies
  • 3,266
  • 4
  • 27
  • 43
4

Answered on mailing list, but basically this happening because a feedbparser entry contains data that cannot be losslessly serialised to JSON, e.g. time.struct_time instances. Unfortunately, couchdb-python then goes on to assume it's a file, masking the actual error.

Matt Goodall
  • 1,682
  • 10
  • 7
1

Maybe there is a bug in Python CouchDB. You could say it is not sufficiently liberal in what it accepts.

But, basically, CouchDB stores JSON. You should work with whatever "JSON" is in your language. Obviously with Python that means dict objects.

You might get the best bang-for-the-buck figuring out how to convert all your types to a plain Python dict before calling into CouchDB. Maybe that's not the most "right" solution, but I suspect it is the quickest.

My Python is rusty. Is it possible that dict(foo) could ever return a non-dict? Maybe FeedParserDict subclasses dict and then uses metaprogramming to return itself when dict() is called? Can you confirm that type(dict(item)) is definitely a plain Python dict?

A common trick in Javascript land is to round-trip through a serializer such as JSON. Something like pickle.loads(pickle.dumps(item)). That pretty much guarantees you have a plain copy of the core data.

JasonSmith
  • 72,674
  • 22
  • 123
  • 149
  • Let me know about the `type(dict(item))` result. If my hypothesis is wrong, maybe something else will ring a bell. Excellent question by the way! – JasonSmith Apr 01 '11 at 04:52