An approach to address html contents in a python class

Question

I'm coping with some html parsing, and I'm having quite a hard time defining a way to address the information being extracted.

For example, consider a page like this http://www.the-numbers.com/movies/1999/FIGHT.php. I want to address every content, like The Numbers Rating, Rotten Tomatoes, Production Budget, Theatrical Release, and others, so that I'm to store the value each "key" may assume.

The process of extraction is solved for me, what I'm not sure is about a proper way to store these contents. As I said, they work like "keys", so a dictionary is quite a direct answer. Still I'm tempted by adding a member for each of these "keys" in the class I'm building.

The question is which approach will work out better, considering code writing, during the access of these contents, and if are those the best approaches on this is issue.

I would have, for the first case, something like:

class Data:

    def __init__(self):
        self.data = dict()

    def adding_data(self):
        self.data["key1"] = (val1, val2)
        self.data["key2"] = val3
        self.data["key3"] = [val4, val5, val6, ...]

And for the second one:

class Data:

    def adding_data(self):
        self.key1 = (val1, val2)
        self.key2 = val3
        self.key3 = [val4, val5, val6, ...]

The reason why I'm considering this is that I'm using BeautifulSoup API, and I'm very in with the way they do address each tag on the resulting "soup".

soup = BeautifulSoup(data)
soup.div
soup.h2
soup.b

Which way do you think is more user-friendly? Is there any better way to do this?

score 0 · Answer 1 · answered Dec 26 '12 at 14:00

If you have fixed number of attributes, ie. you know the key values beforehand, then I think the better way would be to make each of these keys as instance variables like in your second example.

If on the other hand you do not know which "keys" you will have beforehand or if there are too many of them, then you can use a container type like a dictionary. You can add data to the dictionary dynamically so it will also be less burden if there are many of them. For example you can use a "for ... in .." loop to add data.

score 0 · Accepted Answer · answered Dec 26 '12 at 14:02

If you use class attributes (self.key1 ...) a tool that checks your code statically (like pylint) will show you unused and unsefined variables and therefore mistypes.

class toy(object):
    pass

a = toy()
a.key1 = "hello world"
print a.key10

Pylint run:

> pylint toto.py
************* Module toto
C:  1,0: Black listed name "toto"
C:  1,0: Missing docstring
C:  1,0:toy: Invalid name "toy" (should match [A-Z_][a-zA-Z0-9]+$)
C:  1,0:toy: Missing docstring
W:  5,0: Attribute 'key1' defined outside __init__
R:  1,0:toy: Too few public methods (0/2)
C:  4,0: Invalid name "a" (should match (([A-Z_][A-Z0-9_]*)|(__.*__))$)
E:  6,6: Instance of 'toy' has no 'key10' member

That won't be the case with keys in a dictionary. A typing mistake will go silent, which is why I would prefer class attributes. However if you have a dictionary you can easily iterate through the set of keys. While you can also get the list of attributes of a class instance, you will get some noise in it. (see key1 lost among the other attributes defined by default)

>>> class toy(object):
...     pass
... 
>>> a = toy()
>>> a.key1 = "hello world"
>>> dir(a)
['__class__', '__delattr__', '__dict__', '__doc__', '__format__', '__getattribute__', '__hash__', '__init__', '__module__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', 'key1']

So, if you don't need to iterate in the list of "keys" you have created, I'd use the class attribute way.

I'll take your suggestion! Thanks for the post! – Rubens Dec 26 '12 at 14:09 — Rubens, Dec 26 '12 at 14:09

An approach to address html contents in a python class

2 Answers2