39

I am doing some exercises with datasets like so:

List with many dictionaries

users = [
    {"id": 0, "name": "Ashley"},
    {"id": 1, "name": "Ben"},
    {"id": 2, "name": "Conrad"},
    {"id": 3, "name": "Doug"},
    {"id": 4, "name": "Evin"},
    {"id": 5, "name": "Florian"},
    {"id": 6, "name": "Gerald"}
]

Dictionary with few lists

users2 = {
    "id": [0, 1, 2, 3, 4, 5, 6],
    "name": ["Ashley", "Ben", "Conrad", "Doug","Evin", "Florian", "Gerald"]
}

Pandas dataframes

import pandas as pd
pd_users = pd.DataFrame(users)
pd_users2 = pd.DataFrame(users2)
print pd_users == pd_users2

Questions:

  1. Should I structure the datasets like users or like users2?
  2. Are there performance differences?
  3. Is one more readable than the other?
  4. Is there a standard I should follow?
  5. I usually convert these to pandas dataframes. When I do that, both versions are identical... right?
  6. The output is true for each element so it doesn't matter if I work with panda df's right?
Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
megashigger
  • 8,695
  • 17
  • 47
  • 79

6 Answers6

29

This relates to column oriented databases versus row oriented. Your first example is a row oriented data structure, and the second is column oriented. In the particular case of Python, the first could be made notably more efficient using slots, such that the dictionary of columns doesn't need to be duplicated for every row.

Which form works better depends a lot on what you do with the data; for instance, row oriented is natural if you only ever access all of any row. Column oriented meanwhile makes much better use of caches and such when you're searching by a particular field (in Python, this may be reduced by the heavy use of references; types like array can optimize that). Traditional row oriented databases frequently use column oriented sorted indices to speed up lookups, and knowing these techniques you can implement any combination using a key-value store.

Pandas does convert both your examples to the same format, but the conversion itself is more expensive for the row oriented structure, simply because every individual dictionary must be read. All of these costs may be marginal.

There's a third option not evident in your example: In this case, you only have two columns, one of which is an integer ID in a contiguous range from 0. This can be stored in the order of the entries itself, meaning the entire structure would be found in the list you've called users2['name']; but notably, the entries are incomplete without their position. The list translates into rows using enumerate(). It is common for databases to have this special case also (for instance, sqlite rowid).

In general, start with a data structure that keeps your code sensible, and optimize only when you know your use cases and have a measurable performance issue. Tools like Pandas probably means most projects will function just fine without finetuning.

Yann Vernier
  • 15,414
  • 2
  • 28
  • 26
  • 1
    An example of using `slots` to save memory: http://tech.oyster.com/save-ram-with-python-slots/ – 0 _ Sep 20 '17 at 03:42
7

Time complexity for the lookups in -

  • List - O(n)
  • Dicts - O(1)

But that wouldn't hurt much if your data isn't that big and also modern day processors are quite efficient.
You should go with the one in which the lookup is syntactically cleaner and readable(readability matters).
The first option is quite appropriate as the variable is a collection of users(which have been assigned an id) while the second would be just a collection of usernames and ids.

Anurag-Sharma
  • 4,278
  • 5
  • 27
  • 42
  • 1
    "You should go with the one in which the lookup is syntactically cleaner and readable" +1. But I don't think time complexity matters as we don't know how he is accessing that data. – Artur Gaspar May 29 '15 at 15:24
  • Actually, Python's lists are reference arrays, and have O(1) lookup. You might have been expecting linked lists. – Yann Vernier Nov 07 '19 at 08:37
  • @YannVernier He meant a lookup of a specific value in the list, not just access by index. – feature_engineer Apr 20 '20 at 17:24
  • `dict` is no better at finding values than `list`; it only matters when looking up keys, compared to [association lists](https://en.wikipedia.org/wiki/Association_list). Since `dict` is always available in Python, we rarely use that form, but it's accepted as input when [creating a `dict`](https://docs.python.org/3/library/stdtypes.html#dict). Do go with `dict` if you always look up by the same key and that key is not a contiguous series of integers from 0, which `id` happens to be in this example. – Yann Vernier May 19 '20 at 15:39
7

Users

  1. When you need to append some new user just make a new dict of all user details and append it

  2. Easily sortable as @StevenRumbalski suggested

  3. Searching will be easy

  4. This is more compact and easily manageable as record grows (for some very high number of records I think we will need something better than users too)

Users2

  1. Personally I am seeing this for the first time and I wouldn't approach this if I have a high number of records.

PS: But I would like to learn advantages of users2 over users Again a nice question

0 _
  • 10,524
  • 11
  • 77
  • 109
therealprashant
  • 701
  • 15
  • 27
5

users in general sense is actually a collection of user elements. So it's better to define the user element as a standalone entity. So your first option is the right one.

dlask
  • 8,776
  • 1
  • 26
  • 30
4

Some answers regarding the pandas aspect:

  1. Both dataframes are indeed the same and are column oriented, which is good, because pandas works best when data in each column is homogeneous (i.e. numbers can be stored as ints and floats). A key reason for using pandas in the first place is that you can do vectorized numerical operations that are orders of magnitude faster than pure python -- but this relies on columnar organization when data is of heterogeneous type.
  2. You could do pd_users.T to transpose, if you wanted to, and would then see (via info() or dtypes) that everything is then stored as a general purpose object because the column contains both strings and numbers.
  3. Once converted, you can do pd_users.set_index('id') so that your dataframe is essentially a dictionary with id as the keys. Or vice versa with name.
  4. It's pretty common (and generally pretty fast) to change indexes, then change them back, transpose, subset, etc. when working with pandas so it's usually not necessary to think too much about the structure at the beginning. Just change it as you need to on the fly.
  5. This may be getting off on a tangent, but the a simpler pandas analog of what you have above may be a Series rather than DataFrame. A series is essentially a column of a dataframe though it really is just a one-dimensional data array with an index ("keys").

Quick demo (using df as the dataframe name, the common convention):

>>> df.set_index('name')

         id
name       
Ashley    0
Ben       1
Conrad    2
Doug      3
Evin      4
Florian   5
Gerald    6

>>> df.set_index('name').T

name  Ashley  Ben  Conrad  Doug  Evin  Florian  Gerald
id         0    1       2     3     4        5       6

>>> df.set_index('name').loc['Doug']

id    3
Name: Doug, dtype: int64
JohnE
  • 29,156
  • 8
  • 79
  • 109
  • Hey! You mentioned that both data frames are column oriented. The most upvoted answer right now suggest one is column and the other is row. Can you confirm? – megashigger May 29 '15 at 13:18
  • 1
    I believe @YannVernier is only referring to the case *before* converting to pandas. You already proved they are the same yourself with `pd_users == pd_users2`. But you could do `pd_users == pd_users2.T` (put a transpose on either one) to further verify. It will raise an exception because the the two dataframes no longer conform. Aside from checking for equality, just printing the dataframe shows how pandas is structuring data in terms of rows and columns. – JohnE May 29 '15 at 13:26
  • Ah ok that makes sense. Thanks for clarifying. – megashigger May 30 '15 at 02:38
1

First option of list of dictionaries will be much better for quite few reasons. List does provides methods such as EXTEND, APPENT, PUSH which are not readily available with dictionaries.

Pralhad Narsinh Sonar
  • 1,406
  • 1
  • 14
  • 23