5

I am searching for a persistent data storage solution that can handle heterogenous data stored on disk. PyTables seems like an obvious choice, but the only information I can find on how to append new columns is a tutorial example. The tutorial has the user create a new table with added column, copy the old table into the new table, and finally delete the old table. This seems like a huge pain. Is this how it has to be done?

If so, what are better alternatives for storing mixed data on disk that can accommodate new columns with relative ease? I have looked at sqlite3 as well and the column options seem rather limited there, too.

Zelazny7
  • 39,946
  • 18
  • 70
  • 84
  • 1
    Have you tried mongo? – user17375 Apr 10 '13 at 19:08
  • I've toyed around with it, but many of the numpy dtypes that pandas use aren't natively compatible. And I couldn't find a way to register new data types with mongo the way you can with sqlite3. Have you had any luck? The current path I'm taking is to use HDFStore and put every dataframe column in its own node. See this issue: https://github.com/pydata/pandas/issues/3307 – Zelazny7 Apr 10 '13 at 19:20
  • 1
    Looks like this might be worth pursuing: http://stackoverflow.com/questions/15818712/encoding-custom-python-objects-as-bson-with-pymongo – Zelazny7 Apr 10 '13 at 20:32
  • Thanks for sharing. Actually I realize that my poor background in the database thing prevents me from consistently moving forward in this problem. – user17375 Apr 11 '13 at 06:41

2 Answers2

5

Yes, you must create a new table and copy the original data. This is because Tables are a dense format. This gives it a huge performance benefits but one of the costs is that adding new columns is somewhat expensive.

Anthony Scopatz
  • 3,265
  • 2
  • 15
  • 14
2

thanks for Anthony Scopatz's answer.

I search website and in github, I found someone has shown how to add columns in PyTables. Example showing how to add a column in PyTables

orginal version ,Example showing how to add a column in PyTables, but have some difficulty to migrate.

revised version, Isolated the copying logic, while some terms is deprecated, and it has some minor error in adding new columns.

based on their's contribution, I updated the code for adding new column in PyTables. (Python 3.6, windows)

# -*- coding: utf-8 -*-
"""
PyTables, append a column
    """
import tables as tb
pth='d:/download/'

# Describe a water class
class Water(tb.IsDescription):
    waterbody_name   = tb.StringCol(16, pos=1)   # 16-character String
    lati             = tb.Int32Col(pos=2)        # integer
    longi            = tb.Int32Col(pos=3)        # integer
    airpressure      = tb.Float32Col(pos=4)      # float  (single-precision)
    temperature      = tb.Float64Col(pos=5)      # double (double-precision)

# Open a file in "w"rite mode
# if don't include pth, then it will be in the same path as the code.
fileh = tb.open_file(pth+"myadd-column.h5", mode = "w")

# Create a table in the root directory and append data...
tableroot = fileh.create_table(fileh.root, 'root_table', Water,
                               "A table at root", tb.Filters(1))
tableroot.append([("Mediterranean", 10, 0, 10*10, 10**2),
              ("Mediterranean", 11, -1, 11*11, 11**2),
              ("Adriatic", 12, -2, 12*12, 12**2)])
print ("\nContents of the table in root:\n",
       fileh.root.root_table[:])

# Create a new table in newgroup group and append several rows
group = fileh.create_group(fileh.root, "newgroup")
table = fileh.create_table(group, 'orginal_table', Water, "A table", tb.Filters(1))
table.append([("Atlantic", 10, 0, 10*10, 10**2),
              ("Pacific", 11, -1, 11*11, 11**2),
              ("Atlantic", 12, -2, 12*12, 12**2)])
print ("\nContents of the original table in newgroup:\n",
       fileh.root.newgroup.orginal_table[:])
# close the file
fileh.close()

#%% Open it again in append mode
fileh = tb.open_file(pth+"myadd-column.h5", "a")
group = fileh.root.newgroup
table = group.orginal_table

# Isolated the copying logic
def append_column(table, group, name, column):
    """Returns a copy of `table` with an empty `column` appended named `name`."""
    description = table.description._v_colObjects.copy()
    description[name] = column
    copy = tb.Table(group, table.name+"_copy", description)

    # Copy the user attributes
    table.attrs._f_copy(copy)

    # Fill the rows of new table with default values
    for i in range(table.nrows):
        copy.row.append()
    # Flush the rows to disk
    copy.flush()

    # Copy the columns of source table to destination
    for col in descr:
        getattr(copy.cols, col)[:] = getattr(table.cols, col)[:]

    # choose wether remove the original table
#    table.remove()

    return copy

# Get a description of table in dictionary format
descr = table.description._v_colObjects
descr2 = descr.copy()

# Add a column to description
descr2["hot"] = tb.BoolCol(dflt=False)

# append orginal and added data to table2 
table2 = append_column(table, group, "hot", tb.BoolCol(dflt=False))
# Fill the new column
table2.cols.hot[:] = [row["temperature"] > 11**2 for row in table ]
# Move table2 to table, you can use the same name as original one.
table2.move('/newgroup','new_table')

# Print the new table
print ("\nContents of the table with column added:\n",
       fileh.root.newgroup.new_table[:])
# Finally, close the file
fileh.close()
Renke
  • 452
  • 6
  • 22