I know that Tableau is fairly popular so I cannot be the only person who is using the Python TDE API to get data into the compressed Tableau format. Documentation
1) This API is only Python 2.7 and it is literally the only reason why I need 2.7 installed
2) I use pandas to process and clean raw data and prepare it for Tableau, but then I need to read an intermediate .csv in a 2.7 environment with csv reader and do row based operations (so pandas creates a .csv, then I open that with Python 2.7 and do my TDE stuff)
Here is an example code that I have. The first portion is just creating an empty .tde file and defining the data type per column which can be done separately. The last chunk is what I wish I could do in pandas with a to_tde method. It is inserting each row of data into the tde file along with the data type.
#this doesnt need to be included, it is creating the empty tde table
ExtractAPI.initialize()
start_time = time.clock()
if os.path.isfile(extract_file):
print('Deleting file', extract_file)
os.remove(extract_file)
try:
tde_file = Extract(extract_file)
except:
os.remove(extract_file)
tde_file = Extract(extract_file)
csv = csv.reader(open(data_location + csv_filename, 'rb'), delimiter=',', quotechar='"')
print('Done reading .csv')
#this stuff doesn't really matter, this is just creating the empty tde file
dates = [list of column names]
strings = [list of column names]
doubles = [list of column names]
ints = [list of column names]
def create_column(column, dtype):
tabledef.addColumn(column, dtype)
if tde_file.hasTable('Extract'):
print('Error')
table = tde_file.openTable('Extract')
tabledef = table.getTableDefinition()
else:
tabledef = TableDefinition()
for column in dates:
create_column(column, Type.DATE)
for column in strings:
create_column(column, Type.CHAR_STRING)
for column in doubles:
create_column(column, Type.DOUBLE)
for column in ints:
create_column(column, Type.INTEGER)
table = tde_file.addTable('Extract', tabledef)
print('Done')
#this is the part that I wish I could do in pandas with a to_tde function. it is inserting each row one at a time based on the datatype which was defined
newrow = Row(tabledef)
next(csv)
for line in csv:
date = dt.datetime.strptime(line[0], "%Y-%m-%d")
newrow.setDate(0, date.year, date.month, date.day)
date = dt.datetime.strptime(line[1], "%Y-%m-%d")
newrow.setDate(1, date.year, date.month, date.day)
date = dt.datetime.strptime(line[2], "%Y-%m-%d")
newrow.setDate(2, date.year, date.month, date.day)
newrow.setCharString(3, str(line[3]))
newrow.setCharString(4, str(line[4]))
newrow.setCharString(5, str(line[5]))
newrow.setCharString(6, str(line[6]))
newrow.setCharString(7, str(line[7]))
newrow.setCharString(8, str(line[8]))
newrow.setCharString(9, str(line[9]))
newrow.setCharString(10, str(line[10]))
newrow.setDouble(11, float(line[11]))
newrow.setDouble(12, float(line[12]))
newrow.setDouble(13, float(line[13]))
newrow.setDouble(14, float(line[14]))
newrow.setDouble(15, float(line[15]))
newrow.setInteger(16, int(line[16]))
newrow.setInteger(17, int(line[17]))
newrow.setInteger(18, int(line[18]))
newrow.setInteger(19, int(line[19]))
newrow.setInteger(20, int(line[20]))
table.insert(newrow)
tde_file.close()
ExtractAPI.cleanup()
print('Extract created in {} seconds'.format(time.clock() - start_time))
Is this something which is being worked on by anyone? Does it seem like it would be easy to do (even if not directly added to the main pandas library)?