How to create a tensorflow dataset from a DataFrame with vector columns?

Question

So I have some train data in a csv file train.csv with the following format:

x;y;type
[1,2,3];[2,3,4];A
[2,7,9];[0,1,2];B

This file is parsed as a pd.DataFrame with the following:

CSV_COLUMN_NAMES = ['x', 'y', 'type']
train = pd.read_csv("train.csv", names=CSV_COLUMN_NAMES, header=0, delimiter=";")
train['x'] = train['x'].apply(literal_eval)
train['y'] = train['y'].apply(literal_eval)

So far so good. The literal_eval function is applied so x and y are treated as array. The next step is to create a DataSet with the following:

features, labels = train, train.pop('type')
dataset = tf.data.Dataset.from_tensor_slices((dict(features), labels))

And here is where it breaks :( It spills the following errors:

TypeError: Expected binary or unicode string, got [1, 2, 3]

Why is binary or unicode string expected? Are vector feature columns not allowed? Or am I doing something wrong? Please shed me some light

score 4 · Accepted Answer · answered Jun 01 '18 at 16:16

TF can automatically create a tensor from a data frame as long as it has only one data type, in this case it seems to have different data types.

Without literal_eval the code seems to work, as each of the features are string and not of mixed type:

train = pd.read_csv("train.csv", names=CSV_COLUMN_NAMES, header=0, delimiter=",")

Features,labels = train,train.pop('type')

dataset = tf.data.Dataset.from_tensor_slices((dict(Features), labels))
iterator = dataset.make_initializable_iterator()
next_element = iterator.get_next()

with tf.Session() as sess:
  sess.run(iterator.initializer)
  print(sess.run(next_element))
  print(sess.run(next_element))

Output:

({'y': b'[2, 3, 4]', 'x': b'[1, 2, 3]'}, b'A')
({'y': b'[0, 1, 2]', 'x': b'[2, 7, 9]'}, b'B')

Based on this solution: (How to convert a Numpy 2D array with object dtype to a regular 2D array of floats ) if we convert the mixed object type to same (with np.vstack), it works.

train['x'] = train['x'].apply(literal_eval)
train['y'] = train['y'].apply(literal_eval)

Features,labels = train,train.pop('type')
dataset = tf.data.Dataset.from_tensor_slices(((np.vstack(Features['x']),    np.vstack(Features['y'])), labels))

iterator = dataset.make_initializable_iterator()
next_element = iterator.get_next()

with tf.Session() as sess:
   sess.run(iterator.initializer)
   print(sess.run(next_element))
   print(sess.run(next_element))

Output:

((array([1, 2, 3]), array([2, 3, 4])), b'A')
((array([2, 7, 9]), array([0, 1, 2])), b'B')

Thanks! This works. It turns out in my case a dict is required so there were a few more steps. Will add an answer to mark this. — jack3694078, Jun 04 '18 at 00:49

score 1 · Answer 2 · answered Jun 04 '18 at 00:52

See the other answer for making a dataset. If features should be a dictionary of `Tensor`s. error is encountered use the following:

def dfToFeature(df):
    result = {}
    for key in df.keys():
        result[key] = np.vstack(df[key])
    return result

How to create a tensorflow dataset from a DataFrame with vector columns?

2 Answers2

Linked