0

I have a Pandas DF where 1 column contains a single int, and the other column contains anywhere from 2 to 50 ints.

Here is an example below

           EmbedID                          MappedC
1911    3096611        [610580, 1396024, 1383000, 2480745, 751823, 97...
1912    3096612        [365607, 917990]
1913    3096613      [1067171, 638200, 2192752, 1609109, 1984544, 3...
1914    3096614       [521163, 217279, 347655]
1915    3096615      [1139429, 1254616, 3034840, 2312074, 68243]

The numbers EmbedIDserves as the label, and two random numbers chosen from the MappedC column serves as the corresponding input numbers .

What's the best way to convert this into a tf.record file?

I see guides for converting a single numpy column to a tf.record file, such as these

https://gist.github.com/swyoon/8185b3dcf08ec728fb22b99016dd533f

Numpy to TFrecords: Is there a more simple way to handle batch inputs from tfrecords?

http://www.machinelearninguru.com/deep_learning/tensorflow/basics/tfrecord/tfrecord.html

However, they all have trouble when the column / array has a varying number of ints.

Edit:

If this changes anything, here is more details about what exactly I am doing with the data.

For training on Tensorflow, the single int column contains an index for a vector in an embedding matrix. That vector will be used as the label.

The column with multiple ints have the 'input data'. For each label from the column containing a single int, 2 numbers will be chosen at random from the column containing multiple ints.

I am basically doing a word2vec Cbow type of training

SantoshGupta7
  • 5,607
  • 14
  • 58
  • 116
  • Are you happy to just take the first two values from MappedC as your two "random" values, or do you actually need random selection? If you do need randomization, should it be possible to select the same MappedC value twice, or not? – John Zwinck Oct 28 '18 at 04:35
  • I'll need to take a random selection from MappedC . For your second question, actually I haven't thought of the possibilities to take the same MappedC value on a different epochs . . .if possible I would prefer to take different values for each epoch until all the values have been used, and then it starts over again. – SantoshGupta7 Oct 28 '18 at 04:43

1 Answers1

2

First, shuffle your MappedC values:

import random
df.MappedC.apply(random.shuffle)

Then take the first and second values:

df.MappedC.str[0]
df.MappedC.str[1]

df.MappedC.str looks like something about strings, which may be confusing, but Series.str works for lists as well as for strings, so this lets us choose the first and second element of each list, and efficiently construct new Series from those.

You can now use the usual methods to put the data into TensorFlow, as you now have two plain Series of integers.

Alternatively, this will give you a Series of randomly chosen pairs:

df.MappedC.map(lambda row: random.sample(row, 2))
John Zwinck
  • 239,568
  • 38
  • 324
  • 436
  • This solution works, but I would either have to keep the dataframe in memory and create new tf.records arrays every time an epoch is finished, or pre-create datasets for each epoch I want to run, which would defeat the purpose of using tf.records in the first place (save RAM, save space, convenience). – SantoshGupta7 Oct 28 '18 at 05:35
  • @SantoshGupta7: I'm not sure what you're after. Do you want a solution that works on quantum computers? I think then we could do all the computations without loading the data. – John Zwinck Oct 29 '18 at 12:02
  • I am looking for a way for a tf.record file to hold an array where the number of integers in each slot varies. – SantoshGupta7 Oct 29 '18 at 23:32
  • Then you will have to pad the shorter lists to make the data rectangular. See https://datascience.stackexchange.com/questions/15056/how-to-use-lists-in-tensorflow – John Zwinck Oct 30 '18 at 01:00