2

2 years ago I wrote code in TensorFlow, and as part of the data loading I used the function 'load_csv_without_header'. Now, when I'm running the code, I get the message:

WARNING:tensorflow:From C:\Users\Roi\Desktop\Code_Win_Ver\code_files\Tensor_Flow\version1\build_database_tuple.py:124: load_csv_without_header (from tensorflow.contrib.learn.python.learn.datasets.base) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.data instead.

How do I use 'tf.data' instead of the current function? How can I can the same dtype, at the same format, without the csv header with tf.data? I'm using TF version 1.8.0 over Python 3.5.

Appreciate your help!

roishik
  • 515
  • 2
  • 9
  • 19

2 Answers2

7

Using tf.data to work with a csv file:

From TensorFlow's official documentation:

The tf.data module contains a collection of classes that allows you to easily load data, manipulate it, and pipe it into your model.

Using the API, tf.data.Dataset is intended as the new standard of interfacing with data in TensorFlow. It represent "a sequence of elements, in which each element contains one or more Tensor objects". For a CSV, an element is just a single row of training example, represented as a pair of tensor that correspond to the data (our x) and the label ("target") respectively.

Using the API, the primary method of extracting each row (or each element more accurately) in a tensorflow dataset (tf.data.Dataset) is by consuming the Iterator and TensorFlow has an API named tf.data.Iterator for that. To return the next row, we can call get_next() on the Iterator for example.

Now onto the code to take csv and transform that into our tensorflow dataset.

Method 1: tf.data.TextLineDataset() and tf.decode_csv()

With more recent versions of TensorFlow's Estimator API, instead of load_csv_without_header, you'd read your CSV or using the more generic tf.data.TextLineDataset(you_train_path) instead. You can chain that with skip() to skip the first row if there is a header row, but in your case, that wasn't necessary.

You can then use tf.decode_csv() to pack decode each line of your CSV into its own respective fields.

The code solution:

import tensorflow as tf
train_path = 'data_input/iris_training.csv'
# if no header, remove .skip()
trainset = tf.data.TextLineDataset(train_path).skip(1)

# Metadata describing the text columns
COLUMNS = ['SepalLength', 'SepalWidth',
           'PetalLength', 'PetalWidth',
           'label']
FIELD_DEFAULTS = [[0.0], [0.0], [0.0], [0.0], [0]]
def _parse_line(line):
    # Decode the line into its fields
    fields = tf.decode_csv(line, FIELD_DEFAULTS)

    # Pack the result into a dictionary
    features = dict(zip(COLUMNS,fields))

    # Separate the label from the features
    label = features.pop('label')

    return features, label

trainset = trainset.map(_parse_line)
print(trainset)

You would get:

<MapDataset shapes: ({
    SepalLength: (), 
    SepalWidth: (), 
    PetalLength: (), 
    PetalWidth: ()}, ()), 
types: ({
    SepalLength: tf.float32, 
    SepalWidth: tf.float32, 
    PetalLength: tf.float32, 
    PetalWidth: tf.float32}, tf.int32)>

You can verify the output classes:

{'PetalLength': tensorflow.python.framework.ops.Tensor,
  'PetalWidth': tensorflow.python.framework.ops.Tensor,
  'SepalLength': tensorflow.python.framework.ops.Tensor,
  'SepalWidth': tensorflow.python.framework.ops.Tensor},
 tensorflow.python.framework.ops.Tensor)

You can also use get_next to iterate through the iterator:

x = trainset.make_one_shot_iterator()
x.next()
# Output:
({'PetalLength': <tf.Tensor: id=165, shape=(), dtype=float32, numpy=1.3>,
  'PetalWidth': <tf.Tensor: id=166, shape=(), dtype=float32, numpy=0.2>,
  'SepalLength': <tf.Tensor: id=167, shape=(), dtype=float32, numpy=4.4>,
  'SepalWidth': <tf.Tensor: id=168, shape=(), dtype=float32, numpy=3.2>},
 <tf.Tensor: id=169, shape=(), dtype=int32, numpy=0>)

Method 2: from_tensor_slices() to construct a dataset object from numpy or pandas

train, test = tf.keras.datasets.mnist.load_data()
mnist_x, mnist_y = train

mnist_ds = tf.data.Dataset.from_tensor_slices(mnist_x)
print(mnist_ds)
# returns: <TensorSliceDataset shapes: (28,28), types: tf.uint8>

Another (more elaborated) example:

import pandas as pd

california_housing_dataframe = pd.read_csv("https://download.mlcc.google.com/mledu-datasets/california_housing_train.csv", sep=",")
# Define the input feature: total_rooms
my_feature = california_housing_dataframe[["total_rooms"]]

# Configure a numeric feature column for total_rooms
feature_columns = [tf.feature_column.numeric_column("total_rooms")]

# Define the label
targets = california_housing_dataframe["median_house_value"]

# Convert pandas data into a dict of np arrays.
features = {key:np.array(value) for key,value in dict(features).items()}                                           

# Construct a dataset, and configure batching/repeating.
ds = tf.data.Dataset.from_tensor_slices((features,targets))

I also strongly suggest this article and this, both from the official documentation; Safe to say that should cover most if not all your use case and will help you migrate from the deprecated load_csv_without_header() function.

onlyphantom
  • 8,606
  • 4
  • 44
  • 58
0

you can use tf.TextLineReader which has option to skip headers

reader = tf.TextLineReader(skip_header_lines=1)
Mufeed
  • 3,018
  • 4
  • 20
  • 29