Questions tagged [petastorm]

27 questions
4
votes
0 answers

ValueError: Items of feature_columns must be a _FeatureColumn. (Tensorflow 1.13)

I'm running into a ValueError when running Tensorflow-1.13 + Horovod-0.16 + Spark-0.24 + Petastorm-0.17. It's a straightforward implementation of a model_fn and some indicator_columns, but is throwing an error similar to Items of feature_columns…
3
votes
0 answers

What is the best way to feed training data from parquet file to a Tensorflow/Keras model?

I have a training dataset stored on S3 in parquet format. I wish to load this data into a notebook (on databricks cluster) and train a Keras model on it. There are few ways that I can think of to train Keras model on this dataset: read parquet file…
exAres
  • 4,806
  • 16
  • 53
  • 95
3
votes
0 answers

Should I create a PyTorch Dataset to train a model off a pyspark dataframe?

I want to train a PyTorch NLP model over training data in columnar format, and I thought to construct a PyTorch Dataset using as raw data a pyspark dataframe (not sure it's the right approach...). To preprocess text I'm using a tokenizer provided by…
Davide Fiocco
  • 5,350
  • 5
  • 35
  • 72
2
votes
2 answers

How to print out data that goes to keras model.fit , specifically if using petastorm dataset

Update While I appreciated AloneTogether's answer, I didn't like that I was using take() and it was separate from model.fit. I put another answer here if you want to look at it. It involves subclassing Model. It's not too bad. End of Update I have…
Craig
  • 177
  • 2
  • 12
2
votes
1 answer

Storing ndarrays into Parquet via uber/petastorm?

Is it possible to store N-dimensional arrays into Parquet via uber/petastorm ?
Leo Gallucci
  • 16,355
  • 12
  • 77
  • 110
1
vote
1 answer

Create train and valid dataset in petastorm

Versions : Python3.7.13, Tensorflow-2.9.1, Petastorm-0.12.1 In petastorm it seems as if only way to train model using dataset created from petastorm is to fit the model within Reader context manager like below as done in…
haneulkim
  • 4,406
  • 9
  • 38
  • 80
1
vote
0 answers

spark: exec: "executor": executable file not found in $PATH: unknown

I am trying to do some calculus by using petastorm v0.11.4 in a docker container and minikube v1.25.2 As long as I run the process locally, everything works as expected. As soon as I try to spread the work in the minikube cluster, I receive the…
skynet1010
  • 143
  • 4
  • 11
1
vote
0 answers

Tensorflow pentastrom , training stuck

I have 2 very large (in tb) datasets (using pentastorm to train tf model) what I am doing is loading the datasets using pentastorm and then creating a single feature and labels dataset, as I cant pass two separate datasets train_X_mlp =…
prajwal rao
  • 87
  • 1
  • 9
1
vote
0 answers

Petastorm with Databricks Connect failing

Using Azure Databricks. I have petastorm==0.11.2 and databricks-connect==9.1.0 My databricks-connect session seems to be working I'm able to read in data into my remote workspace. But when I use petastorm to create a spark converter object it says…
Jamalan
  • 482
  • 4
  • 15
1
vote
1 answer

What is best way to convert time series data (parquet format) into sequences using petastorm?

Pardon me if use the terms in the wrong sense. I am still grappling with many spark and distributed related things. Here is my use case and I am not able to get a complete picture of the implementation. I have time-series data of 40 columns and 100…
Ashok Krishna
  • 143
  • 1
  • 5
1
vote
1 answer

How to replace tf.train.batch , as it is deprecated

This is the code for training mnist data using Petastorm. def train_and_test(dataset_url, training_iterations, batch_size, evaluation_interval): with make_reader(os.path.join(dataset_url, 'train'), num_epochs=None) as train_reader: with…
Asha
  • 67
  • 5
1
vote
0 answers

Trying to create parquet Petastorm dataset

I'm currently trying to create a parquet petastorm dataset to store a video dataset. My code is: MotionSchema = Unischema('TeaserSchema', [ UnischemaField( 'video', np.uint8, (None, None, None, 3), NdarrayCodec(),…
1
vote
0 answers

InvalidArgumentError when reading parquet files into Keras via Petastorm

I'm trying to read in data from parquet for a language model. The parquet contains two columns: target (int) feature_vec (int array) I'm adapting the code from this post (Which works for me). When I try the code below I get an InvalidArgumentError…
dspringate
  • 1,805
  • 2
  • 13
  • 20
1
vote
2 answers

Creating parquet Petastorm dataset through Spark fails with Overflow error (larger than 4GB)

I'm trying to implement Uber's Petastorm dataset creation which utilizes Spark to create a parquet file following the tutorial on their Github page. The code: spark = SparkSession.builder.config('spark.driver.memory',…
bluesummers
  • 11,365
  • 8
  • 72
  • 108
0
votes
0 answers

How to integrate tf.data.dataset with rayTune for distributed training

Using tensorflow-cpu==2.9.3, petastorm==0.12.1 on python 3.7 I've created tf.data.Dataset using petastorm for train and validation dataset. ds_train (DatasetV1Adapter; think this is old version of tf.data.dataset) ds_valid (DatasetV1Adapter) First…
haneulkim
  • 4,406
  • 9
  • 38
  • 80
1
2