Questions tagged [petastorm]
27 questions
4
votes
0 answers
ValueError: Items of feature_columns must be a _FeatureColumn. (Tensorflow 1.13)
I'm running into a ValueError when running Tensorflow-1.13 + Horovod-0.16 + Spark-0.24 + Petastorm-0.17. It's a straightforward implementation of a model_fn and some indicator_columns, but is throwing an error similar to Items of feature_columns…

Gan
- 41
- 2
3
votes
0 answers
What is the best way to feed training data from parquet file to a Tensorflow/Keras model?
I have a training dataset stored on S3 in parquet format. I wish to load this data into a notebook (on databricks cluster) and train a Keras model on it. There are few ways that I can think of to train Keras model on this dataset:
read parquet file…

exAres
- 4,806
- 16
- 53
- 95
3
votes
0 answers
Should I create a PyTorch Dataset to train a model off a pyspark dataframe?
I want to train a PyTorch NLP model over training data in columnar format, and I thought to construct a PyTorch Dataset using as raw data a pyspark dataframe (not sure it's the right approach...).
To preprocess text I'm using a tokenizer provided by…

Davide Fiocco
- 5,350
- 5
- 35
- 72
2
votes
2 answers
How to print out data that goes to keras model.fit , specifically if using petastorm dataset
Update
While I appreciated AloneTogether's answer, I didn't like that I was using take() and it was separate from model.fit.
I put another answer here if you want to look at it. It involves subclassing Model. It's not too bad.
End of Update
I have…

Craig
- 177
- 2
- 12
2
votes
1 answer
Storing ndarrays into Parquet via uber/petastorm?
Is it possible to store N-dimensional arrays into Parquet via uber/petastorm ?

Leo Gallucci
- 16,355
- 12
- 77
- 110
1
vote
1 answer
Create train and valid dataset in petastorm
Versions : Python3.7.13, Tensorflow-2.9.1, Petastorm-0.12.1
In petastorm it seems as if only way to train model using dataset created from petastorm is to fit the model within Reader context manager like below as done in…

haneulkim
- 4,406
- 9
- 38
- 80
1
vote
0 answers
spark: exec: "executor": executable file not found in $PATH: unknown
I am trying to do some calculus by using petastorm v0.11.4 in a docker container and minikube v1.25.2
As long as I run the process locally, everything works as expected. As soon as I try to spread the work in the minikube cluster, I receive the…

skynet1010
- 143
- 4
- 11
1
vote
0 answers
Tensorflow pentastrom , training stuck
I have 2 very large (in tb) datasets (using pentastorm to train tf model)
what I am doing is loading the datasets using pentastorm and then creating a single feature and labels dataset, as I cant pass two separate datasets
train_X_mlp =…

prajwal rao
- 87
- 1
- 9
1
vote
0 answers
Petastorm with Databricks Connect failing
Using Azure Databricks.
I have petastorm==0.11.2 and databricks-connect==9.1.0
My databricks-connect session seems to be working I'm able to read in data into my remote workspace. But when I use petastorm to create a spark converter object it says…

Jamalan
- 482
- 4
- 15
1
vote
1 answer
What is best way to convert time series data (parquet format) into sequences using petastorm?
Pardon me if use the terms in the wrong sense. I am still grappling with many spark and distributed related things.
Here is my use case and I am not able to get a complete picture of the implementation.
I have time-series data of 40 columns and 100…

Ashok Krishna
- 143
- 1
- 5
1
vote
1 answer
How to replace tf.train.batch , as it is deprecated
This is the code for training mnist data using Petastorm.
def train_and_test(dataset_url, training_iterations, batch_size, evaluation_interval):
with make_reader(os.path.join(dataset_url, 'train'), num_epochs=None) as train_reader:
with…

Asha
- 67
- 5
1
vote
0 answers
Trying to create parquet Petastorm dataset
I'm currently trying to create a parquet petastorm dataset to store a video dataset. My code is:
MotionSchema = Unischema('TeaserSchema', [
UnischemaField(
'video', np.uint8, (None, None, None, 3), NdarrayCodec(),…

Guilherme Marques
- 263
- 1
- 7
1
vote
0 answers
InvalidArgumentError when reading parquet files into Keras via Petastorm
I'm trying to read in data from parquet for a language model.
The parquet contains two columns:
target (int)
feature_vec (int array)
I'm adapting the code from this post (Which works for me). When I try the code below I get an InvalidArgumentError…

dspringate
- 1,805
- 2
- 13
- 20
1
vote
2 answers
Creating parquet Petastorm dataset through Spark fails with Overflow error (larger than 4GB)
I'm trying to implement Uber's Petastorm dataset creation which utilizes Spark to create a parquet file following the tutorial on their Github page.
The code:
spark = SparkSession.builder.config('spark.driver.memory',…

bluesummers
- 11,365
- 8
- 72
- 108
0
votes
0 answers
How to integrate tf.data.dataset with rayTune for distributed training
Using tensorflow-cpu==2.9.3, petastorm==0.12.1 on python 3.7
I've created tf.data.Dataset using petastorm for train and validation dataset.
ds_train (DatasetV1Adapter; think this is old version of tf.data.dataset)
ds_valid (DatasetV1Adapter)
First…

haneulkim
- 4,406
- 9
- 38
- 80