Column separator mismatch when reading Parquet dataset into H2OFrame after conversion from Delta to Parquet

Question

I am attempting to read a multi-file Parquet dataset into an H2OFrame and it results in a column mismatch error:

H2OResponseError: Server error water.exceptions.H2OIllegalArgumentException:
  Error: Column separator mismatch. One file seems to use "" and the other uses " ".

The dataset is initially converted from Delta to Parquet since H2O doesn't support Delta tables as data sources:

# convert from Delta to Parquet
delta_uri = 's3://my_bucket/path/to/delta/folder/'
df = spark.read.format('delta').load(delta_uri)
parquet_uri = 's3://my_bucket/path/to/parquet/folder/'
df.write.parquet(parquet_uri)

# extract Parquet into H2OFrame (this line is where the error happens)
data = h2o.import_file(path=parquet_uri)

Is there a way to enforce a single column separator across all Parquet files when making the conversion from Delta to Parquet?

The H2O cluster is running version 3.34.0.3 of H2O. The code above is being run within a Databricks notebook.

score 0 · Answer 1 · answered Oct 25 '21 at 14:09

0

make sure there are only parquet files in your folder (no spline etc)

answered Oct 25 '21 at 14:09

donSjon

1

Column separator mismatch when reading Parquet dataset into H2OFrame after conversion from Delta to Parquet

1 Answers1