0

I am using pyspark dataframes, I want to read a parquet file and write it with a different schema from the original file

The original schema is (It have 9.000 variables, I am just putting the first 5 for the example):

[('id', 'string'),
 ('date', 'string'),
 ('option', 'string'),
 ('cel1', 'string'),
 ('cel2', 'string')]

And I want to write:

[('id', 'integer'),
 ('date', 'integer'),
 ('option', 'integer'),
 ('cel1', 'integer'),
 ('cel2', 'integer')]

My code is:

df = sqlContext.read.parquet("PATH")

### SOME OPERATIONS ###

write_schema = StructType([StructField('id'  , IntegerType(), True),
                           StructField('date'  , IntegerType(), True),
                           StructField('option'  , IntegerType(), True),
                           StructField('cel1'  , IntegerType(), True),
                           StructField('cel2'  , IntegerType(), True) ])


df.option("schema",write_schema).write("PATH")

After I run it I still have the same schema from the original data, everything is string, the schema did not changed


Also I tried using

df = sqlContext.read.option("schema",write_schema).parquet(PATH)

This option does not change the schema when I read it, It shows the original one, so I use (suggested in here):

df = sqlContext.read.schema(write_schema).parquet(PATH)

These one works for the reading part, if I see the types I get:

df.dtypes

#>>[('id', 'int'),
#   ('date', 'int'),
#   ('option', 'int'),
#   ('cel1', 'int'),
#   ('cel2', 'int')]

But when I tried to write the parquet I get an error:

Parquet column cannot be converted. Column: [id], Expected: IntegerType, Found: BINARY

Regards

Joe
  • 561
  • 1
  • 9
  • 26

2 Answers2

1

Cast your columns to int and then try writing to another parquet file. No schema specification needed.

df = spark.read.parquet("filepath")
df2 = df.select(*map(lambda col: df[col].cast('int'), df.columns))
df2.write.parquet("filepath")
mck
  • 40,932
  • 13
  • 35
  • 50
  • Hi, thanks, with 10MM rows and 9.000 columns, is this optimum? Does it parallelize? – Joe Dec 03 '20 at 20:12
  • It certainly does – mck Dec 03 '20 at 20:12
  • Thanks, I will try it, just curious now, do you know why does the write.parquet does not change the schema? – Joe Dec 03 '20 at 20:40
  • @Joe that's because write.parquet does not accept schema as an argument - see https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.DataFrameWriter.parquet – mck Dec 04 '20 at 07:30
0

for this you can actually enforce the schema right when reading the data. you can modify the code as follows:

df = sqlContext.read.option("schema",write_schema).parquet("PATH")
df.write.parquet("NEW_PATH")
Anand Vidvat
  • 977
  • 7
  • 20
  • Hi, thanks, I tried it but it did not work, you can I updated the post – Joe Dec 03 '20 at 19:10
  • This isn't correct, parquet file contains the schema definition, you can't just randomly read colunms a, b, c of types str from a file whose schema is columns x, y, z of types int. – Kashyap Jun 14 '22 at 17:10