14

I'm trying to import data with parquet format with custom schema but it returns : TypeError: option() missing 1 required positional argument: 'value'

   ProductCustomSchema = StructType([
        StructField("id_sku", IntegerType(), True),
        StructField("flag_piece", StringType(), True),
        StructField("flag_weight", StringType(), True),
        StructField("ds_sku", StringType(), True),
        StructField("qty_pack", FloatType(), True)])

def read_parquet_(path, schema) : 
    return spark.read.format("parquet")\
                             .option(schema)\
                             .option("timestampFormat", "yyyy/MM/dd HH:mm:ss")\
                             .load(path)

product_nomenclature = 'C:/Users/alexa/Downloads/product_nomenc'
product_nom = read_parquet_(product_nomenclature, ProductCustomSchema)
user9176398
  • 441
  • 1
  • 4
  • 15
  • 4
    Not sure if it would work, but instead of `.option(schema)`, try `.schema(schema)`. – martinarroyo Sep 18 '18 at 12:55
  • return Py4JJavaError: An error occurred while calling o1259.showString. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 50.0 failed 1 times, most recent failure: Lost task 0.0 in stage 50.0 (TID 660, localhost, executor driver): org.apache.spark.sql.execution.QueryExecutionException: Parquet column cannot be converted in file file:///C:/Users/alexa/Downloads/product_nomenc/lu_product_nomenclature%252Fpart-r-00002-8c115acd-057f-43a5-b7dd-0e7d0ef1eb9e.gz.parquet. Column: [id_sku], Expected: IntegerType, Found: BINARY – user9176398 Sep 18 '18 at 13:05
  • 2
    See valid signatures at https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameReader (and out of interest why not spark.read.parquet() and infer the schema?) – 9bO3av5fw5 Sep 18 '18 at 13:22
  • for me adding `("recursive",true)` as mentioned in this answer - https://stackoverflow.com/a/74188877/6490744 worked – Sowjanya R Bhat Mar 19 '23 at 18:38

2 Answers2

9

As mentioned in the comments you should change .option(schema) to .schema(schema). option() requires you to specify a key (the name of the option you're setting) and a value (what value you want to assign to that option). You are getting the TypeError because you were just passing a variable called schema to option without specifying what that option you were actually trying to set with that variable.

The QueryExecutionException you posted in the comments is being raised because the schema you've defined in your schema variable does not match the data in your DataFrame. If you're going to specify a custom schema you must make sure that schema matches the data you are reading. In your example the column id_sku is stored as a BinaryType, but in your schema you're defining the column as an IntegerType. pyspark will not try to reconcile differences between the schema you provide and what the actual types are in the data and an exception will be thrown.

To fix your error make sure the schema you're defining correctly represents your data as it is stored in the parquet file (i.e. change the datatype of id_sku in your schema to be BinaryType). The benefit to doing this is you get a slight performance gain by not having to infer the file schema each time the parquet file is read.

vielkind
  • 2,840
  • 1
  • 16
  • 16
0

Use .option(schema=ProductCustomSchema)

  • As it’s currently written, your answer is unclear. Please [edit] to add additional details that will help others understand how this addresses the question asked. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Mar 02 '23 at 00:52