0

I am working with PySpark and I want to insert an array of strings into my database that has a JDBC driver but I am getting the following error:

IllegalArgumentException: Can't get JDBC type for array<string>

This error happens when I have an ArrayType(StringType()) format for a UDF. And when I try to overwrite the column type:

.option("createTableColumnTypes", "col1 ARRAY, col2 ARRAY, col3 ARRAY, col4 ARRAY")

I get:

DataType array is not supported.(line 1, pos 18)

This makes me wonder if the problem is within Spark 3.1.2 where there is no mapping for array and I have to convert it into a string or is it coming from the driver that I am using?

For reference, I am using CrateDB as database. And here is its driver: crate.io/docs/jdbc/en/latest

Cande
  • 36
  • 7

2 Answers2

2

Probably switching to use Postgres JDBC with CrateDB instead of crate-jdbc could solve your issue.

Sample PySpark program tested with CrateCB 4.6.1 and postgresql 42.2.23:

from pyspark.sql import Row

df = spark.createDataFrame([
    Row(a = [1, 2]),
    Row(a = [3, 4])
])
df

df.write \
  .format("jdbc") \
  .option("url", "jdbc:postgresql://<url-to-server>:5432/?sslmode=require") \
  .option("driver", "org.postgresql.Driver") \
  .option("dbtable", "<tableName>") \
  .option("user", "<username>") \
  .option("password", "<password>") \
  .save()
jayeff
  • 1,689
  • 14
  • 14
  • That worked. Now I am getting some other errors: While inserting the arrays I get only the first 1,000 rows from 1,000,000 and it throws: `Py4JJavaError: An error occurred while calling o536.save. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 3.0 (TID 3) (844a50665e58 executor driver): org.postgresql.util.PSQLException: ERROR: line 1:1: mismatched input 'ROLLBACK' expecting {'SELECT', 'DEALLOCATE', 'CREATE', 'ALTER', 'KILL', 'BEGIN'...` – Cande Aug 03 '21 at 17:48
  • And when I try to insert float numbers defined in spark udf: `Py4JJavaError: An error occurred while calling o1014.save. : org.postgresql.util.PSQLException: ERROR: Cannot find data type: float4` – Cande Aug 03 '21 at 17:52
  • For some reason the first problem is able to be resolved after defining a batchsize option that matches the dataframe's rows. Now all the rows are inserted – Cande Aug 03 '21 at 18:22
  • The Spark `PostgresDialect` maps float and double to `FLOAT4` and `FLOAT8` respectively ([src](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/jdbc/PostgresDialect.scala#L88-L89)) which isn't supported by CrateDB right now. Don't know if there is a workaround for this unfortunately – jayeff Aug 04 '21 at 11:06
  • Just tried the 'createTableColumnTypes' option and it worked by defining it as float in it. Thanks for your help! – Cande Aug 04 '21 at 17:25
  • Cool that it works. FYI: next version of CrateDB will include FLOAT4 / FLOAT8 and should work out of the box. [PR](https://github.com/crate/crate/pull/11634) for this was just merged – jayeff Aug 05 '21 at 13:14
1

Could you maybe try adding the datatype for the array i.e. ARRAY(TEXT) ?

.option("createTableColumnTypes", "col1 ARRAY(TEXT), col2 ARRAY(TEXT), col3 ARRAY(TEXT), col4 ARRAY(TEXT)")

SELECT ['Hello']::ARRAY;
--> SQLParseException[line 1:25: no viable alternative at input 'SELECT ['Hello']::ARRAY limit']
SELECT ['Hello']::ARRAY(TEXT);
--> SELECT OK, 1 record returned (0.002 seconds)
proddata
  • 216
  • 1
  • 7
  • With that I get: `ParseException: mismatched input 'TEXT' expecting INTEGER_VALUE(line 1, pos 24)` – Cande Jul 30 '21 at 18:38