I'm building a recommendation engine based Apache Spark. I can load data from PostgreSQL, but when I try to map this data I get a value error:
This works successfully.
df = sql_context.read.format('jdbc').options(
url=db_url,
dbtable=db_table, driver="org.postgresql.Driver"
).load()
This line prints the schema to the console.
df.printSchema()
It outputs "ınteger" instead of "integer". I think that's the issue.
Here is the console output of the schema:
root
|-- id: ınteger (nullable = false)
|-- user_id: ınteger (nullable = false)
|-- star: ınteger (nullable = false)
|-- product_id: ınteger (nullable = false)
I'm trying to get specific columns, but it raises a value error.
validation_for_predict_rdd = validation_rdd.map(
lambda x: (x.user_id, x.product_id)
)
Error output:
raise ValueError("Could not parse datatype: %s" % json_value)
ValueError: Could not parse datatype: ınteger
I tried to define a custom schema to solve that. But JDBC doesn't allow to use custom schema.
custom_schema = StructType([
StructField("id", LongType(), False),
StructField("user_id", LongType(), False),
StructField("star", LongType(), False),
StructField("product_id", LongType(), False)])
df = sql_context.read.format('jdbc').options(
url=db_url,
dbtable=db_table, driver="org.postgresql.Driver"
).load(schema=custom_schema)
Error output:
raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: 'jdbc does not allow user-specified schemas.;'
What is the solution for the "ınteger" value error? I could change the database field types, but that wouldn't be a proper solution.