PySpark Value Error

Question

I'm building a recommendation engine based Apache Spark. I can load data from PostgreSQL, but when I try to map this data I get a value error:

This works successfully.

df = sql_context.read.format('jdbc').options(
    url=db_url,
    dbtable=db_table, driver="org.postgresql.Driver"
).load()

This line prints the schema to the console.

df.printSchema()

It outputs "ınteger" instead of "integer". I think that's the issue.

Here is the console output of the schema:

root
 |-- id: ınteger (nullable = false)
 |-- user_id: ınteger (nullable = false)
 |-- star: ınteger (nullable = false)
 |-- product_id: ınteger (nullable = false)

I'm trying to get specific columns, but it raises a value error.

validation_for_predict_rdd = validation_rdd.map(
    lambda x: (x.user_id, x.product_id)
)

Error output:

raise ValueError("Could not parse datatype: %s" % json_value)
ValueError: Could not parse datatype: ınteger

I tried to define a custom schema to solve that. But JDBC doesn't allow to use custom schema.

custom_schema = StructType([
    StructField("id", LongType(), False),
    StructField("user_id", LongType(), False),
    StructField("star", LongType(), False),
    StructField("product_id", LongType(), False)])

df = sql_context.read.format('jdbc').options(
    url=db_url,
    dbtable=db_table, driver="org.postgresql.Driver"
).load(schema=custom_schema)

Error output:

raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: 'jdbc does not allow user-specified schemas.;'

What is the solution for the "ınteger" value error? I could change the database field types, but that wouldn't be a proper solution.

Can you connect to the database using `pgsql` and do `DESCRIBE TABLE`? I suspect it may be something with encoding of the database or table. — Jacek Laskowski, Apr 17 '17 at 16:50
CREATE DATABASE buyexpress_service_layer WITH OWNER = postgres ENCODING = 'UTF8' TABLESPACE = pg_default LC_COLLATE = 'Turkish_Turkey.1254' LC_CTYPE = 'Turkish_Turkey.1254' CONNECTION LIMIT = -1; — Burak Özdemir, Apr 17 '17 at 16:52
Ah! `LC_` are messing with your Spark application (!) Can you change it using `LC_ALL=en_US.UTF-8` and `LANG=en_US.UTF-8` before starting your `pyspark` or `spark-submit`? — Jacek Laskowski, Apr 17 '17 at 16:55

score 3 · Answer 1 · edited May 23 '17 at 11:54

The source of the problem is JVM locale. In the local mode, with Python interpreter locale set to ('en_US', 'UTF-8'), you can reproduce the general issue as follows:

Create a session:

from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql.types import IntegerType

sc = SparkContext(master="local[*]")
spark = SparkSession(sc)

Set locale to tr_TR:

locale = sc._jvm.java.util.Locale
locale.setDefault(locale.forLanguageTag("tr-TR"))

Try to create a DataFrame:

spark.createDataFrame([1, 2, 3], IntegerType())

Py4JJavaError: An error occurred while calling o25.applySchemaToPythonRDD.
: java.util.NoSuchElementException: key not found: integer
    at scala.collection.MapLike$class.default(MapLike.scala:228)
    at scala.collection.AbstractMap.default(Map.scala:59)

Furthermore:

from pyspark.sql.functions import col

spark.read.json(
    sc.parallelize(["""{"x": 1}"""
])).selectExpr("CAST(x AS integer)")

ValueError                                Traceback (most recent call last)
...
ValueError: Could not parse datatype: ınteger

The simplest fix is to set spark.driver.extraJavaOptions to set JVM locale to en_US:

spark.driver.extraJavaOptions -Duser.country=US -Duser.language=en

in SPARK_HOME/conf/spark-defaults.conf. It is probably a good idea to use the same setting for spark.executor.extraJavaOptions as well.

You could also try using java.util.Locale.setDefault trick as above, before you execute any other code:

locale.setDefault(locale.forLanguageTag("en-US"))

but I don't think it is a reliable solution.

You can also update to the recent master where this issue has been partially resolved:

SPARK-20156 - Java String toLowerCase "Turkish locale bug" causes Spark problems

More about the source of the problem: The infamous Turkish locale bug.

score 0 · Accepted Answer · answered Apr 18 '17 at 10:07

I changed the system language, but it didn't work. Furthermore, I backed up my database and created a new one. According to Jacek's answer, I used different Character type and collation. In this case, the new create statement should be like that:

CREATE DATABASE buyexpress_service_layer
  WITH OWNER = postgres
       TEMPLATE = template0
       ENCODING = 'UTF8'
       TABLESPACE = pg_default
       LC_COLLATE = 'English_United States.1252'
       LC_CTYPE = 'English_United States.1252'
       CONNECTION LIMIT = -1;

It's working now. Thank you for all the answers.

PySpark Value Error

2 Answers2