3

I'm building a recommendation engine based Apache Spark. I can load data from PostgreSQL, but when I try to map this data I get a value error:

This works successfully.

df = sql_context.read.format('jdbc').options(
    url=db_url,
    dbtable=db_table, driver="org.postgresql.Driver"
).load()

This line prints the schema to the console.

df.printSchema()

It outputs "ınteger" instead of "integer". I think that's the issue.

Here is the console output of the schema:

root
 |-- id: ınteger (nullable = false)
 |-- user_id: ınteger (nullable = false)
 |-- star: ınteger (nullable = false)
 |-- product_id: ınteger (nullable = false)

I'm trying to get specific columns, but it raises a value error.

validation_for_predict_rdd = validation_rdd.map(
    lambda x: (x.user_id, x.product_id)
)

Error output:

raise ValueError("Could not parse datatype: %s" % json_value)
ValueError: Could not parse datatype: ınteger

I tried to define a custom schema to solve that. But JDBC doesn't allow to use custom schema.

custom_schema = StructType([
    StructField("id", LongType(), False),
    StructField("user_id", LongType(), False),
    StructField("star", LongType(), False),
    StructField("product_id", LongType(), False)])

df = sql_context.read.format('jdbc').options(
    url=db_url,
    dbtable=db_table, driver="org.postgresql.Driver"
).load(schema=custom_schema)

Error output:

raise AnalysisException(s.split(': ', 1)[1], stackTrace)
pyspark.sql.utils.AnalysisException: 'jdbc does not allow user-specified schemas.;'

What is the solution for the "ınteger" value error? I could change the database field types, but that wouldn't be a proper solution.

zero323
  • 322,348
  • 103
  • 959
  • 935
Burak Özdemir
  • 510
  • 8
  • 18
  • 1
    Can you connect to the database using `pgsql` and do `DESCRIBE TABLE`? I suspect it may be something with encoding of the database or table. – Jacek Laskowski Apr 17 '17 at 16:50
  • CREATE DATABASE buyexpress_service_layer WITH OWNER = postgres ENCODING = 'UTF8' TABLESPACE = pg_default LC_COLLATE = 'Turkish_Turkey.1254' LC_CTYPE = 'Turkish_Turkey.1254' CONNECTION LIMIT = -1; – Burak Özdemir Apr 17 '17 at 16:52
  • 2
    Ah! `LC_` are messing with your Spark application (!) Can you change it using `LC_ALL=en_US.UTF-8` and `LANG=en_US.UTF-8` before starting your `pyspark` or `spark-submit`? – Jacek Laskowski Apr 17 '17 at 16:55
  • I really hoped that it'd work, but it failed. – Burak Özdemir Apr 17 '17 at 17:35

2 Answers2

3

The source of the problem is JVM locale. In the local mode, with Python interpreter locale set to ('en_US', 'UTF-8'), you can reproduce the general issue as follows:

  • Create a session:

    from pyspark import SparkContext
    from pyspark.sql import SparkSession
    from pyspark.sql.types import IntegerType
    
    sc = SparkContext(master="local[*]")
    spark = SparkSession(sc)
    
  • Set locale to tr_TR:

    locale = sc._jvm.java.util.Locale
    locale.setDefault(locale.forLanguageTag("tr-TR"))
    
  • Try to create a DataFrame:

    spark.createDataFrame([1, 2, 3], IntegerType())
    
    Py4JJavaError: An error occurred while calling o25.applySchemaToPythonRDD.
    : java.util.NoSuchElementException: key not found: integer
        at scala.collection.MapLike$class.default(MapLike.scala:228)
        at scala.collection.AbstractMap.default(Map.scala:59)
    

    Furthermore:

    from pyspark.sql.functions import col
    
    spark.read.json(
        sc.parallelize(["""{"x": 1}"""
    ])).selectExpr("CAST(x AS integer)")
    
    ValueError                                Traceback (most recent call last)
    ...
    ValueError: Could not parse datatype: ınteger
    

The simplest fix is to set spark.driver.extraJavaOptions to set JVM locale to en_US:

spark.driver.extraJavaOptions -Duser.country=US -Duser.language=en

in SPARK_HOME/conf/spark-defaults.conf. It is probably a good idea to use the same setting for spark.executor.extraJavaOptions as well.

You could also try using java.util.Locale.setDefault trick as above, before you execute any other code:

locale.setDefault(locale.forLanguageTag("en-US"))

but I don't think it is a reliable solution.

You can also update to the recent master where this issue has been partially resolved:

  • SPARK-20156 - Java String toLowerCase "Turkish locale bug" causes Spark problems

More about the source of the problem: The infamous Turkish locale bug.

Community
  • 1
  • 1
zero323
  • 322,348
  • 103
  • 959
  • 935
0

I changed the system language, but it didn't work. Furthermore, I backed up my database and created a new one. According to Jacek's answer, I used different Character type and collation. In this case, the new create statement should be like that:

CREATE DATABASE buyexpress_service_layer
  WITH OWNER = postgres
       TEMPLATE = template0
       ENCODING = 'UTF8'
       TABLESPACE = pg_default
       LC_COLLATE = 'English_United States.1252'
       LC_CTYPE = 'English_United States.1252'
       CONNECTION LIMIT = -1;

It's working now. Thank you for all the answers.

Burak Özdemir
  • 510
  • 8
  • 18