I'm facing a very strange issue with pyspark
on macOS Sierra. My goal is to parse dates in ddMMMyyyy
format (eg: 31Dec1989
) but get errors. I run Spark 2.0.1, Python 2.7.10 and Java 1.8.0_101. I tried also using Anaconda 4.2.0 (it ships with Python 2.7.12), but get errors too.
The same code on Ubuntu Server 15.04 with same Java version and Python 2.7.9 works without any error.
The official documentation about spark.read.load()
states:
dateFormat
– sets the string that indicates a date format. Custom date formats follow the formats atjava.text.SimpleDateFormat
. This applies to date type. If None is set, it uses the default value value,yyyy-MM-dd
.
The official Java documentation talks about MMM
as the right format to parse month names like Jan
, Dec
, etc. but it throws a lot of errors starting with java.lang.IllegalArgumentException
.
The documentation states that LLL
can be used too, but pyspark
doesn't recognize it and throws pyspark.sql.utils.IllegalArgumentException: u'Illegal pattern component: LLL'
.
I know of another solution to dateFormat
, but this is the fastest way to parse data and the simplest to code. What am I missing here?
In order to run the following examples you simply have to place test.csv
and test.py
in the same directory, then run <spark-bin-directory>/spark-submit <working-directory>/test.py
.
My test case using ddMMMyyyy
format
I have a plain-text file named test.csv
containing the following two lines:
col1
31Dec1989
and the code is the following:
from pyspark.sql import SparkSession
from pyspark.sql.types import *
spark = SparkSession \
.builder \
.appName("My app") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
struct = StructType([StructField("column", DateType())])
df = spark.read.load( "test.csv", \
schema=struct, \
format="csv", \
sep=",", \
header="true", \
dateFormat="ddMMMyyyy", \
mode="FAILFAST")
df.show()
I get errors. I tried also moving month name before or after days and year (eg: 1989Dec31
and yyyyMMMdd
) without success.
A working example using ddMMyyyy
format
This example is identical to the previous one except from the date format. test.csv
now contains:
col1
31121989
The following code prints the content of test.csv
:
from pyspark.sql import SparkSession
from pyspark.sql.types import *
spark = SparkSession \
.builder \
.appName("My app") \
.config("spark.some.config.option", "some-value") \
.getOrCreate()
struct = StructType([StructField("column", DateType())])
df = spark.read.load( "test.csv", \
schema=struct, \
format="csv", \
sep=",", \
header="true", \
dateFormat="ddMMyyyy", \
mode="FAILFAST")
df.show()
The ouput is the following (I omit the various verbose lines):
+----------+
| column|
+----------+
|1989-12-31|
+----------+
UPDATE1
I made a simple Java class that uses java.text.SimpleDateFormat
:
import java.text.*;
import java.util.Date;
class testSimpleDateFormat
{
public static void main(String[] args)
{
SimpleDateFormat format = new SimpleDateFormat("yyyyMMMdd");
String dateString = "1989Dec31";
try {
Date parsed = format.parse(dateString);
System.out.println(parsed.toString());
}
catch(ParseException pe) {
System.out.println("ERROR: Cannot parse \"" + dateString + "\"");
}
}
}
This code doesn't work on my environment and throws this error:
java.text.ParseException: Unparseable date: "1989Dec31"
but works perfectly on another system (Ubuntu 15.04). This seems a Java issue, but I don't know how to solve it. I installed the latest available version of Java and all of my software has been updated.
Any ideas?
UPDATE2
I've found how to make it work under pure Java by specifying Locale.US
:
import java.text.*;
import java.util.Date;
import java.util.*;
class HelloWorldApp
{
public static void main(String[] args)
{
SimpleDateFormat format = new SimpleDateFormat("yyyyMMMdd", Locale.US);
String dateString = "1989Dec31";
try {
Date parsed = format.parse(dateString);
System.out.println(parsed.toString());
}
catch(ParseException pe) {
System.out.println(pe);
System.out.println("ERROR: Cannot parse \"" + dateString + "\"");
}
}
}
Now, the question becomes: how to specify Java's Locale in pyspark
?