I'm looking to consume some compressed csv files into DataFrames so that I can eventually query them using SparkSQL. I would normally just use sc.textFile() to consume the file and use various map() transformations to parse and transform the data but the files in question have some difficult values to parse. Particularly, there are quote encapsulated values that have commas inside them, which breaks the option of using the split() function inside of a map() transformation.
Here's what I'm doing:
I launch spark with the spark-csv and commons-csv jars
PYSPARK_PYTHON=python2.7 sudo pyspark --jars "spark-csv_2.10-1.0.0.jar,commons-csv-1.1.jar"
I create a schema variable since my csv doesn't have a header and then make the below call
sqlc = SQLContext(sc)
apps_df = sqlc.read.format("com.databricks.spark.csv").options(header="false",codec="org.apache.hadoop.io.compress.GzipCodec").load("s3://path_to_file.csv.gz", schema = customSchema)
This does return a DataFrame object that has the correct schema when you use apps_df.printSchema(), but apps_df.count() returns 0 and apps_df.first() returns nothing.
Edit:
Here's my, hopefully, repeatable example
Replace full_filepath with a .csv file in your directory
Replace full_gzip_filepath with the .gz version of the csv file in your directory
from pyspark.sql import SQLContext
from pyspark.sql.types import *
sqlc = SQLContext(sc)
import pandas as pd
import numpy as np
from subprocess import check_call
columns = ['A','B', 'C']
data = np.array([np.arange(10)]*3).T
df = pd.DataFrame(data, columns=columns)
df.to_csv('full_filepath')
check_call(['gzip', 'full_filepath'])
test_scsv_df = sqlc.read.format("com.databricks.spark.csv").options(header="true",inferSchema="true",codec="org.apache.hadoop.io.compress.GzipCodec").load("full_gzip_filepath")
test_scsv_df.show()
This returns:
+---+---+---+---+
| | A| B| C|
+---+---+---+---+
+---+---+---+---+
If you also run the next few commands, you will see that the file is properly consumable via pandas
test_pd = pd.read_csv('full_gzip_filepath', sep=',', compression='gzip', quotechar='"', header=0)
test_pd_df = sqlc.createDataFrame(test_pd)
test_pd_df.show()
This returns:
+----------+---+---+---+
|Unnamed: 0| A| B| C|
+----------+---+---+---+
| 0| 0| 0| 0|
| 1| 1| 1| 1|
| 2| 2| 2| 2|
| 3| 3| 3| 3|
| 4| 4| 4| 4|
| 5| 5| 5| 5|
| 6| 6| 6| 6|
| 7| 7| 7| 7|
| 8| 8| 8| 8|
| 9| 9| 9| 9|
+----------+---+---+---+