4

I'm looking to consume some compressed csv files into DataFrames so that I can eventually query them using SparkSQL. I would normally just use sc.textFile() to consume the file and use various map() transformations to parse and transform the data but the files in question have some difficult values to parse. Particularly, there are quote encapsulated values that have commas inside them, which breaks the option of using the split() function inside of a map() transformation.

Here's what I'm doing:

I launch spark with the spark-csv and commons-csv jars

PYSPARK_PYTHON=python2.7 sudo pyspark --jars "spark-csv_2.10-1.0.0.jar,commons-csv-1.1.jar"

I create a schema variable since my csv doesn't have a header and then make the below call

sqlc = SQLContext(sc)
apps_df = sqlc.read.format("com.databricks.spark.csv").options(header="false",codec="org.apache.hadoop.io.compress.GzipCodec").load("s3://path_to_file.csv.gz", schema = customSchema)

This does return a DataFrame object that has the correct schema when you use apps_df.printSchema(), but apps_df.count() returns 0 and apps_df.first() returns nothing.

Edit:

Here's my, hopefully, repeatable example

Replace full_filepath with a .csv file in your directory

Replace full_gzip_filepath with the .gz version of the csv file in your directory

from pyspark.sql import SQLContext
from pyspark.sql.types import *
sqlc = SQLContext(sc)
import pandas as pd
import numpy as np
from subprocess import check_call

columns = ['A','B', 'C']
data = np.array([np.arange(10)]*3).T

df = pd.DataFrame(data, columns=columns)

df.to_csv('full_filepath')

check_call(['gzip', 'full_filepath'])

test_scsv_df = sqlc.read.format("com.databricks.spark.csv").options(header="true",inferSchema="true",codec="org.apache.hadoop.io.compress.GzipCodec").load("full_gzip_filepath")

test_scsv_df.show()

This returns:

+---+---+---+---+
|   |  A|  B|  C|
+---+---+---+---+
+---+---+---+---+

If you also run the next few commands, you will see that the file is properly consumable via pandas

test_pd = pd.read_csv('full_gzip_filepath', sep=',', compression='gzip', quotechar='"', header=0)

test_pd_df = sqlc.createDataFrame(test_pd)

test_pd_df.show()

This returns:

+----------+---+---+---+
|Unnamed: 0|  A|  B|  C|
+----------+---+---+---+
|         0|  0|  0|  0|
|         1|  1|  1|  1|
|         2|  2|  2|  2|
|         3|  3|  3|  3|
|         4|  4|  4|  4|
|         5|  5|  5|  5|
|         6|  6|  6|  6|
|         7|  7|  7|  7|
|         8|  8|  8|  8|
|         9|  9|  9|  9|
+----------+---+---+---+
justafisch
  • 41
  • 3
  • 1
    Do you experience the same problem when you read from a local file system / HDFS? – zero323 Nov 25 '15 at 01:26
  • I brought the same file down to my local machine and ran the above locally, pointing to the local version of the file, with the same results. – justafisch Nov 25 '15 at 17:27
  • I cannot reproduce the problem. Could you provide more details like Spark version? – zero323 Nov 25 '15 at 17:29
  • Absolutely, our cluster is running 1.5 and I have 1.5.2 running locally. – justafisch Nov 25 '15 at 17:44
  • Works for me... I guess a [a Minimal, Complete, and Verifiable example](http://stackoverflow.com/help/mcve) (write file, gzip, read from the same session) could be useful here. – zero323 Nov 26 '15 at 23:51
  • I have added an example that displays the issue I am experiencing. – justafisch Dec 01 '15 at 21:36

0 Answers0