50

I'm pretty new in Spark and I've been trying to convert a Dataframe to a parquet file in Spark but I haven't had success yet. The documentation says that I can use write.parquet function to create the file. However, when I run the script it shows me: AttributeError: 'RDD' object has no attribute 'write'

from pyspark import SparkContext
sc = SparkContext("local", "Protob Conversion to Parquet ")

# spark is an existing SparkSession
df = sc.textFile("/temp/proto_temp.csv")

# Displays the content of the DataFrame to stdout
df.write.parquet("/output/proto.parquet")

Do you know how to make this work?

The spark version that I'm using is Spark 2.0.1 built for Hadoop 2.7.3.

ultraInstinct
  • 4,063
  • 10
  • 36
  • 53

2 Answers2

84

The error was due to the fact that the textFile method from SparkContext returned an RDD and what I needed was a DataFrame.

SparkSession has a SQLContext under the hood. So I needed to use the DataFrameReader to read the CSV file correctly before converting it to a parquet file.

from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("Protob Conversion to Parquet") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

# read csv
df = spark.read.csv("/temp/proto_temp.csv")

# Displays the content of the DataFrame to stdout
df.show()

df.write.parquet("output/proto.parquet")
Ryabchenko Alexander
  • 10,057
  • 7
  • 56
  • 88
ultraInstinct
  • 4,063
  • 10
  • 36
  • 53
  • 8
    Even if your code is correct, your explanation isn't. SparkContext doesn't convert the CSV file to an RDD. The `textFile` method from SparkContext returns an RDD and what you need is a `DataFrame` thus a SQLContext or a HiveContext which is also encapsulated in a SparkSession in **spark 2+** Would you care correcting that information and accept the answer to close the question ? – eliasah Feb 08 '17 at 08:55
  • 2
    Thanks @eliasah for your feedback! – ultraInstinct Feb 08 '17 at 10:08
  • The answer is for dataframe. how can i write an rdd in parquet format? – mnis.p Jul 13 '18 at 15:05
  • df.write.parquet takes the file folder as an argument and not its absolute path. – Haha Nov 09 '20 at 13:36
  • 1
    @eliasah Does your comment mean that for **spark 2+** we need only the following two lines to convert csv to Parquet: `df = spark.read.parquet("/path/to/infile.csv") df.write.csv("/path/to/outfile.parquet"` Did I get it right? – nam Nov 13 '21 at 15:59
6

You can also write out Parquet files from Spark with koalas. This library is great for folks that prefer Pandas syntax. Koalas is PySpark under the hood.

Here's the Koala code:

import databricks.koalas as ks

df = ks.read_csv('/temp/proto_temp.csv')
df.to_parquet('output/proto.parquet')
Machavity
  • 30,841
  • 27
  • 92
  • 100
Powers
  • 18,150
  • 10
  • 103
  • 108
  • Hi @Powers, I tried installing it `sc.install_pypi_package("koalas") #Install latest koalas version` while I was working on AWS EMR. However, when I tried importing it it said `No module named 'koalas'` – Sowmya Jun 09 '20 at 12:39
  • 1
    @Sowmya - This link explains how to install pypi packages in an EMR environment: https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-jupyterhub-install-kernels-libs.html. Hope that helps! – Powers Jun 09 '20 at 12:42
  • Thanks. That's indeed nice of you to reply to my comment. I knew that link, just instead of doing a system installation, was thinking of more of a local or noteboook specific installation. Well, if the local one doesn't work out then would go for system installation. – Sowmya Jun 09 '20 at 12:59