How to convert a JSON result to Parquet in python?

Question

Follow the script below to convert a JSON file to parquet format. I am using the pandas library to perform the conversion. However the following error is occurring: AttributeError: 'DataFrame' object has no attribute 'schema' I am still new to python.

Here's the original json file I'm using: [ { "a": "01", "b": "teste01" }, { "a": "02", "b": "teste02" } ]

What am i doing wrong?

import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

df = pd.read_json('C:/python/json_teste')

pq = pa.parquet.write_table(df, 'C:/python/parquet_teste')

Error:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-23-1b4ced833098> in <module>
----> 1 pq = pa.parquet.write_table(df, 'C:/python/parquet_teste')

C:\Anaconda\lib\site-packages\pyarrow\parquet.py in write_table(table, where, row_group_size, version, use_dictionary, compression, write_statistics, use_deprecated_int96_timestamps, coerce_timestamps, allow_truncated_timestamps, data_page_size, flavor, filesystem, **kwargs)
   1256     try:
   1257         with ParquetWriter(
-> 1258                 where, table.schema,
   1259                 filesystem=filesystem,
   1260                 version=version,

C:\Anaconda\lib\site-packages\pandas\core\generic.py in __getattr__(self, name)
   5065             if self._info_axis._can_hold_identifiers_and_holds_name(name):
   5066                 return self[name]
-> 5067             return object.__getattribute__(self, name)
   5068 
   5069     def __setattr__(self, name, value):

AttributeError: 'DataFrame' object has no attribute 'schema'

Print file:

#print 
print(df)
   a        b
0  1  teste01
1  2  teste02

#following columns
df.columns
Index(['a', 'b'], dtype='object')

#following types
df.dtypes
a     int64
b    object
dtype: object

score 6 · Answer 1 · edited Jul 15 '22 at 15:57

6

You can also directly read JSON files utilizing pyarrow as in the following example:

from pyarrow import json
import pyarrow.parquet as pq

table = json.read_json('C:/python/json_teste') 
pq.write_table(table, 'C:/python/result.parquet')  # save json/table as parquet

Reference: reading and writing with pyarrow.parquet

edited Jul 15 '22 at 15:57

user1717828

7,122
8
34
59

answered Apr 16 '21 at 13:26

Morgana

205
3
5

Felix K Jose · Answer 2 · 2021-07-02T19:28:03.330

5

You can achieve what you are looking for by pyspark as follows:

from pyspark.sql import SparkSession

spark = SparkSession \
    .builder \
    .appName("JsonToParquetPysparkExample") \
    .getOrCreate()

json_df = spark.read.json("C://python/test.json", multiLine=True,) 
json_df.printSchema()
json_df.write.parquet("C://python/output.parquet")

edited Jul 02 '21 at 19:28

answered Jul 02 '21 at 19:17

Felix K Jose

782
7
10

Do you have to set up a spark cluster for this? – sam May 11 '23 at 13:53

score 4 · Answer 3 · answered Dec 02 '19 at 15:26

4

If your motive is to just convert json to parquet, you can probably use pyspark API:

>>> data = [ { "a": "01", "b": "teste01" }, { "a": "02", "b": "teste02" } ]
>>> df = spark.createDataFrame(data)
>>> df.write.parquet("data.parquet")

Now, this DF is a spark dataframe, which can be saved in parquet.

answered Dec 02 '19 at 15:26

Hussain Bohra

985
9
15

I imported pyspark the script and when executing this script the following error occurred: AttributeError: module 'pyspark' has no attribute 'createDataFrame'. – Mateus Silvestre Dec 02 '19 at 18:27
You need to start a pyapark she'll to test this. createDataFrame is a method on spark which comes by default on pyspark shell. – Hussain Bohra Dec 02 '19 at 19:05
Thanks for the feedback. But i started a pyapark she'll to test and had this error again. I am using the jupyter notebook. Follow the code: import pyspark data = [ { "a": "01", "b": "teste01" }, { "a": "02", "b": "teste02" } ] df = spark.createDataFrame(data) df.write.parquet("data.parquet") – Mateus Silvestre Dec 02 '19 at 19:22
what version of pyspark are you using? I believe spark.createDataFrame is available 2.3.0 onward, have a look in this doc https://spark.apache.org/docs/2.3.0/sql-programming-guide.html#programmatically-specifying-the-schema – Hussain Bohra Dec 02 '19 at 20:12
@MateusSilvestre Take a look at this in order to instantiate `spark`: https://spark.apache.org/docs/2.3.0/sql-programming-guide.html#hive-tables – Zack Burt May 22 '20 at 19:10
@MateusSilvestre try this https://gist.github.com/zackster/183b2729abbe0553abfec82a566c0f92 – Zack Burt May 22 '20 at 20:00
@MateusSilvestre I have provided a complete example for your ask. please take a look and see if it works – Felix K Jose Jul 02 '21 at 19:18

score 3 · Answer 4 · answered Jul 02 '21 at 18:18

Here's how to convert a JSON file to Apache Parquet format, using Pandas in Python. This is an easy method with a well-known library you may already be familiar with.

Firstly, make sure to install pandas and pyarrow. If you're using Python with Anaconda:

conda install pandas
conda install pyarrow

Then, here is the code:

import pandas as pd
data = pd.read_json(FILEPATH_TO_JSON_FILE)
data.to_parquet(PATH_WHERE_TO_SAVE_PARQUET_FILE)

I hope this helps, please let me know if I can clarify anything.

score 1 · Answer 5 · answered Dec 02 '19 at 15:20

Welcome to Stackoverflow, the library you are using shows that in example that you need to write the column names in the data frame. Try using column names of your data frame and it will work.

# Given PyArrow schema
import pyarrow as pa
schema = pa.schema([
    pa.field('my_column', pa.string),
    pa.field('my_int', pa.int64),
])
convert_json(input_filename, output_filename, schema)

reference: json2parquet

How to convert a JSON result to Parquet in python?

5 Answers5