Is it possible to read Excel file from Apache Zeppellin to PySpark or to a Pandas Dataframe?

Question

I have got a file in HDFS (/user/username/Project/data/file.xlsx) that I want to read into a DataFrame. (I do not care if it is a PySpark DataFrame or Pandas, but Pandas is preferred.)

I am using a Zeppelin Notebook to do my code.

Is it possible to get data from this file?

I have already tried the following commands, but none of them worked:

df = pd.read_excel("/user/username/Project/data/file.xlsx")
df = pd.read_excel("hdfs:///user/username/Project/data/file.xlsx")
df = pd.read_excel("hdfs://user/username/Project/data/file.xlsx")

score 1 · Answer 1 · answered Jul 19 '19 at 11:34

I don't think you can read files stored in hdfs directly with pandas.

You probably have to either :

load the file into spark then use toPandas()

df = spark.read.format("excel").load("hdfs:xxx").toPandas()
use some alternative to enable pandas to read directly, as described here

score 0 · Answer 2 · edited Feb 21 '20 at 17:01

0

It seems export and import commands in Python Interpreter in Apache Zeppellin can be only realised through "pd.read_csv" and "to_csv" modules.

edited Feb 21 '20 at 17:01

Yunnosch

26,130
9
42
54

answered Feb 21 '20 at 16:57

Ivan7

37
6

Is it possible to read Excel file from Apache Zeppellin to PySpark or to a Pandas Dataframe?

2 Answers2