0

I have got a file in HDFS (/user/username/Project/data/file.xlsx) that I want to read into a DataFrame. (I do not care if it is a PySpark DataFrame or Pandas, but Pandas is preferred.)

I am using a Zeppelin Notebook to do my code.

Is it possible to get data from this file?

I have already tried the following commands, but none of them worked:

  • df = pd.read_excel("/user/username/Project/data/file.xlsx")
  • df = pd.read_excel("hdfs:///user/username/Project/data/file.xlsx")
  • df = pd.read_excel("hdfs://user/username/Project/data/file.xlsx")
Secespitus
  • 710
  • 2
  • 14
  • 22

2 Answers2

1

I don't think you can read files stored in hdfs directly with pandas.

You probably have to either :

  • load the file into spark then use toPandas()

    df = spark.read.format("excel").load("hdfs:xxx").toPandas()

  • use some alternative to enable pandas to read directly, as described here

0

It seems export and import commands in Python Interpreter in Apache Zeppellin can be only realised through "pd.read_csv" and "to_csv" modules.

Yunnosch
  • 26,130
  • 9
  • 42
  • 54
Ivan7
  • 37
  • 6