1

So I have been having some issues reading large excel files into databricks using pyspark and pandas. Spark seems to be really fast at csv and txt but not excel

i.e

df2=pd.read_excel(excel_file, sheetname=sheets,skiprows = skip_rows).astype(str)

df = spark.read.format("com.crealytics.spark.excel").option("dataAddress", "\'" + sheet + "\'" + "!A1").option("useHeader","false").option("maxRowsInMemory",1000).option("inferSchema","false").load(filePath)

We have found the fastest way to read in an excel file to be one which was written by a contractor:

from openpyxl import load_workbook
import csv
from os import sys
excel_file = "/dbfs/{}".format(path)

sheets = []
workbook = load_workbook(excel_file,read_only=True,data_only=True)
all_worksheets = workbook.get_sheet_names()

for worksheet_name in workbook.get_sheet_names():
    print("Export " + worksheet_name + " ...")
    try:
        worksheet = workbook.get_sheet_by_name(worksheet_name)
    except KeyError:
        print("Could not find " + worksheet_name)
        sys.exit(1)

    with open("/dbfs/{}/{}.csv".format(tempDir, worksheet_name), 'w') as your_csv_file:
      wr = csv.writer(your_csv_file, quoting=csv.QUOTE_ALL)
      headerDone = False
      for row in worksheet.iter_rows():
          lrow = []

          if headerDone == True:
            lrow.append(worksheet_name)
          else:
            lrow.append("worksheet_name")
            headerDone = True

          for cell in row:
              lrow.append(cell.value)
          wr.writerow(lrow)

#Sometimes python gets a bit ahead of itself and 
#tries to do this before it's finished writing the csv
#and fails

retryCount = 0
retryMax = 20

while retryCount < retryMax:
    try:
      df2 = spark.read.format("csv").option("header", "true").load(tempDir)
      if df2.count() == 0:
        print("Retrying load from CSV")
        retryCount = retryCount + 1
        time.sleep(10)
      else:
        retryCount = retryMax
    except:
      print("Thew an error trying to read the file")

The reason it is fast is that it is only storing one line of excel sheet in memory when it loops round. I tried appending the list of rows together but this made it very slow.

The issue with the above method is that it writing to csv and re-reading it doesn't seem the most robust method. Its possible that the csv could be read part way while its written and it could still be read in and data could be lost.

Is there any other way of making this fast such as using cython so you can just put the append the list of rows without incurring a penalty for the memory and put them directly into spark directly via createDataFrame?

Tiger_Stripes
  • 485
  • 5
  • 17
  • Excel is not a big data format. While Spark has a connector to read directly from it (the crealytics one, which you reference), the Excel file format is not designed for distributed reads. What your contractor does is convert the Excel spreadsheet to a series of CSV files, which can be processed more efficiently than Excel, that's all. Your contractor had the right mindset. – Oliver W. Nov 28 '19 at 10:45
  • The biggest problem is likely to be going between row and column-based data structures. Excel and CSV are row-based, data frames are columnar. If it's possible to write to spark row-wise then you can probably skip the CSV step. – Charlie Clark Nov 28 '19 at 12:33
  • What i was thinking is that if the all the rows could be appended together in Cython somehow, then i can just stick it directly into spark without the writing to csv. However I dont know much about cython so not sure if its possible – Tiger_Stripes Nov 28 '19 at 22:30
  • I just wrapped on work on a similar issue - excel is a terrible format to work with as you can see. The best way I've found is very similar where I converted the sheets to individual csv files but as a dataframe. That means the csv is split to the number of workers you have. We used Aspose for Java to handle our conversion and it was very fast. – David Nguyen Nov 28 '19 at 23:03
  • you can read excel /xls file using spark scala. Below is sample code val df = spark.read.format("com.crealytics.spark.excel") .option("location", "/FileStore/tables/Airline.xlsx") .option("useHeader", "true") .option("treatEmptyValuesAsNulls", "false") .option("inferSchema", "false") .option("addColorColumns", "false") .load("/FileStore/tables/Sample.xlsx") – Praveen Dec 12 '19 at 09:32
  • @Praveen , spark turns out to be really bad at reading excel files – Tiger_Stripes Dec 12 '19 at 12:54

0 Answers0