0

I have an excel file having 4 worksheets. Each worksheet has first 3 rows as blank, i.e. the data starts from row number 4 and that continues for thousands of rows further. Note: As per the requirement I am not supposed to delete the blank rows.

My goals are below

1) read the excel file in spark 2.1
2) ignore the first 3 rows, and read the data from 4th row to row number 50. The file has more than 2000 rows. 
3) convert all the worksheets from the excel to separate CSV, and load them to existing HIVE tables.

Note: I have the flexibility of writing separate code for each worksheet.

How can I achieve this?

I can create a Df to read a single file and load it to HIVE. But I guess my requirement would need more than that.

learner
  • 73
  • 2
  • 9

1 Answers1

1

You could for instance use the HadoopOffice library (https://github.com/ZuInnoTe/hadoopoffice/wiki).

There you have the following options:

1) use Hive directly to read the Excel files and to CTAS to a table in CSV format You would need to deploy the HadoopOffice Excel Serde https://github.com/ZuInnoTe/hadoopoffice/wiki/Hive-Serde then you need to create the table (see documentation for all the option, the example reads from sheet1 and skips the first 3 lines)

create external table ExcelTable(<INSERTHEREYOURCOLUMNSPECIFICATION>) ROW FORMAT  SERDE 'org.zuinnote.hadoop.excel.hive.serde.ExcelSerde' STORED AS INPUTFORMAT 'org.zuinnote.hadoop.office.format.mapred.ExcelFileInputFormat' OUTPUTFORMAT 'org.zuinnote.hadoop.excel.hive.outputformat.HiveExcelRowFileOutputFormat' LOCATION '/user/office/files' TBLPROPERTIES("hadoopoffice.read.simple.decimalFormat"="US","hadoopoffice.read.sheet.skiplines.num"="3", "hadoopoffice.read.sheet.skiplines.allsheets"="true", "hadoopoffice.read.sheets"="Sheet1","hadoopoffice.read.locale.bcp47"="US","hadoopoffice.write.locale.bcp47"="US");

Then do CTAS into a CSV format table:

create table CSVTable ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' AS Select * from ExcelTable;

2) use Spark Depending on the Spark version you have different options: for Spark 1.x you can use the HadoopOffice fileformat and for Spark 2.x the Spark2 DataSource (the latter would also include support for Python). See howtos here

Jörn Franke
  • 186
  • 4