1

I'm trying to create a data ingestion routine to load data from multiple excel files with multiple tabs and columns in a data structure using python. The structuring of the tabs in each of the excel files is the same. Can someone please help me with my code? Please let me know what can be changed here.

folder = r'specified_path'

#Changing the directory to the database directory
os.chdir(folder) 

#Getting the list of files from the assigned path
files = os.listdir(folder) 

#Joining the list of files to the assigned path
for archivedlist in files:
    local_path = os.path.join(folder, archivedlist)
    print("Joined Path: ", local_path)

#Reading the data from the files in the dictionary data structure
main_dict = {}
def readdataframe(files):
    df_dict = {}
    for element in files:
        df_dict[element] = pd.read_excel(element, sheet_name = None)
    print(df_dict[element].keys)
return df_dict

print(readdataframe(files))
Harsh780
  • 13
  • 5

1 Answers1

0

Let's assume you have two files in a directory called excel_test:

1.xlsx 
Sheet1      Sheet2
col1 col2   col1 col2
1     2     3     4

2.xlsx 
Sheet1      Sheet2
col1 col2   col1 col2
5     6     7     8

You can store your extracted data in multiple ways, let's see some methods:


1) A single dictionary

A dictionary where all the keys are strings composed by the "name of the file, underscore, sheet name" and the values are pd.DataFrames

import pandas as pd
import os

files_dir = "excel_test/"
files = os.listdir(files_dir) 

# here will be stored all the content from your files and sheets
sheets_content = {}

# reading all files inside the folder
for file in files:
  
  # reading the content of a xlsx file
  data = pd.ExcelFile(files_dir+file)
  
  # iterating through all sheets
  for sheet in data.sheet_names:
    # saving the content of the sheet for that file (-5 deletes the .xlsx part from the name of the file and makes everything more readable)
    sheets_content[file[:-5]+"_"+sheet] = data.parse(sheet)

print(sheets_content)

Output:

{'1_Sheet1':    
        col1  col2
    0     1     2, 
'1_Sheet2':    
        col1  col2
    0     3     4, 
'2_Sheet1':    
        col1  col2
    0     5     6, 
'2_Sheet2':   
        col1  col2
    0     7     8
}

2) A dictionary of dictionaries

Store all xlsx files in a dictionary with as keys the file names and as value another dictionary. The inner dictionary has keys as sheets names and values are pd.DataFrames:

import pandas as pd
import os

files_dir = "excel_test/"
files = os.listdir(files_dir) 

sheets_content = {}

for file in files:

  data = pd.ExcelFile(files_dir+file)
  
  file_data = {}
  for sheet in data.sheet_names:
    file_data[sheet] = data.parse(sheet)

  sheets_content[file[:-5]] = file_data

Output:

{'1': 
    {'Sheet1':    
           col1  col2
       0     1     2, 
     'Sheet2':    
           col1  col2
       0     3     4}, 
 '2': 
    {'Sheet1':    
           col1  col2
        0     5     6, 
     'Sheet2':    
           col1  col2
        0     7     8}
}

3) A list of dictionaries

Store all xlsx files in an array where each element is a dictionary with as keys the sheets names and values pd.DataFrames:

import pandas as pd
import os

files_dir = "excel_test/"
files = os.listdir(files_dir) 

sheets_content = []

for file in files:

  data = pd.ExcelFile(files_dir+file)
  
  file_data = {}
  for sheet in data.sheet_names:
    file_data[sheet] = data.parse(sheet)
  sheets_content.append(file_data)

Output:

[
 {'Sheet1':    
      col1  col2
   0     1     2, 
  'Sheet2':    
      col1  col2
   0     3     4}, 
 {'Sheet1':    
      col1  col2
   0     5     6, 
  'Sheet2':    
      col1  col2
   0     7     8}
]
ClaudiaR
  • 3,108
  • 2
  • 13
  • 27
  • So, I want my excel sheet names to be the keys and the data with the sheets inside as the value for the dictionary. Will this code work in that case? Here, I can see that various tabs of each excel sheet are as the keys. Correct me if I’m wrong. – Harsh780 Sep 02 '22 at 00:24
  • Okay, but you said you want to iterate through multiple excel files right ? How do you want to store the data? Do you want an array with as elements a dictionary for each excel? – ClaudiaR Sep 02 '22 at 04:47
  • I was thinking to create a list of dictionaries. Or just a dictionary with keys as file name and value as the data inside each file. – Harsh780 Sep 02 '22 at 05:25
  • Great, I've updated the answer giving multiple solutions. Pick the one you like more. – ClaudiaR Sep 02 '22 at 05:57