0

If I have lots of excel files as follows (here are just two examples):

data1.xlsx

df1

data2.xlsx

enter image description here

Is it possible I just take the part with columns of id, a, b, c and ignore the rest and concatenate all those files together into a new excel file in Python. Thanks.

enter image description here

Here is what I have tried:

import os

for root, dirs, files in os.walk(src, topdown=False):
    for file in files:
        if file.endswith('.xlsx') or file.endswith('.xls'):
            #print(os.path.join(root, file))
            try:
                df0 = pd.read_excel(os.path.join(root, file))
                #print(df0)
            except:
                continue
            df1 = pd.DataFrame(columns = [columns_selected])
            df1 = df1.append(df0, ignore_index = True)
            print(df1)
            df1.to_excel('test.xlsx', index = False)
ah bon
  • 9,293
  • 12
  • 65
  • 148
  • Using `iloc` to get only the data after specific columns from `.xlsx`, then concatenate both the sheets to make a new one. and pufff! done. – DirtyBit Jan 24 '19 at 12:58
  • Thanks. In fact my real data is literally a mess, quite difficult to deal with. :( – ah bon Jan 24 '19 at 14:04
  • ahbon: You can probably adapt the code in [my answer](https://stackoverflow.com/a/54361624/355230) to your other question to do this (as I already told you I thought was likely). Instead of the destination being a single directory somewhere, for this it's single dataframe—and instead of copying files to the destination directory, you'll want to extract and concatenate data from all the files which have one of the `wanted` file extensions. – martineau Jan 25 '19 at 11:09
  • Yeah. I agree. I means if it's possible to concatenate data1.xlsx and data2.xlsx ignore contents before and after empty rows to get final excel with columns of id, a, b, c. – ah bon Jan 25 '19 at 11:36

2 Answers2

1

use skpirows and nrows https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_excel.html

import pandas

df1 = pd.read_excel('data1.xlsx', skpirows=3, nrows=5)
df2 = pd.read_excel('data2.xlsx', skpirows=4, nrows=5)

dfFinal = df1.append(df2)
Charles R
  • 1,621
  • 1
  • 8
  • 25
  • 1
    My mistake... I delete it – Charles R Jan 24 '19 at 13:01
  • Thanks for your help. In fact I have lots of those type of excel files under a folder I don't want real_excel one by one, are there other solutions? – ah bon Jan 24 '19 at 13:02
  • if all files have the same pattern, you can have one file with many sheets on it. Then you can set sheet_name=None to read and parse all data. – Charles R Jan 24 '19 at 13:04
  • @ahbon are those files or sheets within file? – DirtyBit Jan 24 '19 at 13:06
  • Thanks. The problem is they are located in different folders and subfolders, and plus thoese file have sheets as well. Mabey first I need to do is to copy all excel files into one folder via shutil.copytree(), then combine them to one file with multiple sheets, and at last step, take your solution. – ah bon Jan 24 '19 at 13:14
  • @ahbon you don't really have to copy them to a single sheet, I have suggested a way to just read all the excel files in a folder – DirtyBit Jan 24 '19 at 13:15
1

Extending @Charles R's answer with your requirement of multiple excel files.

# get all the files
os.chdir('C:\ExcelWorkbooksFolder')
FileList = glob.glob('*.xlsx')
print(FileList)

and then:

for File in FileList:
    for x in File:
        # the rest of the code for reading
DirtyBit
  • 16,613
  • 4
  • 34
  • 55
  • Thanks. Please check the poster I put here. The structure of folders and files is similar to this one: https://stackoverflow.com/questions/54346748/iterate-excel-files-and-output-in-one-folder-in-python?noredirect=1#comment95510177_54346748 – ah bon Jan 24 '19 at 13:17
  • I think we need use os.walk to iterate all excel files end with xlsx or xls, non? – ah bon Jan 24 '19 at 14:08