How to split a large excel file using Pandas?

Question

I've tried the following(pd is pandas):

 for i, chunk in pd.read_excel(os.path.join(INGEST_PATH,file), chunksize=5):

but I am getting this error:

NotImplementedError: chunksize keyword of read_excel is not implemented

I've tried searching for other methods but most of them are for CSV files, not xlsx, I also have pandas version 0.20.1
Any help is appreciated.

Did you try these solutions? https://stackoverflow.com/questions/38623368/reading-a-portion-of-a-large-xlsx-file-with-python/38623545 — Riley Hun, May 25 '17 at 17:54
I am not familiar with `chunksize`. One possibility, you can read the excel first into a dataframe, then split the index of the dataframe with `numpy.array_split` or something like that — zyxue, May 25 '17 at 17:54
@RileyHun i've tried both, getting the same chunksize error. — Pear, May 25 '17 at 18:12

score 0 · Accepted Answer · answered May 25 '17 at 18:14

0

df = pd.read_excel(os.path.join(INGEST_PATH,file))

# split indexes
idxes = np.array_split(df.index.values, 5)

chunks = [df.ix[idx] for idx in idxes]

answered May 25 '17 at 18:14

zyxue

7,904
5
48
74

venkat.gmrit05 · Answer 2 · 2018-04-12T09:33:12.073

the above solutions werent working for me because the file wasnt being split properly and resulted into omitting the last few rows.. actually it gave me an error saying unequal divisions or something to that effect.

so i wrote the following. this will work for any file size.

enter code here
url_1=r'C:/Users/t3734uk/Downloads/ML-GooGLECRASH/amp_ub/df2.csv'
target_folder=r'C:\Users\t3734uk\Downloads\ML-GooGLECRASH\amp_ub'
df = pd.read_csv(url_1)

rows,columns=df.shape




def calcRowRanges(_no_of_files):
    row_ranges=[]
    interval_size=math.ceil(rows/_no_of_files)
    print('intrval size is ----> '+ '{}'.format(interval_size))
    for n in range(_no_of_files):
        row_range=(n*interval_size,(n+1)*interval_size)
    #    print(row_range)
        if row_range[1] > rows:
            row_range=(n*interval_size,rows)
    #    print(row_range)

        row_ranges.append(row_range)
    return row_ranges


def splitFile(_df_,_row_ranges):
    for row_range in _row_ranges:
        _df=_df_[row_range[0]:row_range[1]]
        writer = pd.ExcelWriter('FILE_'+str(_row_ranges.index(row_range))+'_'+'.xlsx')

        _df.to_excel(writer)



def dosplit(num_files):
    row_ranges=calcRowRanges(num_files)
    print(row_ranges)
    print(len(row_ranges))
    splitFile(df,row_ranges)

dosplit(enter_no_files_to_be_split_in)

on second thoughts the following fucntion is more intuitive:

def splitFile2(_df_,no_of_splits):
_row_ranges=calcRowRanges(no_of_splits)
for row_range in _row_ranges:
    _df=_df_[row_range[0]:row_range[1]]
    writer = pd.ExcelWriter('FILE_'+str(_row_ranges.index(row_range))+'_'+'.xlsx')

    _df.to_excel(writer)enter code here

How to split a large excel file using Pandas?

2 Answers2