0

I've tried the following(pd is pandas):

 for i, chunk in pd.read_excel(os.path.join(INGEST_PATH,file), chunksize=5):

but I am getting this error:

NotImplementedError: chunksize keyword of read_excel is not implemented

I've tried searching for other methods but most of them are for CSV files, not xlsx, I also have pandas version 0.20.1
Any help is appreciated.

Pear
  • 785
  • 2
  • 10
  • 19
  • Did you try these solutions? https://stackoverflow.com/questions/38623368/reading-a-portion-of-a-large-xlsx-file-with-python/38623545 – Riley Hun May 25 '17 at 17:54
  • I am not familiar with `chunksize`. One possibility, you can read the excel first into a dataframe, then split the index of the dataframe with `numpy.array_split` or something like that – zyxue May 25 '17 at 17:54
  • @RileyHun i've tried both, getting the same chunksize error. – Pear May 25 '17 at 18:12
  • @zyxue can you give an example please? – Pear May 25 '17 at 18:12

2 Answers2

0
df = pd.read_excel(os.path.join(INGEST_PATH,file))

# split indexes
idxes = np.array_split(df.index.values, 5)

chunks = [df.ix[idx] for idx in idxes]
zyxue
  • 7,904
  • 5
  • 48
  • 74
0

the above solutions werent working for me because the file wasnt being split properly and resulted into omitting the last few rows.. actually it gave me an error saying unequal divisions or something to that effect.

so i wrote the following. this will work for any file size.

enter code here
url_1=r'C:/Users/t3734uk/Downloads/ML-GooGLECRASH/amp_ub/df2.csv'
target_folder=r'C:\Users\t3734uk\Downloads\ML-GooGLECRASH\amp_ub'
df = pd.read_csv(url_1)

rows,columns=df.shape




def calcRowRanges(_no_of_files):
    row_ranges=[]
    interval_size=math.ceil(rows/_no_of_files)
    print('intrval size is ----> '+ '{}'.format(interval_size))
    for n in range(_no_of_files):
        row_range=(n*interval_size,(n+1)*interval_size)
    #    print(row_range)
        if row_range[1] > rows:
            row_range=(n*interval_size,rows)
    #    print(row_range)

        row_ranges.append(row_range)
    return row_ranges


def splitFile(_df_,_row_ranges):
    for row_range in _row_ranges:
        _df=_df_[row_range[0]:row_range[1]]
        writer = pd.ExcelWriter('FILE_'+str(_row_ranges.index(row_range))+'_'+'.xlsx')

        _df.to_excel(writer)



def dosplit(num_files):
    row_ranges=calcRowRanges(num_files)
    print(row_ranges)
    print(len(row_ranges))
    splitFile(df,row_ranges)

dosplit(enter_no_files_to_be_split_in)

on second thoughts the following fucntion is more intuitive:

def splitFile2(_df_,no_of_splits):
_row_ranges=calcRowRanges(no_of_splits)
for row_range in _row_ranges:
    _df=_df_[row_range[0]:row_range[1]]
    writer = pd.ExcelWriter('FILE_'+str(_row_ranges.index(row_range))+'_'+'.xlsx')

    _df.to_excel(writer)enter code here