1

I have multiple excel sheets that have identical column names. When I was saving the files from previous computations I forgot to set ‘Date’ as index and now all of them (40) have index columns with numbers from 1-200. If I load these into python they get an additional index column again resulting in 2 unnamed columns. I know I can use the glob function to access all my files. But is there a way I can access all the files, drop/delete the unnamed index column and set the new index to the date column

Here is an example of 1 excel sheet right now

df = pd.DataFrame({
'': [0, 1,2,3,4],
'Date': [1930, 1931, 1932, 1933,1934],
'value': [11558522, 12323552, 13770958, 18412280, 13770958],
}) 
Tamarie
  • 125
  • 2
  • 6
  • 18

3 Answers3

0

dfs = [pd.read_csv(file).set_index('Date')[['value']] for file in glob.glob("/your/path/to/folder/*.csv")]

0

A quick way to do that with pandas is:

>>> df = df.drop('', axis=1)
>>> df = df.set_index('Date')
>>> df
         value
Date          
1930  11558522
1931  12323552
1932  13770958
1933  18412280
1934  13770958

(I did the above from memory, but a general tip for these kinds of things is to look in the documentation for appropriate functions. https://pandas.pydata.org/pandas-docs/stable/reference/frame.html )

You can also specify header column when loading the files:

https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html

index_colint, list of int, default None

Column (0-indexed) to use as the row labels of the DataFrame. Pass None if there is no such column. If a list is passed, those columns will be combined into a MultiIndex. If a subset of data is selected with usecols, index_col is based on the subset.

Community
  • 1
  • 1
razofz
  • 26
  • 3
0

I think simpliest is set wrong first column to index and then use DataFrame.set_index for rewrite them by Date column:

import glob, os

for file in glob.glob('subset/*.xlsx'):

    df = pd.read_excel(file, index_col=[0]).set_index('Date')
    print (df)

    #new excel files

    h,t = os.path.split(file)
    df.to_excel(os.path.join(h, 'new_' + t))

    #overwrite excel files (first backup data if something failed for avoid lost data)
    #df.to_excel(file)
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • 1
    this worked perfect. but how do i know save the dataframe (df) as individual excel files – Tamarie Mar 03 '20 at 09:49
  • 1
    `df.to_excel('new_' + file)` throws a file not found error. if i try the overwrite it does not overwrite the previous files – Tamarie Mar 03 '20 at 10:08
  • @Tamarie - To same folder like original data, only changed filenames – jezrael Mar 03 '20 at 11:27
  • oh yes i see it. put it only saved 1 file (the last one) not all the files – Tamarie Mar 03 '20 at 11:27
  • so i have new_file1 in my folder but it neglected saving files 2 - 20 – Tamarie Mar 03 '20 at 11:30
  • 1
    it finally worked. turns out i wasnt doing this in my for loop. i had split the code thats why it was doing it for 1 file only. Thank you so much for the help. The solution works perfect now – Tamarie Mar 03 '20 at 11:34
  • @Tamarie - Yop, I test it and working for me perfectly. – jezrael Mar 03 '20 at 11:35
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/208920/discussion-between-tamarie-and-jezrael). – Tamarie Mar 03 '20 at 12:44