0

I've tried using numpy.concatonate as this post suggests

The files I'm working on originally look like this.

>>> df = pd.read_html(file)
>>> df.head()
            0            1                        2
0          Id         Date                   Reason
1         161  04-Mar-2019                Cancelled
2         843  04-Mar-2019                Cancelled
3         042  04-Mar-2019                Cancelled
4         247  04-Mar-2019                Cancelled 

This is my code, the csv file it spits out still has the column names where the files have been joined.

files = sorted(glob.glob(path + "Export*"+ extension))
all_data = []
if len(files) == 1: # check if there's only 1 file
    df = pd.read_html(files[0])[0]
    df.to_csv(path + filename + ".csv", index=False, header=False)

first = files.pop(-1) # order the files
files.insert(0, first)
for file in files:
    df = pd.read_html(file)[0]
    all_data.append(df.values)       


df = pd.DataFrame(np.concatenate(all_data), columns=pd.read_html(first)[0].columns)
df.to_csv(path + filename + ".csv", index=False, header=False)

1 Answers1

0

The problem here is that you're not reading your files in as you think you are; first[0].columns is actually ['0', '1', '2'] not ['Id', 'Date', 'Reason'] (as you'd like it to be).

Without the columns identified correctly, your call to

pd.DataFrame(np.concatenate(all_data), columns=pd.read_html(first[0].columns))

isn't going to behave as you'd like it.

When you read in the data with read_html you'll want to ensure you specify where the column names are located in the file. You can do this with the header argument.

So, you're probably going to want to read in your data as something like df = pd.read_html(file, header=1).

After being read in your data should look like this.

df.head()
    Id         Date     Reason
1  161  04-Mar-2019  Cancelled
2  843  04-Mar-2019  Cancelled
3   42  04-Mar-2019  Cancelled
4  247  04-Mar-2019  Cancelled

With this, the result of your concatenation should be as you desire.

jwalton
  • 5,286
  • 1
  • 18
  • 36