Pandas: Concatenate files but skip the headers except the first file

Question

I have 3 files representing the same dataset split in 3 and I need to concatenate:

import pandas

df1 = pandas.read_csv('path1')
df2 = pandas.read_csv('path2')
df3 = pandas.read_csv('path3')

df = pandas.concat([df1,df2,df3])

But this will keep the headers in the middle of the dataset, I need to remove the headers (column names) from the 2nd and 3rd file. How do I do that?

I dont understand - how is possible `skiprows=1` solution can works? because concat align data by first dataframe and if remove columns names from second and third dataframe it cannot align. Or something missing? — jezrael, Aug 13 '17 at 08:45
You're right, I check the skipping of the line but not the concatenation. Definitely the skiprows code is not the right one, the dataset should have 23 columns it has almost 3 times that. — MCG Code, Aug 13 '17 at 08:59

jezrael · Accepted Answer · 2017-08-13T08:36:19.053

11

I think you need numpy.concatenate with DataFrame constructor:

df = pd.DataFrame(np.concatenate([df1.values, df2.values, df3.values]), columns=df1.columns)

Another solution is replace columns names in df2 and df3:

df2.columns = df1.columns
df3.columns = df1.columns
df = pd.concat([df1,df2,df3], ignore_index=True)

Samples:

np.random.seed(100)
df1 = pd.DataFrame(np.random.randint(10, size=(2,3)), columns=list('ABF'))
print (df1)
   A  B  F
0  8  8  3
1  7  7  0

df2 = pd.DataFrame(np.random.randint(10, size=(1,3)), columns=list('ERT'))
print (df2)
   E  R  T
0  4  2  5

df3 = pd.DataFrame(np.random.randint(10, size=(3,3)), columns=list('HTR'))
print (df3)
   H  T  R
0  2  2  2
1  1  0  8
2  4  0  9

print (np.concatenate([df1.values, df2.values, df3.values]))
[[8 8 3]
 [7 7 0]
 [4 2 5]
 [2 2 2]
 [1 0 8]
 [4 0 9]]

df = pd.DataFrame(np.concatenate([df1.values, df2.values, df3.values]), columns=df1.columns)
print (df)
   A  B  F
0  8  8  3
1  7  7  0
2  4  2  5
3  2  2  2
4  1  0  8
5  4  0  9

df = pd.concat([df1,df2,df3], ignore_index=True)
print (df)
   A  B  F
0  8  8  3
1  7  7  0
2  4  2  5
3  2  2  2
4  1  0  8
5  4  0  9

edited Aug 13 '17 at 08:36

answered Aug 13 '17 at 08:31

jezrael

822,522
95
1,334
1,252

Do you have any idea why is it required to add df2.columns = df1.columns if the files already have identical headers? – MCG Code Aug 13 '17 at 09:16
1

If columns are identical, then your solution should works perfectly - concat align data by columns. – jezrael Aug 13 '17 at 09:17
Your code works perfectly. I'm just wandering why pandas insists on me making the df2.columns = df1.columns before using ignore_index=True – MCG Code Aug 13 '17 at 09:18
I think your columns names are different, so need my solution. But if columns names are same, then need only `df = pd.concat([df1,df2,df3], ignore_index=True)`. – jezrael Aug 13 '17 at 09:19
The columns are identical I check it with all(df2.columns == df1.columns) and is returns True. But when I run the line df = pd.concat([df1,df2,df3], ignore_index=True) it just duplicates the columns, only when I use your full code (incl the replacement of columns) that it works – MCG Code Aug 13 '17 at 09:20
I think your solumns has to be different, maybe something like `0` as number and `0` like string. I have same issue with `all` some times before and very long time looking for problem. Because if duplicate columns it seems cannot allign so all duplicates column names are different some way. – jezrael Aug 13 '17 at 09:24
maybe help check `print (df.columns.tolist())` – jezrael Aug 13 '17 at 09:27

score 4 · Answer 2 · answered Aug 13 '17 at 08:30

4

You have to use argument skip_rows of read_csv for second and third lines like here:

import pandas

df1 = pandas.read_csv('path1')
df2 = pandas.read_csv('path2', skiprows=1)
df3 = pandas.read_csv('path3', skiprows=1)

df = pandas.concat([df1,df2,df3])

answered Aug 13 '17 at 08:30

Serenity

35,289
20
120
115

I agree with Jezrael, the concatenation duplicates the columns, as many times as the files. I was a bit too fast, I was happy to see the first line disappear, but didn't check on the right that column numbers became huge – MCG Code Aug 13 '17 at 09:00

Ashlae Blume · Answer 3 · 2021-07-04T20:53:12.020

2

Been working on this recently myself, here's the most compact/elegant thing I came up with:

import pandas as pd

frame_list=[df1, df2, df3]
frame_mod=[frame_list[i].iloc[0:] for i in range(0,len(frame_list))]
frame_frame=pd.concat(frame_mod)

edited Jul 04 '21 at 20:53

answered Jul 04 '21 at 20:34

Ashlae Blume

21
2

score 0 · Answer 4 · answered Dec 03 '19 at 19:01

0

Use:

df = pd.merge(df1, df2, how='outer')

Merge rows that appear in either or both df1 and df2 (union).

answered Dec 03 '19 at 19:01

Gustavo Bertoli

48
8

Pandas: Concatenate files but skip the headers except the first file

4 Answers4

Linked