185

The data I have to work with is a bit messy.. It has header names inside of its data. How can I choose a row from an existing pandas dataframe and make it (rename it to) a column header?

I want to do something like:

header = df[df['old_header_name1'] == 'new_header_name1']

df.columns = header
Veedrac
  • 58,273
  • 15
  • 112
  • 169
E.K.
  • 4,179
  • 8
  • 30
  • 50

6 Answers6

299
In [21]: df = pd.DataFrame([(1,2,3), ('foo','bar','baz'), (4,5,6)])

In [22]: df
Out[22]: 
     0    1    2
0    1    2    3
1  foo  bar  baz
2    4    5    6

Set the column labels to equal the values in the 2nd row (index location 1):

In [23]: df.columns = df.iloc[1]

If the index has unique labels, you can drop the 2nd row using:

In [24]: df.drop(df.index[1])
Out[24]: 
1 foo bar baz
0   1   2   3
2   4   5   6

If the index is not unique, you could use:

In [133]: df.iloc[pd.RangeIndex(len(df)).drop(1)]
Out[133]: 
1 foo bar baz
0   1   2   3
2   4   5   6

Using df.drop(df.index[1]) removes all rows with the same label as the second row. Because non-unique indexes can lead to stumbling blocks (or potential bugs) like this, it's often better to take care that the index is unique (even though Pandas does not require it).

unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • Thank you so much for your quick response! How can I choose a row by value in stead of index location to make it header? So for your example something like.. df.columns = df[df[0] == 'foo'] – E.K. Oct 01 '14 at 17:57
  • The problem with that is there could be more than one row which has the value `"foo"`. One way around that problem is to explicitly choose the first such row: `df.columns = df.iloc[np.where(df[0] == 'foo')[0][0]]`. – unutbu Oct 01 '14 at 18:02
  • Ah I see why you did that way. For my case, I know there is only one row that has the value "foo". So it is ok. I just did this way I guess it is the same as the one you gave me above. idx_loc = df[df[0] == 'foo'].index.tolist()[0] df.columns = df.iloc[idx_loc] – E.K. Oct 01 '14 at 18:08
112

This works (pandas v'0.19.2'):

df.rename(columns=df.iloc[0])
Rob
  • 26,989
  • 16
  • 82
  • 98
Zachary Wilson
  • 1,244
  • 1
  • 8
  • 4
35

It would be easier to recreate the data frame. This would also interpret the columns types from scratch.

headers = df.iloc[0]
new_df  = pd.DataFrame(df.values[1:], columns=headers)
shahar_m
  • 3,461
  • 5
  • 41
  • 61
16

To rename the header without reassign df:

df.rename(columns=df.iloc[0], inplace = True)

To drop the row without reassign df:

df.drop(df.index[0], inplace = True)
Govinda
  • 789
  • 7
  • 6
5

You can specify the row index in the read_csv or read_html constructors via the header parameter which represents Row number(s) to use as the column names, and the start of the data. This has the advantage of automatically dropping all the preceding rows which supposedly are junk.

import pandas as pd
from io import StringIO

In[1]
    csv = '''junk1, junk2, junk3, junk4, junk5
    junk1, junk2, junk3, junk4, junk5
    pears, apples, lemons, plums, other
    40, 50, 61, 72, 85
    '''

    df = pd.read_csv(StringIO(csv), header=2)
    print(df)

Out[1]
       pears   apples   lemons   plums   other
    0     40       50       61      72      85
ccpizza
  • 28,968
  • 18
  • 162
  • 169
  • This does not address the question itself, which is asking about an already existing `DataFrame`. – pablete Jul 19 '22 at 23:30
  • some of the users who found this question (possibly the majority) would have a more generic use case than the OP; this answer is for that group – ccpizza Jul 21 '22 at 17:17
0

Keeping it Python simple

Padas DataFrames have columns attribute why not use it with standard Python, it is much clearer what you are doing:

table = [['name', 'Rf', 'Rg', 'Rf,skin', 'CRI'],
 ['testsala.cxf', '86', '95', '92', '87'],
 ['testsala.cxf: 727037 lm', '86', '95', '92', '87'],
 ['630.cxf', '18', '8', '11', '18'],
 ['Huawei stk-lx1.cxf', '86', '96', '88', '83'],
 ['dedo uv no filtro.cxf', '52', '93', '48', '58']]

import pandas as pd
data = pd.DataFrame(table[1:],columns=table[0])

or in the case is not the first row, but the 10th for instance:

columns = table.pop(10)
data = pd.DataFrame(table,columns=columns)
G M
  • 20,759
  • 10
  • 81
  • 84
  • 1
    Tested for performance, although we know that the creation of a new DataFrame is "time-consuming" Anyhow this approach took 40X more time – gbox Jan 04 '23 at 21:11
  • @gbox thanks for you comment! If you want edit the answer – G M Jan 05 '23 at 09:51