20

I have a huge DataFrame, where some columns have the same names. When I try to pick a column that exists twice, (eg del df['col name'] or df2=df['col name']) I get an error. What can I do?

smci
  • 32,567
  • 20
  • 113
  • 146
user3107640
  • 201
  • 1
  • 2
  • 3
  • Using the toy example ``df = DataFrame(np.random.randn(3,3), columns=list('aba'))`` these operations work fine for me. Try to make a smaller example that reproduces your problem. – Dan Allan Dec 16 '13 at 14:41
  • 1
    It could be the versioning. In 0.8, for example, I believe even trying to access a duplicate column name creates an IndexError, though it still allows you to create the data with duplicated names. – ely Dec 16 '13 at 14:43
  • Duplicate column names are a pain in the ass in pandas, and every other package and language I know of. Can't you create unique column string names? Like, append an integer to make them unique, where needed. You're making life difficult for yourself. In general, when col names are not unique in the first n characters (n being some suitably small integer, like 2..10), I take my sledgehammer out. – smci Dec 19 '17 at 02:37

4 Answers4

16

You can adress columns by index:

>>> df = pd.DataFrame([[1,2],[3,4],[5,6]], columns=['a','a'])
>>> df
   a  a
0  1  2
1  3  4
2  5  6
>>> df.iloc[:,0]
0    1
1    3
2    5

Or you can rename columns, like

>>> df.columns = ['a','b']
>>> df
   a  b
0  1  2
1  3  4
2  5  6
Roman Pekar
  • 107,110
  • 28
  • 195
  • 197
6

This is not a good situation to be in. Best would be to create a hierarchical column labeling scheme (Pandas allows for multi-level column labeling or row index labels). Determine what it is that makes the two different columns that have the same name actually different from each other and leverage that to create a hierarchical column index.

In the mean time, if you know the positional location of the columns in the ordered list of columns (e.g. from dataframe.columns) then you can use many of the explicit indexing features, such as .ix[], or .iloc[] to retrieve values from the column positionally.

You can also create copies of the columns with new names, such as:

dataframe["new_name"] = data_frame.ix[:, column_position].values

where column_position references the positional location of the column you're trying to get (not the name).

These may not work for you if the data is too large, however. So best is to find a way to modify the construction process to get the hierarchical column index.

ely
  • 74,674
  • 34
  • 147
  • 228
6

Another solution:

def remove_dup_columns(frame):
     keep_names = set()
     keep_icols = list()
     for icol, name in enumerate(frame.columns):
          if name not in keep_names:
               keep_names.add(name)
               keep_icols.append(icol)
     return frame.iloc[:, keep_icols]

import numpy as np
import pandas as pd

frame = pd.DataFrame(np.random.randint(0, 50, (5, 4)), columns=['A', 'A', 'B', 'B'])

print(frame)
print(remove_dup_columns(frame))

The output is

    A   A   B   B
0  18  44  13  47
1  41  19  35  28
2  49   0  30  16
3  39  29  43  41
4  26  19  48  13
    A   B
0  18  13
1  41  35
2  49  30
3  39  43
4  26  48
leitungswasser
  • 228
  • 3
  • 6
1

The following function removes columns with dublicate names and keeps only one. Not exactly what you asked for, but you can use snips of it to solve your problem. The idea is to return the index numbers and then you can adress the specific column indices directly. The indices are unique while the column names aren't

def remove_multiples(df,varname):
    """
    makes a copy of the first column of all columns with the same name,
    deletes all columns with that name and inserts the first column again
    """
    from copy import deepcopy
    dfout = deepcopy(df)
    if (varname in dfout.columns):
        tmp = dfout.iloc[:, min([i for i,x in enumerate(dfout.columns == varname) if x])]
        del dfout[varname]
        dfout[varname] = tmp
    return dfout

where

[i for i,x in enumerate(dfout.columns == varname) if x]

is the part you need

horseshoe
  • 1,437
  • 14
  • 42
  • I had the same problem and tried your function, however it seems that deepcopy is a function from a specific library - NameError: name 'deepcopy' is not defined. Which lib is it from? – aabujamra Jun 27 '17 at 15:37
  • 1
    @ abutremutante: its: from copy import deepcopy I added it above – horseshoe Jun 28 '17 at 07:36