2

given df

df = pd.DataFrame(np.arange(8).reshape(2, 4), columns=list('abcd'))

enter image description here

Suppose I need column 'b' to be at the end. I could do:

df[['a', 'c', 'd', 'b']]

enter image description here

But what is the most efficient way to ensure that a given column is at the end?

This is what I've been going with. What would others do?

def put_me_last(df, column):
    return pd.concat([df.drop(column, axis=1), df[column]], axis=1)

put_me_last(df, 'b')

enter image description here


Timing Results

conclusion mfripp is the winner. Seems as if reindex_axis is a large efficiency gain over []. That is really good info.

enter image description here

code

from string import lowercase

df_small = pd.DataFrame(np.arange(8).reshape(2, 4), columns=list('abcd'))
df_large = pd.DataFrame(np.arange(1000000).reshape(10000, 100),
                        columns=pd.MultiIndex.from_product([list(lowercase[:-1]), ['One', 'Two', 'Three', 'Four']]))


def pir1(df, column):
    return pd.concat([df.drop(column, axis=1), df[column]], axis=1)

def pir2(df, column):
    if df.columns[-1] == column:
        return df
    else:
        pos = df.columns.values.__eq__('b').argmax()
        return df[np.roll(df.columns, len(df.columns) - 1 - pos)]

def pir3(df, column):
    if df.columns[-1] == column:
        return df
    else:
        pos = df.columns.values.__eq__('b').argmax()
        cols = df.columns.values
        np.concatenate([cols[:pos], cols[1+pos:], cols[[pos]]])
        return df[np.concatenate([cols[:pos], cols[1+pos:], cols[[pos]]])]

def pir4(df, column):
    if df.columns[-1] == column:
        return df
    else:
        return df[np.roll(df.columns.drop(column).insert(0, column), -1)]

def carsten1(df, column):
    cols = list(df)
    if cols[-1] == column:
        return df
    else:
        return pd.concat([df.drop(column, axis=1), df[column]], axis=1)

def carsten2(df, column):
    cols = list(df)
    if cols[-1] == column:
        return df
    else:
        idx = cols.index(column)
        new_cols = cols[:idx] + cols[idx + 1:] + [column]
        return df[new_cols]

def mfripp1(df, column):
    new_cols = [c for c in df.columns if c != column] + [column]
    return df[new_cols]

def mfripp2(df, column):
    new_cols = [c for c in df.columns if c != column] + [column]
    return df.reindex_axis(new_cols, axis='columns', copy=False)

def ptrj1(df, column):
    return df.reindex(columns=df.columns.drop(column).append(pd.Index([column])))

def shivsn1(df, column):
    column_list=list(df)
    column_list.remove(column)
    column_list.append(column)
    return df[column_list]

def merlin1(df, column):
    return df[df.columns.drop(["b"]).insert(99999, 'b')]


list_of_funcs = [pir1, pir2, pir3, pir4, carsten1, carsten2, mfripp1, mfripp2, ptrj1, shivsn1]

def test_pml(df, pml):
    for c in df.columns:
        pml(df, c)

summary = pd.DataFrame([], [f.__name__ for f in list_of_funcs], ['Small', 'Large'])

for f in list_of_funcs:
    summary.at[f.__name__, 'Small'] = timeit(lambda: test_pml(df_small, f), number=100)
    summary.at[f.__name__, 'Large'] = timeit(lambda: test_pml(df_large, f), number=10)
piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • Try comparing your alternatives with the [`timeit`](https://docs.python.org/3.4/library/timeit.html#module-timeit) module. – mhawke Jul 27 '16 at 00:41
  • 1
    @mhawke I plan on it. That's how I'll decide the answer. – piRSquared Jul 27 '16 at 00:42
  • So which of your 2 alternatives is faster, so far? – mhawke Jul 27 '16 at 00:42
  • I'm working on the test. I have to randomize the column order and test many trials as well over different sized datasets. I'll have one tonight. – piRSquared Jul 27 '16 at 00:45
  • @mhawke the one answer so far only benefits if the column is last already. In a randomized setting, it will get that benefit 1 / len(columns) of the time. depending on the number of columns, the benefit might not out weigh the cost of checking. Anyway, I'll build a proper test. – piRSquared Jul 27 '16 at 00:47
  • @No, something broke in your function and I fell asleep trying to fix it ;-) . Oh yeah, it failed on `insert(9999` – piRSquared Jul 27 '16 at 16:58

5 Answers5

3

Well, the first (and, depending on your use case, most efficient) optimization is to ensure at first that you don't have to rearrange the DataSet. If the column you want to be the last one is already in its place, then you can just return the df unchanged. Try this one:

def put_me_last2(df, column):
    if list(df)[-1] == column:
        return df
    else: return pd.concat([df.drop(column, axis=1), df[column]], axis=1)

I've tried it with 8 million entries instead of the 8 from your example, and the speed was about the same when I demanded column b as the last one, and 300 times faster (500us vs 150ms) when I wanted the last column to be d (i.e. the case without reordering).

This won't help you if you have lots of columns or usually want to rearrange columns, but it doesn't hurt, either.

Update:

I've found a faster method: Dont drop and re-add a column, but use df[cols] with the wanted list of columns. Gives me about 40% speedup (90ms vs 150ms with 8 million entries).

def put_me_last3(df, column):
    cols = list(df)
    if cols[-1] == column:
        return df
    else:
        idx = cols.index(column)
        new_cols = cols[:idx] + cols[idx + 1:] + [column]
        return df[new_cols]
Carsten
  • 17,991
  • 4
  • 48
  • 53
  • I'd edit that to be `if df.columns[-1] == column:`. But, yeah that's a good tip. – piRSquared Jul 27 '16 at 00:40
  • Both of them work the same way. One of them might be a few microseconds faster. If you test it, please add the results to my answer or your question, I'd be interested. I have added an additional method for a 40% speedup. – Carsten Jul 27 '16 at 00:53
3

I would rearrange the list of columns, rather than removing and appending one of them:

import pandas as pd
import numpy as np

df = pd.DataFrame(np.arange(8).reshape(2, 4), columns=list('abcd'))

def put_me_last(df, column):
    return pd.concat([df.drop(column, axis=1), df[column]], axis=1)

def put_me_last_fast(df, column):
    new_cols = [c for c in df.columns if c != column] + [column]
    return df[new_cols]

def put_me_last_faster(df, column):
    new_cols = [c for c in df.columns if c != column] + [column]
    return df.reindex_axis(new_cols, axis='columns', copy=False)

Timings (in iPython):

%timeit put_me_last(df, 'b')
# 1000 loops, best of 3: 741 µs per loop

%timeit put_me_last_fast(df, 'b')
# 1000 loops, best of 3: 295 µs per loop

%timeit put_me_last_faster(df, 'b')
# 1000 loops, best of 3: 239 µs per loop

%timeit put_me_last_faster(df, 'd')  # not changing order
# 1000 loops, best of 3: 125 µs per loop

Note: you could use the line below to define new_cols, but it's about 80x slower than the one used above (2 µs vs 160 µs)

new_cols = df.columns.drop(column).insert(-1, column)

Also note: if you often try to move a column to the end that is already there, you can cut the time for those cases to below 1 µs by adding this, as noted by @Carsten:

if df.columns[-1] == column:
    return df
Merlin
  • 24,552
  • 41
  • 131
  • 206
Matthias Fripp
  • 17,670
  • 5
  • 28
  • 45
2

How about this one:

df.reindex(columns=df.columns.drop(col).append(pd.Index([col])))

(.append([col]) doesn't work - might be a bug. Edit: use of .append(pd.Index([col]) is probably the safest option in append.)

Comment on testing: If you plan a test with timeit, try running it on a large df (like 1e4 rows or more) and probably with -n1 -r1 to prevent caching.

ptrj
  • 5,152
  • 18
  • 31
1

It's not the fastest though:

def put_me_last(df,column):
    column_list=list(df)
    column_list.remove(column)
    column_list.append(column)
    return df[column_list]  



%timeit put_me_last(df,'b')
1000 loops, best of 3: 391 µs per loop
shivsn
  • 7,680
  • 1
  • 26
  • 33
0

Starting with this:

 df.columns
 Index([u'a', u'b', u'c', u'd'], dtype='object')

Dont do this, looks like a bug.

 df.columns.drop(["b"]).insert(-1, 'b')
 Index([u'a', u'c', u'b', u'd'], dtype='object')

 df.columns.drop(["b"]).insert(-1, 'x')
 Index([u'a', u'c', u'x', u'd'], dtype='object')

WORK Around:

 df.columns.drop(["b"]).insert(99999, 'b')
 Index([u'a', u'c', u'd', u'b'], dtype='object')
Merlin
  • 24,552
  • 41
  • 131
  • 206
  • IMO, it's not a bug it's a standard behavior of Python's `list.insert()` method. Try this instead: `df.columns.drop('b').insert(len(df.columns)-1, 'b')` – MaxU - stand with Ukraine Jul 27 '16 at 03:12
  • @Merlin: good point. It behaves this way by design, but it's surprising if you're hoping to insert an item at the end of an index. It's annoying that df.Index.insert() works like list.insert(), but df.Index.append() doesn't work like list.append(). This means there's no totally natural way to add an item to the end of an index (have to use a big number or retrieve the length of the index). – Matthias Fripp Jul 28 '16 at 23:46