0

I'm loading data in pandas, whereas the column date contains the datetime values, e.g.:

   date    ; .....more stuff ......
2000-01-03 ; 
2000-01-04 ;
2000-01-06 ;
...
2000-01-31 ;
2000-02-01 ;
2000-02-02 ;
2000-02-04 ;

I have a function to add a column containing the weekday-indices (0-6):

def genWeekdays(df,src='date',target='weekday'):
    """
    bla bla bla
    """
    df[target] = df[src].apply(lambda x: x.weekday())
    return df

calling it via

df = genWeekdays(df)

df has about a million rows and this takes about 1.3secs. Any way to speed this up? Im little surprised on how long this takes on my i7-4770k :( Thanks in advance

Veedrac
  • 58,273
  • 15
  • 112
  • 169
tim
  • 9,896
  • 20
  • 81
  • 137
  • Because you're passing a Python function, there's no surprise that it's going to take a long time. – Veedrac Jun 10 '14 at 13:45
  • what are you trying to do, best to show the expected output. – Jeff Jun 10 '14 at 13:46
  • Is it faster if you just add the column: `df[target] = df[src].apply(lambda x: x.weekday())` without creating a new df and returning it? – EdChum Jun 10 '14 at 13:47
  • 1
    @EdChum: as far as I understand, this won't modify the original `df`. @Veedrac: Thanks, yeah I see, but that's the only way I found to add a column containing the weekdays, dunno how to convert this differently. @Jeff: What's unclear about it? In my opinion I stated it clearly: I want to add a column. This column should contain the weekday-index (0=Monday, 6=Sunday) for the corresponding `date` of this row. – tim Jun 10 '14 at 13:49
  • 4
    You should always post a simple exmaple that can be copy-pasted directly for the setup code (e.g. your frame is just a description, not actual code, its VERY clear this way). Further you should show a sample code of your output frame (or if you can't what it should look like). See below for trivial soln. – Jeff Jun 10 '14 at 13:53

1 Answers1

4
In [30]: df = DataFrame(dict(date = pd.date_range('20000101',periods=100000,freq='s'), value = np.random.randn(100000)))

In [31]: df['weekday'] = pd.DatetimeIndex(df['date']).weekday

In [32]: %timeit pd.DatetimeIndex(df['date']).weekday
10 loops, best of 3: 34.9 ms per loop

In [33]: df
Out[33]: 
                     date     value

In [33]: df
Out[33]: 
                     date     value  weekday
0     2000-01-01 00:00:00 -0.046604        5
1     2000-01-01 00:00:01 -1.691611        5
2     2000-01-01 00:00:02  0.416015        5
3     2000-01-01 00:00:03  0.054822        5
4     2000-01-01 00:00:04 -0.661163        5
5     2000-01-01 00:00:05  0.274402        5
6     2000-01-01 00:00:06 -0.426533        5
7     2000-01-01 00:00:07  0.028769        5
8     2000-01-01 00:00:08  0.248581        5
9     2000-01-01 00:00:09  1.302145        5
10    2000-01-01 00:00:10 -1.886830        5
11    2000-01-01 00:00:11  2.276506        5
12    2000-01-01 00:00:12  0.054104        5
13    2000-01-01 00:00:13  0.378990        5
14    2000-01-01 00:00:14  0.868879        5
15    2000-01-01 00:00:15 -0.046810        5
16    2000-01-01 00:00:16 -0.499447        5
17    2000-01-01 00:00:17  1.067412        5
18    2000-01-01 00:00:18 -1.625986        5
19    2000-01-01 00:00:19  0.515884        5
20    2000-01-01 00:00:20 -1.884882        5
21    2000-01-01 00:00:21  0.943775        5
22    2000-01-01 00:00:22  0.034501        5
23    2000-01-01 00:00:23  0.438170        5
24    2000-01-01 00:00:24 -1.211937        5
25    2000-01-01 00:00:25 -0.229930        5
26    2000-01-01 00:00:26  0.938805        5
27    2000-01-01 00:00:27  0.026815        5
28    2000-01-01 00:00:28  2.166740        5
29    2000-01-01 00:00:29 -0.096927        5
...                   ...       ...      ...
99970 2000-01-02 03:46:10 -0.310023        6
99971 2000-01-02 03:46:11  0.561321        6
99972 2000-01-02 03:46:12  2.207426        6
99973 2000-01-02 03:46:13 -0.253933        6
99974 2000-01-02 03:46:14 -0.711145        6
99975 2000-01-02 03:46:15 -0.477377        6
99976 2000-01-02 03:46:16  1.492970        6
99977 2000-01-02 03:46:17  0.308510        6
99978 2000-01-02 03:46:18  0.126579        6
99979 2000-01-02 03:46:19 -1.704093        6
99980 2000-01-02 03:46:20 -0.328285        6
99981 2000-01-02 03:46:21  1.685411        6
99982 2000-01-02 03:46:22 -0.368899        6
99983 2000-01-02 03:46:23  0.915786        6
99984 2000-01-02 03:46:24 -1.694855        6
99985 2000-01-02 03:46:25 -1.488130        6
99986 2000-01-02 03:46:26 -1.274004        6
99987 2000-01-02 03:46:27 -1.508376        6
99988 2000-01-02 03:46:28  0.551695        6
99989 2000-01-02 03:46:29  0.007957        6
99990 2000-01-02 03:46:30 -0.214852        6
99991 2000-01-02 03:46:31 -1.390088        6
99992 2000-01-02 03:46:32 -0.472137        6
99993 2000-01-02 03:46:33 -0.969515        6
99994 2000-01-02 03:46:34  1.129802        6
99995 2000-01-02 03:46:35 -0.291428        6
99996 2000-01-02 03:46:36  0.337134        6
99997 2000-01-02 03:46:37  0.989259        6
99998 2000-01-02 03:46:38  0.705592        6
99999 2000-01-02 03:46:39 -0.311884        6

[100000 rows x 3 columns]
Jeff
  • 125,376
  • 21
  • 220
  • 187
  • thanks, working. Decreased runtime from 1.3secs to 0.3secs for my 800k rows which is quite nice. Thanks – tim Jun 10 '14 at 13:55
  • 1
    use ``apply`` ONLY as a last resort (e.g. you can't do vectorized things). even if you have a very complicated function to do, you can often do vectorized calculations on most of it, saving the last for ``apply``, which is essentially a loop. – Jeff Jun 10 '14 at 13:56
  • 1
    Using apply took 172ms versus 39ms using Jeff's method, I can also confirm that it made negligle difference whether the apply was called inside or outside the function but it does modify the df so you didn't need to return the df as it was being modified inside the function – EdChum Jun 10 '14 at 14:00
  • 1
    @Jeff: thanks for your clarifications once again, very glad you helped me with it. As a new user like me to python/pandas, the problem mostly is, that I can only search/google for solutions as the libraries contain so many classes and functions that I don't know what to look for (in this case the `DatetimeIndex` class). And then sometimes different solutions (in this case using `apply`) come up on google/stackoverflow and yet again I can NOT verify that there is no better solution as I dont have the insight into the library. But I keep learning heavily each day, thanks :) – tim Jun 10 '14 at 14:02
  • 1
    np. docs are pretty comprehensive on using the date-time functionaility, as well as lots of methods/properties in the various objects, so use ipython tab-completion. – Jeff Jun 10 '14 at 14:03
  • Yeah if you know what to look for, true. But if you have no clue, then it gets a little difficult and mostly time consuming until you find **the real** (i.e. fast/good) solution. But yes, using ipython is a nice idea I didn't come up with yet. Will try it. – tim Jun 10 '14 at 14:04
  • yep...docs are big! unfortunately you try to document everything, but then organizing becomes harder. – Jeff Jun 10 '14 at 14:07