4

I'm looking to find the max run of consecutive zeros in a DataFrame with the result grouped by user. I'm interested in running the RLE on usage.

sample input:

user--day--usage
A-----1------0
A-----2------0
A-----3------1
B-----1------0
B-----2------1
B-----3------0

Desired output

user---longest_run
a - - - - 2
b - - - - 1

mydata <- mydata[order(mydata$user, mydata$day),]
user <- unique(mydata$user)
d2 <- data.frame(matrix(NA, ncol = 2, nrow = length(user)))
names(d2) <- c("user", "longest_no_usage")
d2$user <- user
for (i in user) {
  if (0 %in% mydata$usage[mydata$user == i]) {
    run <- rle(mydata$usage[mydata$user == i]) #Run Length Encoding
    d2$longest_no_usage[d2$user == i] <- max(run$length[run$values == 0])
  } else {
    d2$longest_no_usage[d2$user == i] <- 0 #some users did not have no-usage days
  }
}
d2 <- d2[order(-d2$longest_no_usage),]

this works in R but I want to do the same thing in python, I'm totally stumped

Community
  • 1
  • 1

4 Answers4

4

Use groupby with size by columns user, usage and helper Series for consecutive values first:

print (df)
  user  day  usage
0    A    1      0
1    A    2      0
2    A    3      1
3    B    1      0
4    B    2      1
5    B    3      0
6    C    1      1


df1 = (df.groupby([df['user'], 
                   df['usage'].rename('val'), 
                   df['usage'].ne(df['usage'].shift()).cumsum()])
        .size()
        .to_frame(name='longest_run'))

print (df1)
                longest_run
user val usage             
A    0   1                2
     1   2                1
B    0   3                1
         5                1
     1   4                1
C    1   6                1

Then filter only zero rows, get max and add reindex for append non 0 groups:

df2 = (df1.query('val == 0')
          .max(level=0)
          .reindex(df['user'].unique(), fill_value=0)
          .reset_index())
print (df2)
  user  longest_run
0    A            2
1    B            1
2    C            0

Detail:

print (df['usage'].ne(df['usage'].shift()).cumsum())
0    1
1    1
2    2
3    3
4    4
5    5
6    6
Name: usage, dtype: int32
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Excellent, I've learnt a lot from your numerous `pandas` answers in SO and it must be rubbing off because I was damn close to the same answer. Only difference I used `count().rename(...)` instead of `size().to_frame(...)` the latter being better (and I didn't think to `reindex()`). Is there any advantage to using `.ne()` over `!=` for the `cumsum()`? – AChampion Jul 31 '18 at 06:17
  • @AChampion - Thank you. Once a some time ago I test `ne` vs `!=` and there was winner `.ne`, so use it (if not forget), also some `()` should be removed, so nicer code. `.count` is another solution here, I agree, but because it count non`NaN` values I prefer `.size`, but here it seems working same `.size` and `.count` – jezrael Jul 31 '18 at 06:21
2

get max number of consecutive zeros on series:

def max0(sr):
     return (sr != 0).cumsum().value_counts().max() - (0 if (sr != 0).cumsum().value_counts().idxmax()==0 else 1)


max0(pd.Series([1,0,0,0,0,2,3]))

4

RELW
  • 189
  • 1
  • 14
0

I think the following does what you are looking for, where the consecutive_zero function is an adaptation of the top answer here.

Hope this helps!

import pandas as pd
from itertools import groupby

df = pd.DataFrame([['A', 1], ['A', 0], ['A', 0], ['B', 0],['B',1],['C',2]], 
                  columns=["user", "usage"])

def len_iter(items):
    return sum(1 for _ in items)

def consecutive_zero(data):
    x = list((len_iter(run) for val, run in groupby(data) if val==0))
    if len(x)==0: return 0 
    else: return max(x)

df.groupby('user').apply(lambda x: consecutive_zero(x['usage']))

Output:

user
A    2
B    1
C    0
dtype: int64
Florian
  • 24,425
  • 4
  • 49
  • 80
  • Thanks! If I wanted to make the output of the groupby method into a data frame with columns user and the output how would I make that change? – Logan Brooks Aug 01 '18 at 19:04
  • You could use `aggregate` instead of `apply`, i.e. `df = df.groupby('user').agg(lambda x: consecutive_zero(x['usage']))` – Florian Aug 02 '18 at 06:20
0

If you have a large dataset and speed is essential, you might want to try the high-performance pyrle library.

Setup:

# pip install pyrle
# or 
# conda install -c bioconda pyrle
import numpy as np
np.random.seed(0)
import pandas as pd
from pyrle import Rle
size = int(1e7)
number = np.random.randint(2, size=size)
user = np.random.randint(5, size=size)
df = pd.DataFrame({"User": np.sort(user), "Number": number})
df
#          User  Number
# 0           0       0
# 1           0       1
# 2           0       1
# 3           0       0
# 4           0       1
# ...       ...     ...
# 9999995     4       1
# 9999996     4       1
# 9999997     4       0
# 9999998     4       0
# 9999999     4       1
# 
# [10000000 rows x 2 columns]

Execution:

for u, udf in df.groupby("User"):
    r = Rle(udf.Number)
    is_0 = r.values == 0
    print("User", u, "Max", np.max(r.runs[is_0]))
# (Wall time: 1.41 s)


# User 0 Max 20
# User 1 Max 23
# User 2 Max 20
# User 3 Max 22
# User 4 Max 23
The Unfun Cat
  • 29,987
  • 31
  • 114
  • 156