6

This question is similar to Split (explode) pandas dataframe string entry to separate rows but includes a question about adding ranges.

I have a DataFrame:

+------+---------+----------------+
| Name | Options | Email          |
+------+---------+----------------+
| Bob  | 1,2,4-6 | bob@email.com  |
+------+---------+----------------+
| John |   NaN   | john@email.com |
+------+---------+----------------+
| Mary |   1,2   | mary@email.com |
+------+---------+----------------+
| Jane | 1,3-5   | jane@email.com |
+------+---------+----------------+

And I'd like the Options column to be split by the comma as well as rows added for a range.

+------+---------+----------------+
| Name | Options | Email          |
+------+---------+----------------+
| Bob  | 1       | bob@email.com  |
+------+---------+----------------+
| Bob  | 2       | bob@email.com  |
+------+---------+----------------+
| Bob  | 4       | bob@email.com  |
+------+---------+----------------+
| Bob  | 5       | bob@email.com  |
+------+---------+----------------+
| Bob  | 6       | bob@email.com  |
+------+---------+----------------+
| John | NaN     | john@email.com |
+------+---------+----------------+
| Mary | 1       | mary@email.com |
+------+---------+----------------+
| Mary | 2       | mary@email.com |
+------+---------+----------------+
| Jane | 1       | jane@email.com |
+------+---------+----------------+
| Jane | 3       | jane@email.com |
+------+---------+----------------+
| Jane | 4       | jane@email.com |
+------+---------+----------------+
| Jane | 5       | jane@email.com |
+------+---------+----------------+

How can I go beyond using concat and split like the reference SO article says to accomplish this? I need a way to add a range.

That article uses the following code to split comma delineated values (1,2,3):

In [7]: a
Out[7]: 
    var1  var2
0  a,b,c     1
1  d,e,f     2

In [55]: pd.concat([Series(row['var2'], row['var1'].split(','))              
                    for _, row in a.iterrows()]).reset_index()
Out[55]: 
  index  0

0     a  1
1     b  1
2     c  1
3     d  2
4     e  2
5     f  2

Thanks in advance for your suggestions!

Update 2/14 Sample data was updated to match my current case.

kabaname
  • 265
  • 1
  • 12

4 Answers4

6

If I understand what you need

def yourfunc(s):
    ranges = (x.split("-") for x in s.split(","))

    return [i for r in ranges for i in range(int(r[0]), int(r[-1]) + 1)]


df.Options=df.Options.apply(yourfunc)

df
Out[114]: 
   Name          Options           Email
0   Bob  [1, 2, 4, 5, 6]   bob@email.com
1  Jane     [1, 3, 4, 5]  jane@email.com


df.set_index(['Name','Email']).Options.apply(pd.Series).stack().reset_index().drop('level_2',1)
Out[116]: 
   Name           Email    0
0   Bob   bob@email.com  1.0
1   Bob   bob@email.com  2.0
2   Bob   bob@email.com  4.0
3   Bob   bob@email.com  5.0
4   Bob   bob@email.com  6.0
5  Jane  jane@email.com  1.0
6  Jane  jane@email.com  3.0
7  Jane  jane@email.com  4.0
8  Jane  jane@email.com  5.0
BENY
  • 317,841
  • 20
  • 164
  • 234
  • Just a question, why does Option column turn to float? – jpp Feb 13 '18 at 22:48
  • @jp_data_analysis when apply pd.Series, there is np.nan , then after stack(even np.nan has been removed as defualt) , it turn into float :-) – BENY Feb 13 '18 at 22:52
6

I like using np.r_ and slice
I know it looks like a mess but beauty is in the eye of the beholder.

def parse(o):
    mm = lambda i: slice(min(i), max(i) + 1)
    return np.r_.__getitem__(tuple(
        mm(list(map(int, s.strip().split('-')))) for s in o.split(',')
    ))

r = df.Options.apply(parse)
new = np.concatenate(r.values)
lens = r.str.len()

df.loc[df.index.repeat(lens)].assign(Options=new)

   Name  Options           Email
0   Bob        1   bob@email.com
0   Bob        2   bob@email.com
0   Bob        4   bob@email.com
0   Bob        5   bob@email.com
0   Bob        6   bob@email.com
2  Mary        1  mary@email.com
2  Mary        2  mary@email.com
3  Jane        1  jane@email.com
3  Jane        3  jane@email.com
3  Jane        4  jane@email.com
3  Jane        5  jane@email.com

Explanation

  • np.r_ takes different slicers and indexers and returns an array of the combination.

    np.r_[1, 4:7]
    array([1, 4, 5, 6])
    

    or

    np.r_[slice(1, 2), slice(4, 7)]
    array([1, 4, 5, 6])
    

    But if I need to pass an arbitrary bunch of them, I need to pass a tuple to np.r_ s __getitem__ method.

    np.r_.__getitem__((slice(1, 2), slice(4, 7), slice(10, 14)))
    array([ 1,  4,  5,  6, 10, 11, 12, 13])
    

    So I iterate, parse, make slices and pass to np.r_.__getitem__

  • Use a combo of loc, pd.Index.repeat, and pd.Series.str.len after applying my cool parser

  • Use pd.DataFrame.assign to overwrite existing column

__NOTE__
If you have bad characters in your Options column, I'd try to filter like this.

df = df.dropna(subset=['Options']).astype(dict(Options=str)) \
       .replace(dict(Options={'[^0-9,\-]': ''}), regex=True) \
       .query('Options != ""')
piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • Interesting! Thanks for explaining the solution. When mm() is called, I'm getting an AttributeError where "'float' object has no attribute 'split'". Any idea why that could be? – kabaname Feb 14 '18 at 12:54
  • I'm guessing it could be due to null values in my column. – kabaname Feb 14 '18 at 12:57
  • that is definitely why. dropna first – piRSquared Feb 14 '18 at 17:14
  • so I did that (df_2 = df[df['Options'].notnull()])) and I find myself with a ValueError (ValueError: invalid literal for int() with base 10: '') for r = df_2.Options... - I think that might be because you can't cast a string as a float? Or am I mistaken? – kabaname Feb 14 '18 at 17:17
  • That means there are other characters in the options strings that you didn't represent in the sample data. Try `int('^')` to reproduce the error. If you pasted the entire error then you must have non-printing characters in there. You can try stripping them out first. – piRSquared Feb 14 '18 at 17:22
  • thanks - that worked. Now on df.loc I have a "KeyError : 0". [Full Traceback](https://pastebin.com/u4k4mDWw) I looked up key errors, but it's unclear to me how to fix that line of code. – kabaname Feb 14 '18 at 17:51
  • Key error means that you are trying to access a key (in this case, an index value) that isn't there. The way I set it up, that should be impossible (famous last words). However, if you tried my **__NOTE__**, make sure to do that filtering prior to anything else. Then proceed with the steps laid out in my solution. – piRSquared Feb 14 '18 at 18:18
  • Ahh I see, that's a helpful explanation, thanks. So, in my data, the `r` variable has no index of 0 because 0-4 were NaN and dropped. It seems to me that's why it's failing with a KeyError, according to your last explanation? **I've updated my sample data to reflect other cases that I neglected to show** – kabaname Feb 14 '18 at 18:41
  • @kabaname adjusted answer accordingly – piRSquared Feb 14 '18 at 18:49
5

Start with a custom replacement function:

def replace(x):
    i, j = map(int, x.groups())
    return ','.join(map(str, range(i, j + 1)))

Store the column names somewhere, we'll use them later:

c = df.columns

Next, replace items in df.Options, then split on a comma:

v = df.Options.str.replace('(\d+)-(\d+)', replace).str.split(',')

Next, reshape your data and finally load into a new dataframe:

df = pd.DataFrame(
       df.drop('Options', 1).values.repeat(v.str.len(), axis=0)
)
df.insert(c.get_loc('Options'), len(c) - 1, np.concatenate(v))
df.columns = c

df

   Name Options           Email
0   Bob       1   bob@email.com
1   Bob       2   bob@email.com
2   Bob       4   bob@email.com
3   Bob       5   bob@email.com
4   Bob       6   bob@email.com
5  Jane       1  jane@email.com
6  Jane       3  jane@email.com
7  Jane       4  jane@email.com
8  Jane       5  jane@email.com
cs95
  • 379,657
  • 97
  • 704
  • 746
  • First of all, thanks for a super creative solution! I'm getting a TypeError on the df.drop line. Full error is: "Cannot cast array data from dtype('float64') to dtype('int64') according to the rule 'safe'" – kabaname Feb 14 '18 at 12:48
  • @kabaname it appears df.Options has NaNs. I'm not currently at a keyboard but when I am, I can try to provide you with a solution that handles this. Can you just drop the rows? That will simplify the solution. – cs95 Feb 14 '18 at 12:55
  • ah unfortunately I don't think I can drop the rows as the empty cells still have information in the other columns in a row that I'd like to keep. Sorry, I should've indicated that in the OP. Do you think I could could create a new notnull() DF, perform the operations and add it back in? – kabaname Feb 14 '18 at 13:23
  • Okay, using the **_NOTE_** from @piRSquared I was able to create the df. However, when I get to df.insert I get a Key Error : 0. Is that because all NaN rows were dropped? For example, if I have a DF with index [0,1,2,3] and rows 0,1 were dropped because their `Options` column were `NaN`, does that mean that when I try to do a df.insert (to put the df back to replace the original values) that it will fail since there technically is no 0? – kabaname Feb 14 '18 at 18:29
4

Here is one solution. While it's not pretty (minimal use of pandas), it is fairly efficient.

import itertools, pandas as pd, numpy as np; concat = itertools.chain.from_iterable

def ranger(mystr):
    return list(concat([int(i)] if '-' not in i else \
                list(range(int(i.split('-')[0]), int(i.split('-')[-1])+1)) \
                for i in mystr.split(',')))

df = pd.DataFrame([['Bob', '1,2,4-6', 'bob@email.com'],
                   ['Jane', '1,3-5', 'jane@email.com']],
                  columns=['Name', 'Options', 'Email'])

df['Options'] = df['Options'].map(ranger)

lens = list(map(len, df['Options']))

df_out = pd.DataFrame({'Name': np.repeat(df['Name'].values, lens),
                       'Email': np.repeat(df['Email'].values, lens),
                       'Option': np.hstack(df['Options'].values)})

#             Email  Name  Option
# 0   bob@email.com   Bob       1
# 1   bob@email.com   Bob       2
# 2   bob@email.com   Bob       4
# 3   bob@email.com   Bob       5
# 4   bob@email.com   Bob       6
# 5  jane@email.com  Jane       1
# 6  jane@email.com  Jane       3
# 7  jane@email.com  Jane       4
# 8  jane@email.com  Jane       5

Benchmarking of 4 solutions below (for interest only).

As a general rule, the repeat varieties outperform. In addition, solutions that create new dataframes from scratch (as opposed to apply) do better. Dropping down to numpy gives best results.

import itertools, pandas as pd, numpy as np; concat = itertools.chain.from_iterable

def ranger(mystr):
    return list(concat([int(i)] if '-' not in i else \
                list(range(int(i.split('-')[0]), int(i.split('-')[-1])+1)) \
                for i in mystr.split(',')))

def replace(x):
    i, j = map(int, x.groups())
    return ','.join(map(str, range(i, j + 1)))

def yourfunc(s):
    ranges = (x.split("-") for x in s.split(","))
    return [i for r in ranges for i in range(int(r[0]), int(r[-1]) + 1)]

def parse(o):
    mm = lambda i: slice(min(i), max(i) + 1)
    return np.r_.__getitem__(tuple(mm(list(map(int, s.strip().split('-')))) for s in o.split(',')))

df = pd.DataFrame([['Bob', '1,2,4-6', 'bob@email.com'],
                   ['Jane', '1,3-5', 'jane@email.com']],
                  columns=['Name', 'Options', 'Email'])

df = pd.concat([df]*1000, ignore_index=True)

def explode_jp(df):
    df['Options'] = df['Options'].map(ranger)
    lens = list(map(len, df['Options']))
    df_out = pd.DataFrame({'Name': np.repeat(df['Name'].values, lens),
                           'Email': np.repeat(df['Email'].values, lens),
                           'Option': np.hstack(df['Options'].values)})
    return df_out

def explode_cs(df):
    c = df.columns
    v = df.Options.str.replace('(\d+)-(\d+)', replace).str.split(',')
    df_out = pd.DataFrame(df.drop('Options', 1).values.repeat(v.str.len(), axis=0))
    df_out.insert(c.get_loc('Options'), len(c) - 1, np.concatenate(v))
    df_out.columns = c
    return df_out

def explode_wen(df):
    df.Options=df.Options.apply(yourfunc)
    df_out = df.set_index(['Name','Email']).Options.apply(pd.Series).stack().reset_index().drop('level_2',1)
    return df_out

def explode_pir(df):
    r = df.Options.apply(parse)
    df_out = df.loc[df.index.repeat(r.str.len())].assign(Options=np.concatenate(r))
    return df_out

%timeit explode_jp(df.copy())   # 32.7 ms ± 1.54 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit explode_cs(df.copy())   # 90.6 ms ± 2.07 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit explode_wen(df.copy())  # 675 ms ± 12.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit explode_pir(df.copy())  # 163 ms ± 1.97 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
jpp
  • 159,742
  • 34
  • 281
  • 339
  • Your timings are unfair because your answer caters to this specific case, but will fail for any other number of columns. Generalising as I've done is not easy and certainly hurts performance. :) – cs95 Feb 14 '18 at 03:01
  • 1
    @COLDSPEED, you are right. and that's why i upvoted your answer. But I did these timings for my own education, you can still take away a lot from performance of 4 very different answers. – jpp Feb 14 '18 at 03:05