2

I am trying to process a dataframe. This includes creating new columns and updating their values based on the values in other columns. More concretely, I have a predefined "source" that I want to classify. This source can fall under three different categories 'source_dtp', 'source_dtot', and 'source_cash'. I want to add three new columns to the dataframe that are comprised of either 1's or 0's based on the original "source" column.

I am currently able to do this, it's just really slow...

Original column sample:

source
_id                     
AV4MdG6Ihowv-SKBN_nB    DTP
AV4Mc2vNhowv-SKBN_Rn    Cash 1
AV4MeisikOpWpLdepWy6    DTP
AV4MeRh6howv-SKBOBOn    Cash 1
AV4Mezwchowv-SKBOB_S    DTOT
AV4MeB7yhowv-SKBOA5b    DTP

Desired output:

source_dtp  source_dtot source_cash
_id         
AV4MdG6Ihowv-SKBN_nB    1.0 0.0 0.0
AV4Mc2vNhowv-SKBN_Rn    0.0 0.0 1.0
AV4MeisikOpWpLdepWy6    1.0 0.0 0.0
AV4MeRh6howv-SKBOBOn    0.0 0.0 1.0
AV4Mezwchowv-SKBOB_S    0.0 1.0 0.0
AV4MeB7yhowv-SKBOA5b    1.0 0.0 0.0

This is my current approach, but it's very slow. I would much prefer a vectorized form of doing this but I don't know how - as the condition is very elaborate.

# For 'source' we will use the following classes:
source_cats = ['source_dtp', 'source_dtot', 'source_cash']
# [0, 0, 0] would imply 'other', hence no need for a fourth category

# add new features to dataframe, initializing to nan
for cat in source_cats:
    data[cat] = np.nan

for row in data.itertuples():
    # create series to hold the result per row e.g. [1, 0, 0] for `cash`
    cat = [0, 0, 0]
    index = row[0]
    # to string as some entries are numerical
    source_type = str(data.loc[index, 'source']).lower()
    if 'dtp' in source_type:
        cat[0] = 1
    if 'dtot' in source_type:
        cat[1] = 1
    if 'cash' in source_type:
        cat[2] = 1
    data.loc[index, source_cats] = cat

I am using itertuples() as it proved faster than interrows().

Is there a faster way of achieving the same functionality as above?

EDIT: This is not just with regards to creating a one hot encoding. It boils down to updating the column values dependent on the value of another column. E.g. if I have a certain location_id I want to update its respective longitude and latitude columns - based on that original id (without iterating in the way that I do above because it's really slow for large datasets).

vcovo
  • 336
  • 1
  • 3
  • 16
  • `df.source.str.get_dummies()` will get you your 0s and 1s. Then, use `pd.concat` to join your dataframe, or just call `df.assign`. – cs95 Oct 11 '17 at 01:43
  • 1
    Yes, you can do this with np.where or np.select. If you proposed a better question with good sample data and expected results, we, Stack Overflow community, will show you how. – Scott Boston Oct 11 '17 at 02:58
  • As @ScottBoston said, it would've been helpful to mention this from the start. The fact that you wrote all this code to demonstrate an MCVE made us believe that it reflected your actual use case. Now, you'll need to provide some more data and expected output so we can understand your actual use case and how it differs from this one. – cs95 Oct 11 '17 at 03:30
  • @vconvo If your question was answered [here](https://stackoverflow.com/questions/46679401/how-to-do-pd-get-dummies-or-other-ways), please close this one too, and accept one of our answers. – cs95 Oct 11 '17 at 04:09
  • @coldspeed @Scott Boston I'll open another question - I didn't expect there to be such as a specific solution to the above (`get_dummies()`) – vcovo Oct 11 '17 at 14:33

2 Answers2

3

Another way to do this is to use pd.get_dummies on the dataframe. First put '_id' into the index.

source = source.set_index('_id')
df_out = pd.get_dummies(source).reset_index()

print(df_out)

Output:

                    _id  source_Cash 1  source_DTOT  source_DTP
0  AV4MdG6Ihowv-SKBN_nB              0            0           1
1  AV4Mc2vNhowv-SKBN_Rn              1            0           0
2  AV4MeisikOpWpLdepWy6              0            0           1
3  AV4MeRh6howv-SKBOBOn              1            0           0
4  AV4Mezwchowv-SKBOB_S              0            1           0
5  AV4MeB7yhowv-SKBOA5b              0            0           1
Scott Boston
  • 147,308
  • 15
  • 139
  • 187
1

You can use str.get_dummies to get your OHEncodings.

c = df.source.str.get_dummies().add_prefix('source_').iloc[:, ::-1]
c.columns = c.columns.str.lower().str.split().str[0]
print(c)
   source_dtp  source_dtot  source_cash
0           1            0            0
1           0            0            1
2           1            0            0
3           0            0            1
4           0            1            0
5           1            0            0

Next, concatenate c with _id using pd.concat.

df = pd.concat([df._id, c], 1)
print(df)
                    _id  source_dtp  source_dtot  source_cash
0  AV4MdG6Ihowv-SKBN_nB           1            0            0
1  AV4Mc2vNhowv-SKBN_Rn           0            0            1
2  AV4MeisikOpWpLdepWy6           1            0            0
3  AV4MeRh6howv-SKBOBOn           0            0            1
4  AV4Mezwchowv-SKBOB_S           0            1            0
5  AV4MeB7yhowv-SKBOA5b           1            0            0

Improvement! Now slightly smoother, thanks to Scott Boston's set_index - reset_index paradigm:

df = df.set_index('_id')\
      .source.str.get_dummies().iloc[:, ::-1]
df.columns = df.columns.str.lower().str.split().str[0]
df = df.add_prefix('source_').reset_index()

print(df)
                    _id  source_dtp  source_dtot  source_cash
0  AV4MdG6Ihowv-SKBN_nB           1            0            0
1  AV4Mc2vNhowv-SKBN_Rn           0            0            1
2  AV4MeisikOpWpLdepWy6           1            0            0
3  AV4MeRh6howv-SKBOBOn           0            0            1
4  AV4Mezwchowv-SKBOB_S           0            1            0
5  AV4MeB7yhowv-SKBOA5b           1            0            0
cs95
  • 379,657
  • 97
  • 704
  • 746
  • Interesting... I can't exactly follow how the condition is being evaluated - how are the ones and zeroes assigned? How would this be done with more complex conditions? – vcovo Oct 11 '17 at 01:58
  • @vcovo I'm not really sure what you're asking, but `pd.get_dummies` just assigns one hot encodings, which is what you seem to want. For more "complicated conditions", you'll need to specify what those conditions are, so they can be addressed appropriately. – cs95 Oct 11 '17 at 02:30
  • Let me rephrase: how does data.source.str.get_dummies() know to assign `[0, 0, 1] ` for `[source_dtp, source_dtot, source_cash]` if `source="Cash 123"`? Where does the `if 'cash' in source_type` condition come in to play? With regards to "More complicated conditions": for example `if any(st in source_type for st in cash_types)` where `cash_types` is an array of e.g. `['cash', 'money', 'contant']` – vcovo Oct 11 '17 at 02:36
  • @vcovo Simply put, what happens is `get_dummies` finds all unique items, and assigns positions to them in that order. Then, for each entry, the corresponding bit is set, depending on what the value is for that row. For more complicated conditions, `pd.get_dummies` will not work here. This is a special function that is suitable for this use case only. For more complicated conditions, more involved methods are needed. Assuming that is your actual use case, you might want to open a new question with an MVCE that reflects that! – cs95 Oct 11 '17 at 03:03