4

I have a dataframe which can be generated from the code as given below

df = pd.DataFrame({'person_id' :[1,2,3],'date1': 
['12/31/2007','11/25/2009','10/06/2005'],'val1': 
[2,4,6],'date2': ['12/31/2017','11/25/2019','10/06/2015'],'val2':[1,3,5],'date3': 
['12/31/2027','11/25/2029','10/06/2025'],'val3':[7,9,11]})

I followed the below solution to convert it from wide to long

pd.wide_to_long(df, stubnames=['date', 'val'], i='person_id', 
j='grp').sort_index(level=0)

Though this works with sample data as shown below, it doesn't work with my real data which has more than 200 columns. Instead of person_id, my real data has subject_ID which is values like DC0001,DC0002 etc. Does "I" always have to be numeric? Instead it adds the stub values as new columns in my dataset and has zero rows

This is how my real columns looks like

enter image description here

My real data might contains NA's as well. So do I have to fill them with default values for wide_to_long to work?

enter image description here

Can you please help as to what can be the issue? Or any other approach to achieve the same result is also helpful.

The Great
  • 7,215
  • 7
  • 40
  • 128
  • Copy and pasting your code here gave me your desired output, which you mentioned, so it would be better for us to be able to reproduce your problem (and therefore better diagnose it) if your sample data better matched your real data. – m13op22 Jun 28 '19 at 13:22
  • updated the screenshot of my actual columns if that can help anyway – The Great Jun 28 '19 at 13:26
  • 1
    That updated screenshot helps a lot. If I rename your columns to match the update, then I get an empty dataframe. I'd suggest renaming your columns so that the time points are at the end. For example, `'H1Date'` becomes `HDate1` or `Date1`, something like that. – m13op22 Jun 28 '19 at 13:32
  • So, I have more than 200 columns. Can you let me know how can I change this?I am starting off and your input would really be helpful – The Great Jun 28 '19 at 13:33
  • Sure, is it correct to assume that all columns can be sorted into 3 groups, HDate, HDerived, and H? – m13op22 Jun 28 '19 at 13:35
  • Yes. These 3 columns keep repeating with different numbers. They are in a sequence – The Great Jun 28 '19 at 13:37
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/195697/discussion-between-avles-and-hs-nebula). – The Great Jun 28 '19 at 13:38
  • a somewhat similar (but not exact) query was asked here>https://stackoverflow.com/questions/56657365/pivot-dataframe-with-duplicate-values. I trust, the solution is not working, maybe due to duplicate columns in your real dataset – Siraj S. Jun 28 '19 at 14:37
  • @SirajS. Any other approach to do the same? I mean instead of wide_to_long – The Great Jun 28 '19 at 14:39

3 Answers3

3

Try adding additional argument in the function which allows the strings suffix.

pd.long_to_wide(.......................,suffix='\w+')
Matt Ke
  • 3,599
  • 12
  • 30
  • 49
vipulish
  • 31
  • 1
2

The issue is with your column names, the numbers used to convert from wide to long need to be at the end of your column names or you need to specify a suffix to groupby. I think the easiest solution is to create a function that accepts regex and the dataframe.

import pandas as pd
import re

def change_names(df, regex):
    # Select one of three column groups
    old_cols = df.filter(regex = regex).columns
    # Create list of new column names
    new_cols = []
    for col in old_cols:
        # Get the stubname of the original column
        stub = ''.join(re.split(r'\d', col))
        # Get the time point
        num = re.findall(r'\d+', col) # returns a list like ['1']
        # Make new column name
        new_col = stub + num[0]
        new_cols.append(new_col)

    # Create dictionary mapping old column names to new column names
    dd = {oc: nc for oc, nc in zip(old_cols, new_cols)}
    # Rename columns
    df.rename(columns = dd, inplace = True)

    return df


tdf = pd.DataFrame({'person_id' :[1,2,3],'h1date': ['12/31/2007','11/25/2009','10/06/2005'],'t1val': [2,4,6],'h2date': ['12/31/2017','11/25/2019','10/06/2015'],'t2val':[1,3,5],'h3date': ['12/31/2027','11/25/2029','10/06/2025'],'t3val':[7,9,11]})

# Change date columns
tdf = change_names(tdf, 'date$')
tdf = change_names(tdf, 'val$')

print(tdf)
    person_id      hdate1  tval1      hdate2  tval2      hdate3  tval3
0          1  12/31/2007      2  12/31/2017      1  12/31/2027      7
1          2  11/25/2009      4  11/25/2019      3  11/25/2029      9
2          3  10/06/2005      6  10/06/2015      5  10/06/2025     11
m13op22
  • 2,168
  • 2
  • 16
  • 35
  • 1
    You mean does wide_to_long expect columns to have numbers at the end of their name? Does this give the output? When I modified the column names to have numbers at the end, it didn't help? – The Great Jun 28 '19 at 13:52
  • I think for your case, that'd be the easiest fix, but generally not necessarily, looking at the last example [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.wide_to_long.html). – m13op22 Jun 28 '19 at 13:55
  • But still this returns 0 records. – The Great Jun 28 '19 at 13:56
  • I don't know why it returns zero records still. I don't really know what's the issue? Can you help me implement this with melt approach or any other approach? – The Great Jun 28 '19 at 13:58
  • I'd suggest looking at this [link](https://stackoverflow.com/questions/45123924/convert-pandas-dataframe-from-wide-to-long) then and go from there. – m13op22 Jun 28 '19 at 14:01
2

This is quite late to answer this question. But putting the solution here in case someone else find it useful

    tdf = pd.DataFrame({'person_id' :[1,2,3],'h1date': ['12/31/2007','11/25/2009','10/06/2005'],'t1val': [2,4,6],'h2date': ['12/31/2017','11/25/2019','10/06/2015'],'t2val':[1,3,5],'h3date': ['12/31/2027','11/25/2029','10/06/2025'],'t3val':[7,9,11]})
    
    ## You can use m13op22 solution to rename your columns with numeric part at the 
    ## end of the column name. This is important.
    
    tdf = tdf.rename(columns={'h1date': 'hdate1', 't1val': 'tval1',
                              'h2date': 'hdate2', 't2val': 'tval2',
                              'h3date': 'hdate3', 't3val': 'tval3'})

   ## Then use the non-numeric portion,  (in this example 'hdate', 'tval') as 
   ## stubnames. The mistake you were doing was using ['date', 'val'] as stubnames.

    df = pd.wide_to_long(tdf, stubnames=['hdate', 'tval'], i='person_id', j='grp').sort_index(level=0)
    
    print(df)
Beta
  • 1,638
  • 5
  • 33
  • 67