2

Follow-up from my last question: pyjanitor pivot_longer multiple sets of columns with common grouping variable and id column

In my last question, the dataset I gave was oversimplified for the problem I was having. I have changed the column names to represent the ones in my dataset, as I couldn't figure out how to fix them myself using regex in pivot_longer. In the model dataset I gave, columns were written with the following pattern: number_word, but in my dataset the columns are in any order and never separated by underscores (e.g., wordnumber).

Note that the number needs to be the same grouping variable for each column set. So there should be a rating, estimate, and type for each number.

The dataset

df = pd.DataFrame({

    'id': [1, 1, 1],
    'ratingfirst': [1, 2, 3],
    'ratingsecond': [2.8, 2.9, 2.2],
    'ratingthird': [3.4, 3.8, 2.9],
    'firstestimate': [1.2, 2.4, 2.8],
    'secondestimate': [2.4, 3, 2.4],
    'thirdestimate':[3.4, 3.8, 2.9],
    'firsttype': ['red', 'green', 'blue'],
    'secondtype': ['red', 'green', 'yellow'],
    'thirdtype': ['red', 'red', 'blue'],
})

Desired output

The header of my desired output is the following:

id category rating estimate type
1 first 1.0 1.2 'red'
prayner
  • 393
  • 1
  • 10

2 Answers2

1

I think the easiest way would be to align the columns you have with what was used in the previous question, something like:

def fix_col_header(s, d):
    for word, word_replace in d.items():
        s = s.replace(word, word_replace)
        if s.startswith("_"):
            s = s[len(word_replace):] + s[:len(word_replace)]
    return s

d = {"first":"_first", "second":"_second", "third": "_third"}
df.columns = [fix_col_header(col, d) for col in df.columns]

This will give the columns:

id, rating_first, rating_second, rating_third, estimate_first, estimate_second, estimate_third, type_first, type_second, type_third

Now you can apply the solution from the previous question (note that category and value are switched). For completeness I have added it here:

import janitor

(df
.pivot_longer(
    column_names="*_*", 
    names_to = (".value", "category"), 
    names_sep="_")
)
Shaido
  • 27,497
  • 23
  • 70
  • 73
  • This solution would not work because it is not always written with the number first (e.g., ratingfirst instead of firstrating). So you cannot add an underscore at the end of each number string. – prayner Jan 09 '23 at 09:43
  • @prayner: it should work with both cases (before or after), the if statement in the function takes care of it. I tried with the dataframe in the question and it looks correct. – Shaido Jan 09 '23 at 10:14
1

pivot_longer supports multiple .value - you can take advantage of that to reshape your dataframe, using the names_sep parameter:

# pip install pyjanitor
import pandas as pd
import janitor
(df
.pivot_longer(
    index='id',
    names_to = (".value", "category", ".value"),
    names_sep = "(first|second|third)")
)
   id category  rating  estimate    type
0   1    first     1.0       1.2     red
1   1    first     2.0       2.4   green
2   1    first     3.0       2.8    blue
3   1   second     2.8       2.4     red
4   1   second     2.9       3.0   green
5   1   second     2.2       2.4  yellow
6   1    third     3.4       3.4     red
7   1    third     3.8       3.8     red
8   1    third     2.9       2.9    blue

If you look at the breakdown you can see what's going on here :

df.columns[1:].str.split("(first|second|third)")
Index([   ['rating', 'first', ''],   ['rating', 'second', ''],
          ['rating', 'third', ''],  ['', 'first', 'estimate'],
       ['', 'second', 'estimate'],  ['', 'third', 'estimate'],
            ['', 'first', 'type'],     ['', 'second', 'type'],
            ['', 'third', 'type']],
      dtype='object')

Note how we have three entries, and one of them is an empty string. This matches our names_to argument -> (".value", "category", ".value"); once pivot_longer matches, it then in the final output combine the .values into one, so for ['rating', 'first', ''], it pulls out ('rating', ''), and finally lumps them into one -> rating, and on and on for the rest of the columns.

Another option is with pd.stack:

temp = df.set_index('id')
temp.columns = temp.columns.str.split("(first|second|third)", expand = True)
temp.columns.names = [None, 'category', None]
temp = temp.stack('category')
temp.columns = temp.columns.map("".join)
temp.reset_index()

   id category  estimate    type  rating
0   1    first       1.2     red     1.0
1   1   second       2.4     red     2.8
2   1    third       3.4     red     3.4
3   1    first       2.4   green     2.0
4   1   second       3.0   green     2.9
5   1    third       3.8     red     3.8
6   1    first       2.8    blue     3.0
7   1   second       2.4  yellow     2.2
8   1    third       2.9    blue     2.9
sammywemmy
  • 27,093
  • 4
  • 17
  • 31