2

I have a dataset containing diagnosis columns (DIAGX1-DIAGX42) for patients and I need to create a variable that sums the values for these based on weights from an external index.

df_patients

patients = [('pat1', 'Z509', 'M33', 'M32', 'M315'),
         ('pat2', 'I099', 'I278', 'M05', 'F01'),
         ('pat3', 'N057', 'N057', 'N058', 'N057')]
labels = ['patient_num', 'DIAGX1', 'DIAGX2', 'DIAGX3', 'DIAGX4']
df_patients = pd.DataFrame.from_records(patients, columns=labels)
df_patients

Input
patient_num DIAGX1  DIAGX2  DIAGX3  DIAGX4
pat1        Z509    M33     M32     M315
pat2        I099    I278    M05     F01
pat3        N057    N057    N058    N057

Output
patient_num DIAGX1  DIAGX2  DIAGX3  DIAGX4 Score
pat1        Z509    M33     M32     M315   1
pat2        I099    I278    M05     F01    6
pat3        N057    N057    N058    N057   0

external_index, where if a column from the dataset above contains a value in any of the below that the value would be added. Only one member contributes to a value been given, e.g a value of both F01, F02 both in dementia will only result in 2 being allocated for that record/patient, values are only added/summed if they occur across grouped indexes e.g. F01=2 and I099=1 sum to 3

  1. congestive_heart_failure = 2
  2. dementia = 2
  3. chronic_pulmonary_disease= 1
  4. rheumatologic_disease = 1
congestive_heart_failure = [
    "I099",
    "I255",
    "I420",
    "I425",
    "I426",
    "I427",
    "I428",
    "I429",
    "I43",
    "I50",
    "P290",
]
dementia = ["F01", "F02", "F03", "F051", "G30", "G311"]
chronic_pulmonary_disease = [
    "I278",
    "I279",
    "J40",
    "J41",
    "J42",
    "J43",
    "J44",
    "J45",
    "J47",
    "J60",
    "J61",
    "J62",
    "J63",
    "J64",
    "J65",
    "J66",
    "J67",
    "J684",
    "J701",
    "J703",
]
rheumatologic_disease = [
    "M05",
    "M06",
    "M315",
    "M32",
    "M33",
    "M34",
    "M351",
    "M353",
    "M360",
]
Eoin Vaughan
  • 121
  • 1
  • 10

2 Answers2

3

Annotated code

idx = {
    'dementia': dementia,
    'rheumatologic_disease': rheumatologic_disease,
    'congestive_heart_failure': congestive_heart_failure,
    'chronic_pulmonary_disease': chronic_pulmonary_disease,
}
mapping = {v: k for k, vals in idx.items() for v in vals}

weights = {
    'dementia': 2,
    'rheumatologic_disease': 1,
    'congestive_heart_failure': 2,
    'chronic_pulmonary_disease': 1,
}

# Convert the dataframe into long format
df = df_patients.melt('patient_num')

# Substitute disease name inplace of codes
df['value'] = df['value'].map(mapping)

# Drop dupes per patient and disease
df = df.drop_duplicates(['patient_num', 'value'])

# Map the weights assigned to diseases
df['value'] = df['value'].map(weights)

# Sum the weights per patient and map it back to original dataframe
df_patients['Score'] = df['patient_num'].map(df.groupby('patient_num')['value'].sum())

Result

  patient_num DIAGX1 DIAGX2 DIAGX3 DIAGX4  Score
0        pat1   Z509    M33    M32   M315    1.0
1        pat2   I099   I278    M05    F01    6.0
2        pat3   N057   N057   N058   N057    0.0
Shubham Sharma
  • 68,127
  • 6
  • 24
  • 53
  • @ScottBoston Hi both, thank you for your help. Can I ask you both do you find that you transform (melt/pivot) your dfs a lot to make changes like this? I ask as I would have been stuck in my brain trying to iterate over the rows before trying to do this. – Eoin Vaughan May 25 '22 at 17:50
  • @EoinVaughan A rule of thumb would be **"When using pandas always avoid iterating wherever possible"** – Shubham Sharma May 25 '22 at 17:52
  • @ShubhamSharma and this is avoided mostly via transforming/reshaping our data? – Eoin Vaughan May 25 '22 at 18:01
  • @EoinVaughan Reshaping is a powerful concept but whether to reshape or not depends upon your use case and what your are trying to achieve with data. And most of the times there is always a way in pandas by which you can avoid iteration. – Shubham Sharma May 25 '22 at 18:08
1

You can do it this way and probably a few other ways more efficiently:

chf_dict = dict(zip(congestive_heart_failure,['chf']*len(congestive_heart_failure)))
dementia_dict = dict(zip(dementia,['dem']*len(dementia)))
cpd_dict = dict(zip(chronic_pulmonary_disease,['cpd']*len(chronic_pulmonary_disease)))
rd_dict = dict(zip(rheumatologic_disease,['rd']*len(rheumatologic_disease)))
          
disease_map = chf_dict
disease_map.update(dementia_dict)
disease_map.update(cpd_dict)
disease_map.update(rd_dict)

score_dict = {'cpd':1, 
              'chf':2, 
              'rd':1, 
              'dem':2}

score_df = df_patients.set_index('patient_num').stack().map(disease_map)\
           .droplevel(1).reset_index(name='disease')\
           .drop_duplicates().set_index('patient_num')['disease']\
           .map(score_dict)\
           .groupby(level=0).sum().rename('Score')

df_patients.merge(score_df, left_on='patient_num', right_index=True)

Output:

  patient_num DIAGX1 DIAGX2 DIAGX3 DIAGX4  Score
0        pat1   Z509    M33    M32   M315    1.0
1        pat2   I099   I278    M05    F01    6.0
2        pat3   N057   N057   N058   N057    0.0
Scott Boston
  • 147,308
  • 15
  • 139
  • 187