1

My goal is to collapse the below table into one single column and this question deals specifically with the blue row below. The table has three categorical variables and 6 analysis/quantitative variables. Columns C1 and C2 are the only variables that need to match for a successful join. All blank cells are NaNs and python code for copying is below. These rows are exported independently because they have information found in other related tables, not included in the export.

Question. (Blue) Some of the quantitative information in the blue row is also in the grey row and some is not. Is there a way to copy the new information (-8 in Q6) into the grey row and then delete/highlight the blue row? Here, the grey row categorical information is maintained, assuming the keep='first' default of drop_duplicates is active.

Related Question.(Yellow row) How to delete rows that are not exact duplicates but contain no new information (more NaN)

Data table

Table for data analysis

Expected Output

The expected output would have the grey row updated with Q6 from the blue row and the blue row removed.

[['C1 (PK)', 'C2 (FK)', 'C3', 'C4', 'Q1', 'Q2', 'Q3', 'Q4', 'Q5', 'Q6']
['S1','P3','H1',Timestamp('2004-12-04 00:00:00'),-15.0,-27.4,nan,-10.0,-15.0,-8]]

Current Progress

My current code includes this line to drop all rows where all quantitative variables are NaN.
df.dropna(subset=df.columns[4:],how='all', inplace=True)

Also, this line for deleting all rows where all quantitative variables are the same.
df.drop_duplicates(subset=df.columns[4:], inplace=True)

Example code that can be copied into an IDE.

import pandas as pd

df = [['S1','P3','H1',Timestamp('2004-12-04 00:00:00'),-15.0,-27.4,nan,-10.0,-15.0,nan],
 ['S1','P3','H1',Timestamp('2004-12-20 00:00:00'),nan,nan,nan,nan,nan,nan],
 ['S1','P3','H2',Timestamp('2004-12-20 00:00:00'),-15.0,nan,nan,-10.0,nan,nan],
 ['S1','P3','H3',Timestamp('2004-12-07 00:00:00'),nan,nan,nan,nan,-15.0,-8.0],
 ['S1','P3','H1', Timestamp('2004-12-04 00:00:00'), -15.0,-27.4,nan,-10.0, -15.0, nan]]
cols = ['C1 (PK)', 'C2 (FK)', 'C3', 'C4', 'Q1', 'Q2', 'Q3', 'Q4', 'Q5', 'Q6']
pd.DataFrame(data=df,columns=cols)

df.drop_duplicates(inplace=True)
df.dropna(subset=df.columns[4:],how='all', inplace=True)
df.drop_duplicates(subset=df.columns[4:], inplace=True)
mcbridecaleb
  • 101
  • 1
  • 8
  • 1
    If only C1 and C2 are required for a match, it's unclear to me why this input doesn't collapse to a single row, or why only the 4th row is the one that gets removed. – ALollz Jan 16 '20 at 16:56
  • can you post your expected output ? – Umar.H Jan 16 '20 at 17:08
  • @ALollz, I have updated the question. This SO post is aimed specifically at the issue presented by the blue column. Some of the rows are exported independently due to information in other linked tables that are not part of the export. – mcbridecaleb Jan 16 '20 at 17:08
  • I understand that, it's just that without any information about how that blue and grey rows are chosen, the only option is to hard code that you're updating the 0th row with the 3rd where it's missing, then dropping the third. – ALollz Jan 16 '20 at 17:28
  • @ALollz the question has been updated. The gray row categorical information is kept because I am assuming the keep='first' default of pd.drop_duplicates. – mcbridecaleb Jan 16 '20 at 18:38

1 Answers1

0

Split of the categorical columns:

df_categorical = df[['C1 (PK)', 'C2 (FK)',"C3", "C4"]]

Perform a groupby on first 2 columns and select first element to keep:

df_categorical = df_categorical.groupby(["C1 (PK)", "C2 (FK)"]).first()

For the quantitative columns use groupby again and use mean this time:

df_quantitative = df.groupby(['C1 (PK)', 'C2 (FK)']).mean()

merge the two dataframes to get the result

df_final = pd.concat([df_quantitative, df_categorical], axis=1)

reset index

df_final.reset_index(inplace=True)