0

I have a dataframe like shown below

df2 = pd.DataFrame({'person_id':[1],'H1_date' : ['2006-10-30 00:00:00'], 'H1':[2.3],'H2_date' : ['2016-10-30 00:00:00'], 'H2':[12.3],'H3_date' : ['2026-11-30 00:00:00'], 'H3':[22.3],'H4_date' : ['2106-10-30 00:00:00'], 'H4':[42.3],'H5_date' : [np.nan], 'H5':[np.nan],'H6_date' : ['2006-10-30 00:00:00'], 'H6':[2.3],'H7_date' : [np.nan], 'H7':[2.3],'H8_date' : ['2006-10-30 00:00:00'], 'H8':[np.nan]})

enter image description here

As shown in my screenshot above, my source datframe (df2) contains few NA's

When I do df2.stack(), I lose all the NA's from the data.

However I would like to retain NA for H7_date and H8 because they have got their corresponding value / date pair. For H7_date, I have a valid value H7 and for H8, I have got it's corresponding H8_date.

I would like to drop records only when both the values (H5_date,H5) are NA.

Please note I have got only few columns here and my real data has more than 150 columns and column names aren't known in advance.

I expect my output to be like as shown below which doesn't have H5_date,H5 though they are NA's

enter image description here

The Great
  • 7,215
  • 7
  • 40
  • 128
  • 3
    `df2.stack(dropna=False)` ? – anky Jul 31 '19 at 07:54
  • Sorry, this doesn't work. I mean it still retains NA for H5Date, H5 as well and they are displayed. I want to drop those records – The Great Jul 31 '19 at 07:56
  • Both the answers with `melt` approach work but for the answer by @Jon Clements, I am bit unclear about `.all` and `.any` approach. So, I am marking @tawab_shakeel answer as solution. Just being clear that its due to my limited proficiency with Python, I couldn't understand the difference between .any and .all difference when applied to @Jon Clements answer. nonetheless both the answers give expected output – The Great Jul 31 '19 at 13:24

3 Answers3

1

You can use :

col = [x for x in df.columns if "date" in x] for column in col: df.dropna(subset=[column,column[:-4]], how = 'all',inplace=True)

subset will select the lines where the NA is detected, how specifies the conditions on the line (here all the of the 2 lines must be NA) and inplace modifies the current DataFrame

Mayeul sgc
  • 1,964
  • 3
  • 20
  • 35
  • In this case, you have given column names. But I may not know the column names. How do I drop in that case? There are more than 150 columns and data keeps coming in. So I don't really know the column names – The Great Jul 31 '19 at 07:59
  • Do you know their index ? – Mayeul sgc Jul 31 '19 at 08:00
  • No, I do not. The sample DF in question has 8 - 15 columns but in real time time they might have more than 150 columns. It's like Time series data. Each day has a measurement. Some dates may not have a measurement and some measurements may not have corresponding date. – The Great Jul 31 '19 at 08:02
  • 1
    @AVLES do the columns following a naming convention, eg: are there always H? and H?_date pairs? – Jon Clements Jul 31 '19 at 08:03
  • @ALVES how do you choose which column you want to drop the NA from then ? – Mayeul sgc Jul 31 '19 at 08:06
  • @Jon Clements hmmm yes and no I would say. Currently its of this format which is H? and H?_date (this is for heart rate measurement). but when data is for cholestrol measurement (Lchol? and L?Chol_date). What I am trying to say is we can find the keywords `Date/Date` in the column and day number like 1,2,3,4 (H1,H2,H3,H4) but the alphasbets might change from H to LChol or BP etc – The Great Jul 31 '19 at 08:09
  • @Mayeulsgc - When both the columns are NA, I mean in this case `H5Date` AND `H5` both are NA's, so we can drop them. Whereas `H7` is not NA, so we retain `H7Date`. Similary for `H8` and `H8Date` – The Great Jul 31 '19 at 08:11
  • 2
    Alright so you want to drop every row for which there is any couple (HX;HXDate) that is fully NA ? – Mayeul sgc Jul 31 '19 at 08:15
  • Yes. But please note that column name shown here is a sample. As provided in comment above, it can have any column name but it defintely contain the keyword `Date` and day number like `1`,`2` etc. It can be like H1Date,H1 or L1CholDate,L1Chol or BP1Date,BP or UACR1Date,UACR1 – The Great Jul 31 '19 at 08:19
  • I edited my solution, however I don't think it will be the most efficient – Mayeul sgc Jul 31 '19 at 08:31
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/197251/discussion-between-avles-and-mayeul-sgc). – The Great Jul 31 '19 at 08:31
  • Sorry can't use it, company firewall – Mayeul sgc Jul 31 '19 at 08:32
  • What is the column[-4]? Because the solution might work only this dataframe. I mean here H5Date is 4 the column. So I guess, you have used this – The Great Jul 31 '19 at 08:33
  • 2
    Taking out the "Date" out of the column name, assuming the date is in the last position. If it is not I let you modify the string to take the "Date" keyword out and get the second column name – Mayeul sgc Jul 31 '19 at 08:35
  • Shouldn't the position be `:2` . Your code results in error but when I use below it returns an empty dataframe for column in col: print(column) print(column[:2]) df2.dropna(subset=[column,column[:2]], how = 'all',inplace=True) – The Great Jul 31 '19 at 09:02
  • Did you apply that to the dataframe that I shared above? It results in empty dataframe. Not sure whether I missed anything. In any case, I guess it should be `[:-5]` – The Great Jul 31 '19 at 09:08
  • I based myself on the example you gave in comment using H5Date, not H5_Date, if there are underscore then yes take it out with [:-5] – Mayeul sgc Jul 31 '19 at 09:09
  • I guess there has been some confusion with the way I laid out things. just to be clear, I don't wish to drop the full record. I only wish to drop these two columns. f2.dropna(subset=['H5_date','H5'],how = 'all', inplace=True), this approach may not work – The Great Jul 31 '19 at 09:17
  • Misunderstanding indeed, so if on any row you have a couple (X, X_Date) that have both values NA, you drop the couple of columns for the entire dataframe, am I right ? – Mayeul sgc Jul 31 '19 at 09:26
  • I mean not for the entire dataframe but for that person_id – The Great Jul 31 '19 at 09:40
  • if it is working for you yes it is probably the most elegant solution – Mayeul sgc Jul 31 '19 at 09:47
1

try pd.DataFrame.melt

df = pd.melt(df2, id_vars='person_id', var_name='col', value_name='dates')
df['col2'] = df['col'].str.split("_").str[0]
df['count'] = df.groupby(['col2'])['dates'].transform(pd.Series.count)
df = df[df['count'] != 0]
df.drop(['col2', 'count'], axis=1, inplace=True)
print(df)
    person_id      col                dates
0           1  H1_date  2006-10-30 00:00:00
1           1       H1                  2.3
2           1  H2_date  2016-10-30 00:00:00
3           1       H2                 12.3
4           1  H3_date  2026-11-30 00:00:00
5           1       H3                 22.3
6           1  H4_date  2106-10-30 00:00:00
7           1       H4                 42.3
10          1  H6_date  2006-10-30 00:00:00
11          1       H6                  2.3
12          1  H7_date                  NaN
13          1       H7                  2.3
14          1  H8_date  2006-10-30 00:00:00
15          1       H8                  NaN

tawab_shakeel
  • 3,701
  • 10
  • 26
1

On approach is to melt the DF, apply a key that identifies columns in the same "group" (in this case H<some digits> but you can amend that as required), then group by person and that key, filter those groups to those containing at least one non-NA value), eg:

Starting with:

df = pd.DataFrame({'person_id':[1],'H1_date' : ['2006-10-30 00:00:00'], 'H1':[2.3],'H2_date' : ['2016-10-30 00:00:00'], 'H2':[12.3],'H3_date' : ['2026-11-30 00:00:00'], 'H3':[22.3],'H4_date' : ['2106-10-30 00:00:00'], 'H4':[42.3],'H5_date' : [np.nan], 'H5':[np.nan],'H6_date' : ['2006-10-30 00:00:00'], 'H6':[2.3],'H7_date' : [np.nan], 'H7':[2.3],'H8_date' : ['2006-10-30 00:00:00'], 'H8':[np.nan]})

Use:

df2 = (
    df.melt(id_vars='person_id')
    .assign(_gid=lambda v: v.variable.str.extract('H(\d+)'))
    .groupby(['person_id', '_gid'])
    .filter(lambda g: bool(g.value.any()))
    .drop('_gid', 1)
)

Which gives you:

    person_id variable                value
0           1  H1_date  2006-10-30 00:00:00
1           1       H1                  2.3
2           1  H2_date  2016-10-30 00:00:00
3           1       H2                 12.3
4           1  H3_date  2026-11-30 00:00:00
5           1       H3                 22.3
6           1  H4_date  2106-10-30 00:00:00
7           1       H4                 42.3
10          1  H6_date  2006-10-30 00:00:00
11          1       H6                  2.3
12          1  H7_date                  NaN
13          1       H7                  2.3
14          1  H8_date  2006-10-30 00:00:00
15          1       H8                  NaN

You can then use that as a starting point to tweak if necessary.

Jon Clements
  • 138,671
  • 33
  • 247
  • 280
  • I understand that any() will return true when one of the elements is true. I consider valid values (anything other than NA) as `True` here. Am I right to understand that when I replace `.any()` with `.all()` in your code, I shouldn't be seeing any of NA's. So it should exclude record number 12,13,14,15 as well? But I see them. Have I got this wrong? – The Great Jul 31 '19 at 09:54