-1

I'm excercising on this open dataset. The basic description of used dataframes:

There's one that contains info on data availability: dataavail

Timestamp    Availability
201605252300 True
201605252315 False
201605252015 True

There's one that contains the actual data weatherdata

SquareID Timestamp    Precipitation
25       201605252300 1

The thing is, that if there is no rain, that is not marked by weatherdata.Precipitation being 0, but by simply missing from weatherdata. However, not all data that is missing means no precipitation, it can also mean technical failure. That's where dataavail comes in. If for a Timestamp in data the Availability is False, that means there was a technical problem and there's no data. If for a Timestamp in data the Availability is True, that means the precipitation was actually 0.

I want to combine these two dataframes so that I have a dataframe describing the full spectrum of options, namely one that looks like this:

Timestamp    Availability Precipitation
201605252300 True         1            #if the availability is True, and there's rain, data can be pulled from weatherdata
201605252315 False        NaN          #if there's a technical issue, data can stay NaN for further imputation
201605252015 True         0            #if the data's availability is True, but there's no entry fot it in weatherdata, it should be 0

Now to achieve this, I'm trying to merge them as

precip_alldata = pd.merge(weatherdata, dataavail, on=['Timestamp'], how='right')

In my understanding, according to the documentation this should

right: use only keys from right frame (SQL: right outer join)

But that is what I want, since in dataavail there's availability data for ALL the possible timestamps. I also understand, that this does not yet substitute NaN's with 0 in case of availability being True. However, the output precip_alldata doesn't have ANY rows, where Availability is True, but Precipitation is NaN, and that leads me to believe that something's off in my thinking. Looking at the raw dataset, I can clearly see occasions where there was no precipitation and the Avaiability is True. So my output looks something like

Timestamp    Availability Precipitation
201605252300 True         1
201605252315 False        Nan

where in my understanding/what I want is

Timestamp    Availability Precipitation
201605252300 True         1
201605252315 False        NaN
201605252015 True         NaN

in this first step. Could you please point out the culprit in my thinking? Thank you!

lte__
  • 7,175
  • 25
  • 74
  • 131

1 Answers1

-1

Try the following:

precip_alldata = pd.merge(dataavail, weatherdata, how='left')
precip_alldata[['TimeStamp', 'Availability', 'Precipitation']]

Using a left join will give you all rows from the left dataframe (dataavail).

Output:

          Timestamp Availability  Precipitation
0  201605252300         True            1.0
1  201605252315        False            NaN
2  201605252015         True            NaN
Andrew L
  • 6,618
  • 3
  • 26
  • 30
  • `precip_alldata = pd.merge(dataavail, weatherdata, how='left')` is literally the same as `precip_alldata = pd.merge(weatherdata,dataavail, how='right')`. Nevertheless, I tried and it yields the same output. – lte__ Mar 16 '17 at 12:17
  • 2
    I replicated your exact datasets above which yielded the copy/pasted output you're looking at. Something else is going on if this doesn't work. – Andrew L Mar 16 '17 at 12:27