I'm excercising on this open dataset. The basic description of used dataframes:
There's one that contains info on data availability: dataavail
Timestamp Availability
201605252300 True
201605252315 False
201605252015 True
There's one that contains the actual data weatherdata
SquareID Timestamp Precipitation
25 201605252300 1
The thing is, that if there is no rain, that is not marked by weatherdata.Precipitation
being 0
, but by simply missing from weatherdata
. However, not all data that is missing means no precipitation, it can also mean technical failure. That's where dataavail
comes in. If for a Timestamp
in data
the Availability
is False
, that means there was a technical problem and there's no data. If for a Timestamp
in data
the Availability
is True
, that means the precipitation was actually 0.
I want to combine these two dataframes so that I have a dataframe describing the full spectrum of options, namely one that looks like this:
Timestamp Availability Precipitation
201605252300 True 1 #if the availability is True, and there's rain, data can be pulled from weatherdata
201605252315 False NaN #if there's a technical issue, data can stay NaN for further imputation
201605252015 True 0 #if the data's availability is True, but there's no entry fot it in weatherdata, it should be 0
Now to achieve this, I'm trying to merge them as
precip_alldata = pd.merge(weatherdata, dataavail, on=['Timestamp'], how='right')
In my understanding, according to the documentation this should
right: use only keys from right frame (SQL: right outer join)
But that is what I want, since in dataavail
there's availability data for ALL the possible timestamps. I also understand, that this does not yet substitute NaN's with 0 in case of availability being True. However, the output precip_alldata
doesn't have ANY rows, where Availability
is True
, but Precipitation
is NaN
, and that leads me to believe that something's off in my thinking. Looking at the raw dataset, I can clearly see occasions where there was no precipitation and the Avaiability
is True
. So my output looks something like
Timestamp Availability Precipitation
201605252300 True 1
201605252315 False Nan
where in my understanding/what I want is
Timestamp Availability Precipitation
201605252300 True 1
201605252315 False NaN
201605252015 True NaN
in this first step. Could you please point out the culprit in my thinking? Thank you!