1

I am trying to collect temperature for each hour of the year. After some scrapping, I have almost all temp value by hours but unfortunately, some data was missing and a total number of temp is only 8600 rather than 8783 that year 2016 have as hours. And unfortunately, I don't know from which specific hours the temp data is missing. So I have to fulfill temp data till I have all corresponding values to hours in the year 2016. Is there a way to fulfill temp series on random basis inserting values that are near to original ones that I have.

When I am using pd.concat(), all rows till 8783 indexes (in temp column), are filled with NaN due to fewer rows in temp(8600).

df = pd.concat([date_df, temp_df], axis =1)
Output:

datetime                  |     temp
0 2016-01-01 01:00:00           -6°C
1 2016-01-01 02:00:00           -6°C
2 2016-01-01 03:00:00           -6°C
3 2016-01-01 04:00:00           -7°C

.....

8780 2016-12-31 22:00:00        NaN
8781 2016-12-31 23:00:00        NaN
8782 2017-01-01 00:00:00        NaN

So can I add missing temp values on random indexes and their values to be as nearest one's from the existing ones. Thanks in advance and may the force be with you!

Hristo Stoychev
  • 405
  • 1
  • 5
  • 13

1 Answers1

1

You can use:

print (df)
            datetime    temp
2016-01-01  01:00:00   8.0°C
2016-01-01  02:00:00   7.0°C
2016-01-01  03:00:00     NaN
2016-01-01  04:00:00     NaN
2016-01-01  05:00:00   4.0°C
2016-01-01  06:00:00  10.0°C

df['temp'] = df['temp'].str.replace('°C', '').astype(float)

df['temp'] = df['temp'].interpolate().astype(str) + '°C'
print (df)
            datetime    temp
2016-01-01  01:00:00   8.0°C
2016-01-01  02:00:00   7.0°C
2016-01-01  03:00:00   6.0°C
2016-01-01  04:00:00   5.0°C
2016-01-01  05:00:00   4.0°C
2016-01-01  06:00:00  10.0°C

EDIT:

#sample data
np.random.seed(54)
rng = pd.date_range('2016-01-01', freq='H', periods=100)
df = pd.DataFrame({'temp':np.random.randint(10, size=len(rng))}, index=rng)

#add 10 random NaNs
df.loc[np.random.choice(df.index, 10), 'temp'] = np.nan
print (df)
                     temp
2016-01-01 00:00:00   5.0
2016-01-01 01:00:00   2.0
2016-01-01 02:00:00   7.0
2016-01-01 03:00:00   1.0
2016-01-01 04:00:00   0.0
2016-01-01 05:00:00   NaN
2016-01-01 06:00:00   7.0
2016-01-01 07:00:00   1.0
2016-01-01 08:00:00   9.0
...
...
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Hey jezreal, My original dataframe is with datateime column and 8783 rows and temp series that I am trying to add to that DF with index 8783 is less, around 8600 rows for temp. Some temp values for unknown datetime hours of the year are missing. When I am trying to conc or expand temp series all NaN values are added at the end and filling only last ~100 indexes. datetime | temp 2016-01-01 01:00:00 8°C 2016-01-01 02:00:00 7°C 2016-01-01 03:00:00 6°C ---- 2016-12-31 23:00:00 NaN – Hristo Stoychev Jun 21 '17 at 11:32
  • mate I'm newby in SO, how I can PM you for some basic tips for writing and posting in this great site? :) – Hristo Stoychev Jun 21 '17 at 11:35
  • No problem, I want only know how do you get `NaN`s in bottom. – jezrael Jun 21 '17 at 11:36
  • Give me some time. – jezrael Jun 21 '17 at 11:36
  • But i dont understand random mean, can you explain more? – jezrael Jun 21 '17 at 11:38
  • NaN are filling after I am using pd.expand(). Because when I'm trying to add a column with different length (as I have with temp) there is an error ofc. So I do pd.expand(), temp series to have the same match of 8783 rows length and simply added to existing df with datetime. And the idea is to have same match temp but missing values to be spread around randomly on some step between all existing values in temp, not only to expand them and add them in the end. – Hristo Stoychev Jun 21 '17 at 11:40
  • Ok, so first need add missing datetimes and then replace `NaN` by something? some random mean? Or some random number? – jezrael Jun 21 '17 at 11:42
  • all datetime values are ok. They are 8783. But temp values are only 8600. So I need to fulfill that series/column till match with all day-hour rows. – Hristo Stoychev Jun 21 '17 at 11:44
  • Sorry and thank you for patience. I still dont undesrans logic of replacement. I think your last sentence `(spread randomly in a small group, best by 1 with step 500 and using nearest value for reference) in entire series?` – jezrael Jun 21 '17 at 11:46
  • In my opinion the best is edit question and add some data sample 5-10 rows with desired output, because formatting of comments is problematic. – jezrael Jun 21 '17 at 11:54
  • Actually, I am the one that should say thank you for your patience :) Well, all my explanation is some kind of misleading. Let me start from fresh. I am trying to collect temperature for each hour of the year. After some scrapping, I have almost all temp value by hours but unfortunately, some data was missing during web scraping and a total number of temp is only 8600 rather than 8783 that year 2016 have as hours. – Hristo Stoychev Jun 21 '17 at 12:02
  • What about [`interpolate`](http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.interpolate.html) ? Is possible use it? Check edited answer. – jezrael Jun 21 '17 at 12:15
  • OK, thank you for sample data. But you forget for desired output. – jezrael Jun 21 '17 at 12:28
  • Also NaNs data are always in last rows? For better sample is possible add some NaNs between existing values? – jezrael Jun 21 '17 at 12:31
  • Thanks mate! Unfortunately, all NaN's are in the end cuz I don't know which values are not included. I can simply add them manually but the effort ... :) There are nearly 100 missing points. Just wondering if there is a clever pythonic/pandas trick. Thanks a lot for your time and effort!!! – Hristo Stoychev Jun 21 '17 at 12:41
  • Ok, if need replace all NaNs by some scalar like `10` use `df['temp'] = df['temp'].fillna(10)`. Isit what you want? – jezrael Jun 21 '17 at 12:43
  • Not exactly. The idea is that all values from last days will be generated and not real. All missing points are at all days and hours during the year. For example missing temp point from (example): 04-15 03:00, other from 06-20 at 16:00. So the last real value that I have in temp could be from 2016-12-31 at 23:00 but due to missing ones, it goes to 12-28 05:00. Get the point? Due to missing values, real ones are going forwards in index and corresponding day and time. So if I could find a way to insert in random indexes some random temp values, data will be more consistent and close to reality. – Hristo Stoychev Jun 21 '17 at 12:56
  • Ok, if need replace NaN by some random values `df['temp'] = df['temp'].combine_first(pd.Series(np.random.randint(10, size=len(df.index)), index=df.index))` anf if need replace NaNs by forward filling use `df['temp'] = df['temp'].ffill()` – jezrael Jun 21 '17 at 13:00
  • One more idea, we have 8600 rows in temp(as a series, not in DF with concat() and filled with NaN in the end), can we add 100 NaN's on random index places in temp? And then to say in pandas column temp to convert all NaN with ffill(). – Hristo Stoychev Jun 21 '17 at 13:21