3

The graph shows water temperature against time. When there is an activation, temperature will increase. When the activation ends, temperature will start decreasing(although sometimes there may be a time lag). enter image description here

I would like to calculate the number of times where there is an event(each blue circle represents one activation). There are times of random noise(red circles - indicates random temperature change, but you can see there is only increases or decrease but not both, implying that it is not a proper event).

Temperature records update for every 0.5°C change in temperature, regardless of time.

I have tried using 1) temperature difference, and 2) temperature change gradient of adjacent data points to identify the event start timestamps and end timestamps, and counting it as one event. But this is not very accurate.

I am told that I should use only the temperature difference and identify the pattern of (increase - max temp - decrease) as one event. Any ideas what is an appropriate way to calculate the total number of activations?


Update1:

Sample Data:

        id      timestamp               temperature 
27581   27822   2020-01-02 07:53:05.173 19.5    
27582   27823   2020-01-02 07:53:05.273 20.0    
27647   27888   2020-01-02 10:01:46.380 20.5    
27648   27889   2020-01-02 10:01:46.480 21.0    
27649   27890   2020-01-02 10:01:48.463 21.5    
27650   27891   2020-01-02 10:01:48.563 22.0    
27711   27952   2020-01-02 10:32:19.897 21.5    
27712   27953   2020-01-02 10:32:19.997 21.0
27861   28102   2020-01-02 11:34:41.940 21.5    
...

Update2:

Tried:

df['timestamp'] = pd.to_datetime(df['timestamp'])
df['Date'] = [datetime.datetime.date(d) for d in df['timestamp']] 
df['Date'] = pd.to_datetime(df['Date'])   
df = df[df['Date'] == '2020-01-02']

# one does not need duplicate temperature values, 
# because the task is to find changing values
df2 = df.loc[df['temperature'].shift() != df['temperature']]

# ye good olde forward difference
der = np.diff(df2['temperature'])
# to have the same length as index
der = np.insert(der,len(der),np.NaN)
# make it column
df2['sig'] = np.sign(der)

# temporary array
evts = np.zeros(len(der))
# we find that points, where the signum is changing from 1 to -1, i.e. crosses zero
evts[(df2['sig'].shift() != df2['sig'])&(0 > df2['sig'])] = 1.0
# make it column for plotting
df2['events'] = evts

# preparing plot
fig,ax = plt.subplots(figsize=(20,20))
ax.xaxis_date()
ax.xaxis.set_major_locator(plticker.MaxNLocator(20))

# temperature itself
ax.plot(df2['temperature'],'-xk')
ax2=ax.twinx()

# 'events'
ax2.plot(df2['events'],'-xg')

## uncomment next two lines for plotting of signum
# ax3=ax.twinx()
# ax3.plot(df2['sig'],'-m')

# x-axis tweaking
ax.xaxis.set_major_formatter(mdates.DateFormatter('%H:%M'))
minLim = '2020-01-02 00:07:00'
maxLim = '2020-01-02 23:59:00'
plt.xlim(mdates.date2num(pd.Timestamp(minLim)),
          mdates.date2num(pd.Timestamp(maxLim)))
plt.show()

and incurred a blank graph with messages:

/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:31: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

/usr/local/lib/python3.6/dist-packages/ipykernel_launcher.py:38: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

Update3:

Writing a for-loop to generate a graph for each day:

df['timestamp'] = pd.to_datetime(df['timestamp'])   
df['Date'] = df['timestamp'].dt.date     
df.set_index(df['timestamp'], inplace=True)

start_date = pd.to_datetime('2020-01-01 00:00:00')
end_date = pd.to_datetime('2020-02-01 00:00:00')
df = df.loc[(df.index >= start_date) & (df.index <= end_date)]

for date in df['Date'].unique():   
  df_date = df[df['Date'] == date]

# one does not need duplicate temperature values, 
# because the task is to find changing values
  df2 = pd.DataFrame.copy(df_date.loc[df_date['temperature'].shift() != df_date['temperature']])

# ye good olde forward difference
  der = np.sign(np.diff(df2['temperature']))
# to have the same length as index
  der = np.insert(der,len(der),np.NaN)
# make it column
  df2['sig'] = der

# temporary array
  evts = np.zeros(len(der))
# we find that points, where the signum is changing from 1 to -1, i.e. crosses zero
  evts[(df2['sig'].shift() != df2['sig'])&(0 > df2['sig'])] = 1.0
# make it column for plotting
  df2['events'] = evts

# preparing plot
  fig,ax = plt.subplots(figsize=(30,10))

  ax.xaxis_date()
# df2['timestamp'] = pd.to_datetime(df2['timestamp'])
  ax.xaxis.set_major_locator(plticker.MaxNLocator(20)) 

# temperature itself
  ax.plot(df2['temperature'],'-xk')
  ax2=ax.twinx()

# 'events'
  g= ax2.plot(df2['events'],'-xg')

# x-axis tweaking
  ax.xaxis.set_major_formatter(mdates.DateFormatter('%H:%M'))
  minLim = '2020-01-02 00:07:00'
  maxLim = '2020-01-02 23:59:00'
  plt.xlim(mdates.date2num(pd.Timestamp(minLim)),
          mdates.date2num(pd.Timestamp(maxLim)))

  ax.autoscale()     
  plt.title(date)
  print(np.count_nonzero(df2['events'][minLim:maxLim]))
  plt.show(g)

The graph worked but not the number of counts.


Update4:

enter image description here

enter image description here It looks like some graphs(eg. 2020-01-01, 2020-01-04, 2020-01-05) are over a random fragment of time(probably on the weekends). Is there a way to delete these days?

nilsinelabore
  • 4,143
  • 17
  • 65
  • 122
  • Just to make sure, 2 looks like a random temperature change or that should finish when 3 starts in the picture. Also could you provide any sample data? – FBruzzesi Mar 28 '20 at 10:32
  • @FBruzzesi Yes you're right, the activation or event is usually pretty short, which lasts a few seconds, but the temperature decreases slowly, represented by the downward slopping curve between 2 and 3. Please see edited question for sample data – nilsinelabore Mar 28 '20 at 12:42
  • The first 'red' circle looks identical to 'blue' #2, 6 and, to some extent, 3. Do I understand right that the only difference is the slope sign right after the peak? – Suthiro Mar 28 '20 at 13:14
  • @Suthiro Yeah I think you can say that. Another reason is that from looking at the graph we notice the red1 is isolated on its own(the gradient is small) which implies it could be due to natural random temperature fluctuation, rather than due to an activation. On the other hand, blue 2,3,6 are in the middle of a cluster of events(temperature is generally higher), which is unlikely to have random temperature fluctuations in very short time period, thus we believe they are caused by actual activations. Sorry it's a bit confusing.. – nilsinelabore Mar 28 '20 at 13:25
  • Could you please provide a link to series used to produce the image above? Then I could give a try to analyze it. Otherwise it is hard to suggest something useful. – Suthiro Mar 28 '20 at 13:34
  • @Suthiro Sure, may I have your email please? – nilsinelabore Mar 28 '20 at 13:45
  • @nilsinelabore I'm unsure if I want to share my email in the comments. Why not to upload to virtually any filehosting and share a link? – Suthiro Mar 28 '20 at 13:51

1 Answers1

2

First of all, I'd advice you to increase number of points, I mean in the experimental setup itself.
Nevertheless, it looks like one can extract 'events' from the data provided. The idea is simple: we need to find 'peaks', characterized with raise-the-decline pattern. To find raise and declines, it is naturally to use first order derivative, and since we are interested only in sign (plus for increasing function, minus for decreasing), I simply used signum over first order forward difference. Since we assuming that there is no spontaneously occuring peaks, we need to find points of forward difference where sign changes. It is, in fact, a surrogate second order derivative, and, actually, I achieved almost the same result using simple 2nd-order forward difference, however, not that handy.


I used next routine

# imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import matplotlib.ticker as plticker
# endimports

# path to csv
path = r'JanuaryData.csv'
# reading the csv
df = pd.read_csv(path,usecols=['timestamp','temperature'],parse_dates=True, index_col='timestamp')

# selecting the part for the analysis
startDate = '2020-01-01 00:00:00'
endDate = '2020-01-03 23:59:00'
df = df.loc[startDate:endDate]

# one does not need duplicate temperature values, 
# because the task is to find changing values
df2 = df.loc[df['temperature'].shift() != df['temperature']]

# ye good olde forward difference
der = np.diff(df2['temperature'])
# to have the same length as index
der = np.insert(der,len(der),np.NaN)
# make it column
df2['sig'] = np.sign(der)

# temporary array
evts = np.zeros(len(der))
# we find that points, where the signum is changing from 1 to -1, i.e. crosses zero
evts[(df2['sig'].shift() != df2['sig'])*(0 > df2['sig'])] = 1.0
# make it column for plotting
df2['events'] = evts

# preparing plot
fig,ax = plt.subplots(figsize=(20,20))
ax.xaxis_date()
ax.xaxis.set_major_locator(plticker.MaxNLocator(20))

# temperature itself
ax.plot(df2['temperature'],'-xk')
ax2=ax.twinx()

# 'events'
ax2.plot(df2['events'],'-xg')

## uncomment next two lines for plotting of signum
# ax3=ax.twinx()
# ax3.plot(df2['sig'],'-m')

# x-axis tweaking
ax.xaxis.set_major_formatter(mdates.DateFormatter('%H:%M'))
minLim = '2020-01-02 00:07:00'
maxLim = '2020-01-02 23:59:00'
plt.xlim(mdates.date2num(pd.Timestamp(minLim)),
          mdates.date2num(pd.Timestamp(maxLim)))
plt.show()

The image produced by the code: The image produced by the code Green curve peaks shows the beginning of the corresponding peak of the temperature and I'm sorry for not-so-visual representation. I tried to analyze the other data in the .csv, and it looks like the algorithm works well.


EDIT #1 replace line

df2 = df.loc[df['temperature'].shift() != df['temperature']]

with

df2 = pd.DataFrame.copy(df.loc[df['temperature'].shift() != df['temperature']])

to get rid of SettingWithCopyWarning.

and also rewrite lines with forward difference from

# ye good olde forward difference
der = np.diff(df2['temperature'])
# to have the same length as index
der = np.insert(der,len(der),np.NaN)
# make it column
df2['sig'] = np.sign(der)

to

# ye good olde forward difference
der = np.sign(np.diff(df2['temperature']))
# to have the same length as index
der = np.insert(der,len(der),np.NaN)
# make it column
df2['sig'] = der

to prevent np.sign() warning about NaN value.


EDIT #2 to print number of events in range use

print(np.count_nonzero(df2['events'][minLim:maxLim]))

for limits used above it prints 6, for the entire dataset it gives 174.

Suthiro
  • 1,210
  • 1
  • 7
  • 18
  • Thank you. But I received some error. Please see edited question – nilsinelabore Mar 28 '20 at 22:43
  • These are not errors, but warnings. See edit #1 to get rid of them, however, I had them too and they not interfered with the graph. Just in case, I use python 3.7.6rc1, pandas 1.0.2, numpy 1.18.0, matplotlib 3.1.2. – Suthiro Mar 28 '20 at 23:56
  • Hi Suthiro, I commented out `minLim = '2020-01-02 00:07:00' maxLim = '2020-01-02 23:59:00' plt.xlim(mdates.date2num(pd.Timestamp(minLim)), mdates.date2num(pd.Timestamp(maxLim)))` and am able to plot the graph now, however, there are no `xticks`. – nilsinelabore Mar 29 '20 at 00:26
  • Then after this, how can I calculate the number of peaks? Thanks a million... – nilsinelabore Mar 29 '20 at 00:27
  • Update: I manually set the column `timestamp` as index using `df2.set_index(df2['timestamp'], inplace=True)` and the xticks are showing now:) – nilsinelabore Mar 29 '20 at 06:13
  • @nilsinelabore I'm not sure why you have issues with index, but it's good you solved it. See edit #2 for simple events counting. – Suthiro Mar 29 '20 at 12:34
  • Hi Suthiro, the graph looks awesome! I also tried to apply a for-loop which seems to work too, but not the counting section.. I have edited the question. – nilsinelabore Mar 29 '20 at 13:29
  • @nilsinelabore You are not updating minLim and maxLim in your for-loop, always counting the peaks between `2020-01-02 00:07:00` and `2020-01-02 23:59:00`. – Suthiro Mar 29 '20 at 14:49
  • I see. Could you please advise how I can format the `minLim` and `maxLim` to tailor each `date` as of "Update3"? – nilsinelabore Mar 29 '20 at 22:06
  • 1
    replace corresponding lines with `minLim = date`, `maxLim = date + pd.tseries.frequencies.to_offset('1d')` – Suthiro Mar 29 '20 at 22:12
  • Deleting the weekend data using `df = df[df.index.dayofweek < 5]` seems to solve the issue in Update4.. Thank you. – nilsinelabore Mar 30 '20 at 01:18
  • Sorry may I ask what `der = np.insert(der,len(der),np.NaN)` means and why we need to do this? – nilsinelabore Apr 08 '20 at 11:56
  • 1
    @nilsinelabore `np.diff(df2['temperature'])` returns a vector one element shorter than `'temperature'`, because there is no way to calculate difference for the first point (see `numpy.diff` for extended info). To create a new column in existing `pandas.DataFrame` the data should be the same length as `index`, so I insert one "meaningless" element to the end of the vector to make it the same length as `df2.index` (see `numpy.insert` for extended info about parameters). – Suthiro Apr 08 '20 at 15:00