I have an "ideal" formula in which I have a sum of values like
score[i] = SUM(properties[i]) * frequency[i] + recency[i]
with properties
being a vector of values, frequency
and recency
scalar values, taken from a given dataset of N items. While all variables here is numeric and with discrete integer values, the recency
value is a UNIX timestamp in a given time range (like 1 month since now, or 1 week since now, etc. on daily basis).
In the dataset, each item i
has a date value expressed as recency[i]
, and a frequency value frequency[i]
, and the list properties[i]
. All properties of item[i]
are therefore evaluated on each day expressed as recency[i]
in the proposed time range.
According to this formula the recency contribution to the score
value for the item[i]
is a negative contribution: the older is the timestamp the better is the score (hence the +
sign in that formula).
My idea was to use a re-scaler approach in the given range like
scaler = MinMaxScaler(feature_range=(min(recencyVec), max(recencyVec)))
scaler = scaler.fit(values)
normalized = scaler.transform(values)
where recencyVec
collects all recency
vectors for each data point, where min(recencyVec)
is the first day and max(recencyVec)
is the last day.
using the scikit-learn object MinMaxScaler, hence transforming the recency
values by scaling each feature to the given range as suggested in How to Normalize and Standardize Time Series Data in Python
Is this the correct approach for this numerical formulation? Which alternative approach may be possible to normalize the timestamp values when summed to other discrete numeric values?