4

I am dealing with datasets with missing data and need to be able to fill forward, backward, and gaps. So, for example, if I have data from Jan 1, 2000 to Dec 31, 2010, and some days are missing, when a user requests a timespan that begins before, ends after, or encompasses the missing data points, I need to "fill in" these missing values.

Is there a proper term to refer to this concept of filling in data? Imputation is one term, don't know if it is "the" term for it though.

I presume there are multiple algorithms & methodologies for filling in missing data (use last measured, using median/average/moving average, etc between 2 known numbers, etc.

Anyone know the proper term for this problem, any online resources on this topic, or ideally links to open source implementations of some algorithms (C# preferably, but any language would be useful)

tbone
  • 5,715
  • 20
  • 87
  • 134

3 Answers3

2

The term you're looking for is interpolation. (obligatory wiki link)

You're asking for a C# solution with datasets but you should also consider doing this at the database level like this.

An simple, brute-force approach in C# could be to build an array of consecutive dates with your beginning and ending values as the min/max values. Then use that array to merge "interpolated" date values into your data set by inserting rows where there is no matching date for your date array in the dataset.

Here is an SO post that gets close to what you need: interpolating missing dates with C#. There is no accepted solution but reading the question and attempts at answers may give you an idea of what you need to do next. E.g. Use the DateTime data in terms of Ticks (long value type) and then use an interpolation scheme on that data. The convert the interpolated long values to DateTime values.

Community
  • 1
  • 1
Paul Sasik
  • 79,492
  • 20
  • 149
  • 189
2

The algorithm you use will depend a lot on the data itself, the size of the gaps compared to the available data, and its predictability based on existing data. It could also incorporate other information you might know about what's missing, as is common in statistics, when your actual data may not reflect the same distribution as the universe across certain categories.

Linear and cubic interpolation are typical algortihms that are not difficult to implement, try googling those.

Here's a good primer with some code:

http://paulbourke.net/miscellaneous/interpolation/

The context of the discussion in that link is graphics but the concepts are universally applicable.

Jamie Treworgy
  • 23,934
  • 8
  • 76
  • 119
0

For the purpose of feeding statistical tests, a good search term is imputation - e.g. http://en.wikipedia.org/wiki/Imputation_%28statistics%29

mcdowella
  • 19,301
  • 2
  • 19
  • 25