16

I'm looking for an elegant and efficient way to represent and store an arbitrary probability distribution constructed by explicit sampling.

The distribution is expected to have the following properties:

  • Samples are floating point values, but in principle can be thought to have resolution down to .001
  • Samples are drawn from an interval [-4000; 4000]
  • However, for any two samples a, b, |a - b| < 40
  • 90% of the time, it will have a sharp peak or several sharp peaks close to each other
  • 10% of the time, it will have a peak with an uneven plateau of width 0.5 to 5.

The usual representation -- a histogram array -- is undesirable mainly because of the trade-off between quantization/resolution and space. I imagine there must be a method of representation that adaptively varies the bin size depending on local "complexity".

Space is of concern, because a higher-level grid-like data structure will contain thousands of cells, each containing at least one such probability representation. Easy serialization for disk or network transfer is desirable, but efficiency is not a priority.

Any help would be appreciated.

George Skoptsov
  • 3,831
  • 1
  • 26
  • 44
  • Why is space a concern? What are your requirements for precision? What do you want to store it for (storage on disk, manipulation in memory (what kind of manipulation?)) In the case of disc storage I'd considere a simple histogram array + zipping. – Jens Schauder Mar 20 '12 at 15:32
  • I edited the question to address your comment. Precision preference is .001. Histogram array would require a 40000 cells in my case, which is not acceptable. Even 400 cells is pushing it. There will be 1000s of these representations stored in memory, each one being updated or queried on a regular basis, so generic compression would likely incur an unacceptable overhead. – George Skoptsov Mar 20 '12 at 15:40
  • Well, how much precision loss is acceptable? Since a loss of information is (almost?) inevitable if you don’t want to store the original data points. – Konrad Rudolph Mar 20 '12 at 15:48
  • 1
    At the risk of sounding inane, I cannot give a direct answer to the question. In the worst case I can deal with quantizing data down to .01, although .001 would be preferred. Ideally, I would like to have an adaptive solution that maximized precision based on local density. I.e. if 95% of my data is concentrated within 1 unit of the median, I'd like that region to have the highest precision, but the remaining 5% could be a lot less precise. – George Skoptsov Mar 20 '12 at 15:56
  • I'm not a statistician or mathematician by any means, so sorry if I'm talking nonsense, but doesn't that kind of problem would have a chance to be solved by compressing the in-memory data using an algorithm such as those used by FLAC, since audio waveforms are somewhat similar to your data ? I'm also wondering if some form of interpolation might be possible... – SirDarius Mar 20 '12 at 16:00
  • Have you considered quantizing the data into something like an Octree (but in 1D) with detail/splits increasing in places where there is more detail/aplitude? http://en.wikipedia.org/wiki/Octree – Akanksh Mar 20 '12 at 16:34
  • @SirDarius Frequency=based approach is not a bad idea, but it lends itself well to situations when you have all the data upfront. One of my requirements is to build it 'online', as the samples keep coming in. – George Skoptsov Mar 20 '12 at 19:49
  • @Akanksh I like the oct-tree approach, but I want to avoid storing explicit values... – George Skoptsov Mar 20 '12 at 19:54
  • @GeorgeSkoptsov If you do not want to store explicit values, but some kind of approximation, you can use the octree cells to define the number of elements in it. (Reverse engineer from the granularity of the octree node). Although, if you could mention how exactly you want to use that PDF, perhaps people here will be able to guide you with more appropriate data structures. Because in worst case, you don't have to store the data at all, but calculate a curve fitting equation for the sample. – Akanksh Mar 21 '12 at 16:22
  • Literally the absolute optimal compression for a probability distribution is known as arithmetic coding (some people say that it is equivalent to range coding). It's patent-encumbered however. A naive 17-bit delta coder will produce a ~53% reduction in storage size from floats. An adaptive delta coder that models your data set will do better, perhaps ~70%. A linear prediction + rice-golomb coding (like FLAC) should do good for you too. – std''OrgnlDave Mar 23 '12 at 17:09

4 Answers4

5

Interesting problem. Here is a suggestion, which may be quite difficult do implement depending on how much mathematically inclined you are.

Note that I trade space for speed, since what I suggest is likely to be quite heavy computationally (but this is to be tested against real data).

First, use a functional approach. A probability distribution is a probability measure:

struct Distribution
{
    virtual ~Distribution() {};
    virtual double integrate(std::function<double(double)>) = 0;
};

This way, you abstract away from the samples that you produce, since you do not want to store them. Convince yourself that you can do pretty much anything with an "integrate" method.

Of course, with the explicit samples, you do something like

struct SampledDistribution
{
    double integrate(std::function<double(double)> f)
    {
        double acc = 0;
        for (double x: samples) acc += f(samples);
        return acc / samples.size();
    }

    std::deque<double> samples;
};

Now, the storage part:

The usual representation -- a histogram array -- is undesirable mainly because of the trade-off between quantization/resolution and space. I imagine there must be a method of representation that adaptively varies the bin size depending on local "complexity".

The traditional method is wavelets. You produce coefficients with calls to integrate, which you can serialize. You throw coefficients away if the variance of the integral estimator they produce is high.

Then, to deserialize, you produce a Distribution object whose integrate method performs the integration against the wavelets. Integration can be performed with your favorite quadrature method. I stay deliberately vague here because the actual implementation depends on the wavelet family you choose (smooth, compactly supported, orthogonal or not, etc). You will need to dive into the litterature for details anyway.

The point here is that you usually need only very few wavelets to represent a smooth function with few features (say a few peaks, and regularly shaped otherwise), unlike more "regular" finite elements (histograms are a particular kind of finite elements representation). The wavelet representation adapts itself to the features of the transformee, whatever their location or size. You can moreover decide how many coefficients to keep, and hence control the compression ratio.

Also, the 0.001 figure is quite high a number: I suspect you'll need only a handful of coefficients

The tradeoff lies in which wavelet class you use: very smooth distributions will likely be well represented with smooth wavelets, but compactly supported wavelets may be easier to integrate against, etc. Experiment. Note that you don't need "wavelet transform" packages here: only the explicit representation of wavelet functions, and quadrature routines (try Gauss-XXX procedures for reconstruction, or something high order).

I'd favor wavelets which are defined in the Fourier domain (like Lemarie wavelets), since their value and derivatives at zero in Fourier space is known, this allows you to enforce the contraints on the distributions: probability measures must integrate to one, and you may happen to know the expected value or variance beforehand.

Also, you may want to change variables to only deal with functions eg. on [0,1]. There is a copious litterature on wavelets on the interval.

Alexandre C.
  • 55,948
  • 11
  • 128
  • 197
  • +1, nice answer. And, because you've (correctly) brought wave theory into the discussion, [I think I'll just leave this here](https://en.wikipedia.org/wiki/Fourier_transform). :) – MrGomez Mar 26 '12 at 23:06
  • @MrGomez: Good point. Wavelets defined in Fourier space have advantages here. Let me expand. – Alexandre C. Mar 27 '12 at 07:40
  • @AlexandreC. I like your answer best. It would be nice to get a bit more detail and pointers to existing implementations if you know of any. In any case, I'm going to award you the bounty as it's expiring soon. – George Skoptsov Mar 28 '12 at 16:17
2

I would highly recommend looking at all of the distribution models available in SciPy. They most heavily work with this question, and any spec algorithm I come up with that attempts to solve the resolution granularity problem will be something they've at least tangentially considered.

As mentioned here, though, an octree for multiply-resolute data seems like the correct representation of the data structure that you desire. Instead of a flat histogram that contains data constrained to rote bins, your goal is to store perturbations in the data set at varying levels of fixed granularity. This is most similar to the algorithm used for JPEG compression, with the additional desire of pulling samples from the distribution at some level of granularity and complexity.

For creating a classification algorithm or dichotomizer to populate the data structure, I highly recommend looking at decomposition algorithms like JPEG 2000 for determining the shape and complexity of the perturbations in the data. Other approaches (for example, ID3 and C4.5) may also be appropriate, but I feel a simple, vectorized decomposition may work for your purposes.

And, for what it's worth, I will be very interested if you find a more applicable algorithm within the space. I work directly with researchers interested in multi-scaled relationships in data, so if you stumble upon something more efficient, please let us know!

Community
  • 1
  • 1
MrGomez
  • 23,788
  • 45
  • 72
1

Store n quantiles. They'll cluster adaptively around the modes.

Let n be a power of 2, then density (and cumulative probability) lookup can be limited to O(log(n)) by storing the quantiles as a binary-tree of medians and "sub-medians" recursively.

Museful
  • 6,711
  • 5
  • 42
  • 68
1

How about this:

I assume we start with data equivalent to a histogram, i.e. a list of intervals with the number of samples associated to it.

Now lets build an approximation for it by starting with a trivial histogram with one bucket containing all the samples.

For all buckets in the approximation consider splitting it in the middle int two buckets, but only actually split the bucket wich will yield the best improvement in the approximation.

repeat until desired approximation is reached or maximum of acceptable buckets is obtained.

So what you need is a way to identify the best bucket to split. I'd say the maximum deviation of the new buckets compared to half the original bucket might work.

You also need some criterium when to stop.

I think this is actually quite similar to an Octree in 1D. You might look there for efficient implementations.

For actually storing the approximation you could, just stor one array with the bucket boundaries and anotherone with the content of each bucket. Or one array of twice the size with alternating bucketboundary and content value.

Jens Schauder
  • 77,657
  • 34
  • 181
  • 348
  • I'm trying to avoid storing explicit values. Wouldn't the buckets contain explicit values; otherwise how would I be able to split them? Or do you have a different representation in mind? I don't think a tuple of mean and stddev would cut it here; distro of samples in the bucket is not going to be gaussian. – George Skoptsov Mar 20 '12 at 18:23
  • You'll have to start with the explicit values and then build an approximation from that. That's what I tried to sketch in my answer. – Jens Schauder Mar 21 '12 at 04:32