5

I have a type to describe post-calibration radiocarbon date probability distributions. The details and background don't matter for the question: It boils down to one probability value in _calPDFDens for each year in _calPDFCals:

data CalPDF = CalPDF {
    -- | Sample identifier, e.g. a lab number
      _calPDFid :: String
    -- | Years calBCAD
    , _calPDFCals :: VU.Vector YearBCAD
    -- | Probability densities for each year in '_calPDFCals'
    , _calPDFDens :: VU.Vector Float
    }

(VU is Data.Vector.Unboxed)

Now: It is common practice to sum multiple such distributions to derive a sum probability distribution. That means a full outer join on the years in _calPDFCals and then summing the respective values in _calPDFDens. I implemented this as follows:

sumPDFs :: CalPDF -> CalPDF -> CalPDF
sumPDFs = combinePDFs (+)

combinePDFs :: (Float -> Float -> Float) -> CalPDF -> CalPDF -> CalPDF
combinePDFs f (CalPDF name1 cals1 dens1) (CalPDF name2 cals2 dens2) = 
    let startRange = minimum [VU.head cals1, VU.head cals2]
        stopRange = maximum [VU.last cals1, VU.last cals2]
        emptyBackdrop = zip [startRange..stopRange] (repeat (0.0 :: Float))
        pdf1 = VU.toList $ VU.zip cals1 dens1
        pdf2 = VU.toList $ VU.zip cals2 dens2
        pdfCombined = fullOuter f pdf2 (fullOuter f pdf1 emptyBackdrop)
        pdfNew = CalPDF (name1 ++ "+" ++ name2) (VU.fromList $ map fst pdfCombined) (VU.fromList $ map snd pdfCombined)
    in normalizeCalPDF pdfNew
    where
        -- https://stackoverflow.com/questions/24424403/join-or-merge-function-in-haskell
        fullOuter :: (Float -> Float -> Float) -> [(YearBCAD, Float)] -> [(YearBCAD, Float)] -> [(YearBCAD, Float)]
        fullOuter _ xs [] = xs
        fullOuter _ [] ys = ys
        fullOuter f xss@(x:xs) yss@(y:ys)
            | fst x == fst y = (fst x, f (snd x) (snd y)) : fullOuter f xs ys
            | fst x < fst y  = x                          : fullOuter f xs yss
            | otherwise      = y                          : fullOuter f xss ys

I was wondering if I could rewrite this code, so that CalPDF becomes an instance of Monoid and sumPDFs becomes <>.

The issue I can not overcome and which lead me to post is question, is how mempty should look like. I already have something like this in combinePDFs: emptyBackdrop. This is required in my implementation, to fill or complete years in between both input PDFs, if they do not overlap.

emptyBackdrop fulfills some of the requirements for mempty, but it depends on the input PDFs. Theoretically, the true mempty would be a CalPDF, which starts at the beginning of time, ends at the end of time and attributes each of these infinite years a probability of zero. But this can not be implemented with unboxed vectors.

Is there an elegant way to make CalPDF and instance of Monoid? Would it be useful already to make it an instance of Semigroup with what I have already?


Edit: As suggested by @leftaroundabout here is a reproducible, minimal implementation of the setup described above.

main :: IO ()
main = do
    let myPDF1 = [(1,1), (2,1), (3,1)]
        myPDF2 = [(2,1), (3,1), (4,1)]
    putStrLn $ show $ sumPDFs myPDF1 myPDF2

type CalPDF = [(Int, Float)]

sumPDFs :: CalPDF -> CalPDF -> CalPDF
sumPDFs pdf1 pdf2 = 
    let startRange = minimum [fst $ head pdf1, fst $ head pdf2]
        stopRange = maximum [fst $ last pdf1, fst $ last pdf2]
        emptyBackdrop = zip [startRange..stopRange] (repeat (0.0 :: Float))
        pdfCombined = fullOuter pdf2 (fullOuter pdf1 emptyBackdrop)
    in pdfCombined
    where
        fullOuter :: [(Int, Float)] -> [(Int, Float)] -> [(Int, Float)]
        fullOuter xs [] = xs
        fullOuter [] ys = ys
        fullOuter xss@(x@(year1,dens1):xs) yss@(y@(year2,dens2):ys)
            | year1 == year2 = (year1, dens1 + dens2) : fullOuter xs ys
            | year1 < year2  = x                      : fullOuter xs yss
            | otherwise      = y                      : fullOuter xss ys
nevrome
  • 1,471
  • 1
  • 13
  • 28
  • 2
    If you just want to sum them, an instance of `Semigroup` is enough. `<>` is in fact part of `Semigroup` – marcosh May 30 '22 at 10:57
  • I don't think this can be a law-abiding `Monoid` _or_ `Semigroup` as it stands, not with all that meta-information. But I suspect you _could_ refactor the data type into something with a valid `Monoid` or even `Monad`-ish instance (parametric on what is now the fixed `YearBCAD` type). — Please simplify the code here, replace unboxed vectors with lists etc., to make the question as clear as possible. – leftaroundabout May 30 '22 at 11:19
  • Really, `fullOuter` seems to be the interesting part. I suggest making this standalone, look at the most general type, that already gives more insights. Also: getting rid of the `fst`/`snd` in favour of pattern matching would aid readability. – leftaroundabout May 30 '22 at 11:22
  • I added a simplified, minimal and reproducible example, @leftaroundabout. I fear, though, that this minimal representation of the problem may not necessarily be sufficient to answer my question. I can not simply give up the performance improvements of unboxed vectors in other parts of my code base to get some neat abstraction here. – nevrome May 30 '22 at 12:26
  • Ok. Now... why do you need `emptyBackdrop`? Shouldn't you be able to directly join `pdf1` and `pdf2`, since any entries that are missing in one of them are implicitly zero in the other anyway? – leftaroundabout May 30 '22 at 12:50
  • I think I need it at least to cover cases where `pdf1` and `pdf2` do not overlap at all. The result should always have all years between the two distributions. – nevrome May 30 '22 at 13:00
  • 1
    Shouldn'it be `stopRange = maximum [fst $ last pdf1, fst $ last pdf2]` ? so we include year 4 ? – jpmarinier May 30 '22 at 13:25
  • True! Fixed. Thanks for pointing it out. – nevrome May 30 '22 at 13:34
  • ...result should have all years that are mentioned in _either of the distributions_. –right? But `fullOuter` already guarantees that, also without any `emptyBackdrop`. – leftaroundabout May 30 '22 at 13:43
  • Either of the distributions AND years between the distributions. So if `pdf1` covers the years 1-10 and `pdf2` the years 20-30, then the result should also cover 11-19, so the complete set of 1-30. Sorry that I didn't make this clear right away. – nevrome May 30 '22 at 14:28
  • So, the year ranges are always without gaps? Then it looks like you should never be storing years in a vector/list at all, but instead just the start- and end points. But, _why_ store all of them, even if many will be zero? – leftaroundabout May 30 '22 at 17:24
  • You're asking interesting questions here. This is indeed a good point and I might have gotten a bit lost in my implementation. Maybe I should go back and do some more general refactoring. But this goes beyond the scope of this question. – nevrome May 30 '22 at 17:46
  • Presumably all the `YearBCAD`s before the beginning of your `_calPDFCals` are implicitly associated with weight 0, and all the `YearBCAD`s after similarly. So can't your empty value simply have empty vectors? Wouldn't that represent "starts at the beginning of time, ends at the end of time and attributes each of these infinite years a probability of zero"? – Daniel Wagner May 30 '22 at 17:55
  • Oh - that's finally an idea that is in line with what I initially had in mind when writing this question. I think that would be a possibility. But now that I saw the comments/answers by leftaroundabout and you, Daniel Wagner I'm considering some more general refactoring. – nevrome May 30 '22 at 18:02

2 Answers2

4

Consider reworking your type a bit.

import Data.Map (Map)
import qualified Data.Map as M

data CalPDF = CalPDF
    { _calPDFid :: [String]
    , _calPDFdens :: Map YearBCAD Float
    }

The instances can now be quite short indeed:

instance Semigroup CalPDF where
    CalPDF id dens <> CalPDF id' dens' = CalPDF
        (id <> id')
        (M.unionWith (+) dens dens')

instance Monoid CalPDF where
    mempty = CalPDF mempty mempty

I used [String] in place of a single +-delimited String for two reasons: it allows you to use + in your names without ambiguity, and it makes <> a bit simpler as you don't need to avoid adding a + when one or the other argument is the empty String to remain law-abiding. Pretty-printers can still show this with +s using e.g. intercalate "+".

You could use HashMap or IntMap in place of Map in essentially the same way if one of those fits your needs better.

Daniel Wagner
  • 145,880
  • 9
  • 220
  • 380
  • 1
    Very interesting! A Map might actually be a much better data structure for this. Especially given that `unionWith` is essentially a crazy simple replacement of my outer join solution. – nevrome May 30 '22 at 17:59
2

Any Semigroup can be lifted to a Monoid with Maybe.

Daniel Wagner
  • 145,880
  • 9
  • 220
  • 380