0

I've figured out a way to implement this function on the InCore representation of the data as follows:

groupByCol :: (Eq a, Ord a, RecVec rs) =>
             (forall (f :: * -> *).
                 Functor f =>
                 (a -> f a) -> Record rs -> f (Record rs))
             -> FrameRec rs -> Map a (FrameRec rs)
groupByCol feature frame = M.map toFrame $ F.foldl' groupBy M.empty frame
  where groupBy m r = M.insertWith (\[new] old -> new:old) (view feature r) [r] m

And the following unit test passes (so you can see an example of how this is used and what the expected result is).

  describe "Lib.groupByCol" $ do
    it "Returns the expected map when splitting the Spam dataset on the Images column" $ do
      -- type SpamOrHam = Record '[SpamId :-> Int, SuspiciousWords :-> Bool, UnknownSender :-> Bool, Images :-> Bool, SpamClass :-> Text]
      spamFrame <- loadSpamOrHam
      let row0 :: SpamOrHam
          row0 = 489 &: True &: True &: False &: "spam" &: RNil
          row1 :: SpamOrHam
          row1 = 376 &: True &: False &: True &: "spam" &: RNil
          groupingMap = groupByCol images spamFrame
          falseRows = F.toList $ groupingMap M.! False
          trueRows = F.toList $ groupingMap M.! True
      falseRows `shouldContain` [row0]
      falseRows `shouldNotContain` [row1]
      trueRows `shouldContain` [row1]
      trueRows `shouldNotContain` [row0]

So, I wanted to create an equivalent groupByCol function using the functions provided by the Pipes library, instead, since that library is what Frames uses internally to optimize its calculations. The below is my best attempt, but it appears that fold from the Pipes library handles converting my data to a Map for me, because the type system says I'm getting back a Map a (Map a (Record rs)) instead of a Map a [Record rs], and the latter is the goal. I'm pretty sure this auto-conversion doesn't automatically partition my dataset by unique value in the provided column, though.

groupByCol' :: (Eq a, Ord a, RecVec rs) =>
             (forall (f :: * -> *).
                 Functor f =>
                 (a -> f a) -> Record rs -> f (Record rs))
             -> FrameRec rs -> Map a (FrameRec rs)
groupByCol' feature frame =
  P.fold groupBy M.empty (M.map toFrame) (P.each frame)
    where groupBy m r = M.insertWith (\[new] old -> new:old) (view feature r) [r] m

Here's the full code if you need further context. I don't mind sharing since this is just personal exploration. https://github.com/josiah14-MachineLearning/ID3-and-Derivatives/blob/master/ID3/haskell/sequential/hid3-and-seq/src/Lib.hs#L116


Update

I just had some success with the following code which leverages the Identity monad, but I'd really like to get away from using Lists in the groupBy nested function and use Pipes there, as well. I thought using Pipes.yield would be the way to go, but I'm unable to satisfy the typechecker.

groupByCol' :: (Eq a, Ord a, RecVec rs) =>
             (forall (f :: * -> *).
                 Functor f =>
                 (a -> f a) -> Record rs -> f (Record rs))
             -> FrameRec rs -> Map a (FrameRec rs)
groupByCol' feature frame =
  runIdentity $ P.fold groupBy M.empty (M.map toFrame) (P.each frame)
    where groupBy m r = M.insertWith (\[new] old -> new:old) (view feature r) [r] m

josiah
  • 1,314
  • 1
  • 13
  • 33
  • I don't understand your question yet, but you should check out [`pipes-group`](http://hackage.haskell.org/package/pipes-group) and perhaps [`Streaming.Pipes`](https://hackage.haskell.org/package/streaming-utils-0.2.0.0/docs/Streaming-Pipes.html). – dfeuer Aug 15 '19 at 08:39
  • @dfeuer well, what I mean is, instead of `groupBy` building a `Map a [Record rs]`, I'd like it to build a `Map a (Producer (Record rs) Identity ())` so that I don't have to exit the Pipes library implementation to complete the `groupByCol` operation. Does this help clarify my question any? Thanks for suggesting pipes-group, I'll go through the tutorial there and start tinkering. – josiah Aug 15 '19 at 18:17
  • At this point, since I figured out how to get `P.fold` to work, this question might be more appropriate for StackExchange CodeReview... – josiah Aug 15 '19 at 18:56

0 Answers0