I've figured out a way to implement this function on the InCore
representation of the data as follows:
groupByCol :: (Eq a, Ord a, RecVec rs) =>
(forall (f :: * -> *).
Functor f =>
(a -> f a) -> Record rs -> f (Record rs))
-> FrameRec rs -> Map a (FrameRec rs)
groupByCol feature frame = M.map toFrame $ F.foldl' groupBy M.empty frame
where groupBy m r = M.insertWith (\[new] old -> new:old) (view feature r) [r] m
And the following unit test passes (so you can see an example of how this is used and what the expected result is).
describe "Lib.groupByCol" $ do
it "Returns the expected map when splitting the Spam dataset on the Images column" $ do
-- type SpamOrHam = Record '[SpamId :-> Int, SuspiciousWords :-> Bool, UnknownSender :-> Bool, Images :-> Bool, SpamClass :-> Text]
spamFrame <- loadSpamOrHam
let row0 :: SpamOrHam
row0 = 489 &: True &: True &: False &: "spam" &: RNil
row1 :: SpamOrHam
row1 = 376 &: True &: False &: True &: "spam" &: RNil
groupingMap = groupByCol images spamFrame
falseRows = F.toList $ groupingMap M.! False
trueRows = F.toList $ groupingMap M.! True
falseRows `shouldContain` [row0]
falseRows `shouldNotContain` [row1]
trueRows `shouldContain` [row1]
trueRows `shouldNotContain` [row0]
So, I wanted to create an equivalent groupByCol
function using the functions provided by the Pipes library, instead, since that library is what Frames uses internally to optimize its calculations. The below is my best attempt, but it appears that fold
from the Pipes library handles converting my data to a Map
for me, because the type system says I'm getting back a Map a (Map a (Record rs))
instead of a Map a [Record rs]
, and the latter is the goal. I'm pretty sure this auto-conversion doesn't automatically partition my dataset by unique value in the provided column, though.
groupByCol' :: (Eq a, Ord a, RecVec rs) =>
(forall (f :: * -> *).
Functor f =>
(a -> f a) -> Record rs -> f (Record rs))
-> FrameRec rs -> Map a (FrameRec rs)
groupByCol' feature frame =
P.fold groupBy M.empty (M.map toFrame) (P.each frame)
where groupBy m r = M.insertWith (\[new] old -> new:old) (view feature r) [r] m
Here's the full code if you need further context. I don't mind sharing since this is just personal exploration. https://github.com/josiah14-MachineLearning/ID3-and-Derivatives/blob/master/ID3/haskell/sequential/hid3-and-seq/src/Lib.hs#L116
Update
I just had some success with the following code which leverages the Identity monad, but I'd really like to get away from using Lists in the groupBy nested function and use Pipes there, as well. I thought using Pipes.yield
would be the way to go, but I'm unable to satisfy the typechecker.
groupByCol' :: (Eq a, Ord a, RecVec rs) =>
(forall (f :: * -> *).
Functor f =>
(a -> f a) -> Record rs -> f (Record rs))
-> FrameRec rs -> Map a (FrameRec rs)
groupByCol' feature frame =
runIdentity $ P.fold groupBy M.empty (M.map toFrame) (P.each frame)
where groupBy m r = M.insertWith (\[new] old -> new:old) (view feature r) [r] m