0

For context I work with mixed tabular data. I have complex data pipelines that I’d like to make sure works on any configuration of data.

I see the pandas add-on/extra and have some questions related to that.

  1. How would I generate one-hot columns with this package? Right now I’m just creating a column of integers between (0, nclasses-1) and then one hot encoding after, but it adds up to have to do that every time.

  2. How would I generate longitudinal data with this package? Say I want a multi index and then to generate a bunch of data for that?

  3. Can I control the missingness more precisely? For example, integer strategy doesn’t allow missingness. How would that also factor into multi-categorical data? Or should I just do it myself later.

Edit to add: 4. I would also be interested in trying a mix of columns as well and not always having all columns at all times.

For example this is what I have right now for data that mixes a continuous, binary, and multicategorical feature and then one-hot encodes the latter.

from hypothesis import given, strategies as st
from hypothesis.extra.pandas import data_frames, column
import unittest

class TestTransforms(unittest.TestCase):
    @given(
        data_frames(
            columns=[
                # create continuous var
                column("ctn", dtype=float),
                # create binary var
                column("bin", elements=st.integers(0, 1)),
                # create multicategorical (numerically encoded) var
                column("mult", elements=st.integers(0, 2)),
            ]
        )
    )
    def test_hypothesis(self, df):
        # one-hot encode the multicategorical column
        df = pd.concat(
            [
                df.drop(["mult"], axis=1),
                pd.get_dummies(df["mult"], prefix="mult"),
            ],
            axis=1,
        )

if __name__ == "__main__":
    unittest.main()

Final edit: Here is the final version that works for me as I wanted it to!

from hypothesis import given, strategies as st
from hypothesis.extra.pandas import data_frames, column
import unittest


def onehot_multicategorical_column(
    prefix: str,
) -> Callable[[pd.DataFrame], pd.DataFrame]:
    def integrate_onehots(df: pd.DataFrame) -> pd.DataFrame:
        if df[prefix].empty:
            return df
        dummies = pd.get_dummies(df, columns=[prefix], prefix=prefix, dummy_na=True)
        # Retain nans
        dummies.loc[
            dummies[f"{prefix}_nan"].astype(bool),
            dummies.columns.str.startswith(prefix),
        ] = np.nan
        return dummies.drop(f"{prefix}_nan", axis=1)

    return integrate_onehots


def unpack_tuples(nested_tuples):
    """
    We receive a List[Tuple[int, List[int]]].
    The first int is the numerical id, and the second is the "time point".
    We want to flatten this into a List[Tuple[int, int]] with the same
    id for multiple time points.
    E.g. [(0,[0,1,2]), (1,[0,2])] => [(0,0), (0,1), (0,2), (1,0), (1,2)]
    """
    return [
        (pt_id, time_pt) for pt_id, time_pts in nested_tuples for time_pt in time_pts
    ]

class TestTransforms(unittest.TestCase):
    @given(
        data_frames(
            columns=[
                column("ctn", dtype=float),
                column("bin", elements=st.one_of(st.none(), st.integers(0, 1))),
                column(
                    "mult", elements=st.one_of(st.none(), st.sampled_from([0, 1, 2]))
                ),
            ],
            index=st.builds(
                pd.MultiIndex.from_tuples,
                st.lists(
                    st.tuples(
                        st.integers(0), st.lists(st.integers(0), min_size=1, max_size=5)
                    ),
                    min_size=2,
                ).map(unpack_tuples),
            ),
        ).map(onehot_multicategorical_column("mult"))
    )
    def test_hypothesis(self, df):
    def test_hypothesis(self, df):
        # test stuff with df


if __name__ == "__main__":
    unittest.main()
davzaman
  • 823
  • 2
  • 10
  • 20
  • Confused which "pandas add-on" you're referencing. Could you be more specific? And please supply some test data, what the input looks like, what you want the expected output to look like, and any cody you have tried. – Ian Thompson Aug 18 '22 at 18:23
  • Hi, thanks for the clarification questions. There is no test data because hypothesis is used to generate test data for you, I will provide whatever I have now as reference. – davzaman Aug 18 '22 at 18:37
  • Please make this a fully reproducible example. I get what you mean, but someone who hasn't used the hypothesis package... wouldn't. Definitely wouldn't. – Dominik Stańczak Aug 18 '22 at 18:53
  • Fair enough, I was asked to post this here by one of the developers so I inadvertently wrote it as if it was directed to them. Not used to a wider audience for a pointed question about particular features but it's a good reminder. Thanks. – davzaman Aug 18 '22 at 19:15

1 Answers1

1

How would I generate one-hot columns with this package? Right now I’m just creating a column of integers between (0, nclasses-1) and then one hot encoding after, but it adds up to have to do that every time.

That - or something equivalent like sampled_from(column_names) - is exactly how I'd do it. A helper function and .map(categories_to_one_hot_columns) method should make this reasonably easy.

How would I generate longitudinal data with this package? Say I want a multi index and then to generate a bunch of data for that?

The pdst.series() and pdst.data_frames() strategies both accept an index= argument, which you could define as e.g.

index = st.builds(
    pd.MultiIndex.from_tuples,
    st.lists(st.tuples(...), min_size=1, max_size=10)
)

Can I control the missingness more precisely? For example, integer strategy doesn’t allow missingness. How would that also factor into multi-categorical data? Or should I just do it myself later.

I'd use st.none() | st.integers() for missingness, or more generally st.one_of(...), can be used to mix strategies together.

Zac Hatfield-Dodds
  • 2,455
  • 6
  • 19
  • Thanks for the super helpful answer! So for longitudinal data, I'm trying to figure out how to sample the first part of the tuple [0, 5] times for example, and the second part can be unique. The idea is I want something like [(1,1), (1,2), (1,3), (2,1)]. I see the flatmap example and can understand how it might apply but then it looks clunky to loop that in a sense. I'm assuming there's a more elegant solution that I'm not seeing. – davzaman Aug 22 '22 at 19:35
  • Maybe something like `st.tuples(first_part, st.lists(second_part, max_size=5)).map(lambda t: [(t[0], x) for x in t[1]])`? – Zac Hatfield-Dodds Aug 27 '22 at 22:50
  • With a little finagling this line of thought helped a bunch, I'll put the final thing in an edit. Thanks again! – davzaman Aug 31 '22 at 21:18