0

My dataset contains a column with some data I need to use for splitting by groups in a way that rows belonging to same group should not be divided into train/test but sent as a whole to one of the splits using PYCARET

10 row sample for clarification:

group_id    measure1    measure2    measure3
    1          3455        3425       345
    1          6455         825       945
    1          6444         225       145
    2            23          34       233
    2           623          22       888
    3          3455        3425       345
    3          6155         525       645
    3          6434         325       845
    4            93         345       233
    4           693         222       808

every unique group_id should be sent to any split in full this way (using 80/20):

TRAIN SET:
   
 group_id    measure1    measure2    measure3
        1          3455        3425       345
        1          6455         825       945
        1          6444         225       145
        3          3455        3425       345
        3          6155         525       645
        3          6434         325       845
        4            93         345       233
        4           693         222       808

TEST SET:

 group_id    measure1    measure2    measure3
        2            23          34       233
        2           623          22       888
Forge
  • 1,587
  • 1
  • 15
  • 36

2 Answers2

1

You can try the following per the documentation

https://pycaret.readthedocs.io/en/latest/api/classification.html

fold_strategy = "groupkfold"
Nikhil Gupta
  • 1,436
  • 12
  • 15
  • 1
    I know this option but not sure it is the proper one as the grouping variable should be specified somewhere, otherwise you don’t know how are the groups splitted – Forge Jun 15 '22 at 07:15
-1

One solution could look like this:

import numpy as np
import pandas as pd
from itertools import combinations


def is_possible_sum(numbers, n):
    for r in range(len(numbers)):
        for combo in combinations(numbers, r + 1):
            if sum(combo) == n:
                return combo
    print(f'Desired split not possible')
    raise ArithmeticError


def train_test_split(table: pd.DataFrame, train_fraction: float, col_identifier: str):
    train_ids = []
    occurrences = table[col_identifier].value_counts().to_dict()
    required = sum(occurrences.values()) * train_fraction
    lengths = is_possible_sum(occurrences.values(), required)
    for i in lengths:
        for key, value in occurrences.items():
            if value == i:
                train_ids.append(key)
                del occurrences[key]    # prevents the same ID from being selected twice
                break
    train = table[table[col_identifier].isin(train_ids)]
    test = table[~table[col_identifier].isin(train_ids)]
    return train, test


if __name__ == '__main__':
    df = pd.DataFrame()
    df['Group_ID'] = np.array([1, 1, 1, 2, 2, 3, 3, 3, 4, 4])
    df['Measurement'] = np.random.random(10)
    train_part, test_part = train_test_split(df, 0.8, 'Group_ID')

Some remarks:
This is probably the least elegant way to do it... It uses an ungodly amount of for loops and is probably slow for larger dataframes. It also doesn't randomize the split. Lots of this is because the dictionary of group_id and the count of the samples with a certain group_id can't be reversed as some entries might be ambiguous. You could probably do this with numpy arrays as well, but I doubt that the overall structure would be much different.

First function taken from here: How to check if a sum is possible in array?

code-lukas
  • 1,586
  • 9
  • 19