10

I'm trying to reproduce this GitHub project on my machine, on Topological Data Analysis (TDA).

My steps:

  • get best parameters from a cross-validation output
  • load my dataset feature selection
  • extract topological features from the dataset for prediction
  • create a Random Forest Classifier model built on the best parameters
  • calculate probabilities on test data

Background:

  1. Feature selection

In order to decide which attributes belong to which group, we created a correlation matrix. From this, we saw that there were two big groups, where player attributes were strongly correlated with each other. Therefore, we decided to split the attributes into two groups, one to summarise the attacking characteristics of a player while the other one the defensiveness. Finally, since the goalkeeper has completely different statistics with respect to the other players, we decided to take into account only the overall rating. Below, is possible to see the 24 features used for each player:

Attack: "positioning", "crossing", "finishing", "heading_accuracy", "short_passing", "reactions", "volleys", "dribbling", "curve", "free_kick_accuracy", "acceleration", "sprint_speed", "agility", "penalties", "vision", "shot_power", "long_shots" Defense: "interceptions", "aggression", "marking", "standing_tackle", "sliding_tackle", "long_passing" Goalkeeper: "overall_rating"

From this set of features, the next step we did was to, for each non-goalkeeper player, compute the mean of the attack attributes and the defensive ones.

Finally, for each team in a given match, we compute the mean and the standard deviation for the attack and the defense from these stats of the team's players, as well as the best attack and best defense.

In this way a match is described by 14 features (GK overall value, best attack, std attack, mean attack, the best defense, std defense, mean defense), that mapped the match in the space, following the characterizes of the two teams.


  1. Feature extraction

The aim of TDA is to catch the structure of the space underlying the data. In our project, we assume that the neighborhood of a data point hides meaningful information that is correlated with the outcome of the match. Thus, we explored the data space looking for this kind of correlation.


Methods:

def get_best_params():
    cv_output = read_pickle('cv_output.pickle')
    best_model_params, top_feat_params, top_model_feat_params, *_ = cv_output

    return top_feat_params, top_model_feat_params

def load_dataset():
    x_y = get_dataset(42188).get_data(dataset_format='array')[0]
    x_train_with_topo = x_y[:, :-1]
    y_train = x_y[:, -1]

    return x_train_with_topo, y_train


def extract_x_test_features(x_train, y_train, players_df, pipeline):
    """Extract the topological features from the test set. This requires also the train set

    Parameters
    ----------
    x_train:
        The x used in the training phase
    y_train:
        The 'y' used in the training phase
    players_df: pd.DataFrame
        The DataFrame containing the matches with all the players, from which to extract the test set
    pipeline: Pipeline
        The Giotto pipeline

    Returns
    -------
    x_test:
        The x_test with the topological features
    """
    x_train_no_topo = x_train[:, :14]
    y_test = np.zeros(len(players_df))  # Artificial y_test for features computation
    print('Y_TEST',y_test.shape)

    x_test_topo = extract_features_for_prediction(x_train_no_topo, y_train, players_df.values, y_test, pipeline)

    return x_test_topo

def extract_topological_features(diagrams):
    metrics = ['bottleneck', 'wasserstein', 'landscape', 'betti', 'heat']
    new_features = []
    for metric in metrics:
        amplitude = Amplitude(metric=metric)
        new_features.append(amplitude.fit_transform(diagrams))
    new_features = np.concatenate(new_features, axis=1)
    return new_features

def extract_features_for_prediction(x_train, y_train, x_test, y_test, pipeline):
    shift = 10
    top_features = []
    all_x_train = x_train
    all_y_train = y_train
    for i in tqdm(range(0, len(x_test), shift)):
        #
        print(range(0, len(x_test), shift) )
        if i+shift > len(x_test):
            shift = len(x_test) - i
        batch = np.concatenate([all_x_train, x_test[i: i + shift]])
        batch_y = np.concatenate([all_y_train, y_test[i: i + shift].reshape((-1,))])
        diagrams_batch, _ = pipeline.fit_transform_resample(batch, batch_y)
        new_features_batch = extract_topological_features(diagrams_batch[-shift:])
        top_features.append(new_features_batch)
        all_x_train = np.concatenate([all_x_train, batch[-shift:]])
        all_y_train = np.concatenate([all_y_train, batch_y[-shift:]])
    final_x_test = np.concatenate([x_test, np.concatenate(top_features, axis=0)], axis=1)
    return final_x_test

def get_probabilities(model, x_test, team_ids):
    """Get the probabilities on the outcome of the matches contained in the test set

    Parameters
    ----------
    model:
        The model (must have the 'predict_proba' function)
    x_test:
        The test set
    team_ids: pd.DataFrame
        The DataFrame containing, for each match in the test set, the ids of the two teams
    Returns
    -------
    probabilities:
        The probabilities for each match in the test set
    """
    prob_pred = model.predict_proba(x_test)
    prob_match_df = pd.DataFrame(data=prob_pred, columns=['away_team_prob', 'draw_prob', 'home_team_prob'])
    prob_match_df = pd.concat([team_ids.reset_index(drop=True), prob_match_df], axis=1)
    return prob_match_df

Working code:

best_pipeline_params, best_model_feat_params = get_best_params()

# 'best_pipeline_params' -> {'k_min': 50, 'k_max': 175, 'dist_percentage': 0.1}
# best_model_feat_params -> {'n_estimators': 1000, 'max_depth': 10, 'random_state': 52, 'max_features': 0.5}

pipeline = get_pipeline(best_pipeline_params)
# pipeline -> Pipeline(steps=[('extract_point_clouds',
            # SubSpaceExtraction(dist_percentage=0.1, k_max=175, k_min=50)),
            #('create_diagrams', VietorisRipsPersistence(n_jobs=-1))])

x_train, y_train = load_dataset()

# x_train.shape ->  (2565, 19)
# y_train.shape -> (2565,)

x_test = extract_x_test_features(x_train, y_train, new_players_df_stats, pipeline)

# x_test.shape -> (380, 24)

rf_model = RandomForestClassifier(**best_model_feat_params)
rf_model.fit(x_train, y_train)
matches_probabilities = get_probabilities(rf_model, x_test, team_ids)  # <-- breaks here
matches_probabilities.head()
compute_final_standings(matches_probabilities, 'premier league')

But I'm getting the error:

ValueError: X has 24 features, but DecisionTreeClassifier is expecting 19 features as input.

Loaded dataset (X_train):

Data columns (total 20 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   home_best_attack    2565 non-null   float64
 1   home_best_defense   2565 non-null   float64
 2   home_avg_attack     2565 non-null   float64
 3   home_avg_defense    2565 non-null   float64
 4   home_std_attack     2565 non-null   float64
 5   home_std_defense    2565 non-null   float64
 6   gk_home_player_1    2565 non-null   float64
 7   away_avg_attack     2565 non-null   float64
 8   away_avg_defense    2565 non-null   float64
 9   away_std_attack     2565 non-null   float64
 10  away_std_defense    2565 non-null   float64
 11  away_best_attack    2565 non-null   float64
 12  away_best_defense   2565 non-null   float64
 13  gk_away_player_1    2565 non-null   float64
 14  bottleneck_metric   2565 non-null   float64
 15  wasserstein_metric  2565 non-null   float64
 16  landscape_metric    2565 non-null   float64
 17  betti_metric        2565 non-null   float64
 18  heat_metric         2565 non-null   float64
 19  label               2565 non-null   float64

Please note that the first 14 columns are the features that describe the match, and that the 5 remaining features (minus label) are the topological ones, that are already extracted.

The problem seems to be when code gets to extract_x_test_features() and extract_features_for_prediction(), which should get the tolopogical features and stack the train dataset with it.

Since X_train already has topological features, it adds 5 more and so I end up with 24 features.

I'm not sure, though. I'm just trying to wrap this project around my head...and how prediction is being made here.


How do I fix the mismatch using the code above?


NOTES:

1- x_train and y_test are not dataframes but numpy.ndarray

2 - This question is completely reproducible if one clones or downloads the project from the following link:

Github Link

Zaid Aly
  • 163
  • 1
  • 17
8-Bit Borges
  • 9,643
  • 29
  • 101
  • 198

4 Answers4

1

Returning a slice with 19 features here:

def extract_features_for_prediction(x_train, y_train, x_test, y_test, pipeline):
   (...)
   return final_x_test[:, :19]

Got rid of the error and ran the test.


I still don't get the gist of it, though.

I will grant the bounty to anyone who explains me the idea behind the test set in the context of this project, in the project notebook, which can be found here:

Project Notebook

8-Bit Borges
  • 9,643
  • 29
  • 101
  • 198
0

The answer is actually given in the question already.

You mentioned in your question, # x_test.shape -> (380, 24) and # x_train.shape -> (2565, 19). As it is very clear and can be seen that your test data shape doesn't match with your train data, your train data have 19 features, whereas the test data have got 24 features (they must contain same amount of feature) thus you're getting the error "X has 24 features, but DecisionTreeClassifier is expecting 19 features as input" when you're giving the x_test inside your model in this line - get_probabilities(rf_model, x_test, team_ids).

So, your test data must have 24 features just like your train data.

Khalid Saifullah
  • 747
  • 7
  • 16
  • Yes, that is obvious, but how do I fix it? – 8-Bit Borges Jan 16 '21 at 01:39
  • 1
    There is definitely a mismatch in the data. How to fix it will be dependent on why it is different. You need to look into your training and test data and see where the discrepancy is coming from. I would guess that the test set has some unneeded columns in the data and simply removing them such that the shape will be correct will solve your issue. – Jason Chia Jan 22 '21 at 13:41
0

In your x_train you have 19 features, whereas in X_test you have 24 features? Why is that?

To solve it, show both data frames (x_train and X_test) and try to find why they have different features. At the end, you must have same shape and same features in each dataframes. If not, you will obtain this error.

Probably is an error of the dataset you imported.

Alex Serra Marrugat
  • 1,849
  • 1
  • 4
  • 14
0

batch = np.concatenate([all_x_train, x_test[i: i + shift]]) batch_y = np.concatenate([all_y_train, y_test[i: i + shift].reshape((-1,))]) diagrams_batch, _ = pipeline.fit_transform_resample(batch, batch_y)

here, you are using fit_transform for both all_x_train and x_test. What would you do is fit_transform for all_x_train and just transform for x_test.

the reason is when you use fit_transform for both training and testing, overfitting occurs. hence try:

batch_y = np.concatenate([all_y_train, y_test[i: i + shift].reshape((-1,))])

if is_train:
    diagrams_batch, _ = pipeline.fit_transform_resample(batch, batch_y)
else:
    diagrams_batch, _ = pipeline.transform_resample(batch, batch_y)
Sulgana
  • 13
  • 2