1

I would like to use a multinomial logistic regression to get win probabilities for each of the 5 horses that participate in any given race using each horses previous average speed.

RACE_ID    H1_SPEED     H2_SPEED    H3_SPEED    H4_SPEED    H5_SPEED    WINNING_HORSE
1          40.482081    44.199627   42.034929   39.004813   43.830139   5
2          39.482081    42.199627   41.034929   41.004813   40.830139   4

I am stuck on how to handle the independent variables for each horse given that any of the 5 horses average speed can be placed in any of H1_SPEED through H5_SPEED.

Given the fact that for each race I can put any of the 5 horses under H1_SPEED meaning there is no real relationship between H1_SPEED from RACE_ID 1 and H1_SPEED from RACE_ID 2 other than the arbitrary position I selected.

Would there be any difference if the dataset looked like this -

  • For RACE_ID 1 I swapped H3_SPEED and H5_SPEED and changed WINNING_HORSE from 5 to 3
  • For RACE_ID 2 I swapped H4_SPEED and H1_SPEED and changed WINNING_HORSE from 4 to 1
RACE_ID    H1_SPEED     H2_SPEED    H3_SPEED    H4_SPEED    H5_SPEED    WINNING_HORSE
1          40.482081    44.199627   43.830139   39.004813   42.034929   3
2          41.004813    42.199627   41.034929   39.482081   40.830139   1

Is this an issue, if so how should this be handled? What if I wanted to add more independent features per horse?

radio23
  • 87
  • 1
  • 8

1 Answers1

0

You cannot change in that way your dataset, because each feature (column) has a meaning and probably it depends on the values of the other features. You can imagine it as a six dimensional hyperplane, if you change the value of a feature the position of the point in the hyperplane changes, it does not remain stationary.
If you deem that a feature is useless to solve your problem (i.e. it is independent from the target), you can drop it or avoid to use it during the training phase of your model.

Edit

To solve your specific problem you may add a parameter for each speed column that takes care of the specific horse which is running with that speed. It is a sort of data augmentation, in order to add more problem related features to your model.

RACE_ID   H1_SPEED  H1_HORSE   H2_SPEED  H2_HORSE  ... WINNING_HORSE
1         40.482081        1   44.199627        2  ...             5
2         39.482081        3   42.199627        5  ...             4

I've invented the number associated to each horse, but it seems that this information is present in your dataset.

Andrea
  • 47
  • 1
  • 9
  • The problem is that the order of the features is interchangeable per sample and each feature (column) has no relationship to that same feature (column) from a different row – radio23 May 26 '22 at 00:36
  • There are 5 features per row and I put them each in an arbitrary column – radio23 May 26 '22 at 00:38
  • Maybe, I'm not understanding, but it seems that you want to address the fact that the horse speed is not related to the horse. So, why not create a new feature for each speed column that identifies the horse? In that way, you take into consideration both the speed and the horse that is running with that speed. – Andrea Jun 01 '22 at 07:26
  • Can you share an example of doing this or any links or terms I can search to see how to do this? @Andrea – radio23 Jun 02 '22 at 17:07
  • I've just edited the previous answer, check if it convinces you. – Andrea Jun 05 '22 at 16:01