Train and test set have different length of unique target labels

Question

I'm trying different algorithms, including a NN for a multiclass sequential classification problem.

Dataset is 66 Bach chorales, 12 binary columns of musical notes, and a target variable "chord", which has 102 unique labels.

> dim(d)
[1] 5664   17

> head(d)
      c_id event   c c_sh   d d_sh   e   f f_sh   g g_sh   a a_sh   b bass meter chord
1 000106b_     2 YES   NO  NO   NO YES  NO   NO YES   NO  NO   NO  NO    E     5   C_M
2 000106b_     3 YES   NO  NO   NO YES  NO   NO YES   NO  NO   NO  NO    E     2   C_M
3 000106b_     4 YES   NO  NO   NO  NO YES   NO  NO   NO YES   NO  NO    F     3   F_M
4 000106b_     5 YES   NO  NO   NO  NO YES   NO  NO   NO YES   NO  NO    F     2   F_M
5 000106b_     6  NO   NO YES   NO  NO YES   NO  NO   NO YES   NO  NO    D     4   D_m
6 000106b_     7  NO   NO YES   NO  NO YES   NO  NO   NO YES   NO  NO    D     2   D_m

> levels(e$chord)
  [1] " A#d"  " A#d7" " A_d"  " A_m"  " A_M"  " A_m4" " A_M4" " A_m6" " A_M6" " A_m7" " A_M7" " Abd"  " Abm" 
 [14] " AbM"  " B_d"  " B_d7" " B_m"  " B_M"  " B_M4" " B_m6" " B_m7" " B_M7" " Bbd"  " Bbm"  " BbM"  " Bbm6"
 [27] " BbM7" " C#d"  " C#d6" " C#d7" " C#m"  " C#M"  " C#M4" " C#m7" " C#M7" " C_d6" " C_d7" " C_m"  " C_M" 
 [40] " C_M4" " C_m6" " C_M6" " C_m7" " C_M7" " D#d"  " D#d6" " D#d7" " D#m"  " D#M"  " D_d7" " D_m"  " D_M" 
 [53] " D_M4" " D_m6" " D_M6" " D_m7" " D_M7" " Dbd"  " Dbd7" " Dbm"  " DbM"  " Dbm7" " DbM7" " E_d"  " E_m" 
 [66] " E_M"  " E_M4" " E_m6" " E_m7" " E_M7" " Ebd"  " EbM"  " EbM7" " F#d"  " F#d7" " F#m"  " F#M"  " F#M4"
 [79] " F#m6" " F#m7" " F#M7" " F_d"  " F_d7" " F_m"  " F_M"  " F_M4" " F_m6" " F_M6" " F_m7" " F_M7" " G#d" 
 [92] " G#d7" " G#m"  " G#M"  " G_d"  " G_m"  " G_M"  " G_M4" " G_m6" " G_M6" " G_m7" " G_M7"

> length(unique(e$chord))
[1] 102

> nrows_split_d # number of observations for each class label
  [1]   5   4   5 258 352   2  16  10   2  11  56   1   2  37  17   8 217 143   3   2  19  46   5  26 312   6
 [27]   3  10   2  15  24  39   2   9   7   2   2 144 488  16  17   6  20  66   7   1   4   2   2   4 165 503
 [53]  16  12   3  33  58   2   2   4  21   3   1   6 241 295  14  14  24  43   1 146   1  14   1 143  90  12
 [79]   7  19  34   3   1  42 388  14   3   4   7  38  11   6   6   1   3 179 489   8   3   3  18  52
>

I'm facing issues as a result of how some class labels have so few observations. When it comes to random sampling/ partitioning the data into train and test sets, I've been able to find a work-around, which negates some element of randomness; find a seed that enables the training set to have at least 1 of each of the 102 class labels. Given there are 5664 observations and the training set comprises of 70-80% of the data, this was easily achievable. Furthermore, it seems to work fine, whenever the algorithm I decide to use don't necessitate matrix input/ output.

When I try to feed this into a NN, the same strategy seemingly works, when it comes to create a confusion matrix though, I have issues, as the matrices have different dimensions and indices, as a result of missing unique labels within the test set.

Even when I reduce size of training set and sample with replacement, I'm unable to find a seed where both train and test sets contain 1 of each of the 102 class labels.

A potential solution I could attempt, is to introduce duplicate observations where the frequency of the class label is low. But I'm hesitant to do so, since that's basically cheating.

1) Am I "allowed"/ is it "good practice", to manipulate the seed in order to ensure that the training set contains 1 of each of the unique class labels?

2) Are there any solutions for this problem that don't involve seed manipulation/ introducing duplicates?

If I didn't ensure the 102 uniques in the training set, I wouldn't be able to pass the data to most algorithms as errors will be raised due to the discrepancy.

> length(unique(test$chord))
[1] 75
> length(unique(train$chord))
[1] 102

t <- table(predict=pre, actual=test_t)

> length(unique(pre))
[1] 62
> length(unique(test_t))
[1] 85

nn_accuracy <- sum(diag(t) / sum(t))
> nn_accuracy
[1] 0.4077125

With 12 binary columns and a column of 102 cords, has potentially over 400,000 different combinations yet your dataset is only contains 5660 observations. Do you really expect a meaningful fit? I would first consider if any of binary columns are strong correlated with each other and then potentially bin together some of the cords to reduce the number of unique combinations. — Dave2e, Nov 07 '19 at 20:34
I got 92% accuracy on test data with xgboost and 85% with SVM. Also, oob accuracy while training was around 70% not the 40% I'm getting from the confusion matrix, which is why I suspect it's wrong, also there are missing columns. I expect a meaningful fit because of how music theory works. There are only 12 notes, a chord is typically a triad, so if c is 1, e is 1, g is 1, that chord can only be a C, it might be an inversion of C, but still C. Usually bass notes are the tonic. Just ran a spearman correlation and none of the features are correlated above 0.2, which is to be expected. — mad-a, Nov 07 '19 at 21:00
I would still proceed with caution and consider the chance of over fitting. From your results from "nrows_split_d" many of the rows are in the low single digits and are not a part of the test set. — Dave2e, Nov 07 '19 at 21:33
You could well be right, I tried xgboost again on a 60% split between train and test on a seed where there were 102 unique labels in train and 95 unique labels in the test set and after just 50 iterations still achieved 89% accuracy, I believe the accuracy would be higher with greater iterations. It's hard to say whether it's over-fitting or whether the test set just doesn't contain enough of the lower frequency labels to affect the accuracy drastically. The more common labels are obviously classifying correctly. What can I do, given this? — mad-a, Nov 07 '19 at 21:59
I'll go back to my original comment, can you bin some of the 102 chord categories to a smaller number. You mentioned in your comment that c, e or g is the same as a C, maybe you can start there. — Dave2e, Nov 07 '19 at 23:51
I can reduce 102 chord categories to 99 by converting en-harmonic equivalent scales. It's hard to reduce further, since some of the en-harmonic equivalents aren't a class to begin with, so converting them wouldn't reduce the number of uniques. Furthermore, when we get to diminished/ dominants/ 7ths, while the root may be an en-harmonic equivalent, the altered notes aren't. Since I'm concerned with the classification of chords, not which chorale they belong to, could I simply duplicate observations where the frequency of observations for that label is low? — mad-a, Nov 08 '19 at 13:07

Train and test set have different length of unique target labels

0 Answers0