I'm trying different algorithms, including a NN for a multiclass sequential classification problem.
Dataset is 66 Bach chorales, 12 binary columns of musical notes, and a target variable "chord", which has 102 unique labels.
> dim(d)
[1] 5664 17
> head(d)
c_id event c c_sh d d_sh e f f_sh g g_sh a a_sh b bass meter chord
1 000106b_ 2 YES NO NO NO YES NO NO YES NO NO NO NO E 5 C_M
2 000106b_ 3 YES NO NO NO YES NO NO YES NO NO NO NO E 2 C_M
3 000106b_ 4 YES NO NO NO NO YES NO NO NO YES NO NO F 3 F_M
4 000106b_ 5 YES NO NO NO NO YES NO NO NO YES NO NO F 2 F_M
5 000106b_ 6 NO NO YES NO NO YES NO NO NO YES NO NO D 4 D_m
6 000106b_ 7 NO NO YES NO NO YES NO NO NO YES NO NO D 2 D_m
> levels(e$chord)
[1] " A#d" " A#d7" " A_d" " A_m" " A_M" " A_m4" " A_M4" " A_m6" " A_M6" " A_m7" " A_M7" " Abd" " Abm"
[14] " AbM" " B_d" " B_d7" " B_m" " B_M" " B_M4" " B_m6" " B_m7" " B_M7" " Bbd" " Bbm" " BbM" " Bbm6"
[27] " BbM7" " C#d" " C#d6" " C#d7" " C#m" " C#M" " C#M4" " C#m7" " C#M7" " C_d6" " C_d7" " C_m" " C_M"
[40] " C_M4" " C_m6" " C_M6" " C_m7" " C_M7" " D#d" " D#d6" " D#d7" " D#m" " D#M" " D_d7" " D_m" " D_M"
[53] " D_M4" " D_m6" " D_M6" " D_m7" " D_M7" " Dbd" " Dbd7" " Dbm" " DbM" " Dbm7" " DbM7" " E_d" " E_m"
[66] " E_M" " E_M4" " E_m6" " E_m7" " E_M7" " Ebd" " EbM" " EbM7" " F#d" " F#d7" " F#m" " F#M" " F#M4"
[79] " F#m6" " F#m7" " F#M7" " F_d" " F_d7" " F_m" " F_M" " F_M4" " F_m6" " F_M6" " F_m7" " F_M7" " G#d"
[92] " G#d7" " G#m" " G#M" " G_d" " G_m" " G_M" " G_M4" " G_m6" " G_M6" " G_m7" " G_M7"
> length(unique(e$chord))
[1] 102
> nrows_split_d # number of observations for each class label
[1] 5 4 5 258 352 2 16 10 2 11 56 1 2 37 17 8 217 143 3 2 19 46 5 26 312 6
[27] 3 10 2 15 24 39 2 9 7 2 2 144 488 16 17 6 20 66 7 1 4 2 2 4 165 503
[53] 16 12 3 33 58 2 2 4 21 3 1 6 241 295 14 14 24 43 1 146 1 14 1 143 90 12
[79] 7 19 34 3 1 42 388 14 3 4 7 38 11 6 6 1 3 179 489 8 3 3 18 52
>
I'm facing issues as a result of how some class labels have so few observations. When it comes to random sampling/ partitioning the data into train and test sets, I've been able to find a work-around, which negates some element of randomness; find a seed that enables the training set to have at least 1 of each of the 102 class labels. Given there are 5664 observations and the training set comprises of 70-80% of the data, this was easily achievable. Furthermore, it seems to work fine, whenever the algorithm I decide to use don't necessitate matrix input/ output.
When I try to feed this into a NN, the same strategy seemingly works, when it comes to create a confusion matrix though, I have issues, as the matrices have different dimensions and indices, as a result of missing unique labels within the test set.
Even when I reduce size of training set and sample with replacement, I'm unable to find a seed where both train and test sets contain 1 of each of the 102 class labels.
A potential solution I could attempt, is to introduce duplicate observations where the frequency of the class label is low. But I'm hesitant to do so, since that's basically cheating.
1) Am I "allowed"/ is it "good practice", to manipulate the seed in order to ensure that the training set contains 1 of each of the unique class labels?
2) Are there any solutions for this problem that don't involve seed manipulation/ introducing duplicates?
If I didn't ensure the 102 uniques in the training set, I wouldn't be able to pass the data to most algorithms as errors will be raised due to the discrepancy.
> length(unique(test$chord))
[1] 75
> length(unique(train$chord))
[1] 102
t <- table(predict=pre, actual=test_t)
> length(unique(pre))
[1] 62
> length(unique(test_t))
[1] 85
nn_accuracy <- sum(diag(t) / sum(t))
> nn_accuracy
[1] 0.4077125