-1

I can't decide how to balance my dataset on "distress situations" since it isn't something that can be measured as "the percentage of rotten apples in a factory".

For now, I've chosen to just use "50%-50%" of distress voice snippets and random none-distress snippets.

I'll be glad for some advice from the community, what are the best practises in this situation? I've chosen the 50-50 approach to avoid statistical biases and I'm using a Sequential (Keras) model.

21kc
  • 23
  • 5

1 Answers1

0

Try to modify the loss function instead of the dataset if you cannot modify the dataset. But I think the question is not completely formulated.

Rafa Nogales
  • 614
  • 8
  • 13
  • How does modifying the loss function should help avoid statistical biases? As far as i understand the dataset should correspond to the real world statistics- if 1 of every 4 calls to 911 is an actual emergency, then my dataset should hold the 1:4 ratio. But here i dont know the real world's ratio...so what do you do in such a scenario with your dataset? Keep it in 1:1 ratio? How now its formulated better. – 21kc Sep 28 '19 at 13:18
  • Ok, now I understand better your question. Indeed, in the 911 case imagine what happens if you tag the actual emergency as "not real emergency" this is much worst than tag a "false emergency" as "actual emergency". This is why I'm talking about modify the loss function. If you set a higher penalty if you miss a actual emergency your algorithm is going to focus on not to fail that situation even if the dataset is unbalanced 1:4 – Rafa Nogales Oct 12 '19 at 11:26
  • And how can I modify the loss function (currently Im using Keras's Binary crossenthropy) so that it will focus on not failing to recognise an emergency? Iv'e searched on StackOverFlow but couldnt find an answer. – 21kc Oct 14 '19 at 07:58