0

I have training in pure math but not in statistics, computer science, and information theory so I am a bit lost here and would really appreciate any guidance.

I am looking for some helpful ways to frame a general search approach which would minimize the time complexity of the search.

For example, let's say I was playing a modified version of 20-questions with a friend. The friend has thought of a human, presently alive in the US, and I can ask upto 20 questions to uncover the truth. I want to ask as few questions as possible on average to win the game. We will play this game repeatedly and I want to develop a strategy that would minimize my average win time (as measured by the number of questions asked).

Sample Space: 329.5 million humans currently alive in the US

Rule: Ask any question. The question can have yes or no answer or even a descriptive answer. So for instance, it is allowed to ask the first name of the person.

Intuitively, it seems to me that immediately (as a first quesiton) asking a question like "Is it Barack Obama?" is a terrible question because it splits the sample space (or search space) into two sets, one with 1 person, namely the former US President, and the second containing rest of the US population.

Asking, what is their sex (or old school gender) may be a better question as it will split the yes and no answers into sets of roughly equal sizes.

Instead of asking a binary question, asking an n-ary question is likely better because it will split the sample space into n sub-spaces of varying sizes and if the sizes are similar then that's fantastic. For instance, the question could be, what is the first letter of their last name? There are 26 possible answers, although we know that people in the US are much more likely to have their last name begin with "J" rather than "X".

Of course, I can conceivably ask a 329.5 million-ary question whereby I'll have the answer in one-shot.

My questions for you guys are as follows:

  1. If we fix "n", so asking only binary or ternary or fixed-n-ary questions, it seems to me that the efficient approach would be to ask questions which would divide the sample space into "n" roughly equal parts, if I am minimizing time complexity. How can I prove this? What is the right approach or mathematical fraemwork to prove this? Assuming that I am only minimizing time complexity or the average number of questions I need to ask to get to the solution.

  2. If we don't fix "n" then what would be a general way to frame this mathematically? Now I have two variables over which I am operating, "n" and "the relative size of subsets the answer to a n-ary question splits the sample space", to minimize the time complexity. How can I frame this problem mathematically?

  3. Is my intuition even correct? Or are there faster ways to approach this?

  4. What I am describing sounds an awful lot like a Classificaiton Decision Tree in Machine Learning. Is minimzing Entorpy the right way to frame my question?

  5. Who would know or think about this type of stuff ? Information theorists? Computer Scientists? Statisticians? Probability Theorists? Machine Learning folks? Someone else?

  6. What's the right forum on the internet to get help on this question? Reddit? Some specific stackexchange? Anything else?

Thx

Amatya
  • 1,203
  • 6
  • 32
  • 52
  • Information theorists have thought about "this stuff" for a long time. – Scott Hunter Feb 03 '22 at 20:33
  • @ScottHunter Brilliant! Can you please recommend something I can read? If this is a super basic question then presumably a fundamental text book would contain this? Also, if I have to google this, what should I google? As you can tell, I am not from this field and I don't even have the vocabulary to even know what to search. Thanks very much! – Amatya Feb 03 '22 at 20:35
  • @ScottHunter Thanks!! Despite that overload of attitude. – Amatya Feb 04 '22 at 00:35

0 Answers0