0

I have some input data like this.

unique ID Q1 Q2 Q3
1 1 1 2
2 1 1 2
3 1 0 3
4 2 0 1
5 3 1 2
6 4 1 3

And my target is to extract some data which satisfy the following conditions:

  1. total count: 4
  2. Q1=1 count: 2
  3. Q1=2 count: 1
  4. Q2=1 count: 1~3
  5. Q3=1 count: 1

In this case, both data set with ids [1, 2, 4, 5] or [2, 3, 4, 5] are acceptable answers.

In reality, I will possibly have 6000+ rows of data and up to 12 count limitation like above. The count might varies from 1 to 50. I've written a solution which firstly group all ids by each condition, then use deapth first search to exhaustedly try out all possible combinations between the groups. (I believe this is a brute-force solution...) However, I always run out my computer's memory and my time before I can get a possible answer.

My question is,

  1. what's the possible least time complexity of this problem. (I believe this is kind of subset sum problem, but I am not sure)
  2. how can I solve this problem instead of a brute-force one? I'm considering dynamic programming or decision tree. However, I believe that I will possibly run out of my computer's memory with either of this one. Or can I solve this problem by each data row's probabilities/entropy (and I would appreciate more details on this)?

My brute-force solution sample codes are not worth reading at all. Thus, I'll skip posting my code snippets...

cindy50633
  • 144
  • 2
  • 11
  • 1
    You can find one solution by posing a linear programming problem, for example by using `pulp` with Python. You will have `n` binary variables indicating if a row should be included, and the constraints should be possible to define given the values in the dataframe. – hilberts_drinking_problem Apr 18 '22 at 20:16
  • @hilberts_drinking_problem Thanks for your recommendation! I found out that this is indeed a integer programming problem. I've already found an answer to approximate the result. I will try to answer this question once I have time haha – cindy50633 Apr 25 '22 at 09:32

0 Answers0