-1

Goal: Given a series of binary digits, see where there's a longer bunch of 1's (even if there are a few 0's mixed in).

Background: I'm programming a monte carlo simulation in SAS, and have (say) 100k variables with a 0 or 1. I want to see if (somewhere) there is a cluster that is pretty dense. I don't think 5 in a row of one's would be sufficiently dense (00011111010...), but maybe 100 one's in (01111...11111) would be great. So, I guess I want a localized cluster.

I'm having the code for the variable be, so that n1, n2, etc. would be either 0 or 1,:

array var_of_binary{*} n1 - n100000;

Am I asking for a solution (impossible) of the SAT-CNF which is "a classic problem that is known to be NP-complete"? (PS: I don't understand what that is, but I know it's unsolvable and too complex.)

I think making multiple passes computing density of length 21, 22, 23, ..., 1000 would work (this is pseudo-code, which I did not try to run):

static_max_of_1k = 1000;  /*check in lengths up to 1k, perhaps less*/
do i= 100 to static_max_of_1k ;
    do j=1 to 999000;
      density1 = sum(of var_of_binary{j} - var_of_binary{j+i })/i;
    /* save value of density1, probably in an array */
    end;
end;

Note 1: I don't want a C++ solution (unless it works immediately in SAS as a subroutine without alteration).

Note 2: I don't want recursive code, since if it blows up, I wouldn't have a clue to debug it. (I know my limitations.)

Note 3: I guess I'm doing a 1-dimensional variation of Detect High density pixel areas in a binary image which is sort of cool and a nice photo, but (again) beyond me. I appreciated from afar the metacode of SimpleBlobDetector Class Reference. I think I'm in over my head.

Peter_from_NYC
  • 199
  • 1
  • 2
  • 13
  • You cannot really make a SAS dataset with 100,000 variables and you probably don't want to. Why not think of the problem as 100,000 observations instead? But either way what are your other variables? – Tom Jun 04 '17 at 05:07
  • Do you just want to find the longest run? Or do you really want a frequency table showing that there are say 100 runs of length 5 and 1 run of length 50 in your 100,000 observations? – Tom Jun 04 '17 at 05:12
  • @Tom Thanks for suggesting an efficiency, which may (or may not) be important. Right now, I want my problem to be set up intuitively. The 100k vars are for consecutive time intervals (e.g. 10 seconds each) for one person (or one run of the monte carlo simulation), and each observation would be for another person. – Peter_from_NYC Jun 05 '17 at 02:21
  • @Tom I don't "want to find the longest run". I want localized clusters (runs) of 0's and 1's that have a high concentration of 1's, if that makes any sense to you. – Peter_from_NYC Jun 05 '17 at 02:23

1 Answers1

0

Maybe this will help. Generate data and then determine the runs and their size. FREQ is the size of the run, J is the ID within REP and OBS is the ID of the first obs in the run.

data simulate;
   do rep=1 to 1e1;
      do j = 1 to 1e1;
         y = rand('BINOMIAL',.5,1);
         output;
         end;
      end;
   run;
proc summary data=simulate;
   by rep y notsorted;
   output out=runs(drop=_type_) idgroup(obs out[1](j)=);
   run;
proc print;
   run;

enter image description here

data _null_
  • 8,534
  • 12
  • 14
  • I'm reading and thinking about your answer. You're a good programmer (gulp). It's not what I want (yet), but it has good elements such as employing a PROC to do the hard work. – Peter_from_NYC Jun 05 '17 at 18:16