0

I've dealt with the tabular-data that has approximately more than one million rows in the data and it only contains just one column.

I tried to use bootstrap method also known as traditional sampling methods with replacement.

Since bootstrap method is just to sample the value in the population with replacement, I made the code as below in a simple way.

public static double[] inelegantSampleWithReplacement(double []someArray,int howmany){
        double result[] = new double[NUMBER_OF_ROWS];
        for(int i=0;i<howmany;++i){
            result[i] =  someArray[(int)(someArray.length * Math.random())];
        }
        return result;
    }

It works well and fortunately it takes not too much time for the data with one million rows. It took one minute for a matrix with one million rows.

I am looking for the sampling methods that make the code more faster since I will be faced with big data where billions of rows easily appear.

As you can see, sampling with replacement is a very straight-forward method and I made the code as above. I tried to search for other sophisticated version of bootstrap and found the blog (http://www.inquidia.com/news-and-info/solution-bootstrapping-big-data-environments-how-sample-replacement-using-sampling). I made the codes by following the blog, but results was worse than the above code.

Do you have any great idea of enhancing the running time of above bootstrap method?

StupidWolf
  • 45,075
  • 17
  • 40
  • 72
sclee1
  • 1,095
  • 1
  • 15
  • 36
  • Perhaps you should consider using Apache Spark instead of writing your own? – Strelok Mar 23 '17 at 06:52
  • The constraints is that the code only work on the local computer not a distributed platform. That is why I've asked a questions for a better solution. Thanks for your comment. – sclee1 Mar 23 '17 at 06:55
  • What do you want to achieve? Get random data from an array? What for? – Krzysztof Cichocki Mar 23 '17 at 07:13
  • If you don't give additional information as what you need to do with this data, then it is not possible to make it faster. As the code you posted doesn't contain anything that could be optimised, the only solution is to optimise your approach to the problem you're trying to solve. Please give us the big picture. – Krzysztof Cichocki Mar 23 '17 at 07:23
  • @KrzysztofCichocki Sorry for inaccurate information. What I need to do with this data is to adopt the bootstrap method. (sampling with replacement). I am just wondering that if there is a more sophisticated method than my code in the terms of code running time since I am going to be handle big data. I edited the words that you indicated. Thanks. – sclee1 Mar 23 '17 at 07:36
  • If you know what properties you would like to compute from this data, then you can compute them directly in the for loop - eg. average, histogram. There are also multithreaded versions of such algorithms, so you can implement them as well. – Krzysztof Cichocki Mar 23 '17 at 08:03
  • 1
    Do not understand why your code run so long, maybe you measure some additional operations. This code may be faster if you will use `java.util.Random r = new Random();` before for statement, and for get random index in bounds`r.nextInt(someArray.length);` – user1516873 Mar 23 '17 at 09:18
  • Make it multithreaded. eg. If you have 4 processors then have each thread sample a quarter of the target array. – TedTrippin Mar 23 '17 at 12:15
  • You need to first find out what makes it run slow. I suspect it is out of the picture of this method. Array access is `O(1)`, running it million time should still be fast. What is possibly slow is either the memory allocation, or the random number, but I don't think they are anything obvious. Prove to us this logic is really slow. – Adrian Shum Mar 24 '17 at 01:40
  • One possibility is your memory allocation for the input array: for 1billion double, it means the array is costing you at least 8GB of memory. Are you sure you want to store this 8GB of data in a consecutive piece of memory? – Adrian Shum Mar 24 '17 at 01:44
  • 1
    I have had a quick try on your code. WIth 1 million input, and getting 1000 samples, it takes around 1ms. Even 1 million input, getting 1 million samples takes only 55ms. It is very possibly that something out of the method is causing the slowness – Adrian Shum Mar 24 '17 at 06:37

0 Answers0