-2

110 choose 108 is so fast, but 522 choose 108 waste time to do so. Any Help?

from itertools import combinations
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
#read df
df = pd.read_csv('capacity.csv', header=None)
#plot df
df[1].hist()
plt.ylabel('Quantity')
plt.xlabel('Capacity / Ah')
plt.savefig('capacity_hist.png')
#numpy.ndarray is quicker than pandas.DataFrame here.
nda = df[1].values
# 110 choose 108 and find the best combinations
nda = nda[:111]
combin = combinations(nda, 108)
the_best_list = []
num = 0
tmp = nda.std()
for i in combin:
    num += 1
    if np.std(i) < tmp:
        tmp = np.std(i)
        the_best_list = i
#shows the result
print(num)
print(tmp)
print(the_best_list)

The shape of df is (522, 2).

The figure of hist of df is shown below: normal disbution?

The file capacity.csv is shown below:

1,38.913 2,38.904 3,38.925 4,38.872 6,38.876 7,38.968 8,38.896 9,38.893 10,38.915 11,38.974 12,38.885 13,38.982 14,38.944 16,38.844 18,38.914 19,38.913 20,38.824 21,38.926 22,38.964 23,37.295 24,38.807 25,38.908 27,38.927 28,38.83 30,38.943 32,39.013 33,38.751 36,38.92 37,38.869 38,38.909 39,38.9 40,38.892 41,38.9 42,38.951 43,38.726 44,38.937 45,38.757 46,38.867 47,38.882 48,38.952 49,38.918 50,38.875 51,38.998 52,38.888 54,37.822 56,38.982 57,38.922 58,38.934 59,38.938 60,39.035 61,38.955 63,38.935 64,38.946 66,38.983 67,38.983 69,38.886 71,38.884 72,38.964 73,39.06 74,38.869 75,38.926 76,38.972 78,38.851 79,38.989 80,38.902 81,38.998 82,38.897 83,38.98 85,37.939 86,38.947 87,38.617 89,38.981 90,38.957 91,38.851 92,38.978 93,38.831 94,39.001 95,38.942 96,39.003 97,38.978 98,38.915 99,38.872 100,38.977 101,38.932 102,38.583 103,38.966 104,38.935 105,38.906 106,39.004 107,38.989 109,38.852 110,38.925 111,38.183 113,38.896 114,38.979 116,38.914 117,38.666 119,38.9 120,38.952 121,38.806 122,37.957 123,38.922 124,38.844 125,38.786 126,38.95 128,38.875 129,38.954 130,38.912 131,38.93 132,37.785 133,38.883 134,38.911 135,38.859 136,38.802 137,38.909 138,38.892 139,37.872 140,38.897 141,38.985 142,39 143,38.916 144,38.902 145,38.906 146,37.863 147,38.905 148,38.733 149,38.9 150,38.851 151,38.9 152,38.855 153,38.931 155,38.924 156,38.864 157,38.869 158,38.943 159,38.978 160,39.018 161,38.992 162,38.654 163,38.95 164,38.887 165,38.966 166,38.98 167,38.862 168,38.96 170,38.893 171,38.931 172,38.894 173,38.985 174,38.941 175,38.92 176,38.911 178,38.952 179,38.711 183,38.945 184,38.893 185,38.882 186,38.807 187,38.968 188,38.958 189,38.88 190,38.937 191,38.899 192,38.922 193,36.259 194,38.901 195,38.946 196,38.971 197,38.916 198,38.968 201,38.888 203,38.872 204,38.815 205,38.861 206,38.909 207,39.023 208,38.832 209,38.959 210,38.964 211,38.91 212,38.952 213,39.033 214,38.987 215,38.942 216,38.956 217,38.916 218,38.842 219,37.471 220,38.931 221,38.833 222,38.952 223,38.903 224,38.95 225,38.921 226,38.904 227,39.018 228,38.936 230,38.974 231,38.909 232,38.911 233,38.964 235,37.851 236,38.919 237,38.955 238,39.091 239,38.955 241,38.995 242,39.053 243,39.014 244,39.047 246,39.05 247,39.039 248,39.106 249,38.976 250,38.998 251,38.997 252,38.978 253,39.009 254,39.06 256,39.051 257,39.081 258,39.005 259,39.067 260,38.988 261,39.015 262,39.007 264,36.393 266,39.023 269,38.967 270,39.053 271,39.084 272,38.999 273,39.043 274,39.079 275,38.985 276,39.074 278,39.009 279,39.041 280,39.011 281,39.157 282,39.156 283,41.513 284,38.983 285,39.057 286,38.99 287,39.202 289,38.918 290,39.119 291,38.798 292,39.046 293,39.053 294,38.809 295,39.006 296,38.809 297,38.946 298,38.992 299,38.934 300,39.008 301,39.038 302,39.084 303,39.175 304,39.091 305,38.959 306,39.086 307,39.094 308,38.636 310,39.027 311,38.998 313,39.041 314,39.013 315,39.222 316,39.02 317,38.778 318,38.851 319,39.023 320,39.152 321,39.024 322,38.895 323,38.311 324,38.962 325,38.886 326,39.058 327,39.049 328,38.726 329,39.187 330,39.041 332,39.016 333,38.968 334,38.759 335,39.073 336,38.869 337,38.945 338,38.91 339,39.006 340,39.212 341,39.134 343,39.06 344,38.966 345,39.154 346,38.901 347,38.808 348,38.69 349,38.904 350,39.197 351,39.032 352,38.927 353,39.04 355,39.001 356,38.988 357,38.874 358,38.824 359,37.72 360,38.87 361,37.871 362,38.676 363,39.026 364,37.98 365,37.84 366,38.88 367,39.113 368,39.124 369,39.139 370,39.127 371,38.723 372,38.985 373,39.082 374,38.616 375,39.139 377,38.916 378,38.967 379,38.907 380,39.057 381,39.037 382,38.995 383,38.754 384,38.701 385,38.687 387,39.008 389,39.221 390,38.949 391,38.017 392,38.97 393,38.892 394,38.538 396,38.449 397,39.013 400,38.784 401,39.032 402,38.889 403,38.813 404,38.928 405,38.965 406,39.122 407,38.999 408,38.92 409,38.973 410,38.991 411,39.002 412,38.861 413,38.934 414,38.93 415,38.856 416,39.03 417,38.929 418,38.628 419,38.807 420,38.956 421,39.065 422,39.008 423,38.914 424,38.951 425,38.898 426,38.891 427,39.356 428,38.968 429,39.026 430,38.925 431,39.212 432,39.183 433,39.049 434,39.079 435,39.091 436,39.071 437,38.724 438,38.879 439,38.987 440,39.019 441,38.945 442,39.182 443,39.125 444,39.138 445,39.078 446,38.825 447,39.001 448,39.011 449,39.084 450,39.024 451,39.026 452,39.102 453,39.102 454,39.317 455,38.936 457,38.969 458,38.936 459,38.536 460,38.852 461,39.107 462,38.637 463,38.867 464,37.063 465,38.035 466,39.064 467,37.437 468,38.874 469,38.475 470,38.836 471,38.971 472,38.827 473,38.908 474,38.567 475,38.749 476,37.969 477,38.855 478,38.348 479,38.876 481,38.769 482,38.675 483,38.891 484,38.649 485,38.919 486,38.937 487,38.922 488,38.842 490,38.813 491,38.83 492,38.809 493,38.739 494,38.811 495,39.013 496,39.08 497,38.892 498,38.868 499,38.879 501,38.87 502,38.848 503,38.665 504,39.06 505,38.696 506,38.948 507,38.792 508,38.896 509,38.855 510,38.963 511,38.926 513,38.674 514,38.741 515,38.793 516,38.851 517,38.964 518,38.83 519,38.846 520,39.073 522,38.81 523,37.493 524,38.948 525,38.704 526,37.456 527,38.716 529,38.941 530,38.828 531,38.909 532,38.829 534,38.795 535,38.757 537,38.699 538,38.982 539,38.983 540,38.932 541,38.808 542,38.988 543,38.933 544,39.06 545,39.134 546,38.651 547,38.839 548,39.132 549,38.911 550,38.503 551,38.785 552,38.763 554,38.671 555,38.51 556,38.936 557,38.559 558,38.701 559,38.693 560,38.562 561,38.889 562,38.91 563,38.441 564,38.701 566,38.772 568,38.681 569,38.509 570,38.785 571,38.799 572,38.79 573,38.833 575,38.656 576,38.677 577,38.854 578,38.439 579,38.849 582,38.094 583,38.871 584,38.502 585,38.818 586,38.627 587,38.685 588,38.812 589,38.752 590,38.601

I just want to find a combination which has the minimum value of standard deviation.

Thanks for Help.

  • list the whole combinations than find min of combination is not the best way. Any help? – Jing Zhang Aug 16 '18 at 14:47
  • This code seem to already do what you want. What do you mean by "This code works, but not for the large amount of numbers"? – zvone Aug 16 '18 at 15:51
  • combination(120, 108) takes too many time and throws memory error. – Jing Zhang Aug 17 '18 at 12:50
  • This looks problem looks like it would be solved well by a genetic algorithm – Spoonless Aug 17 '18 at 15:01
  • Your distribution (X) is not normal/gaussian. However it does look (40.0 - X) could be well modeled with a log-normal, exponential or chi-squared distribution. Take a look at [the common probability distributions](https://blog.cloudera.com/blog/2015/12/common-probability-distributions-the-data-scientists-crib-sheet/) – Spoonless Aug 20 '18 at 12:35

1 Answers1

0

Here is a partial solution that works for normally distributed data. The idea is that the subset of samples closest to the mean will be definition have the smallest standard deviation.

If your data is not uniform you could do k means clustering before hand and then check the desired number of points closest to each mean and see which of those subsets has the smallest standard deviation.

Note that the example data you provided only had 38 values, and they were in one line so I changed how nda was created from the dataframe

from itertools import combinations
import pandas as pd
import numpy as np

df = pd.read_csv('randint35-39.txt', header=None)
nda = df.values[0, :]

nda = nda[:113]
print(nda.shape)

# sort values by their distance from the global mean
normalised_vals = abs(nda-np.mean(nda))
indices = np.argsort(normalised_vals)
print(indices)

num_in_combo = 12

the_best = nda[indices[:num_in_combo]]

print("Global mean = {:10.3f} Global std_dev = {:10.3f}".format(np.mean(nda), np.std(nda)))
print("Subset mean = {:10.3f} Subset std_dev = {:10.3f}".format(np.mean(the_best), np.std(the_best)))

print(the_best)

Edit: things to do to improve this answer: use the median instead of the mean if you suspect big outliers. Use the the output of this to create a binary membership mask, create a population of randomly slightly changed masks and then use a genetic algorithm to get the best mask (using the standard deviation of the included samples as the fitness function). If the range of the samples is small then do a brute force search along the range and find the N samples closest to that (just like above) and see which point along the range gives the set with the lowest standard deviation

Spoonless
  • 561
  • 5
  • 14
  • Thanks a lot. I will think about your solution for a while because of my fool about python. And I try the k-means solution from sklearn.cluster, but I can not confirm that the result from k-means is the best. – Jing Zhang Aug 17 '18 at 23:25
  • @jingzhang This is a classic dilemma. If you don't have any prior information about the problem to be optimised then the only way to be sure of having the global optimum is to do brute force. If you only need a good solution, but not the best, then statistical methods like this answer or local optimisation techniques can work. – Spoonless Aug 18 '18 at 09:35
  • Also for your problem, you can look at outlier rejection techniques if the set you want to keep is about 95% or more of the total – Spoonless Aug 18 '18 at 09:37
  • Thanks a lot to Spoonless! What I need is a solution and a result which can manually throw the outliers with a hist plot. I just want to save the time by using programming techniques. any example sent to me to learn from? – Jing Zhang Aug 18 '18 at 13:53
  • scipy d evolution may be the solution. I try to use it. – Jing Zhang Aug 20 '18 at 00:18
  • Thanks for the feedback Jing. If you like my suggestions please upvote my answer. Or best of all Edit my answer with your results from Scipy evolution and that mark it as the accepted answer so other people can benefit too – Spoonless Aug 20 '18 at 08:45
  • I just voted your answer. Also I try the D Envolution for no answer. It is difficult to code the function for DE to minmize. Any Help? – Jing Zhang Aug 20 '18 at 11:41
  • @JingZhang I haven't used Scipy before to genetic algorithms or other evolutionary code (back in my PhD I wrote from scratch in C) and realistically I won't have time to figure it out before I go on vacation in a few days. So even though it looks fun, I won't be able to help in the next week or so, but see my comment on your histogram in the original post – Spoonless Aug 20 '18 at 12:09