5

I'm generating a random dataset. My dataset is sequential, and has upper and under limits. At some random points, I want my dataset to have outliers above and under limits. Here's my code.

generated_data = (12) * np.random.rand(100) + 630
outlier_data = (12) * np.random.rand(20) + (*HERE'S THE PROBLEM)
merged_data = np.concatenate((generated_data, outlier_data))

After this, I think I will shuffle the merged_data. But I don't know how to generate outliers properly.

Jacky1205
  • 3,273
  • 3
  • 22
  • 44
1a2a3a 4a5a6a
  • 103
  • 1
  • 8
  • So what are your limits, and what is the actual problem? – gmds Mar 26 '19 at 07:31
  • Do you mean to have some values below 630 and above 1830? – Tojra Mar 26 '19 at 07:31
  • suggest you look into PyOD package in `pyod.utils.data` function `get_outliers_inliers` – vi_me Mar 26 '19 at 07:32
  • Sorry if my question wasn't clear. My dataset has median which is 630, and upper and under limits which is 12. I want to have outliers over 642 or under 618, randomly. – 1a2a3a 4a5a6a Mar 26 '19 at 07:32
  • You can use generate using uniform or normal distribution functions in numpy with given mean and median. It will also generate outliers – Tojra Mar 26 '19 at 07:38

2 Answers2

3

Just generate three parts of the data independently: first non-outliers, then lower- and upper outliers, merge them together, and finally shuffle them:

def generate(median=630, err=12, outlier_err=100, size=80, outlier_size=10):
    errs = err * np.random.rand(size) * np.random.choice((-1, 1), size)
    data = median + errs

    lower_errs = outlier_err * np.random.rand(outlier_size)
    lower_outliers = median - err - lower_errs

    upper_errs = outlier_err * np.random.rand(outlier_size)
    upper_outliers = median + err + upper_errs

    data = np.concatenate((data, lower_outliers, upper_outliers))
    np.random.shuffle(data)

    return data

You'll get something like this:

>>> data = generate()
>>> data.shape
(100,)
>>> data.min()
518.1635764484727
>>> data.max()
729.9467630423616
>>> np.median(data)
629.9427184256936
constt
  • 2,250
  • 1
  • 17
  • 18
0
def generate_outlier(data,perc):
   perc/=100
   lower_outlier=np.random.randint(data.min()-300,data.min()-100,size= (int(data.size/2),1))
   upper_outlier=np.random.randint(data.max()+100,data.max()+300,size=(int(data.size/2),1))
   outlier=np.concatenate((lower_outlier,upper_outlier))
   np.random.shuffle(outlier)
   outlier=pd.DataFrame(np.reshape(outlier,data.shape))
   outlier=outlier.mask(np.random.random(data.shape)>perc)
   result=outlier.fillna(data)
return result
Ankit
  • 1
  • 1
  • Code-only answers are discouraged on Stack Overflow. Please provide a description of how your answer solves the problem and provide references where appropriate. – DaveL17 Oct 10 '22 at 15:36