3

I am trying to encircle my datapoints of a scatterplot(using ggplot2), so that (1) 100% of my datapoints and (2) 80% of my datapoints are inside that circle. (See 1 - Like in this sctech (please excuse the lazy execution with snippingtool))

Like in this sctech (please excuse the lazy execution with snippingtool)

Here is my dummy-dataset:

 x  y
1   2
1   3
1   4
1   5
2   1
2   2
2   3
2   4
2   5
3   1
3   2
3   3
3   4
4   1
4   2
4   3
5   1
5   2
5   3
5   4
5   9
5   10
6   1
6   2
6   3
6   4
6   5
6   6
6   8
6   9
6   10
7   1
7   2
7   3
7   4
7   5
7   6
7   7
7   8
7   9
7   10
8   2
8   3
8   4
8   5
8   6
8   7
8   8
8   9

I have tried multiple approaches to achieve this, but nothing really satisfies what I want to accomplish.

My first approach was geom_density2d(). However, I have troubles interpreting the results, as I don't really know what the levels mean.

I tried the following:

ggplot(myData, aes(x,y)) + geom_point() + geom_density2d(bins=4, aes(colour=..level..))

Which results in this plot 2:

geom_density2d plot with bins=4 and colours set to ..level..

It is good, as it accomplishes the dent in the contours. However, I don't know how I would get a hull that encircles 100% of my data, and a second hull that encircles 80% of my data.

My second approach was to use the geom_encircle() function of the ggalt package. This results in the following plot 3

geom_encircle of myData

This time, my whole datapoints are encircled - so far so good. But the "dent" like in the geom_contour() plot is not present, and I don't know how to add an "encriclement" that covers only 80% of my datapoints.

My third approach was using the geom_bagplot() function (described here).

ggplot(myData, aes(x,y)) + geom_point() + geom_bag(prop=0.9) + geom_bag(prop=0.8)

(with geom_bag() I cannot use prop=1.0 to cover all datapoints, however setting it to 0.9 is sufficient)

This yields the following plot 5:

geom_bagplot() with 90 and 80% of the Data

This time, again, the dent is not present. Another problem is, that setting prop=0.7 and prop=0.7 yields the exact same outcome. Another problem is, that the hull is not smooth like geom_contour().

How can I produce a plot (with ggplot2) that looks like my sketch in 1?

Thanks in advance!

____________________________________________

EDIT:

The actual dataset to show the real distribution of my datapoints:

    x   y
1   -19.397412  47.544324
2   -8.213419   69.892953
3   -29.926849  39.743923
4   -75.377447  79.817208
5   -9.215048   40.705533
6   -42.868995  45.721222
7   -85.590572  84.058463
8   -62.544121  69.371364
9   -60.209205  64.546267
10  3.598963    20.109707
11  -4.552074   61.3339
12  -197.619021 52.225312
13  -147.133639 56.96088
14  -59.402414  56.487012
15  -68.361091  46.811878
16  -105.556485 57.603839
17  -94.354948  32.706933
18  -107.26281  28.477637
19  -155.692967 35.106937
20  -80.819257  30.664812
21  -142.055086 33.728788
22  -118.353934 27.362929
23  -114.634413 31.501665
24  -113.470642 29.136781
25  -181.380891 41.046883
26  -171.106218 23.359443
27  -156.720415 35.450407
28  -165.042839 29.349575
29  -92.869955  25.478965
30  -114.78719  23.860353
31  -134.115204 25.491367
32  -109.430656 19.105614
33  -120.451655 25.97992
34  -87.570713  21.111895
35  -91.222139  22.484895
36  -208.979695 38.311266
37  -98.814223  16.121487
38  -201.812263 49.547512
39  -168.948464 39.583593
40  -112.44335  20.979357
41  -174.138029 28.470047
42  -220.936718 33.452972
43  -169.687859 33.173458
44  -157.119306 38.573987
45  -150.682075 41.66627
46  -77.397116  27.220171
47  -177.559527 53.278523
48  -61.212396  6.796908
49  -94.602774  24.669706
50  -204.333869 37.002679
51  -124.442364 31.519392
52  -165.722504 39.464188
53  -57.849212  23.973774
54  -106.643382 38.560785
55  -90.679094  29.863184
56  -132.476054 31.988021
57  -188.33621  29.658416
58  -136.247184 38.870171
59  -59.929772  20.626164
60  -121.020003 33.862312
61  -82.968422  33.033312
62  -79.130004  32.800121
63  -51.463395  23.452366
64  -63.819269  27.257994
65  -64.02259   27.711516
66  -66.876407  18.156063
67  -68.175454  22.996369
68  -108.640035 29.915306
69  -21.512647  16.930815
70  -66.902542  17.177093
71  -160.262625 33.061052
72  -41.672641  30.510433
73  -83.31784   28.965415
74  -132.410284 22.843924
75  -54.724716  10.642682
76  -69.688094  30.798878
77  -120.775133 24.597096
78  -78.655551  30.368373
79  -68.299767  35.937048
80  -45.037891  21.636422
81  -49.679704  19.508719
82  -62.018393  76.199247
83  -113.777141 27.730892
84  -74.630501  49.062317
85  -95.154793  37.279829
86  -65.229569  46.26744
87  -42.139223  16.38709
88  -94.186408  28.708069
89  -100.920471 27.533579
90  -66.332707  22.573064
91  -26.419725  13.948061
92  -152.704377 34.165409
93  -50.309209  22.032052
94  -125.896489 34.411915
95  -119.304969 28.786249
96  -41.689412  37.314049
97  -99.936438  31.363461
98  -74.807901  24.259652

This yields the following plot 6:

Actual datapoints

And I would like to show that most of my Datapoints are in the lower part, but still encircle all the data, something like in 7:

enter image description here

____________________________________________

EDIT2:

The "ultimate goal" would be to compare those both contours, without the corresponding datapoints, to another dataset, to see whether there are overlaps, but without overcrowding the resulting plots with too many datapoints.

Servus
  • 373
  • 3
  • 14
  • how do you select 80% of your points? – pogibas Sep 15 '17 at 13:51
  • you could randomly sample 80% of the data -- there's not a unique set of points that represents 80% of your data. There are infinitely many curves that would cover 80% of your data. What exactly do you have in mind? What are the constraints you want to satisfy. Are you making some sort of distributional assumption about these points? If so, you should make that very clear. – MrFlick Sep 15 '17 at 13:52
  • Hey! Thanks for your quick answers! I did not think of the endless possibility of selecting points while putting up my dummy-data, sorry! I will edit my original post with the actual dataset to show the distribution of points. My goal is to show that the majority of the datapoints (e.g. 80%) is centered in the lower part, while still adding an encircling hull to the whole dataset. – Servus Sep 15 '17 at 14:17
  • Added the actual data to the original post, along with a new sketch. – Servus Sep 15 '17 at 14:26
  • 1
    Use of the `bins` parameter in `stat_density2d` has been discussed before [here](https://stackoverflow.com/questions/19329318/how-to-correctly-interpret-ggplots-stat-density2d) and [here](https://stackoverflow.com/questions/34410999/why-is-bins-parameter-unknown-for-the-stat-density2d-function-ggmap). May be worth taking a look. One answer there recommended `emdbook::HPDregionplot`, which allows you to specify `prob = 0.8`. – Z.Lin Sep 15 '17 at 15:17
  • I already tried `emdbook::HPDregionplot`, but I always get an error message when i set `prob = 0.8` or to `prob = 0.7`. What is working is, when I set 'prob = 0.78` , using `test <- as.data.frame(HPDregionplot(mcmc(data.matrix(myData)), prob=0.78))` with my actual dataset (See Edit in the OP). Using `ggplot(myData) + geom_point(aes(x,y)) + geom_polygon(data=test, aes(x,y), alpha=0.3)` gives [This Plot](https://i.imgur.com/j05l59Z.png). I would like to define `prob = 0.8`, as well as `prob = 1.0` (altough setting it to 1.0 is not possible due to another error) – Servus Sep 16 '17 at 10:59

0 Answers0