2

I have three datasets that compile into one big dataset.

Data1 has x-values ranging from 0-47 (ordered), with many y-values (a small error) attached to an x-value. In total there are approx 100000 y values.
Data 2 and 3 are similar but with x-values 48-80 and 80-95 respectively.

The end goal is to produce a standard deviation for each x value (therefore 96 in total), based on the numerous y-values. Therefore, I think I should first extract the y-values for each x-value out of these datasets and then determine the standard deviation as per the norm.

In mathematica, I have tried using the select and part functions to no avail.

Akaisteph7
  • 5,034
  • 2
  • 20
  • 43

2 Answers2

1

Statistically it would be better to provide a prediction interval with the predicted value of y.

There is a video about that here:-

Intervals (for the Mean Response and a Single Response) in Simple Linear Regression

Illustrating with some example data, stored here as a QR code.

enter image description here

qrimage = Import["https://i.stack.imgur.com/s7Ul7.png"];

data = Uncompress@BarcodeRecognize@qrimage;

ListPlot[data, Frame -> True, Axes -> None]

enter image description here

Setting 66 & 95% confidence levels

cl = Map[Function[σ, 2 (CDF[NormalDistribution[0, 1], σ] - 0.5)], {1, 2}];

(* trying a quadratic linear fit *)
lm = LinearModelFit[data, {1, a, a^2}, a];
bands = lm["SinglePredictionBands", ConfidenceLevel -> #] & /@ cl;

(* x value for an observation outside of the sample observations *)
x0 = 50;

(* Predicted value of y *)
y0 = lm[x0]

39.8094

(* Least-squares regression of Y on X *)
Normal[lm]

26.4425 - 0.00702613 a + 0.0054873 a^2

(* Confidence interval for y0 given x0 *)
b1 = bands /. a -> x0;

(* R^2 goodness of fit *)
lm["RSquared"]

0.886419

b2 = {bands, {Normal[lm]}};

(* Prediction intervals plotted over the data range *)
Show[
 Plot[b2, {a, 0, 100}, PlotRange -> {{0, 100}, Automatic}, Filling -> {1 -> {2}}],
 ListPlot[data],
 ListPlot[{{x0, lm[x0]}}, PlotStyle -> Red],
 Graphics[{Red, Line[{{x0, Min[b1]}, {x0, Max[b1]}}]}],
 Frame -> True, Axes -> None]

enter image description here

Row[{"For x0 = ", x0, ", y0 = ", y0,
  " with 95% prediction interval ", y0, " ± ", y0 - Min[b1]}]

For x0 = 50, y0 = 39.8094 with 95% prediction interval 39.8094 ± 12.1118

Addressing your requirement:

The end goal is to produce a standard deviation for each x value (therefore 96 in total), based on the numerous y-values.

The best measure for this may be the standard errors, which can be found via

lm["SinglePredictionConfidenceIntervalTable"] and lm["SinglePredictionErrors"]

They will provide "standard errors for the predicted response of single observations". If you have multiple y values for a single x there will still just be one standard error for each x value.

Ref: https://reference.wolfram.com/language/ref/LinearModelFit.html (Details & Options)

Chris Degnen
  • 8,443
  • 2
  • 23
  • 40
0

See if you can adapt this

exampledata={{1,1},{1,2},{1,4},{2,1},{2,2},{2,2},{3,4},{3,5},{3,12}};
(*first a manual calculation to see what the answer should be*)
{StandardDeviation[{1,2,4}],StandardDeviation[{1,2,2}],StandardDeviation[{4,5,12}]}
(*and now automate the calculation*)
(*if your x values are not exact this will need to be changed*)
x=Union[Map[First,exampledata]];
y[x_]:=Map[Last,Cases[exampledata,{x,_}]];
std=Map[StandardDeviation[y[#]]&,x]

(*{Sqrt[7/3], 1/Sqrt[3], Sqrt[19]}*)

(*{Sqrt[7/3], 1/Sqrt[3], Sqrt[19]}*)

Since you have 100000 pairs this might speed it up. You have said that your data is sorted on x so I won't sort it here. If your data isn't sorted this will produce incorrect results.

exampledata={{1,1},{1,2},{1,4},{2,1},{2,2},{2,2},{3,4},{3,5},{3,12}};
y[x_]:=Map[Last,x];
std=Map[StandardDeviation[y[#]]&, SplitBy[exampledata,First]]

That should give exactly the same results, with fewer passes through the data. You might compare the timing of the two methods and verify that they do produce exactly the same results.

Reading this over, I am not absolutely certain that I exactly correctly understood your verbal description the form of your data structure. I thought you had a long list of {x,y} points with lots of repeated x values. If it looks like I misunderstood and you could include a tiny example bit of Mathematica code holding some of your sample data then I would edit my code to match.

Bill
  • 3,664
  • 1
  • 12
  • 9