0

I'm a data scientist and still relatively new to Scala. I'm trying to understand the Scala documentation and run a t-test from any existing package. I am looking for sample Scala code on a dummy data set that will work and insight into understanding how to understand the documentation.

I'm working in an EMR Notebook (basically Jupyter notebook) in an AWS EMR cluster environment. I tried referring to this documentation but apparently I am not able to understand it: https://commons.apache.org/proper/commons-math/javadocs/api-3.6/org/apache/commons/math3/stat/inference/TTest.html#TTest()

Here's what I've tried, using multiple load statements for two different packages that have t-test functions. I have multiple lines for the math3.state.inference package since I'm not entirely certain the differences between each and wanted to make sure this part wasn't the problem.

import org.apache.commons.math3.stat.inference
import org.apache.commons.math3.stat.inference._ // note sure if this means, import all classes/methods/functions
import org.apache.commons.math3.stat.inference.TTest._
import org.apache.commons.math3.stat.inference.TTest

import org.apache.spark.mllib.stat.test

No errors there.

import org.apache.asdf

Returns an error, as expected.

The documentation for math3.state.inference says there is a TTest() constructor and then shows a bunch of methods. How does this tell me how to use these functions/methods/classes? I see the following "method" does what I'm looking for:

t(double m, double mu, double v, double n)
Computes t test statistic for 1-sample t-test.

but I don't know how to use it. Here's just several things I've tried:

inference.t
inference.StudentTTest
test.student
test.TTest
TTest.t
etc.

But I get errors like the following:

An error was encountered:
<console>:42: error: object t is not a member of package org.apache.spark.mllib.stat.test
       test.t

An error was encountered:
<console>:42: error: object TTest is not a member of package org.apache.spark.mllib.stat.test
       test.TTest

...etc.

So how do I fix these issues/calculate a simple, one-sample t-statistic in Scala with a Spark kernel? Any instructions/guidance on how to understand the documentation will be helpful for the long-term as well.

user2205916
  • 3,196
  • 11
  • 54
  • 82

1 Answers1

0

the formula for computing one sample t test is quite straightforward to implement as a udf (user defined function)

udfs are how we can write custom functions to apply to different rows of the DataFrame. I assume you are okay with generating the aggregated values using standard groupby and agg functions.

import org.apache.spark.sql.functions.col
import org.apache.spark.sql.functions.udf
import org.apache.spark.sql.expressions.UserDefinedFunction

val data = Seq((310, 40, 300.0, 18.5), (310, 41, 320.0, 14.5)).toDF("mu", "sample_size", "sample_mean", "sample_sd")

+---+-----------+-----------+---------+
| mu|sample_size|sample_mean|sample_sd|
+---+-----------+-----------+---------+
|310|         40|      300.0|     18.5|
|310|         41|      320.0|     14.5|
+---+-----------+-----------+---------+

val testStatisticUdf: UserDefinedFunction = udf {
  (sample_mean: Double, mu:Double, sample_sd:Double, sample_size: Int) => 
    (sample_mean - mu) / (sample_sd / math.sqrt(sample_size.toDouble))
}

val result = data.withColumn("testStatistic", testStatisticUdf(col("sample_mean"), col("mu"), col("sample_sd"), col("sample_size")))

+---+-----------+-----------+---------+-------------------+
| mu|sample_size|sample_mean|sample_sd|      testStatistic|
+---+-----------+-----------+---------+-------------------+
|310|         40|      300.0|     18.5|-3.4186785515333833|
|310|         41|      320.0|     14.5| 4.4159477499536886|
+---+-----------+-----------+---------+-------------------+
cliff
  • 76
  • 1
  • This is helpful, but I'm curious, how do I debug the issues associated with the package? – user2205916 Nov 22 '20 at 00:04
  • org.apache.commons.math3.stat and org.apache.spark.mllib.stat are different libraries. The error message you showed is the compiler telling us that the methods you are calling doesn’t exist in the mllib stat class. – cliff Nov 23 '20 at 01:28