I have a dataframe that looks like
|-- alleleFrequencies: array (nullable = true)
| |-- element: double (containsNull = true)
element
is an array of doubles
I wish to get this data into a numpy
array, which I have naively done thus:
allele_freq1 = np.array(df1.select("alleleFrequencies").collect())
but this gives
[[list([0.5, 0.5])]
[list([0.5, 0.5])]
[list([1.0])]...
which isn't a simple 1D array like what I want
I've also tried
allele_freq1 = np.array(df1.select("alleleFrequencies")[0].collect())
but this gives
TypeError: 'Column' object is not callable
I've also tried
allele_freq1 = np.array(df1.select("alleleFrequencies[0]").collect())
but this gives
org.apache.spark.sql.AnalysisException: cannot resolve '`alleleFrequencies[0]`' given input columns...
How can I get the first item in the column alleleFrequencies
placed into a numpy
array?
I checked How to extract an element from a array in pyspark but I don't see how the solution there applies to my situation