0

I have a dataframe that looks like

 |-- alleleFrequencies: array (nullable = true)
 |    |-- element: double (containsNull = true)

element is an array of doubles

I wish to get this data into a numpy array, which I have naively done thus:

allele_freq1 = np.array(df1.select("alleleFrequencies").collect())

but this gives

[[list([0.5, 0.5])]
 [list([0.5, 0.5])]
 [list([1.0])]...

which isn't a simple 1D array like what I want

I've also tried

allele_freq1 = np.array(df1.select("alleleFrequencies")[0].collect())

but this gives

TypeError: 'Column' object is not callable

I've also tried

allele_freq1 = np.array(df1.select("alleleFrequencies[0]").collect())

but this gives

org.apache.spark.sql.AnalysisException: cannot resolve '`alleleFrequencies[0]`' given input columns...

How can I get the first item in the column alleleFrequencies placed into a numpy array?

I checked How to extract an element from a array in pyspark but I don't see how the solution there applies to my situation

con
  • 5,767
  • 8
  • 33
  • 62
  • Possible duplicate of [How to extract an element from a array in pyspark](https://stackoverflow.com/questions/45254928/how-to-extract-an-element-from-a-array-in-pyspark) – pault Nov 08 '19 at 20:32
  • @pault the first you give gives an error that it cannot resolve the column name, and the link you gave gives no useful information – con Nov 08 '19 at 20:33
  • ok then use `pyspark.sql.functions.col` and `getItem` (as shown in the link I gave): `np.array(df1.select(col("alleleFrequencies").getItem(0)).collect())`. *no useful information* is a pretty broad statement. – pault Nov 08 '19 at 20:36
  • thanks @pault the last comment `pyspark.sql.functions.col: np.array(df1.select(col("alleleFrequencies").getItem(0)).collect())` gets the job done – con Nov 08 '19 at 20:37
  • That's exactly what's contained in the duplicate I linked. I bet `selectExpr` would probably work here too: `np.array(df1.selectExpr("alleleFrequencies[0]").collect())` – pault Nov 08 '19 at 20:37

1 Answers1

0
allele_freq1 = np.array(df1.select(col("alleleFrequencies").getItem(0)).collect())
print(allele_freq1)
print(type(allele_freq1))
showdev
  • 28,454
  • 37
  • 55
  • 73
con
  • 5,767
  • 8
  • 33
  • 62