PySpark: extract/collect first array element from a column

Question

I have a dataframe that looks like

 |-- alleleFrequencies: array (nullable = true)
 |    |-- element: double (containsNull = true)

element is an array of doubles

I wish to get this data into a numpy array, which I have naively done thus:

allele_freq1 = np.array(df1.select("alleleFrequencies").collect())

but this gives

[[list([0.5, 0.5])]
 [list([0.5, 0.5])]
 [list([1.0])]...

which isn't a simple 1D array like what I want

I've also tried

allele_freq1 = np.array(df1.select("alleleFrequencies")[0].collect())

but this gives

TypeError: 'Column' object is not callable

I've also tried

allele_freq1 = np.array(df1.select("alleleFrequencies[0]").collect())

but this gives

org.apache.spark.sql.AnalysisException: cannot resolve &#39;`alleleFrequencies[0]`&#39; given input columns...

How can I get the first item in the column alleleFrequencies placed into a numpy array?

I checked How to extract an element from a array in pyspark but I don't see how the solution there applies to my situation

Possible duplicate of [How to extract an element from a array in pyspark](https://stackoverflow.com/questions/45254928/how-to-extract-an-element-from-a-array-in-pyspark) — pault, Nov 08 '19 at 20:32
@pault the first you give gives an error that it cannot resolve the column name, and the link you gave gives no useful information — con, Nov 08 '19 at 20:33
ok then use `pyspark.sql.functions.col` and `getItem` (as shown in the link I gave): `np.array(df1.select(col("alleleFrequencies").getItem(0)).collect())`. *no useful information* is a pretty broad statement. — pault, Nov 08 '19 at 20:36
thanks @pault the last comment `pyspark.sql.functions.col: np.array(df1.select(col("alleleFrequencies").getItem(0)).collect())` gets the job done — con, Nov 08 '19 at 20:37
That's exactly what's contained in the duplicate I linked. I bet `selectExpr` would probably work here too: `np.array(df1.selectExpr("alleleFrequencies[0]").collect())` — pault, Nov 08 '19 at 20:37

score 0 · Accepted Answer · edited Nov 09 '19 at 05:49

0

allele_freq1 = np.array(df1.select(col("alleleFrequencies").getItem(0)).collect())
print(allele_freq1)
print(type(allele_freq1))

edited Nov 09 '19 at 05:49

showdev

answered Nov 08 '19 at 20:40

con

1 Answers1