Extract columns from an arrayish type of a vector column in databricks with pyspark

Question

Has anyone an idea how to retrieve from the probability column the first value "0" (which indicates the probability of that prediction being correct)

After running dataframe.schema (or dataframe.printSchema()) I got the following result for the probability column:

StructField('probability', VectorUDT(), True)

Below I am attaching part of the image of the dataframe.

I tried to expand the column probability with col("probability.*") but it gave me an error:

Can only star expand struct data types. Attribute: `ArrayBuffer(probability)`.

I also tried to expand by just calling "probability.vectorType", for example! but I got the following error:

[INVALID_EXTRACT_BASE_FIELD_TYPE] Cannot extract a value from "probability". Need a complex type [STRUCT, ARRAY, MAP] but got "STRUCT, values: ARRAY>".

Does this answer your question? [How to access element of a VectorUDT column in a Spark DataFrame?](https://stackoverflow.com/questions/39555864/how-to-access-element-of-a-vectorudt-column-in-a-spark-dataframe) — Ronak Jain, Mar 20 '23 at 06:23
@Ronak Jain, thanks for your guidance. The answer marked as the "best one" did not help me much, but the answer from @Nidhi / n1tk solved the problem very clean. . . `prob_df1=lr_pred.withColumn("probability",lr_pred["probability"].cast("String"))` . . `prob_df =prob_df1.withColumn('probabilityre',split(regexp_replace("probability", "^\[|\]", ""), ",")[1].cast(DoubleType()))` — Susy84, Mar 21 '23 at 10:21

Extract columns from an arrayish type of a vector column in databricks with pyspark

0 Answers0