How to get only particular attribute from a column in dataframe schema?

Question

i have this Schema of dataframe df :

root
 |-- id: long (nullable = true)
 |-- a: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- _VALUE: string (nullable = true)
 |    |    |-- _href: string (nullable = true)
 |    |    |-- type: string (nullable = true)

How can I modify the dataframe such that column a contains only _href values and not _value type?
Is it possible?
I've tried something like this , but it's wrong :

df=df.withColumn('a', 'a._href')

For example this is my data :

+---+---------------------------------------------------------------------+
|id|                                   a                                  |
+---+---------------------------------------------------------------------+
| 17|[[Gwendolyn Tucke,http://facebook.com],[i have , http://youtube.com]]|
| 23|[[letter, http://google.com],[hihow are you , http://google.co.il]]  |
+---+---------------------------------------------------------------------+

but when i want to look like this:

+---+---------------------------------------------+
|id|                                   a          |
+---+---------------------------------------------+
| 17|[[http://facebook.com],[ http://youtube.com]]|
| 23|[[http://google.com],[http://google.co.il]]  |
+---+---------------------------------------------+

ps: I don't want to use pandas at all.

What should be the schema of the modified dataframe? (please [edit] your question to answer that) — Yaron, Dec 30 '18 at 15:06
`a` is an array of structs. You could try `col("a").getItem(0).getItem("_href")` but that doesn't generalize well. Please provide an [mcve]. — pault, Dec 30 '18 at 15:09
hi , i have edit it, sorry for the missclirification, @pault i'm trying to modify the values in the dataframe and not just get the `_href` — reeena11, Dec 30 '18 at 16:25
Possible duplicate of [TypeError: Column is not iterable - How to iterate over ArrayType()?](https://stackoverflow.com/questions/48993439/typeerror-column-is-not-iterable-how-to-iterate-over-arraytype) — pault, Dec 30 '18 at 17:28
@pault hi , unfortunately no , i need to modify the column such that it takes specific attribute and not iterat over arraytype — reeena11, Dec 30 '18 at 17:44
You need to iterate over your array and extract the attribute (perhaps using the code I showed in my previous comment). — pault, Dec 30 '18 at 18:51
@pault again i've explained that this is not what i want , i explained it clearly in the question .thank you, please don't put it as answered because it doesn't help! — reeena11, Dec 30 '18 at 19:50

score 1 · Answer 1 · answered Dec 31 '18 at 10:42

You can try below code :

from pyspark.sql.functions import *
df.select("id", explode("a")).select("id","a._href", "a.type").show()

Above code will return DataFrame with 3 columns(id, _href, type) at the same level which you can use for your further analysis.

I hope it helps.

Regards,

Neeraj

score 1 · Accepted Answer · answered Dec 31 '18 at 13:38

You could just select a._href and assign it to a new column. Try this Scala solution.

scala> case class sub(_value:String,_href:String)
defined class sub

scala> val df = Seq((17,Array(sub("Gwendolyn Tucke","http://facebook.com"),sub("i have"," http://youtube.com"))),(23,Array(sub("letter","http://google.com"),sub("hihow are you","http://google.co.il")))).toDF("id","a")
df: org.apache.spark.sql.DataFrame = [id: int, a: array<struct<_value:string,_href:string>>]

scala> df.show(false)
+---+-----------------------------------------------------------------------+
|id |a                                                                      |
+---+-----------------------------------------------------------------------+
|17 |[[Gwendolyn Tucke, http://facebook.com], [i have,  http://youtube.com]]|
|23 |[[letter, http://google.com], [hihow are you, http://google.co.il]]    |
+---+-----------------------------------------------------------------------+


scala> df.select("id","a._href").show(false)
+---+------------------------------------------+
|id |_href                                     |
+---+------------------------------------------+
|17 |[http://facebook.com,  http://youtube.com]|
|23 |[http://google.com, http://google.co.il]  |
+---+------------------------------------------+

You can assign it to a new column

scala> val df2 = df.withColumn("result",$"a._href")
df2: org.apache.spark.sql.DataFrame = [id: int, a: array<struct<_value:string,_href:string>> ... 1 more field]

scala> df2.show(false)
+---+-----------------------------------------------------------------------+------------------------------------------+
|id |a                                                                      |result                                    |
+---+-----------------------------------------------------------------------+------------------------------------------+
|17 |[[Gwendolyn Tucke, http://facebook.com], [i have,  http://youtube.com]]|[http://facebook.com,  http://youtube.com]|
|23 |[[letter, http://google.com], [hihow are you, http://google.co.il]]    |[http://google.com, http://google.co.il]  |
+---+-----------------------------------------------------------------------+------------------------------------------+


scala> df2.printSchema
root
 |-- id: integer (nullable = false)
 |-- a: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- _value: string (nullable = true)
 |    |    |-- _href: string (nullable = true)
 |-- result: array (nullable = true)
 |    |-- element: string (containsNull = true)


scala>

`df.withColumn("result",$"a._href")` doesn't work for pyspsrk. — pault, Dec 31 '18 at 14:24
That doesn't produce the correct output. OP posted that they tried that but it was wrong. — pault, Dec 31 '18 at 14:34
The alternatives are on the duplicate that I linked, but OP doesn't think it's what they need. — pault, Dec 31 '18 at 14:45

How to get only particular attribute from a column in dataframe schema?

2 Answers2