-2

i have this Schema of dataframe df :

root
 |-- id: long (nullable = true)
 |-- a: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- _VALUE: string (nullable = true)
 |    |    |-- _href: string (nullable = true)
 |    |    |-- type: string (nullable = true)

How can I modify the dataframe such that column a contains only _href values and not _value type?
Is it possible?
I've tried something like this , but it's wrong :

df=df.withColumn('a', 'a._href')

For example this is my data :

+---+---------------------------------------------------------------------+
|id|                                   a                                  |
+---+---------------------------------------------------------------------+
| 17|[[Gwendolyn Tucke,http://facebook.com],[i have , http://youtube.com]]|
| 23|[[letter, http://google.com],[hihow are you , http://google.co.il]]  |
+---+---------------------------------------------------------------------+

but when i want to look like this:

+---+---------------------------------------------+
|id|                                   a          |
+---+---------------------------------------------+
| 17|[[http://facebook.com],[ http://youtube.com]]|
| 23|[[http://google.com],[http://google.co.il]]  |
+---+---------------------------------------------+

ps: I don't want to use pandas at all.

pault
  • 41,343
  • 15
  • 107
  • 149
reeena11
  • 95
  • 1
  • 11
  • What should be the schema of the modified dataframe? (please [edit] your question to answer that) – Yaron Dec 30 '18 at 15:06
  • 1
    `a` is an array of structs. You could try `col("a").getItem(0).getItem("_href")` but that doesn't generalize well. Please provide an [mcve]. – pault Dec 30 '18 at 15:09
  • hi , i have edit it, sorry for the missclirification, @pault i'm trying to modify the values in the dataframe and not just get the `_href` – reeena11 Dec 30 '18 at 16:25
  • Possible duplicate of [TypeError: Column is not iterable - How to iterate over ArrayType()?](https://stackoverflow.com/questions/48993439/typeerror-column-is-not-iterable-how-to-iterate-over-arraytype) – pault Dec 30 '18 at 17:28
  • @pault hi , unfortunately no , i need to modify the column such that it takes specific attribute and not iterat over arraytype – reeena11 Dec 30 '18 at 17:44
  • You need to iterate over your array and extract the attribute (perhaps using the code I showed in my previous comment). – pault Dec 30 '18 at 18:51
  • @pault again i've explained that this is not what i want , i explained it clearly in the question .thank you, please don't put it as answered because it doesn't help! – reeena11 Dec 30 '18 at 19:50

2 Answers2

1

You can try below code :

from pyspark.sql.functions import *
df.select("id", explode("a")).select("id","a._href", "a.type").show()

Above code will return DataFrame with 3 columns(id, _href, type) at the same level which you can use for your further analysis.

I hope it helps.

Regards,

Neeraj

Neeraj Bhadani
  • 2,930
  • 16
  • 26
1

You could just select a._href and assign it to a new column. Try this Scala solution.

scala> case class sub(_value:String,_href:String)
defined class sub

scala> val df = Seq((17,Array(sub("Gwendolyn Tucke","http://facebook.com"),sub("i have"," http://youtube.com"))),(23,Array(sub("letter","http://google.com"),sub("hihow are you","http://google.co.il")))).toDF("id","a")
df: org.apache.spark.sql.DataFrame = [id: int, a: array<struct<_value:string,_href:string>>]

scala> df.show(false)
+---+-----------------------------------------------------------------------+
|id |a                                                                      |
+---+-----------------------------------------------------------------------+
|17 |[[Gwendolyn Tucke, http://facebook.com], [i have,  http://youtube.com]]|
|23 |[[letter, http://google.com], [hihow are you, http://google.co.il]]    |
+---+-----------------------------------------------------------------------+


scala> df.select("id","a._href").show(false)
+---+------------------------------------------+
|id |_href                                     |
+---+------------------------------------------+
|17 |[http://facebook.com,  http://youtube.com]|
|23 |[http://google.com, http://google.co.il]  |
+---+------------------------------------------+

You can assign it to a new column

scala> val df2 = df.withColumn("result",$"a._href")
df2: org.apache.spark.sql.DataFrame = [id: int, a: array<struct<_value:string,_href:string>> ... 1 more field]

scala> df2.show(false)
+---+-----------------------------------------------------------------------+------------------------------------------+
|id |a                                                                      |result                                    |
+---+-----------------------------------------------------------------------+------------------------------------------+
|17 |[[Gwendolyn Tucke, http://facebook.com], [i have,  http://youtube.com]]|[http://facebook.com,  http://youtube.com]|
|23 |[[letter, http://google.com], [hihow are you, http://google.co.il]]    |[http://google.com, http://google.co.il]  |
+---+-----------------------------------------------------------------------+------------------------------------------+


scala> df2.printSchema
root
 |-- id: integer (nullable = false)
 |-- a: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- _value: string (nullable = true)
 |    |    |-- _href: string (nullable = true)
 |-- result: array (nullable = true)
 |    |-- element: string (containsNull = true)


scala>
stack0114106
  • 8,534
  • 3
  • 13
  • 38