-1

I have generated pyspark.sql.dataframe.DataFrame with columns names cast and score.

However, I want to keep the only names in cast column, not the ids associated with them, alongside _score column.

e.g Liam Neeson, 'Dan Stevens, Marina Squerciati, Scott Frank

The structure of datatype in column cast looks like this

df_test.schema['cast'].dataType

StructType(List(StructField(Acteur$divActrice,ArrayType(StructType(List(StructField(id,StringType,true),StructField(name,StringType,true))),true),true),StructField(Regisseur,ArrayType(StructType(List(StructField(id,StringType,true),StructField(name,StringType,true))),true),true),StructField(writer,ArrayType(StructType(List(StructField(id,StringType,true),StructField(name,StringType,true))),true),true)))
    elasticsearch_performance_evaluation.drop('id', '_type', 'name').show(20, False)
+------------------------------------------------------------------------------------------------------------+---------+
|cast                                                                                                        |_score   |
+------------------------------------------------------------------------------------------------------------+---------+
|[[[3713, Liam Neeson], [24315, Dan Stevens], [37138, Marina Squerciati]], [[37136, Scott Frank]],]          |90.68749 |
|[[[365, Susan Sarandon], [1939, Tom Hanks], [13425, Halle Berry]], [[6813, Tom Tykwer]],]                   |42.601303|
|[[[1939, Tom Hanks], [3339, Julia Roberts], [8479, Sarah Mahoney]], [[1939, Tom Hanks]],]                   |42.601196|
|[[[1939, Tom Hanks], [42930, Sarita Choudhury], [44288, Alexander Black]], [[6813, Tom Tykwer]],]           |42.601196|
|[[[13379, Emilio Estevez], [13381, Martin Sheen], [13383, Deborah Kara Unger]], [[13379, Emilio Estevez]],] |22.130056|
|[[[15871, Frank Lammers]], [[30157, Mark Mertens]],]                                                        |20.221449|
|[[[44061, Erin Darke], [44548, Hansel Tan], [233, Rachel Weisz]], [[10329, Joshua Marston]],]               |19.26571 |
|[[[1535, Pim Wessels], [2327, Joosje Duk], [2329, Ydwer Bosma]], [[1537, Steven de Jong]],]                 |18.798985|
|[[[6741, Chantal Janzen], [16357, Benja Bruijning], [15865, Fedja van Huêt]], [[2769, Kees van Nieuwkerk]],]|18.79888 |
|[[[6105, Stephen Dorff], [9611, Tom Berenger], [12397, Chyler Leigh]], [[12395, Gabe Torres]],]             |17.88121 |
|[[[2309, Gene Hackman], [8955, Tom Cruise], [12971, Ed Harris]], [[30047, Sydney Pollack]],]                |17.88121 |
|[[[1105, Marisa Tomei], [12763, Hugh Grant], [43842, J.K Simmons]], [[5637, Marc Lawrence]],]               |17.88121 |
|[[[8955, Tom Cruise], [10405, Kelly McGillis], [2023, Val Kilmer]], [[2311, Tony Scott]],]                  |17.88121 |
|[[[2105, Tom Hughes], [43507, Emma Greenwell], [43508, Alex Macqueen]], [[43506, Vivienne de Courcy]],]     |17.881105|
|[[[1201, Carice van Houten], [2271, Bill Nighy], [8955, Tom Cruise]], [[18203, Bryan Singer]],]             |17.881105|
|[[[9159, Anne Hathaway], [9161, Jim Sturgess], [9163, Tom Mison]], [[9157, Lone Scherfig]],]                |17.881105|
|[[[921, Ewan McGregor], [4603, Naomi Watts], [16997, Tom Holland]], [[16995, Juan Antonio Bayona]],]        |17.881105|
|[[[44161, Edie Falco], [44162, Miles Gaston Villanueva], [44163, Gus Halper]],,]                            |17.65994 |
|[[[2327, Joosje Duk], [2329, Ydwer Bosma], [3461, Leo de Jong]], [[1537, Steven de Jong]],]                 |17.65994 |
|[[[10841, Bas Muijs], [25285, Vivian van Huiden], [28091, Tom van Kalmthout]], [[1537, Steven de Jong]],]   |16.626995|
+------------------------------------------------------------------------------------------------------------+---------+

could somebody please help how I can extract data with cast name and score column only.

Thanks in advance

user3459293
  • 320
  • 1
  • 3
  • 11

2 Answers2

1

One possible solution would be the transformation to a rdd, followed by a map function that extracts the actor name and a final transformation back to a dataframe.

df1 = spark.createDataFrame([([[["3713", "Liam Neeson"], ["24315", "Dan Stevens"],\
 ["37138", "Marina Squerciati"]], [["37136", "Scott Frank"]],], 90.687),\
([[["365", "Susan Sarandon"], ["1939", "Tom Hanks"], ["13425", "Halle Berry"]],\
 [["6813", "Tom Tykwer"]],], 42.601303)],["cast","score"])
#creation of the dataframe, e.g. first 2 rows of your example

def extract_actor(row):                    #extract function
    actors, score = row
    new_actors = list()
    for l in actors:
        res = list()
        for actor in l:
            res.append(actor[1])          #walk through all the lists and collect the names
        new_actors.append(res)
    return (new_actors,score)

df1.rdd.map(extract_actor).toDF(["cast","score"]).show(truncate=False)
#transform dataframe to rdd apply the map and transform result back to dataframe

Output:

+--------------------------------------------------------------------------------------+---------+
|cast                                                                                  |score    |
+--------------------------------------------------------------------------------------+---------+
|[WrappedArray(Liam Neeson, Dan Stevens, Marina Squerciati), WrappedArray(Scott Frank)]|90.687   |
|[WrappedArray(Susan Sarandon, Tom Hanks, Halle Berry), WrappedArray(Tom Tykwer)]      |42.601303|
+--------------------------------------------------------------------------------------+---------+
gaw
  • 1,960
  • 2
  • 14
  • 18
  • this is exactly what I wanted then the second question I have 1000 of rows to go over and get names from each row like you did. how can I accomplish that? cheers – user3459293 Feb 14 '19 at 12:57
  • the number of rows should not have an influence on the solution. Just load your data into a dataframe and apply the solution to it. It should extract the names for each row – gaw Feb 14 '19 at 13:34
  • Sorry, I meant this. I get an error as I have the structure of data more complex added to the top of your function? How I can adapt the function to this kind of data frames. – user3459293 Feb 14 '19 at 13:42
  • can you tell the schema or the structure of your dataframe, which you want to use for this? The function always takes a row of the dataframe and works on the values there, if the row contains additional fields, you need to adapt the function and get the values similar to the score variable – gaw Feb 14 '19 at 13:44
  • @ gaw| | | |-- id: string (nullable = true) | | | |-- name: string (nullable = true) |-- _score: double (nullable = true) – user3459293 Feb 14 '19 at 13:45
  • So you have a list of actors, a regisseur and a write, but it still should work since the name is always the second attribute (`actor[1]`) and it will be looped through all of them. what was the error? – gaw Feb 14 '19 at 13:50
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/188424/discussion-between-gaw-and-user3459293). – gaw Feb 14 '19 at 13:50
  • @user3459293 I've rejected [your edit](https://stackoverflow.com/review/suggested-edits/22207337) - it looks like you wanted to edit your question then to ask for clarification. – Wai Ha Lee Feb 14 '19 at 20:14
  • @WaiHaLee well I just wanted to give him the structure of my data. I did not know you can not do this way. but I don't know why the post itself scored -1 – user3459293 Feb 15 '19 at 08:59
  • @user3459293 - I can't vouch for the downvovter on your question, but I'd imagine it's because you appear not to have shown any efforts to figure it out yourself first. (unless I'm not reading your question properly - I'm not familiar with pyspark) – Wai Ha Lee Feb 15 '19 at 09:05
-1

you can use explode function, for more information see these answers

Gal Shaboodi
  • 744
  • 1
  • 7
  • 25
  • then you would have to explode several times, because its a list of a list of lists – gaw Feb 14 '19 at 12:53