I have a function that generates the graph of internal Wikipedia links. in the code, I use the collect function in pyspark but when I use the same code in GCP it doesn't work. I was told to change the collect function that I used in the code to lambda and map expression.
Asked
Active
Viewed 132 times
0
-
Do you mean map as the built-in function or pyspark map? – Olav Aga Dec 06 '21 at 13:24
-
doesn't matter, any of them – Maor Biton Dec 06 '21 at 13:30
1 Answers
1
.map in pyspark works similar to regular map, i.e.
list_of_vertices = pages.map(lambda it:it.anchor_text.id)
and
list_of_edges = pages.map(lambda it: Row(src=it.id, dst=it.anchor_text.id))

Olav Aga
- 143
- 10
-
-
That probably stems from trying to iterate over it. Updated the example to not use a for-loop. – Olav Aga Dec 06 '21 at 14:02
-
thats work well but how i do that for the edges, if I need to map the Row – Maor Biton Dec 06 '21 at 14:06
-
Added an example for edges also, not entirely sure if it works with the Row-objects – Olav Aga Dec 06 '21 at 14:13
-
i need to return edges and vertices, the map doesn't create a list so I cant parallelize it, I mean when I try to return the edges and the vertic it doesn't work well – Maor Biton Dec 06 '21 at 14:24
-
The function .parallelize takes a list and creates a RDD, while .collect does the opposite. An idea would be to use neither, and skip the transformations between list and RDD entirely. – Olav Aga Dec 06 '21 at 14:27
-
But i have to return RDD for each one of them, RDD of edge. and RDD of vertices – Maor Biton Dec 06 '21 at 14:34
-
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/239881/discussion-between-maor-biton-and-olav-aga). – Maor Biton Dec 06 '21 at 14:34