How to reaplace collect function in pyspark to lambda and map

Question

I have a function that generates the graph of internal Wikipedia links. in the code, I use the collect function in pyspark but when I use the same code in GCP it doesn't work. I was told to change the collect function that I used in the code to lambda and map expression.

Do you mean map as the built-in function or pyspark map? – Olav Aga Dec 06 '21 at 13:24 — Olav Aga, Dec 06 '21 at 13:24
doesn't matter, any of them – Maor Biton Dec 06 '21 at 13:30 — Maor Biton, Dec 06 '21 at 13:30

Olav Aga · Answer 1 · 2021-12-06T14:13:20.660

1

.map in pyspark works similar to regular map, i.e.

list_of_vertices = pages.map(lambda it:it.anchor_text.id)

and

list_of_edges = pages.map(lambda it: Row(src=it.id, dst=it.anchor_text.id))

edited Dec 06 '21 at 14:13

answered Dec 06 '21 at 13:23

Olav Aga

143
10

it raise an error "'PipelinedRDD' object is not iterable" (pages) – Maor Biton Dec 06 '21 at 13:32
That probably stems from trying to iterate over it. Updated the example to not use a for-loop. – Olav Aga Dec 06 '21 at 14:02
thats work well but how i do that for the edges, if I need to map the Row – Maor Biton Dec 06 '21 at 14:06
Added an example for edges also, not entirely sure if it works with the Row-objects – Olav Aga Dec 06 '21 at 14:13
i need to return edges and vertices, the map doesn't create a list so I cant parallelize it, I mean when I try to return the edges and the vertic it doesn't work well – Maor Biton Dec 06 '21 at 14:24
The function .parallelize takes a list and creates a RDD, while .collect does the opposite. An idea would be to use neither, and skip the transformations between list and RDD entirely. – Olav Aga Dec 06 '21 at 14:27
But i have to return RDD for each one of them, RDD of edge. and RDD of vertices – Maor Biton Dec 06 '21 at 14:34
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/239881/discussion-between-maor-biton-and-olav-aga). – Maor Biton Dec 06 '21 at 14:34

How to reaplace collect function in pyspark to lambda and map

1 Answers1