0

I have a function that generates the graph of internal Wikipedia links. in the code, I use the collect function in pyspark but when I use the same code in GCP it doesn't work. I was told to change the collect function that I used in the code to lambda and map expression.

enter image description here

Maor Biton
  • 23
  • 4

1 Answers1

1

.map in pyspark works similar to regular map, i.e.

list_of_vertices = pages.map(lambda it:it.anchor_text.id)

and

list_of_edges = pages.map(lambda it: Row(src=it.id, dst=it.anchor_text.id))

Olav Aga
  • 143
  • 10
  • it raise an error "'PipelinedRDD' object is not iterable" (pages) – Maor Biton Dec 06 '21 at 13:32
  • That probably stems from trying to iterate over it. Updated the example to not use a for-loop. – Olav Aga Dec 06 '21 at 14:02
  • thats work well but how i do that for the edges, if I need to map the Row – Maor Biton Dec 06 '21 at 14:06
  • Added an example for edges also, not entirely sure if it works with the Row-objects – Olav Aga Dec 06 '21 at 14:13
  • i need to return edges and vertices, the map doesn't create a list so I cant parallelize it, I mean when I try to return the edges and the vertic it doesn't work well – Maor Biton Dec 06 '21 at 14:24
  • The function .parallelize takes a list and creates a RDD, while .collect does the opposite. An idea would be to use neither, and skip the transformations between list and RDD entirely. – Olav Aga Dec 06 '21 at 14:27
  • But i have to return RDD for each one of them, RDD of edge. and RDD of vertices – Maor Biton Dec 06 '21 at 14:34
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/239881/discussion-between-maor-biton-and-olav-aga). – Maor Biton Dec 06 '21 at 14:34