0

I need a little bit aid. I want to see each of the elements of the rdd (rddseparar) The idea is to count the words of a text, eliminating the special characters and this is one of de steps for get it

import re

fileName = "/databricks-datasets/cs100/lab1/data-001/shakespeare.txt"


rdd = sc.textFile(fileName)
separar = re.split(r"[^A-Za-z\s\d]", rdd.collect()[0])
separarPalabras = [word for frase in separar for word in frase.split()]
rddseparar = sc.parallelize(separarPalabras)

print(rddseparar.collect())

When I run the code, I should be able to see each of the elements in the rddseparate, but I don't.

Spark code execution output

Why can't I see the elements of the rddseparar ?

(2) Spark Jobs ['1609']

2 Answers2

0

The output is correct, but it only returns one row: ['1609']. This is because you only input one row: rdd.collect()[0]). If you want to apply your regex to every row, you could use a loop through your collect output, or go a more spark-route using pyspark functions/udf

user2704177
  • 109
  • 2
  • 6
0

You're not using spark functionality to calculate the word count. You're just getting the the n row value from the rdd and pass It as an argument to another function.

So you're using the rdd as a data structure (array or list ect..)

Instead of doing it that way, you can use the spark transformations and actions to calculate the word count directly.

      val results = sc.textFile(""/databricks-datasets/cs100/lab1/data-001/shakespeare.txt"")  
      .flatMap(line => line.split(";"))   
      .map(word => (word,1))  
      .reduceByKey(_+_)  
      .collect()

I've putted ";" as an exemple but you can develop here to add the list of chars

shalnarkftw
  • 402
  • 2
  • 8
  • if the answer helped you to solve the issue, take a moment to accept and upvote to close this thread as solved.! meta.stackexchange.com/questions/5234/… – shalnarkftw Jul 28 '23 at 14:26