I have a Spark data frame (with the "sparklyr" package in R) with the word count of several words from 3 data sources (news, blogs, and twitter). I'm trying to use collect()
to copy the data from Spark to R's memory.
After counting the words with several functions for text mining with sparklyr, I have the following:
> word_count
# Source: spark<?> [?? x 3]
# Groups: dataset
# Ordered by: desc(n)
dataset word n
<chr> <chr> <int>
1 news said 250414
2 blogs one 127526
3 twitter like 122384
4 twitter get 112510
5 twitter love 106122
6 twitter good 100844
7 blogs like 100105
8 twitter day 91559
9 blogs time 90609
10 twitter thanks 89513
# ... with more rows
Now, if I try to use collect()
, I get the following error:
> full_word_count <- collect(word_count)
Error in RecordBatch__to_dataframe(x, use_threads = option_use_threads()) :
embedded nul in string: '\0\0ul437'
After researching a little bit (Beginner trying to read a CSV with R: Embedded nul in string) it seems that:
The error message states that you have embedded a nul char...: \0 denotes ASCII byte == 0, which is forbidden in an R string (internally, it denotes the end of a string).
Some people already asked something similar (Sparklyr "embedded nul in string" when collecting), but they received no clear answer.
How can I get rid of this "nul" in the string? Can dplyr help me with this? Is there any function in sparklyr to tackle this issue?
I'm trying to collect this Spark data frame to R's internal memory to finally export it as a CSV or XLSX file for further analysis.
Thanks!