1

I have a Spark data frame (with the "sparklyr" package in R) with the word count of several words from 3 data sources (news, blogs, and twitter). I'm trying to use collect() to copy the data from Spark to R's memory.

After counting the words with several functions for text mining with sparklyr, I have the following:

> word_count

# Source:     spark<?> [?? x 3]
# Groups:     dataset
# Ordered by: desc(n)
   dataset word        n
   <chr>   <chr>   <int>
 1 news    said   250414
 2 blogs   one    127526
 3 twitter like   122384
 4 twitter get    112510
 5 twitter love   106122
 6 twitter good   100844
 7 blogs   like   100105
 8 twitter day     91559
 9 blogs   time    90609
10 twitter thanks  89513
# ... with more rows

Now, if I try to use collect(), I get the following error:

> full_word_count <- collect(word_count)

Error in RecordBatch__to_dataframe(x, use_threads = option_use_threads()) : 
  embedded nul in string: '\0\0ul437'

After researching a little bit (Beginner trying to read a CSV with R: Embedded nul in string) it seems that:

The error message states that you have embedded a nul char...: \0 denotes ASCII byte == 0, which is forbidden in an R string (internally, it denotes the end of a string).

Some people already asked something similar (Sparklyr "embedded nul in string" when collecting), but they received no clear answer.

How can I get rid of this "nul" in the string? Can dplyr help me with this? Is there any function in sparklyr to tackle this issue?

I'm trying to collect this Spark data frame to R's internal memory to finally export it as a CSV or XLSX file for further analysis.

Thanks!

caproki
  • 348
  • 2
  • 18
  • May you try if `iconv(x, "latin1", "ASCII", "?")` on the column that contain the `nul` string to see if this get rid of the special character. This is what I used when work in with MSSQL database that encounter similar error 3 years ago. – Sinh Nguyen Apr 18 '21 at 05:47
  • Thanks @SinhNguyen . I'm trying to use that function but it takes a character vector as its input, and so far I haven't found a way to convert a column from the Spark data frame into some sort of character vector... – caproki Apr 18 '21 at 18:46

1 Answers1

0

R is a bit particular about nuls.

you can replace the nuls ("\0") on the Spark side before collecting. The offending string is most likely in your word column:

word_count %>%
replace(word, "\0","") %>%
sdf_collect()

alternatively, you can enter into the encoding/decoding hell that is associated with text mining :P

Janna Maas
  • 1,124
  • 10
  • 15