Error while using unnest_tokens() while passing a function to the token

Question

Error in unnest_tokens.data.frame(., entity, text, token = tokenize_scispacy_entities, : Expected output of tokenizing function to be a list of length 100

The unnest_tokens() works well for a sample of few observations but fails on the entire dataset.

https://github.com/dgrtwo/cord19 Reproducible example:

library(dplyr)
library(cord19)
library(tidyverse)
library(tidytext)
library(spacyr)

Install the model from here - https://github.com/allenai/scispacy

spacy_initialize("en_core_sci_sm")

tokenize_scispacy_entities <- function(text) {
  spacy_extract_entity(text) %>%
    group_by(doc_id) %>%
    nest() %>%
    pull(data) %>%
    map("text") %>%
    map(str_to_lower)
}

paragraph_entities <- cord19_paragraphs %>% 
  select(paper_id, text) %>%
  sample_n(10) %>% 
  unnest_tokens(entity, text, token = tokenize_scispacy_entities)

looks like unnest_tokens expects the number of samples to be returned. But looking at the documentation of spacyr, it should work the other way around. First call `spacy_parse` and then use the `unnest_tokens` from tidytext. — phiver, Mar 23 '20 at 18:54
I guess, spacy_extract_entity() is doing exactly the same thing in the background - parsing and then converting into entities. Anyhow - I still have the issue that I mentioned before. — Sagar K, Mar 24 '20 at 20:25
If I use any other function it works well. I will have a better look at your function to see what it returns. I will janeaustenr package as a test set as I don't have cord19. shouldn't matter for the test. — phiver, Mar 25 '20 at 09:20

moxiaoran · Answer 1 · 2020-03-25T23:02:13.550

0

I face the same problem. I don't know the reason why, after I filter out empty abstract and shorter abstract string, everything seems work just fine.

 abstract_entities <- article_data %>%
  filter(nchar(abstract) > 30) %>%
  select(paper_id, title, abstract) %>%
  sample_n(1000) %>%
  unnest_tokens(entity, abstract, token =  tokenize_scispacy_entities)

edited Mar 25 '20 at 23:02

answered Mar 25 '20 at 22:32

moxiaoran

1
2

Error while using unnest_tokens() while passing a function to the token

Install the model from here - https://github.com/allenai/scispacy

1 Answers1