1

Error in unnest_tokens.data.frame(., entity, text, token = tokenize_scispacy_entities, : Expected output of tokenizing function to be a list of length 100

The unnest_tokens() works well for a sample of few observations but fails on the entire dataset.

https://github.com/dgrtwo/cord19 Reproducible example:

library(dplyr)
library(cord19)
library(tidyverse)
library(tidytext)
library(spacyr)

Install the model from here - https://github.com/allenai/scispacy

spacy_initialize("en_core_sci_sm")

tokenize_scispacy_entities <- function(text) {
  spacy_extract_entity(text) %>%
    group_by(doc_id) %>%
    nest() %>%
    pull(data) %>%
    map("text") %>%
    map(str_to_lower)
}

paragraph_entities <- cord19_paragraphs %>% 
  select(paper_id, text) %>%
  sample_n(10) %>% 
  unnest_tokens(entity, text, token = tokenize_scispacy_entities)
phiver
  • 23,048
  • 14
  • 44
  • 56
Sagar K
  • 11
  • 3
  • @phiver - updated the question with a reproducible example. – Sagar K Mar 23 '20 at 17:56
  • looks like unnest_tokens expects the number of samples to be returned. But looking at the documentation of spacyr, it should work the other way around. First call `spacy_parse` and then use the `unnest_tokens` from tidytext. – phiver Mar 23 '20 at 18:54
  • I guess, spacy_extract_entity() is doing exactly the same thing in the background - parsing and then converting into entities. Anyhow - I still have the issue that I mentioned before. – Sagar K Mar 24 '20 at 20:25
  • If I use any other function it works well. I will have a better look at your function to see what it returns. I will janeaustenr package as a test set as I don't have cord19. shouldn't matter for the test. – phiver Mar 25 '20 at 09:20

1 Answers1

0

I face the same problem. I don't know the reason why, after I filter out empty abstract and shorter abstract string, everything seems work just fine.

 abstract_entities <- article_data %>%
  filter(nchar(abstract) > 30) %>%
  select(paper_id, title, abstract) %>%
  sample_n(1000) %>%
  unnest_tokens(entity, abstract, token =  tokenize_scispacy_entities)
moxiaoran
  • 1
  • 2