0

Good afternoon,

I am trying to sort a large corpus of normative texts of different lengths, and to tag the parts of speech (POS). For that purpose, I was using the tm and udpipe libraries, and given the length of the database.

The other task I need to perform is to identify the entities. I tried the SpacyR library, but it does not correctly identify the name of the organizations, so I want to train a custom NER model based on a few documents from the corpus, which I have personally validated.

How could I "spacy_extract_entity()" with custom data? Or maybe with quanteda and spacyr?

Thanks in advance.

I have done the POS task in this way. I generated a couple of functions.

suppressMessages(suppressWarnings(library(pdftools)))
suppressMessages(suppressWarnings(library(tidyverse)))
suppressMessages(suppressWarnings(library(tm)))

# load the corpus

tm_corpus <- VCorpus(DirSource(
  "working_path,
  pattern = ".pdf"),readerControl = list(reader = readPDF, language = 'es-419'))

# load udpipe

library(udpipe)
dl <- udpipe_download_model(language = "spanish", overwrite = FALSE)
str(dl)
udmodel_spanish <- udpipe_load_model(file = dl$file_model)

# functions to annotate the corpus

f_udpipe_anot <- function(n){
  
  txt <- as.character(tm_corpus[[n]]) %>% #lista simia
    unlist()
  y <- udpipe_annotate(udmodel_spanish, x = txt, trace = TRUE)
  y <- as.data.frame(y)
}

pinkillazo <- function(desde, hasta){
  resultado <- data.frame()
  for (item in desde:hasta){
    print(item)
    resultado <- rbind(resultado, f_udpipe_anot(item))
   
   }
  return(resultado)
}

leyes_udpipe_POS <- pinkillazo(1,13) # here I got the annotated corpus as a dataframe

To identify the named entities, I have tried this:

spacyr::spacy_initialize(model = "es_core_news_sm")
quan_corpus <- corpus(tm_corpus)
POS_df_spacyr <- spacy_parse(quan_corpus, lemma = FALSE, entity = TRUE, tag = FALSE, pos = TRUE)

organiz <- spacy_extract_entity(
  quan_corpus,
  output = c("data.frame", "list"),
  type = c("all", "named", "extended"),
  multithread = TRUE,
  )

I am getting the wrong organizations' names as well as other misspecifications. With multithread, I tought that this task could easen, but it's not the case.

neilfws
  • 32,751
  • 5
  • 50
  • 63

1 Answers1

0

If you want to train your own named entity recognition model in R, you could use R packages crfsuite and R package nametagger which are respectively Conditional Random Fields and Maximum Entropy Models which can be used alongside the udpipe annotation.

If you want deep learning models, you might have to look into torch alongside tokenisers like sentencepiece and embedding techniques like word2vec to implement your own modelling flow (e.g. BiLSTM).

  • Thanks for the suggestion. I've tried nametagger before but it only works with english and czech, and I am working with text in spanish. However, cfrsuite seems to work with udpipe in the same languages that support. I will try this x <- ner_download_modeldata("conll2002-es") – Sergio A. Gottret Rios Jan 31 '23 at 13:22
  • The listed packages allow training your own model on your own data or provide building blocks for the model building which can be in any language. They don´t provide an extensive suite of pretrained models. –  Feb 01 '23 at 06:38