1

I know that there are many questions out there about partial matches and I've read as many as I've been able to, but I have still not managed to extract what I need using R.

In a nutshell, my problem is that I have a data set with over a million rows of Spanish trigrams and I want to find only those that have verbs. In an attempt to make this easier, I added a row with the 500 most common verbs in Spanish in order to try to match them to the trigrams.

I have a data set like this:

data <- data_frame(trigrams= c("no veo que", "no me gusta", "si habla de", "la mesa de", "el caso que"), fequency=c(112, 345, 578), verb=c("hablar", "gustar", "leer"))

The verbs in the third column ("verb") are infinitives and I would like to partially match them to the verbs in the first ("trigram"). I think it would be ideal, in this case, to be able to use a for loop in order to iterate through the 500 verbs that I want to partially match to my over one million trigrams.

so in this case: "gustar" should partially match "no me gusta" and nothing should match verbless trigrams like "el caso que".

I really do hope this makes sense, I have never worked with these amount of data before and I am too new to regular expressions to really figure this out on my own.

PrisLB
  • 23
  • 6

1 Answers1

0

I think this approach using stringr might help you. You might have to do some modifications in order to use it in a dataframe. Basically we have to convert each verb such as "hablar" into a pattern such as 'hablar*' and then do a str_extract() -

library(dplyr)
library(stringr)


trigrams <- c("no veo que", "no me gusta", "si habla de", "la mesa de", "el caso que")
verb <- c("hablar", "gustar", "leer")

# loop through verbs for each verb compare all possible matches in the trigrams vector
# convert the nested list into a vector
result <- lapply(paste(verb,"*", sep = ""),str_extract, string = trigrams) %>%
            unlist(.)
# filter out na values
result <- result[!is.na(result)]

result
#> [1] "habla" "gusta"

Created on 2018-09-16 by the reprex package (v0.2.0).

Suhas Hegde
  • 366
  • 1
  • 6
  • 13
  • Hi Suhas, thanks a lot for the help. However, I don't think I understand it completely. First of all, I would need to vectorize the trigrams and verb columns in my df? Second, would str_extract only give back the in the trigrams verbs that partially match the verbs column? What I need to know is which trigrams contain verbs.Thanks! – PrisLB Sep 17 '18 at 14:24
  • can you do a `dput(head(data))` on your data and share the sample of your data.frame here. I took this approach because your data is not reproducible. It contains uneven number of rows. – Suhas Hegde Sep 17 '18 at 15:52
  • `str_extract()` will only accept vectors of data. So you can basically do `data$trigrams ` to get that column as a vector – Suhas Hegde Sep 17 '18 at 15:53
  • if you don't need to extract partial matches but need `True/False` results, we can use `str_detect()` function from `stringr` – Suhas Hegde Sep 17 '18 at 15:55
  • Actually I just made it work! Thanks a lot Suhas and sorry about the bad data frame :( – PrisLB Sep 18 '18 at 01:06
  • if the answer did work please go ahead and accept it so that the question has an "accepted answer". – Suhas Hegde Sep 18 '18 at 02:03