Questions tagged [stringi]

stringi is THE R package for fast, correct, consistent and convenient string/text processing in each locale and any native character encoding. The use of the ICU library gives R users a platform-independent set of functions known to Java, Perl, Python, PHP, and Ruby programmers.

's stringi package provides a platform independent way of manipulating strings. It is built on the library and has a syntax inspired by the package.

Repositories

Other resources

Related tags

298 questions
3
votes
1 answer

stringi package won't install in CentOS

I am trying to install stringi package in R, but the installation never finishes. After the download and some compilation, I get the following message: Error in dyn.load(file, DLLpath = DLLpath, ...) : unable to load shared object…
Marcus Nunes
  • 851
  • 1
  • 18
  • 33
3
votes
1 answer

Compare two large string vectors takes too long time (remove stopwords)

I am trying to prepare a dataset for machine learning. In the process I would like to remove (stop) words which has few occurrences (often related to bad OCR readings). Currently I have a list of words containing approx 1 mio words which I want to…
3
votes
1 answer

Filter a TermDocumentMatrix with a dictionary of regular expressions

I feel like this should be fairly easy. I have a dictionary of terms that are currently in the format of globs, which I have converted to regular expressions. The reason I've converted them to regular expressions is because I think the tm package…
spindoctor
  • 1,719
  • 1
  • 18
  • 42
3
votes
2 answers

How to remove words not in caps in R?

I'm doing text analysis using R. Is there a way to remove all the words not in caps using tm or stringi? If I have something like this Albert Einstein went to the store and saw his friend Nikola Tesla ... + 200 pags to be converted into Albert…
pachadotdev
  • 3,345
  • 6
  • 33
  • 60
2
votes
2 answers

Efficient string splitting on first match using data.table

I have data of the following structure: require(data.table) dt = data.table(c('string1: val1', 'gnistr2: val2', 'ingstr3: :::!val3', 'gtrins4: val4')) > dt V1 1: string1: val1 2: gnistr2: val2 3: ingstr3: :::!val3 4: …
JDG
  • 1,342
  • 8
  • 18
2
votes
2 answers

R regex to get partly match

I want to use stri_replace_all_regex to replace string but failed. I would like to know whether there are other methods to overcome it. Thanks for anyone who gives help to me! try: the first: > library(string) > a <- c('abc2','xycd2','mnb345','tumb…
flora micy
  • 23
  • 6
2
votes
1 answer

Installing stringi repeatedly fails

I am trying to install likert, which requires stringi. install.package("likert") fails to install stringi. install.package("stringi") from CRAN fails as well: trying URL 'https://cran.rstudio.com/bin/macosx/contrib/4.2/stringi_1.7.6.tgz' Content…
mshelomi
  • 61
  • 3
2
votes
1 answer

stringi R ignore accents special characters to match

I have two dataframes columns, one containing names with accents and the other don't. I want to match them but only exact matches are performed. For example: df<-data.frame(not_accented=c("ACARAU CE","ADRIANOPOLIS PR", "AFUA PA","AMAPARI AP","AGUA…
CelloRibeiro
  • 160
  • 11
2
votes
1 answer

Fastest way to check occurrence of a set of substrings in a large collection of documents using R

I have a large collection of documents, dc, (with several million rows) with the following data.frame structure doc_id body 1 'sdfadfs...' 2 'dfadf...' 3 'sadf....' I also have about 10,000 terms (or substrings) stored in…
Ding Li
  • 673
  • 1
  • 7
  • 19
2
votes
1 answer

str_extract regex with quotes and semicolons

I am parsing long strings with semicolons and quotes using R v4.0.0 and stringi. Here is an example string: tstr1 <- 'gene_id "APE_RS08740"; transcript_id "unassigned_transcript_1756"; gbkey "CDS"; inference "COORDINATES: protein…
acvill
  • 395
  • 7
  • 15
2
votes
5 answers

getting spells and statistics from sequence of numbers

I have a string where I would like to extract spells from a sequence for example, A<- c('000001111000', '0110011', '110001') I would like to get the continuous spell lengths of 0 and 1 in a sequence format. Then using the lengths of the spells I…
user3570187
  • 1,743
  • 3
  • 17
  • 34
2
votes
3 answers

Add comma after first word starting with a capital letter

As the title says. I have a bunch of names and I need to add a comma after the first word that starts with a capital letter. An example: txt <- c( "de Van-Smith J", "van der Smith G.H.", "de Smith JW", "Smith JW") The result should be: [1] "de…
flee
  • 1,253
  • 3
  • 17
  • 34
2
votes
1 answer

Inserting space at specific location in a string

I want to add white space after three character in a string. I used the following code which works well. I wonder if there is any other simple way to accomplish the same task library(stringi) Test <- "3061660217" paste( stri_sub(str = Test, from…
MYaseen208
  • 22,666
  • 37
  • 165
  • 309
2
votes
3 answers

string count all strings giving incorrect answer in R

A<- c('C-C-C','C-C', 'C-C-C-C') library(stringr) B<- str_count(A, "C-C") df<- data.frame(A,B) A B (expected) B(actual) C-C-C 2 1 C-C 1 1 C-C-C-C 3 …
user3570187
  • 1,743
  • 3
  • 17
  • 34
2
votes
2 answers

Get context around extracted word

I have extracted keywords from a dataframe of sentences. I need to get a few words pre- and post- keyword to understand the context and be able to do some basic counts. I have tried multiple stringr and stringi functions and grepl functions others…
Brian Head
  • 57
  • 4