Questions tagged [stringi]

stringi is THE R package for fast, correct, consistent and convenient string/text processing in each locale and any native character encoding. The use of the ICU library gives R users a platform-independent set of functions known to Java, Perl, Python, PHP, and Ruby programmers.

's stringi package provides a platform independent way of manipulating strings. It is built on the library and has a syntax inspired by the package.

Repositories

Other resources

Related tags

298 questions
1
vote
1 answer

Function for stringr pattern look around for multiple matches

I have a dataframe of ~20,0000 observations. I am focused specifically on a column that has abstracts of scientific journals. I am attempting to pull plant species names out of these abstracts. The genus was already extracted out of the abstract, so…
1
vote
1 answer

[readtext]: download files from the Internet to remove text via stringi and read the file into Quanteda

My aim is to read multiple text files into Quanteda, first removing unwanted text that is contained within # marks. Stringi code has been provided to perform this task, however, problems were encountered reading the file in Quanteda, regarding the…
bgreen
  • 63
  • 6
1
vote
2 answers

Extracting named capture groups using stringi, name lost if first child of unnamed capture group

I'm trying to determine which parts of a string match a specific named capture group, using stringi and R (and thus ICU regex). However, if the named capture group is the first child of an unnamed capture group, the name is lost in the output. The…
Erik A
  • 31,639
  • 12
  • 42
  • 67
1
vote
2 answers

How to using `regexp` to remove all the character not in chinese and english

There is ori_string ,how to using regexp to remove all the character not in chinese and english? Thanks! ori_string<-"没a w t _ 中/国.sz" the wished result is "没awt中国sz"
anderwyang
  • 1,801
  • 4
  • 18
1
vote
2 answers

How to handle "Regular expression backtrack stack overflow. (U_REGEX_STACK_OVERFLOW)"?

I have a text from which I want to extract the first two paragraphs. The text consists of several paragraphs seperated by empty lines. The paragraphs themselves can contain line breaks. What I want to extract is everything from the beginning of the…
talocodat
  • 25
  • 5
1
vote
2 answers

How to remove repeated sentences with stringi?

I have a vector of character. For each of these elements I am 100% sure there is a repetition that is always located at the start of the text. A simplified example of a repeated sentence: Hello. Hello. How are you? Wait I aim for is just Hello. How…
GiulioGCantone
  • 195
  • 1
  • 10
1
vote
0 answers

stringi::stri_unescape_unicode() is not able to render Unicode characters in some ranges

Table of contents The context The problem The question The context In the context of R, I'm aware that stringi::stri_unescape_unicode() could be used for converting a Unicode code to its corresponding character. For example, the Unicode code for…
rdrg109
  • 265
  • 1
  • 8
1
vote
2 answers

detect duplicated words within string

In the string below (which is a column in a df) I want to extract strings in which TRUE is present at least two times. I guess I could do some strsplit and then detect duplicates, but is there a method to do it directly? head(df$Filter) [1]…
user2300940
  • 2,355
  • 1
  • 22
  • 35
1
vote
3 answers

Webscrape script variable and convert string into JSON in R

I scrape information with rvest and store it in a dataframe. All information on various institutions and their context characteristics is stored in one string. It looks similar to JSON, but it isn't. I followed another stack post but am not…
Marco
  • 2,368
  • 6
  • 22
  • 48
1
vote
1 answer

How to get str_sub to accept output from str_locate_all when there are multiple replacements in a string and also assign replacements, vectorized

There are a lot of string replacement questions, but I could not find one that addressed this issue specifically. I have a too long and slow if else for loop to solve this problem, but according to the str_sub documentation, the matrix output of…
Pearl
  • 123
  • 6
1
vote
1 answer

How to remove repeated sequences of symbols (characters) in stringr/stringi?

I have a text like this: Insanely good Insanely good music. Kanye West is GOAT. The sky is blue. I want a function that whatever is the first sequence of a string, remove it if it's repeated. In the case above, it would be mutated into: Insanely…
GiulioGCantone
  • 195
  • 1
  • 10
1
vote
2 answers

How to replace only characters located between numbers and leave unchanged those with different locations

How to replace "." that is located within numbers with ",", but not replace "." located elsewhere? Input data: x_input="23.344,) abcd, 12899.2, (, efg; abef. gfdc." Expected ouput: x_output "23,344,) abcd, 12899,2, (, efg; abef. gfdc." I…
Krantz
  • 1,424
  • 1
  • 12
  • 31
1
vote
1 answer

Extract text between specific string in a URL "/"

I am trying to collect everything before a specific set of characters i.e. I have a URL such as the following url = "https://www.somewebsiteLink.com/someDirectory/Directory/ascensor/163235494/d" url2 =…
user113156
  • 6,761
  • 5
  • 35
  • 81
1
vote
1 answer

R - Warning: "argument is not an atomic vector" when attempting to remove whitespace

I'm at the final stage of tidying my data before analysis and have encountered an issue i'm not really able to understand when removing whitespace in the data table. See complete code below for description of the steps in the code. Started from the…
EinarO
  • 27
  • 4
1
vote
1 answer

"C compiler cannot create executables" when installing stringi in R

I often install R packages from source, and need a properly configured ~/.R/Makevars to do this. I want to be able to use OpenMP, so I copied a Makevars I found online. My Makevars ended up being this: OPT_LOC = $(HOME)/homebrew/opt LLVM_LOC =…
Paul
  • 3,321
  • 1
  • 33
  • 42