I have a vector of text data (news data). I am trying to scan the text for any money amount and the text surrounding this amount. I managed this with the first element of my vector but struggle with using a loop and list to repeat the process for all data. I use str_extract_currencies
from stringr
which does a good job in detecting numbers. It may be possible with regular expressions, but I don't know how.
textdata <- data.frame(document = c(1,2),
txt = c("Outplay today announced its $7.3M series A fundraise from Sequoa Capital India. ..., which is poised to be a $5.59B market by 2023, is a huge opportunity for Outplay.", "India's leading digital care ecosystem for chronic condition management – has raised USD 5.7 million in funding led by US-based venture capital firm, W Health Ventures. The funding also saw participation from e-pharmacy Unicorn PharmEasy (a Threpsi Solutions Pvt Ltd brand), Merisis VP and existing investors Orios VP, Leo Capital, and others. With around 463 million people with diabetes and $1.13 billion with hypertension across the world"))
numbers <- str_extract_currencies(textdata$txt[1]) %>%
filter(curr_sym == '$')
for (i in 1:nrow(numbers)){
print( stringr::str_extract(textdata$txt[1], paste0(".{0,20}", numbers$amount[i], ".{0,20}")))
}
finaldata <- data.frame(document = c(1,1,2),
money_related = c("oday announced its $7.3M series A fundraise",
" is poised to be a $5.59B market by 2023, is",
"with diabetes and $1.13 billion with hyper"))
A document may contain 0 or multiple instances of money amounts. I like to store it to a data.frame like this:
> finaldata
document money_related
1 1 oday announced its $7.3M series A fundraise
2 1 is poised to be a $5.59B market by 2023, is
3 2 with diabetes and $1.13 billion with hyper
Thank you very much.