I have a list of data frame with 40802 gene names and I have data frame with 14000 article information. The article information contains Article, Abstract, Day, Month, Year.
I have transformed the date into normal format, and the abstract as character.
I want to have a plot of X in time, and the frequency of the gene names appears in the abstract. EG
| Date | Gene Name | Frequency |
|------------|-----------|-----------|
| 2017-03-20 | GAPDH | 5 |
| 2017-03-21 | AKT | 6 |
Basically, I want to know the gene names most frequently published in the last 100 days and have a timeline to see the evolution of said genenames. Something like a trend.
library(RISmed)
##Research the query - can be anything relevant to protein expression.
##Multiple research not tested yet
search_topic <- 'protein expression'
##Evaluate the query with reldate = days before today, retmax = maximun number of returned results
search_query <- EUtilsSummary(search_topic, retmax=15000, reldate = 100)
##explore the outcome
summary(search_query)
##get the ids for tall the queries to get the articles
QueryId(search_query)
##get all the records associated with the ID - THIS TAKES LOOONG TIME
records<- EUtilsGet(search_query)
##Analyze the structure
str(records)
summary(records)
##Create a data frame with article/abstract/date
pubmed_data <- data.frame('Title'=ArticleTitle(records),'Abstract'=AbstractText(records),
"Day"=DayPubmed(records), "Month" = MonthPubmed(records), "Year"=YearPubmed(records))
##explore the data
head(pubmed_data,1)
##gene names
genename <- read.csv("genename.csv", header = T, stringsAsFactors = F)
##remove any NA tittles
pubmed <-pubmed_data[-which(is.na(pubmed_data$Title)), ]
##Coerce the date to YYYY-MM-DD
pubmed$Date <- as.Date( paste( pubmed$Day , pubmed$Month , sep = "." ) , format = "%d.%m" )
I've read a lot and cannot figure out how to find genemane[1,1] inside pubmed$Abstract
,
and count the times it appeared by time.
Making a plot where X is the last 100 days and the line prot would be the frequency of the genenames,
And the legend would be the genename. So a trend can be observed.
I would really appreciate any ideas how this can be done.
I have tried tm
, and have tried a lot of different things, but still hitting a wall. Is my concept wrong?