I've got a dataframe df with research areas and journal identifiers (ISSN). Each ISSN appears several times in the dataframe and the corresponding Fields may present the same sequence of keywords, overlapping ones or totally different ones e.g. for the same ISSN, one could be "artificial intelligence, soft computing", another could be "cloud computing, soft computing, drones" and a third one could be "genetic algorithms, ANN". However, there is no within string repetition of terms (e.g. "artificial intelligence, soft computing, artificial intelligence"
Here is a representative example of the data:
dput(ptotal)
structure(list(Fields = list("inteligencia artificial, HCI, Bioinformatics",
"inteligencia artificial, BioinspiredAlgorithms", "advancedMachineLearning",
"advancedMachineLearning, Classification, Clustering", "advancedMachineLearning, applied artificial intelligence",
"inteligencia artificial, advancedMachineLearning, engineering"),
ISSN = c("19883064", "19883064", "13704621", "09574174",
"09574174", "09574174"), ExpectedResults = list("inteligencia artificial, HCI, Bioinformatics inteligencia artificial, BioinspiredAlgorithms",
"inteligencia artificial, HCI, Bioinformatics inteligencia artificial, BioinspiredAlgorithms",
"advancedMachineLearning", "advancedMachineLearning, Classification, Clustering advancedMachineLearning, applied artificial intelligence inteligencia artificial, advancedMachineLearning, engineering",
"advancedMachineLearning, Classification, Clustering advancedMachineLearning, applied artificial intelligence inteligencia artificial, advancedMachineLearning, engineering",
"advancedMachineLearning, Classification, Clustering advancedMachineLearning, applied artificial intelligence inteligencia artificial, advancedMachineLearning, engineering")), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(1L, 35L, 4L, 10L, 20L, 39L
))
The original data would be the first two columns (Fields and ISSN) and the last one would be my Expected Result (I simply pasted together all the strings in the rows that share the same ISSN). You can see that if a term appears several times in different rows for the same ISSN it is pasted several times in the Expected Results column.
My final goal is to combine all rows corresponding to the same ISSN into a single string. Afterwards, I plan to find unique terms within the string and count how many times each term is repeated in the combined string (tf), so I have an idea about what the journal is really about. The problem is I don't manage to paste the strings together (by group).
I've seen several similar posts here in stack overflow for number columns and also for single character/word columns (which I have tried), but none seem to work thus far (I've mostly tried with plyr and dplyr). In the posted example, I just used paste to manually put the strings together, so I'm guessing the solution is pretty straightforwards and I'm just coding something wrong.
Can anyone point out to an appropriate post or provide a solution? Thanks in advance!