0

I've got a dataframe df with research areas and journal identifiers (ISSN). Each ISSN appears several times in the dataframe and the corresponding Fields may present the same sequence of keywords, overlapping ones or totally different ones e.g. for the same ISSN, one could be "artificial intelligence, soft computing", another could be "cloud computing, soft computing, drones" and a third one could be "genetic algorithms, ANN". However, there is no within string repetition of terms (e.g. "artificial intelligence, soft computing, artificial intelligence"

Here is a representative example of the data:

dput(ptotal)
structure(list(Fields = list("inteligencia artificial, HCI, Bioinformatics", 
    "inteligencia artificial, BioinspiredAlgorithms", "advancedMachineLearning", 
    "advancedMachineLearning, Classification, Clustering", "advancedMachineLearning, applied artificial intelligence", 
    "inteligencia artificial, advancedMachineLearning, engineering"), 
    ISSN = c("19883064", "19883064", "13704621", "09574174", 
    "09574174", "09574174"), ExpectedResults = list("inteligencia artificial, HCI, Bioinformatics inteligencia artificial, BioinspiredAlgorithms", 
        "inteligencia artificial, HCI, Bioinformatics inteligencia artificial, BioinspiredAlgorithms", 
        "advancedMachineLearning", "advancedMachineLearning, Classification, Clustering advancedMachineLearning, applied artificial intelligence inteligencia artificial, advancedMachineLearning, engineering", 
        "advancedMachineLearning, Classification, Clustering advancedMachineLearning, applied artificial intelligence inteligencia artificial, advancedMachineLearning, engineering", 
        "advancedMachineLearning, Classification, Clustering advancedMachineLearning, applied artificial intelligence inteligencia artificial, advancedMachineLearning, engineering")), class = c("tbl_df", 
"tbl", "data.frame"), row.names = c(1L, 35L, 4L, 10L, 20L, 39L
))

The original data would be the first two columns (Fields and ISSN) and the last one would be my Expected Result (I simply pasted together all the strings in the rows that share the same ISSN). You can see that if a term appears several times in different rows for the same ISSN it is pasted several times in the Expected Results column.

My final goal is to combine all rows corresponding to the same ISSN into a single string. Afterwards, I plan to find unique terms within the string and count how many times each term is repeated in the combined string (tf), so I have an idea about what the journal is really about. The problem is I don't manage to paste the strings together (by group).

I've seen several similar posts here in stack overflow for number columns and also for single character/word columns (which I have tried), but none seem to work thus far (I've mostly tried with plyr and dplyr). In the posted example, I just used paste to manually put the strings together, so I'm guessing the solution is pretty straightforwards and I'm just coding something wrong.

Can anyone point out to an appropriate post or provide a solution? Thanks in advance!

  • 2
    Your data shows no duplicate ISSN, and most of your `Fields` data are invariant. I'm confident this task is straight-forward in whichever dialect of R you prefer (base, `dplyr`, or `data.table`), but it's up to you to make a minimal working **and representative** question. In this case, I suggest you reduce your fields to 2-3 per row, make sure you actually have at least one duplicate in `ISSN` (but not all duplicates, having singles is good too), and demonstrate the expected output (for at least one of the dupes). – r2evans Jun 01 '21 at 19:25
  • Solved it! group_by was not working because I had loaded plyr after dplyr. I removed plyr and I get the expected result now. I apologize for the trouble. – Cris Urdiales Jun 02 '21 at 12:29
  • Haha! Yes, that's a common one. It can be rather frustrating when package functions mask others, and while R is supposed to warn you when it happens, I find it easy to gloss over that message. – r2evans Jun 02 '21 at 13:17
  • 1
    Yes, I was going crazy here, checking the same lines again and again. Your suggestion to go for a representative data subset helped me to debug and fix the issue. Thanks a lot! :) – Cris Urdiales Jun 03 '21 at 14:05

0 Answers0