0

I am using package tm.

I have a dataframe with 2 columns, the first column is ID and the seocnd column contains text. The dataframe looks as follows.

Id       Text
13456    Hi, Good morning
13457    How are you?
13456    May I know who I am speaking to?
13456    Hi, Good evening

I have used the tm package and built dtm and extracted the top 5 words for each document and it looks like:

Id       Term1 Term2 Term3   Term4 Term5
13456    Hi    Good  morning term4 term5
13457    How   are   you     term4 term5
13456    I     Know  may     who   to
13456    Hi    Good  Evening term4 term5

But the required output is:

Id      Term1 Term2 Term3 Term4   Term5
13456   Hi    Good  I     morning evening
13457   How   are   you   term4   term5

I could not find any previous questions posted on this. Thanks in Advance.

Bhavya
  • 3
  • 2
  • Could you explicitely give the function you used (to have your dtm), to be able to try ? – denis Nov 20 '17 at 10:18

1 Answers1

0

The problem you are facing stems from the fact that each line of your data is treated as individual document. Hence, you need to aggregate your data by ìd` before feeding it into the process of generating a dtm.

The following show and example of how to use aggregate in base R. If you have a huge number of documents this can be done more efficiently, e.g., by using the package data.table, which you might have a look at. However, to keep it simple, I use base R. (I have used own exemplary data, for the next time, please use dput or provide the code that generates your data to make it easier for others to read your example data.)

df <- data.frame(id = c(1, 1, 2) , text = c("text1.", "text2.", "text3."))
# id  text
# 1  1 text1.
# 2  1 text2.
# 3  2 text3.
df <- aggregate(df$text, by = list(df$id), FUN = function(x) paste(x, collapse = " "))
# Group.1            x
# 1       1 text1. text2.
# 2       2        text3.
colnames(df) <- c("id", "text")
Manuel Bickel
  • 2,156
  • 2
  • 11
  • 22