1

I have compiled a dataset of tweets using the Twitter API.

The dataset basically looks as follows:

Data <- data.frame(
  X = c(1,2),
  text = c("Hello @User1 #hashtag1, hello @User2 and @User3, #hashtag2", "Hello @User2 #hashtag3, hello @User1 and @User3, #hashtag4"),
  screenname = c("author1", "author2")
) 

Now I want to create a data.frame for social network analysis. I want to show how each of the screennames (in the case of this example "author1" etc.) is linked to users ("@User1" etc.) and hashtags ("#hashtag1", etc.).

To so, I need to extract/copy users and hashtags from the "text" column and write them in new columns. The data.frameshould look like this:

Data <- data.frame(
  X = c(1,2),
  text = c("Hello @User1 #hashtag1, hello @User2 and @User3, #hashtag2", "Hello @User2 #hashtag3, hello @User1 and @User3, #hashtag4"),
  screenname = c("author1", "author2"),
  U1 = c("@User1", "@User2"),
  U2 = c("@User2", "@User1"),
  U3 = c("@User3", "@User3"),
  U4 = c("",""),
  U5 = c("",""),
  H1 = c("#hashtag1", "#hashtag3"),
  H2 = c("#hashtag2", "#hashtag4"),
  H3 = c("",""),
  H4 = c("",""),
  H5 = c("","")
)

How can I extract/copy this information from the "text" column and write it into new columns?

feder80
  • 1,195
  • 3
  • 13
  • 34
  • What do you have empty `U4` and `U5` columns for? – David Arenburg Feb 03 '15 at 11:33
  • I do not know how many users one author mentions in his/her tweet. so I inserted 5 columns for possible users mentioned in a tweet (knowing, that it can be more). – feder80 Feb 03 '15 at 11:34
  • Are you running this procedure several times or you just get this `Data` and you want to convert to `Data2` once? You can just set the amount of columns by the maximum size of Users or hashtags – David Arenburg Feb 03 '15 at 11:35
  • I have a big dataset with ten-thousands of observations. – feder80 Feb 03 '15 at 11:38
  • Ok, try my solution using `stringi` package. I've tested it on different string lengths and it works fine. It is also 99% vectorized so should work fine on a big data set. – David Arenburg Feb 03 '15 at 11:43

1 Answers1

1

Here's my simple attempt using stringi package. This method will create the amount of columns as the longest string in users and hastags, so this will work for any number of users or hashtags mentioned. This is also will be very efficient because this solution is mostly vectorized.

library(stringi)
Users <- stri_extract_all(Data$text, regex = "@[A-Za-z0-9]+")
Data[paste0("U", seq_len(max(sapply(Users, length))))] <- stri_list2matrix(Users, byrow = TRUE)
Hash <- stri_extract_all(Data$text, regex = "#[A-Za-z0-9]+")
Data[paste0("H", seq_len(max(sapply(Hash, length))))] <- stri_list2matrix(Hash, byrow = TRUE)
Data
#   X                                                       text screenname     U1     U2     U3        H1        H2
# 1 1 Hello @User1 #hashtag1, hello @User2 and @User3, #hashtag2    author1 @User1 @User2 @User3 #hashtag1 #hashtag2
# 2 2 Hello @User2 #hashtag3, hello @User1 and @User3, #hashtag4    author2 @User2 @User1 @User3 #hashtag3 #hashtag4
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
  • wow, that works! Just one thing: the users and hashtags are not filled in sequence. so sometimes a hashtag finds itself in the column H15, although there are only 5 hashtags. Do know why? – feder80 Feb 03 '15 at 11:49
  • I couldn't tell without a reproducible example. – David Arenburg Feb 03 '15 at 11:50
  • I am not allowed to make the dataset public, but can I mail it to you? – feder80 Feb 03 '15 at 11:54
  • You can just provide an example a single tweet that it happens and encrypt the names so the error will still occur but you want publish the actual names – David Arenburg Feb 03 '15 at 11:56
  • it is almost impossible to create an example of 4 single tweets. there is so much information in this dataset. i am not able to extract it using dput – feder80 Feb 03 '15 at 12:24
  • I think I found the error - with your script, the Users are not copied in sequence within a row, but within a column. In the example above, the output of users in the first row should be user1, user2, user3 instead of user1, user3, user1. But how can I change that? – feder80 Feb 03 '15 at 12:38
  • Sorry, I've made a small mistake. I've edited the code, try again. – David Arenburg Feb 03 '15 at 12:43
  • one last question: how must the code look like, if I want to have all the users in one column (called "target"), and separated by a comma? – feder80 Feb 03 '15 at 13:19
  • You'll have to show me your desired output with the provided `Data`. – David Arenburg Feb 03 '15 at 13:20
  • I want to import the csv file into Gephi. According to Gephi, I need the following file format: http://gephi.github.io/users/supported-graph-formats/csv-format/ – feder80 Feb 03 '15 at 13:26
  • It starts to get confusing, sorry. Should I edit my question and describe the whole Problem? I am really sorry for the mess of confusing questions! – feder80 Feb 03 '15 at 14:04
  • I think you should accept this solution because it answered your original question correctly. And then ask a new question maybe. – David Arenburg Feb 03 '15 at 14:11
  • You are right. Thank You. Maybe You can also answer the other question. – feder80 Feb 03 '15 at 14:12
  • You can post a link in comments and I'll take a look later today (if it still won't be answered). – David Arenburg Feb 03 '15 at 14:18
  • http://stackoverflow.com/questions/28302705/exporting-twitter-data-to-gephi-using-r – feder80 Feb 03 '15 at 15:32