0

I am trying to clean up data for social network analysis, and as a newcomer to coding, I'm having trouble writing a complex conditional. First, we have dataframe bookinfo where the headers of interest are Date, Receiver, bookID:

>head(bookinfo)
        date                             receiver          bookId readingStatus
1 2017-04-21 03cff9d7-5712-410c-a4bf-f04ceede644b asin:0062228013  ALREADY_READ
2 2017-04-18 03cff9d7-5712-410c-a4bf-f04ceede644b asin:1442449616  ALREADY_READ
3 2017-04-24 03cff9d7-5712-410c-a4bf-f04ceede644b asin:0545851904  ALREADY_READ
4 2017-04-18 03cff9d7-5712-410c-a4bf-f04ceede644b asin:0545384176  ALREADY_READ
5 2017-06-02 03cff9d7-5712-410c-a4bf-f04ceede644b asin:0763643491  ALREADY_READ
6 2017-04-24 03cff9d7-5712-410c-a4bf-f04ceede644b asin:0545851890  ALREADY_READ

Then, we have dataframe rec where the headers of interest are Date, Sender, Receiver, and bookId:

>head(rec)
     date                               sender                             receiver       messageType          bookId
1 4/21/17 7a28156e-950e-47b7-a4aa-241fa9cfcf1a f8b027a3-89eb-475a-83e0-eb94e24eaab4 RECOMMENDS_A_BOOK asin:0986444138
2 4/21/17 fb4eefd3-03e9-40c3-bc9e-af85ea88d827 f8b027a3-89eb-475a-83e0-eb94e24eaab4 RECOMMENDS_A_BOOK asin:1434297314
3 4/21/17 dc319e95-0e3e-461e-b02c-abab4414c741 f8b027a3-89eb-475a-83e0-eb94e24eaab4 RECOMMENDS_A_BOOK asin:1484746694
4 4/18/17 118c57b6-e946-453f-88b2-6ae1282e62ab f8b027a3-89eb-475a-83e0-eb94e24eaab4 RECOMMENDS_A_BOOK asin:1514241587
5 4/21/17 dd0de21d-889d-4bf1-9ebb-af50b6660815 f8b027a3-89eb-475a-83e0-eb94e24eaab4 RECOMMENDS_A_BOOK asin:0986444138
6 4/21/17 f85d06ea-d534-42de-a714-6dc6358d1e29 f8b027a3-89eb-475a-83e0-eb94e24eaab4 RECOMMENDS_A_BOOK asin:1484746694

In the dataframe rec, I want to create a new column Ties. The conditional would be as follows:

Tie = 1 if

  • In rec: Sender, Receiver, and bookId are in the same row AND
  • In bookinfo: that same Receiver, same bookId are in the same row AND the date here is later than the date of the referenced row in rec
  • Note that rec and bookinfo are not necessarily consistent. Whereas Sender+Receiver+bookId may be row 3 in rec, Receiver+bookId may be row 10 in bookinfo.

Otherwise, Tie=0.

The intuition is that if the Receiver shows activity with the book AFTER the date of receiving a recommendation of that book from the Sender, then they have a tie. (If they have show activity before the date, it's unrelated to the Sender).

Thanks in advance for any help and for your time!

c. lam
  • 25
  • 5

1 Answers1

0

I know you have a complex data frame there, but please tried to provide reproducible examples - creating some dummy data that roughly follows your own.

The piece of code you really want here is the Boolean indexing inside the function. It's working with integer location of the columns, but you can do it with column names as well.

Hope this helps.

options(stringsAsFactors = FALSE)
bookinfo <- data.frame(date = as.Date(c('01-08-2010', '01-08-2010', '01-08-2011', '01-09-2011', '01-08-2012'), format = '%d-%m-%Y'), Sender = c('a', 'b', 'c', 'd', 'e'), Receiver = c(1, 2, 3, 4, 5), bookId = as.character(c('Dickens', 'Austen', 'Dickens', 'Austen', 'Shakespeare'))) 
rec <- data.frame(date = as.Date(c('01-08-2016', '01-08-2004', '01-08-2014', '01-07-2011', '01-08-2015'), format = '%d-%m-%Y'), Sender = c('a', 'b', 'c', 'd', 'e'), Receiver = c(1, 2, 3, 4, 5), bookId = as.character(c('Dickens', 'Austen', 'Dickens', 'Austen', 'Shakespeare'))))

bookinfo[, 1] <- as.numeric(bookinfo[, 1])
rec[, 1] <- as.numeric(rec[, 1])

booked <- function (x, y) {
  if ((x[2] == y[2] && x[3] == y[3] && x[4] == y[4] && x[1] < y[1]) == TRUE) {
    result <- 1
  } else {
    result <- 0
  }

  return(result)
}

rec['tie'] <- ''

for (i in 1:nrow(rec)) {
  for (j in 1:nrow(bookinfo)) {
    if (booked(rec[i, ], bookinfo[j, ]) == 1) {
      rec[i, 'tie'] <- 1
      }
  }
}

EDIT

I have updated the code so that I will now scan all data rows in 'bookinfo' for the required matches to rec.

In general, as others will point out, we should be looking to vectorise the code via 'apply'-type functions, rather than writing loops. However, this is a tricky problem and I couldn't find a solution immediately.

Ollie Perkins
  • 333
  • 1
  • 12
  • Oh! Sorry about that; thanks for answering with a more reproducible example for future viewers. A few follow-up questions - why is Sender assigned strings of characters and Receiver is assigned an array? Also, is it necessary to write out the individual titles of every unique string for bookId? Thanks:) – c. lam Apr 29 '18 at 23:23
  • Sorry, another question... I notice in your function the conditionals match up row by row. What would you do in the case that the two dataframes do not necessarily match up row-by-row? For example, Sender, Receiver, and BookId in `rec` could be row 3, but Receiver's entry with that BookId in `bookinfo` could be row 10. – c. lam Apr 30 '18 at 00:18
  • No problem. I'll take a look at this this evening (GMT). – Ollie Perkins Apr 30 '18 at 08:46
  • Hi there - don't read too much into the dummy data. I did this quickly and without much logic... I have written a loop for you that will iterate over all rows of the second data frame. If that now works it would be great if you could upvote and / or accept the answer. :) – Ollie Perkins May 01 '18 at 17:58