2

Goodafternoon!

I am having some trouble with my dataset. I am using a Google AdWords export for data analysis and I want to fit a logit regression model to the data to determine whether an experiment I have conducted impacts the conversion.

The problem is that the data is aggregated and to be able to perform logit regression, the dependent variable needs to be binary. So Instead of having a data point with (e.g.) 10 impressions, 5 Clicks and 2 conversions, I want 10 datapoints of which 5 are clicked on of which 2 have converted.

So I want to go from a data frame that looks like this (very simplified)

| Keyword      | Impressions | Clicks     | Conversions |
| SampleName   |      10     |      5     |     2       |

to this:

| Keyword      | Clicked     | Converted   |
| SampleName   |      1      |      1      |
| SampleName   |      1      |      1      |
| SampleName   |      1      |      0      |
| SampleName   |      1      |      0      |
| SampleName   |      1      |      0      |
| SampleName   |      0      |      0      |
| SampleName   |      0      |      0      |
| SampleName   |      0      |      0      |
| SampleName   |      0      |      0      |
| SampleName   |      0      |      0      |

How would I be able to do this for a very large dataset? I have looked everywhere, but I can't seem to find the solution. I would prefer to use R to do this, but I also have Excel and Stata installed.

Thanks in advance!

Edit Here is some code (extended with an extra row and column) for the data frame. I am quite new to R and this platform. This probably isn't the most clean way to code this, but here it goes:

Key <- c("Sample1", "Sample2")
Imp <- c(10, 6)
Cli <- c(5, 3)
Con <- c(2, 1)
CPC <- c(0.26, 0.15)
df1 <- data.frame(Key, Imp, Cli, Con, CPC)
colnames(df1) <- c("Keyword", "Impressions", "Clicks", "Conversions", "CostPerClick")

Also, I am now running into the problem that things like average costs per click need to be repeated for clicks, since for each click a price is paid. So in the end, I need a dataframe that looks like this:

| Keyword   | Clicked     | Converted   |     CPC     |
| Sample1   |      1      |      1      |     0.26    |
| Sample1   |      1      |      1      |     0.26    |
| Sample1   |      1      |      0      |     0.26    |
| Sample1   |      1      |      0      |     0.26    |
| Sample1   |      1      |      0      |     0.26    |
| Sample1   |      0      |      0      |     0.00    |
| Sample1   |      0      |      0      |     0.00    |
| Sample1   |      0      |      0      |     0.00    |
| Sample1   |      0      |      0      |     0.00    |
| Sample1   |      0      |      0      |     0.00    |
| Sample2   |      1      |      1      |     0.15    |
| Sample2   |      1      |      0      |     0.15    |
| Sample2   |      1      |      0      |     0.15    |
| Sample2   |      0      |      0      |     0.00    |
| Sample2   |      0      |      0      |     0.00    |
| Sample2   |      0      |      0      |     0.00    |

Edit 2 (SOLVED)

akrun's solution seems to be right one when tested on the sample dataset, but if I try to test in on my actual dataset, it is giving the following error:

> result <- setDT(df1)[, list(Clicked=rep(c(1,0), c(Clicks, Impressions-Clicks)), 
+  Converted=rep(c(1,0), c(Conversions, Impressions-Conversions)), 
+  CPC=rep(c(CostPerClick, 0), c(Clicks,Impressions-Clicks))), Keyword]
Error in rep(c(1, 0), c(Clicks, Impressions - Clicks)) : 
  invalid 'times' argument

The keywords don't contain any duplicates and the data does not have NA's:

> length(unique(df1$Keyword))
[1] 186145
> nrow(df1)
[1] 186145
> nrow(df1[complete.cases(df1),]) == nrow(df1)
[1] TRUE

a summary of the data:

> summary(df1)
   Keyword           Impressions          Clicks        Conversions       CostPerClick  
 Length:186145      Min.   :   1.00   Min.   : 1.000   Min.   :0.00000   Min.   :0.010  
 Class :character   1st Qu.:   7.00   1st Qu.: 1.000   1st Qu.:0.00000   1st Qu.:0.130  
 Mode  :character   Median :  16.00   Median : 1.000   Median :0.00000   Median :0.210  
                    Mean   :  32.93   Mean   : 2.167   Mean   :0.03368   Mean   :0.246  
                    3rd Qu.:  39.00   3rd Qu.: 2.000   3rd Qu.:0.00000   3rd Qu.:0.320  
                    Max.   :1521.00   Max.   :91.000   Max.   :4.00000   Max.   :3.680 
Charlotte
  • 33
  • 7

1 Answers1

3

Try

library(data.table)
setDT(df1)[, list(Clicked=rep(c(1,0), c(Clicks, Impressions-Clicks)),
 Converted=rep(c(1,0), c(Conversions, Impressions-Conversions))) , Keyword]
#       Keyword Clicked Converted
# 1: SampleName       1         1
# 2: SampleName       1         1
# 3: SampleName       1         0
# 4: SampleName       1         0
# 5: SampleName       1         0
# 6: SampleName       0         0
# 7: SampleName       0         0
# 8: SampleName       0         0
# 9: SampleName       0         0
#10: SampleName       0         0

Update

Using the updated dataset in the OP's post

setDT(df1)[, list(Clicked=rep(c(1,0), c(Clicks, Impressions-Clicks)), 
 Converted=rep(c(1,0), c(Conversions, Impressions-Conversions)), 
 CPC=rep(c(CostPerClick, 0), c(Clicks,Impressions-Clicks))), Keyword]
#    Keyword Clicked Converted  CPC
# 1: Sample1       1         1 0.26
# 2: Sample1       1         1 0.26
# 3: Sample1       1         0 0.26
# 4: Sample1       1         0 0.26
# 5: Sample1       1         0 0.26
# 6: Sample1       0         0 0.00
# 7: Sample1       0         0 0.00
# 8: Sample1       0         0 0.00
# 9: Sample1       0         0 0.00
#10: Sample1       0         0 0.00
#11: Sample2       1         1 0.15
#12: Sample2       1         0 0.15
#13: Sample2       1         0 0.15
#14: Sample2       0         0 0.00
#15: Sample2       0         0 0.00
#16: Sample2       0         0 0.00

data

 df1 <- structure(list(Keyword = "SampleName", Impressions = 10L, 
 Clicks = 5L, 
 Conversions = 2L), .Names = c("Keyword", "Impressions", "Clicks", 
 "Conversions"), class = "data.frame", row.names = c(NA, -1L))
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Thanks! This is exactly what I meant. I am trying to apply this to my larger dataset that has 21 variables and 186145 observations. However, I am having trouble with some of the numeric variables. Like "Average cost per click". This should be repeated for clicked = 1, but for clicked = 0 these should be zero. Is there a way to build this in? – Charlotte Jun 16 '15 at 07:45
  • @Charlotte Sorry, I didn't understand from your description. It is better to update your post with a small example and the expected result as you showed before. – akrun Jun 16 '15 at 07:48
  • I just edited my post. I hope that it is clearer now. – Charlotte Jun 16 '15 at 08:00
  • @Charlotte It is better to have column names without space i.e. `Cost_per_Click` is okay. When you read the dataset with `read.csv/read.table`, if you don't use `check.names=FALSE`, it will remove the space and replace it with `.` or so. – akrun Jun 16 '15 at 08:03
  • Thanks for the tip, I changed it to CamelCase. – Charlotte Jun 16 '15 at 08:08
  • @Charlotte I think you need to add `CPC=rep(c(CostPerClick, 0), c(Clicks,Impressions-Clicks))` within the `list` – akrun Jun 16 '15 at 08:13
  • @Charlotte If you have several columns that needs to be done with the same way, we can loop it with `lapply(.SD, ...)` and specify the columns using `.SDcols=columnnumber` inside the data.table – akrun Jun 16 '15 at 08:36
  • Thanks for all the help! With the sample everything works exactly like I want it to, even if I add an extra row manually. However, when I take the same variables from my dataset I get: `Error in rep(c(1, 0), c(Clicks, Impressions - Clicks)) : invalid 'times' argument` I think there's something wrong in the data, so I'll keep looking. I think I have enough to work with here. – Charlotte Jun 16 '15 at 09:01
  • @Charlotte Do you have duplicate rows for `Keyword`? It would be better if you update with a small example `4-5` rows that show the error message – akrun Jun 16 '15 at 09:03
  • Nope, the dataframe has 186145 rows and also 186145 unique values for Keyword. I'll make another edit in the original post in a minute. – Charlotte Jun 16 '15 at 09:18
  • @Charlotte Can you run on smaller subsets and check where it cause the error – akrun Jun 16 '15 at 09:49
  • @Charlotte may be you have NAs in the dataset. Suppose, if I change `df1$Clicks[2] <- NA` I get the error `Error in rep(c(1, 0), c(Clicks, Impressions - Clicks)) : invalid 'times' argument`. You could remove those rows with NA values.. ie. `df1[complete.cases(df1),]` before running the code or remove the rows that correspond to NAs in `Clicks` i.e. `df1[!is.na(df1$Clicks),]` – akrun Jun 16 '15 at 10:00
  • I already checked for NA's, but there aren't any `nrow(df1[complete.cases(df1),]) == nrow(df1)` returns `TRUE`. – Charlotte Jun 16 '15 at 10:20
  • @Charlotte That is strange. Any other values like `Inf`, `-Inf` etc that is out of the norm.? – akrun Jun 16 '15 at 10:21
  • @Charlotte Another case where it can fail is when the value is larger than the Impressions. For example. `df1$Clicks[2] <- 15`. You can check `sum(df1$Clicks>df1$Impressions)` – akrun Jun 16 '15 at 10:27
  • 1
    Nope, but I finally found the problem. Sometimes Clicks > Impressions and then it gives an error. It happens in less than 1% of the cases, usually when a keyword has not received many impressions to begin with. Removing these instances solves the problem and your solution works perfectly! Thanks! – Charlotte Jun 16 '15 at 10:43
  • @Charlotte Thanks for the feedback. Glad that you were able to find the problem. – akrun Jun 16 '15 at 10:46