-2

I have to conduct a project on Pagerank and need to calculate the probability that it lands on each page. The text file is also too large to fit in an Excel file and I am having issues coming up with ways to calculate the probability. What is the most efficient way to calculate it? Listed below is an example with probabilities added and an example without the probabilities. For example, page 0's probability is 25% since there are 4 links. The text file is called pagerank.txt. I have access to Python, R, SQL, and Excel.

0  1  0.25
0  2  0.25
0  3  0.25
0  4  0.25

1  2  
1  3  
1  4  
1  5  
1  6 

1 Answers1

1

In R you can do this in base R, dplyr or data.table.

Assuming you have read the data into R in df and the two columns are V1 and V2, you can do :

#Base R
transform(df, prob = 1/ave(V2, V1, FUN = length))

#  V1 V2 prob
#1  0  1 0.25
#2  0  2 0.25
#3  0  3 0.25
#4  0  4 0.25
#5  1  2 0.20
#6  1  3 0.20
#7  1  4 0.20
#8  1  5 0.20
#9  1  6 0.20

#dplyr
library(dplyr)
df %>% group_by(V1) %>% mutate(prob = 1/n())

#data.table
library(data.table)
setDT(df)[, prob := 1/.N, V1]

data

df <- structure(list(V1 = c(0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 1L), V2 = c(1L, 
2L, 3L, 4L, 2L, 3L, 4L, 5L, 6L)), class = "data.frame", row.names = c(NA, -9L))
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213