I am working with word2vec data available at Spanish Billion Words Corpus and Embeddings
The dataset looks like this
v1 v2 v3
once 0.1 0.2
upon 0.3 0.4
a 0.5 0.6
time 0.7 0.8
... + thousands of lines and columns ...
This is my code to read data but what's important are the last two lines:
#install.packages("R.utils")
#install.packages("readr")
#install.packages("jsonlite")
library(R.utils)
library(readr)
library(jsonlite)
##############################
# word vectors in text format
##############################
url <- "http://cs.famaf.unc.edu.ar/~ccardellino/SBWCE/SBW-vectors-300-min5.txt.bz2"
compressed <- "SBW-vectors-300-min5.txt.bz2"
file <- "SBW-vectors-300-min5.txt"
file2 <- "SBW-vectors-300-min5.RData"
if(!file.exists(compressed)) {
print("downloading")
download.file(url, compressed, method="curl")
}
if(!file.exists(file) & file.exists(compressed)) {
bunzip2(compressed, file, remove = FALSE, skip = TRUE)
}
SBW_vectors_300_min5 <- readr::read_lines(file, skip = 1, n_max = -1L)
SBW_vectors_300_min5_df = as.data.frame(do.call(rbind, strsplit(SBW_vectors_300_min5, split = " ")),
stringsAsFactors=FALSE)
When I run
SBW_vectors_300_min5 <- readr::read_lines(file, skip = 1, n_max = -1L)
SBW_vectors_300_min5_df = as.data.frame(do.call(rbind, strsplit(SBW_vectors_300_min5, split = " ")),
stringsAsFactors=FALSE)
I get the error invalid subscript type 'list'
But, if I do
SBW_vectors_300_min5 <- readr::read_lines(file, skip = 1, n_max = -1L)
SBW_vectors_300_min5 = strsplit(SBW_vectors_300_min5, split = " ")
SBW_vectors_300_min5 = as.data.frame(SBW_vectors_300_min5)
I went for the last option as the thing works without error messages. The problem for the last solution is that the columns are read as factors and with factor I cannot do Principal Component Analysis unless I do this
index <- sapply(values, is.factor)
SBW_vectors_300_min5 <- lapply(SBW_vectors_300_min5[index], function(x) as.numeric(as.character(x)))
That thing took more than 24 hrs to compute. If I want to make it reproducible then I have to think about something much more efficient.
How can I make SBW_vectors_300_min5_df = as.data.frame(do.call(rbind, strsplit(SBW_vectors_300_min5, split = " ")), stringsAsFactors=FALSE)
to work?