0

I am working with word2vec data available at Spanish Billion Words Corpus and Embeddings

The dataset looks like this

v1   v2  v3 
once 0.1 0.2
upon 0.3 0.4
a    0.5 0.6
time 0.7 0.8
... + thousands of lines and columns ...

This is my code to read data but what's important are the last two lines:

#install.packages("R.utils")
#install.packages("readr")
#install.packages("jsonlite")
library(R.utils)
library(readr)
library(jsonlite)

##############################

# word vectors in text format 

##############################

url <- "http://cs.famaf.unc.edu.ar/~ccardellino/SBWCE/SBW-vectors-300-min5.txt.bz2"
compressed <- "SBW-vectors-300-min5.txt.bz2"
file <- "SBW-vectors-300-min5.txt"
file2 <- "SBW-vectors-300-min5.RData"

if(!file.exists(compressed)) {
  print("downloading")
  download.file(url, compressed, method="curl")
}

if(!file.exists(file) & file.exists(compressed)) {
  bunzip2(compressed, file, remove = FALSE, skip = TRUE)
}

SBW_vectors_300_min5 <- readr::read_lines(file, skip = 1, n_max = -1L)
SBW_vectors_300_min5_df = as.data.frame(do.call(rbind, strsplit(SBW_vectors_300_min5, split = " ")), 
                                        stringsAsFactors=FALSE)

When I run

SBW_vectors_300_min5 <- readr::read_lines(file, skip = 1, n_max = -1L)
SBW_vectors_300_min5_df = as.data.frame(do.call(rbind, strsplit(SBW_vectors_300_min5, split = " ")), 
                                            stringsAsFactors=FALSE)

I get the error invalid subscript type 'list'

But, if I do

SBW_vectors_300_min5 <- readr::read_lines(file, skip = 1, n_max = -1L)
SBW_vectors_300_min5 = strsplit(SBW_vectors_300_min5, split = " ")
SBW_vectors_300_min5 = as.data.frame(SBW_vectors_300_min5)

I went for the last option as the thing works without error messages. The problem for the last solution is that the columns are read as factors and with factor I cannot do Principal Component Analysis unless I do this

index <- sapply(values, is.factor)
SBW_vectors_300_min5 <- lapply(SBW_vectors_300_min5[index], function(x) as.numeric(as.character(x)))

That thing took more than 24 hrs to compute. If I want to make it reproducible then I have to think about something much more efficient.

How can I make SBW_vectors_300_min5_df = as.data.frame(do.call(rbind, strsplit(SBW_vectors_300_min5, split = " ")), stringsAsFactors=FALSE) to work?

pachadotdev
  • 3,345
  • 6
  • 33
  • 60
  • How many levels do you have for the categorical attributes? Are the categories real numbers or actually factors, on which you intend to apply PCA? – phoxis Sep 19 '16 at 16:40
  • Why are you using `read_lines` and then `strsplit`ting on `" "` and then converting to data frame? Wouldn't it be more direct to use `read_delim(..., delim = " ")`, straight from file to data frame? – Gregor Thomas Sep 19 '16 at 16:44
  • You can even use the `col_types` argument to ensure your columns are of the types you want. – Gregor Thomas Sep 19 '16 at 16:45
  • @phoxis those are real numbers instead of factors – pachadotdev Sep 19 '16 at 17:11
  • @Gregor I tried and the computer hangs :S – pachadotdev Sep 19 '16 at 17:12
  • 1
    In general, you might want to try *not* reading the full file until you have a system that works. Set `n_max = 100` and debug while only using the first 100 rows. – Gregor Thomas Sep 19 '16 at 17:18
  • Thanks @Gregor. I did that 2 days ago with a small test sample and when I found something that did work I did read the full file. – pachadotdev Sep 19 '16 at 18:58

0 Answers0