1

I am working on a project in R that calls fasttext from the command line, and I am not sure how to load the output that fasttext gives me as a dataframe

> data.train<-data.frame(index=c(rep("__label__1",3),rep("__label__2",3)),country=c("ENGLAND","BRITAIN","UNITED KINDOM","USA","AMERICA","UNITED STATES"))

> data.train
       index       country
1 __label__1       ENGLAND
2 __label__1       BRITAIN
3 __label__1 UNITED KINDOM
4 __label__2           USA
5 __label__2       AMERICA
6 __label__2 UNITED STATES

> data.test<-c("EGLND","MURICA")

> data.test
[1] "EGLND"  "MURICA"

> write.table(data.train,"data.train.txt",sep="\t",quote=FALSE,row.names=FALSE,col.names=FALSE)
> 

> write.table(data.test,"data.test.txt",sep="\t",quote=FALSE,row.names=FALSE,col.names=FALSE)
> 
> system("fasttext supervised -input data.train.txt -output model_data")
Read 0M words
Number of words:  8
Number of labels: 2
Progress: 0.0%  words/sec/thread: 103000  lr: 0.100000  loss: 0.672343  eta: -596523h-14m Progress: 100.0%  words/sec/thread: 103000  lr: 0.000000  loss: 0.672343  eta: 0h0m 
Saving model file.

> system("fasttext predict-prob model_data.bin data.test.txt 2")

__label__1 0.5 __label__2 0.498047
__label__1 0.5 __label__2 0.498047

> res<-system("fasttext predict-prob model_data.bin data.test.txt 2", intern=TRUE)

> res
[1] "__label__1 0.5 __label__2 0.498047" "__label__1 0.5 __label__2 0.498047"

The original system call simply prints the fasttext output to the console which is the problem, however as per the comments intern=TRUE allowed me to save this to the variable res, but now the problem is that the variables is just a vector of strings where what I actually require is a data frame of probabilities for each label like this:

> want
  __label__1  __label__2
1 0.5       0.49807
2 0.5       0.49807

This question Fasttext how to load a .csv column into model.predict answers something similar but for python and I need to do this in R.

astel
  • 192
  • 7
  • Maybe use `intern = TRUE`. Have you tried using `fread` from "data.table" yet? – A5C1D2H2I1M1N2O1R2T1 Dec 03 '20 at 17:06
  • Looks like intern=TRUE does mostly what I want in that it allows me to store the output as a list in R however each row is stored as a string that I will have to parse into columns later which I think I might be able to do. I hadn't tried fread though, not sure at what point I would do that, can you explain? – astel Dec 03 '20 at 19:00
  • `fread` (from data.table) should be able to read from system commands. If you `dput` the `head` of the data you managed to read in using `system` and show your desired output, it would be easier to help out. – A5C1D2H2I1M1N2O1R2T1 Dec 03 '20 at 19:24
  • Ok I edited the question to include a reproducible example and show what the end result that I want is – astel Dec 03 '20 at 19:57
  • What I had in mind with `fread` would be something like: `library(data.table); fread(cmd = "fasttext predict-prob model_data.bin data.test.txt 2")`. Looking at the resulting string in your sample data, you'd end up with a 4-column `data.table`. I don't know enough about the fasttext format to say whether this is a good answer or not. For instance, will there always be the same number of labels per element of `res`? Will the labels always be the same? – A5C1D2H2I1M1N2O1R2T1 Dec 03 '20 at 21:49
  • For instance, with the reprex you've shared, the following would work: `x <- fread(cmd = "fasttext predict-prob model_data.bin data.test.txt 2"); ind <- rep(c(FALSE, TRUE), length.out = ncol(x)); x <- setnames(x[, ..ind], sapply(x[, !ind, with = FALSE], `[[`, 1))[]; x` but I'm not sure whether that solution generalizes to the fasttext data structure in general. Hope this helps! – A5C1D2H2I1M1N2O1R2T1 Dec 03 '20 at 21:54
  • On fasttext output the number of labels varies but is controlled by the last entry in the call. For example you could have 10 labels but in my example code I have the number 2 at the end which only shows the top 2 labels. Also the label names can vary but will always start with __label__. So you could have a series of labels as __label__1, __label__03 etc – astel Dec 03 '20 at 22:34
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/225497/discussion-between-a5c1d2h2i1m1n2o1r2t1-and-astel). – A5C1D2H2I1M1N2O1R2T1 Dec 03 '20 at 22:36
  • When I try running the code you provided with the fread function call I get error that UNC paths are not supported and that is switching to a windows directory, it then fails to find fasttext. Link to the chat is broken. – astel Dec 03 '20 at 22:36

1 Answers1

1

Assuming you've gotten to a character vector for res using system(..., intern = TRUE), you can try the following.

res3 <- c("__label__1 0.500768 __label__2 0.499252", 
          "__label__2 0.500768 __label__1 0.499252",
          "__label__3 1")

library(data.table)
x <- fread(text = res3, fill = TRUE)
# rename the columns in "variable"/"value" pairs and add a row indicator
setnames(x, paste0(rep(c("var_", "val_"), length.out = ncol(x)), 
                   rep(1:2, each = ncol(x)/2)))[, row := .I][]
# melt the data into a long form and cast it into a wide form
out <- melt(x, measure = patterns("var_", "val_"), na.rm = TRUE)[
  , dcast(.SD, row ~ value1, value.var = "value2")]
out
#    row __label__1 __label__2 __label__3
# 1:   1   0.500768   0.499252         NA
# 2:   2   0.499252   0.500768         NA
# 3:   3         NA         NA          1

You can add fill = 0 to dcast if you want to replace NA by 0 in your output.

A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
  • It appears that this no longer works. When I try to call x <- fread(text = res3, fill = TRUE) I get an error that text = res3 is an unused argument. Maybe an update to the data.table package? – astel Nov 10 '22 at 19:29