Does someone have a very well-explained example of getting data into a format usable by xgboost in R?
The get started doc doesn't help me. The data (agaricus.train
and agaricus.test
) are already in a specialized format (dgCMatrix
):
> str(agaricus.train)
List of 2
$ data :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
.. ..@ i : int [1:143286] 2 6 8 11 18 20 21 24 28 32 ...
.. ..@ p : int [1:127] 0 369 372 3306 5845 6489 6513 8380 8384 10991 ...
.. ..@ Dim : int [1:2] 6513 126
.. ..@ Dimnames:List of 2
.. .. ..$ : NULL
.. .. ..$ : chr [1:126] "cap-shape=bell" "cap-shape=conical" "cap-shape=convex" "cap-shape=flat" ...
.. ..@ x : num [1:143286] 1 1 1 1 1 1 1 1 1 1 ...
.. ..@ factors : list()
$ label: num [1:6513] 1 0 0 1 0 0 0 1 0 0 ...
I saw this example code use sparse.model.matrix, but I'm still having a hard time putting together fairly plain data into the format xgboost needs.
For example, suppose I have two data frames: words
and labels
.
The words
data frame has sentence_id
and word_id
, with one or more words per sentence.
The data_label
data frame has a sentence_id and label (say, 0 or 1 for a binary classification task).
How do I get that data into a format to predict the label for a sentence?
I can split train and test.
Edit: The simplest version of words and data_label:
words <- data.frame(sentence_id=c(1, 1, 2, 2, 2),
word_id=c(1, 2, 1, 3, 4))
data_label <- data.frame(sentence_id=c(1, 2), label=c(0, 1))