I'm working on a text classification problem that classifies some tweets into one of three labels. I have two columns in my dataset: Score column with the value of 0 (negative), 1 (positive) or 2 (neutral) and Statement column with the tweet text.
I used this example to create my vocabulary in R, but because I had 3 classes, I used a sequential model, set the last layer of my model to softmax and used categorical_crossentropy for compile.
Firstly, as I saw in this example, I converted my Score column into a matrix such as the value for negative tweet is [1,0,0], [0,1,0] for positive and [0,0,1] for neutral.
Here is my code:
df <- read.csv("tweets_data.csv", stringsAsFactors = FALSE)
df %>% count(Score)
df$Score=to_categorical(df$Score, num_classes = 3)
head(df)
df$Statement[1]
training_id <- sample.int(nrow(df), size = nrow(df)*0.8)
training <- df[training_id,]
testing <- df[-training_id,]
#distribution of the number of words in each review.
df$Statement %>%
strsplit(" ") %>%
sapply(length) %>%
summary()
#Alternatively, we can pad the arrays so they all have the same length,
#then create an integer tensor of shape num_examples * max_length. We can
#use an embedding layer capable of handling this shape as the first layer in our network.
num_words <- 10000
max_length <- 50
text_vectorization <- layer_text_vectorization(
max_tokens = num_words,
output_sequence_length = max_length,
)
text_vectorization %>%
adapt(df$Statement)
#see the vocabulary
get_vocabulary(text_vectorization)
text_vectorization(matrix(df$Statement[1], ncol = 1))
model = keras_model_sequential()
input <- layer_input(shape = c(1), dtype = "string")
output <- input %>%
text_vectorization() %>%
layer_embedding(input_dim = num_words + 1, output_dim = 16) %>%
layer_global_average_pooling_1d() %>%
layer_dense(units = 16, activation = "relu") %>%
layer_dropout(0.5) %>%
layer_dense(units = 3, activation = "softmax")
model <- keras_model(input, output)
model %>% compile(
optimizer = 'adam',
loss = 'categorical_crossentropy',
metrics = list('accuracy')
)
history <- model %>% fit(
training$Statement,
as.numeric(training$Score == c(0,1,0)),
epochs = 10,
batch_size = 512,
validation_split = 0.2,
verbose=2
)
And it generates this error:
Error in py_call_impl(callable, dots$args, dots$keywords) :
ValueError: in user code:
File "C:\Users\Username\AppData\Local\r-miniconda\envs\r-reticulate\lib\site-packages\keras\engine\training.py", line 878, in train_function *
return step_function(self, iterator)
File "C:\Users\Username\AppData\Local\r-miniconda\envs\r-reticulate\lib\site-packages\keras\engine\training.py", line 867, in step_function **
outputs = model.distribute_strategy.run(run_step, args=(data,))
File "C:\Users\Username\AppData\Local\r-miniconda\envs\r-reticulate\lib\site-packages\keras\engine\training.py", line 860, in run_step **
outputs = model.train_step(data)
File "C:\Users\Username\AppData\Local\r-miniconda\envs\r-reticulate\lib\site-packages\keras\engine\training.py", line 810, in train_step
y, y_pred, sample_weight, regularization_losses=self.losses)
File "C:\Users\Username\AppData\Local\r-miniconda\envs\r-reticulate\lib\site-packages\keras\engine\compile_utils.py", line 201, in __call_
I am new to Tensorflow and most probably I don't completely understand all the steps, so I think that something is wrong with my model (because when I run this example it works properly, with no error).
So my question is: what is the structure of a sequential model for text classification with multiple classes? I found plenty of resources in python for this question, but I couldn't figure it out in R.