3

I'm working on a text classification problem that classifies some tweets into one of three labels. I have two columns in my dataset: Score column with the value of 0 (negative), 1 (positive) or 2 (neutral) and Statement column with the tweet text.

I used this example to create my vocabulary in R, but because I had 3 classes, I used a sequential model, set the last layer of my model to softmax and used categorical_crossentropy for compile.

Firstly, as I saw in this example, I converted my Score column into a matrix such as the value for negative tweet is [1,0,0], [0,1,0] for positive and [0,0,1] for neutral.

Here is my code:

df <- read.csv("tweets_data.csv", stringsAsFactors = FALSE)
df %>% count(Score)

df$Score=to_categorical(df$Score, num_classes = 3)
head(df)

df$Statement[1]

training_id <- sample.int(nrow(df), size = nrow(df)*0.8)
training <- df[training_id,]
testing <- df[-training_id,]

#distribution of the number of words in each review.
df$Statement %>% 
  strsplit(" ") %>% 
  sapply(length) %>% 
  summary()

#Alternatively, we can pad the arrays so they all have the same length, 
#then create an integer tensor of shape num_examples * max_length. We can 
#use an embedding layer capable of handling this shape as the first layer in our network.
num_words <- 10000
max_length <- 50
text_vectorization <- layer_text_vectorization(
  max_tokens = num_words, 
  output_sequence_length = max_length, 
)

text_vectorization %>% 
  adapt(df$Statement)

#see the vocabulary
get_vocabulary(text_vectorization)

text_vectorization(matrix(df$Statement[1], ncol = 1))

model = keras_model_sequential()
input <- layer_input(shape = c(1), dtype = "string")

output <- input %>% 
  text_vectorization() %>% 
  layer_embedding(input_dim = num_words + 1, output_dim = 16) %>%
  layer_global_average_pooling_1d() %>%
  layer_dense(units = 16, activation = "relu") %>%
  layer_dropout(0.5) %>% 
  layer_dense(units = 3, activation = "softmax")

model <- keras_model(input, output)


model %>% compile(
  optimizer = 'adam',
  loss = 'categorical_crossentropy',
  metrics = list('accuracy')
)

history <- model %>% fit(
  training$Statement,
  as.numeric(training$Score == c(0,1,0)),
  epochs = 10,
  batch_size = 512,
  validation_split = 0.2,
  verbose=2
)

And it generates this error:

Error in py_call_impl(callable, dots$args, dots$keywords) : 
ValueError: in user code:

File "C:\Users\Username\AppData\Local\r-miniconda\envs\r-reticulate\lib\site-packages\keras\engine\training.py", line 878, in train_function *
return step_function(self, iterator)
File "C:\Users\Username\AppData\Local\r-miniconda\envs\r-reticulate\lib\site-packages\keras\engine\training.py", line 867, in step_function **
outputs = model.distribute_strategy.run(run_step, args=(data,))
File "C:\Users\Username\AppData\Local\r-miniconda\envs\r-reticulate\lib\site-packages\keras\engine\training.py", line 860, in run_step **
outputs = model.train_step(data)
File "C:\Users\Username\AppData\Local\r-miniconda\envs\r-reticulate\lib\site-packages\keras\engine\training.py", line 810, in train_step
y, y_pred, sample_weight, regularization_losses=self.losses)
File "C:\Users\Username\AppData\Local\r-miniconda\envs\r-reticulate\lib\site-packages\keras\engine\compile_utils.py", line 201, in __call_

I am new to Tensorflow and most probably I don't completely understand all the steps, so I think that something is wrong with my model (because when I run this example it works properly, with no error).

So my question is: what is the structure of a sequential model for text classification with multiple classes? I found plenty of resources in python for this question, but I couldn't figure it out in R.

  • In case nobody answers you could try asking your question on https://datascience.stackexchange.com/. – Erwan Dec 16 '21 at 23:35

0 Answers0