0

Currently I am working on a Text classification Model to assign one of 2 Labels to each Post.

As an Example:

  • "Hey how are you doing" | Approve
  • "You are really dumb" | Disapprove

Either the model approves or disapproves a post based on toxicity or something like that.

Now I would like to add another layer of labels, that specifys the reason why a post should be disapproved.

As an Example:

  • "You are really dumb" | Disapprove | Flaming
  • "You can buy this here" | Disapprove | Advertisement
  • "Hey you are cool" | Approve

So now I wonder how can I implement a Multi Layer Label Classification to my current code?

Right now my Training Data (data.csv) looks like this, I Split the text and each label with a ³ charackter:

"thank you for the good idea"³Approve
"hallo wie geht es dir heute"³Foreign Language³Disapprove

My current code looks like that:

# Load data
def load_data(file_path):
    with open(file_path, 'r', encoding='utf-8') as f:
        lines = f.readlines()
    posts, labels = [], []
    for line in lines:
        post, label = line.strip().split('³')
        posts.append(post)
        labels.append(label)
    return posts, labels

train_posts, train_labels = load_data('mixed_train_data.csv')
test_posts, test_labels = load_data('mixed_test_data.csv')

# Label mapping
label_to_index = { 
    'Approve': 0,
    'Disapprove': 1
}

# Tokenization and Padding
tokenizer = Tokenizer()
tokenizer.fit_on_texts(train_posts)
sequences = tokenizer.texts_to_sequences(train_posts)
max_sequence_length = 7500
X = pad_sequences(sequences, maxlen=max_sequence_length)
y = np.array([label_to_index[label] for label in train_labels])

test_sequences = tokenizer.texts_to_sequences(test_posts)
test_X = pad_sequences(test_sequences, maxlen=max_sequence_length)
test_y = np.array([label_to_index[label] for label in test_labels])

# Model architecture
model = Sequential([
    Embedding(input_dim=len(tokenizer.word_index), output_dim=50, input_length=max_sequence_length),
    Flatten(),
    Dense(128, activation='relu'),
    Dense(2, activation='softmax')
])

# Model compilation
learning_rate = 0.001
model.compile(loss='sparse_categorical_crossentropy', optimizer=tf.keras.optimizers.Adam(learning_rate=learning_rate), metrics=['accuracy'])

# Model training
model.fit(X, y, epochs=100, batch_size=32, validation_data=(test_X, test_y), callbacks=[checkpoint_callback])

I could need some help to update it for the multi label classification since I don't know where to start.

Menrion
  • 41
  • 1

0 Answers0