Short Version:
I am trying to feed my data in the form of sparse matrix (of the type scipy.sparse._csr.csr_matrix') into a Tensorflow Keras Neural Network model. I highly appreciate any guidance. todense() and toarray() are not options for me. Also feeding in mini batches is not preferred.
Long version (including my efforts):
The problem is about a deep learning model with text, categorical and numerical features. My TfidfVectorizer creates a huge matrix which cannot be fed into a model as dense format.
text_cols = ['ca_name']
categorical_cols = ['cua_name','ca_category_modified']
numerical_cols = ['vidim1', 'vidim2', 'vidim3', 'vim', 'vid']
title_transformer = TfidfVectorizer()
numerical_transformer = MinMaxScaler()
categorical_transformer = OneHotEncoder(handle_unknown='ignore')
preprocessor = ColumnTransformer(
transformers=[
('title', title_transformer, text_cols[0]),
('num', numerical_transformer, numerical_cols),
('cat', categorical_transformer, categorical_cols)
])
# df['dur_linreg] is my numerical target
X_train, X_test, y_train, y_test = train_test_split(df[text_cols+categorical_cols+numerical_cols], df['dur_linreg'], test_size=0.2, random_state=42)
# fit_transform the preprocessor on X_train, only transform X_test
X_train_transformed = preprocessor.fit_transform(X_train)
X_test_transformed = preprocessor.transform(X_test)
I can build and compile a model as following:
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Dense(64, activation='relu', input_shape=(X_train_transformed.shape[1],)))
modeladd(tf.keras.layers.Dense(1))
model.compile(optimizer='adam', loss='mean_squared_error')
But cannot fit it:
history = model.fit(X_train_transformed, y_train, epochs=20, batch_size=32, validation_data=(X_test_transformed, y_test))
InvalidArgumentError: Graph execution error: TypeError: 'SparseTensor' object is not subscriptable
Obviously because I am feeding the model with a sparse scipy.sparse._csr.csr_matrix matrix.
The size of my matrix and my resources restrict me to transform it to
- dense format:
X_train_transformed.todense()
MemoryError: Unable to allocate 205. GiB for an array with shape (275189, 100074) and data type float64 2) (obviously) array:
X_train_transformed.toarray()
MemoryError: Unable to allocate 205. GiB for an array with shape (275189, 100074) and data type float64
According to a post "https://stackoverflow.com/questions/41538692/using-sparse-matrices-with-keras-and-tensorflow" I there are two approaches " Keep it as a scipy sparse matrix, then, when giving Keras a minibatch, make it dense Keep it sparse all the way through, and use Tensorflow Sparse Tensors" The second approach is preferred for me as well. Therefore, I tried the following as well:
However, again I could only build and compile the model without a problem:
from tensorflow.keras.layers import Input, Dense, Dropout
from tensorflow.keras.models import Model
input_layer = Input(shape=(X_train_transformed.shape[1],), sparse=True)
dense1 = Dense(64, activation='relu')(input_layer)
dropout1 = Dropout(0.2)(dense1)
dense2 = Dense(64, activation='relu')(dropout1)
dropout2 = Dropout(0.2)(dense2)
output_layer = Dense(1)(dropout2)
model = Model(input_layer, output_layer)
model.compile(optimizer='adam', loss='mean_squared_error')
But cannot fit it:
history = model.fit(X_train_transformed, y_train, validation_data=(X_test_transformed, y_test), epochs=5, batch_size=32)
InvalidArgumentError: Graph execution error:TypeError: 'SparseTensor' object is not subscriptable
Lastly, in case it is relevant I am using Tensorflow version 2.11.0 installed January 2023.
Many Thanks in advance for your help.