0

Short Version:

I am trying to feed my data in the form of sparse matrix (of the type scipy.sparse._csr.csr_matrix') into a Tensorflow Keras Neural Network model. I highly appreciate any guidance. todense() and toarray() are not options for me. Also feeding in mini batches is not preferred.

Long version (including my efforts):

The problem is about a deep learning model with text, categorical and numerical features. My TfidfVectorizer creates a huge matrix which cannot be fed into a model as dense format.

text_cols = ['ca_name']
categorical_cols = ['cua_name','ca_category_modified']
numerical_cols = ['vidim1', 'vidim2', 'vidim3', 'vim', 'vid']

title_transformer = TfidfVectorizer()
numerical_transformer = MinMaxScaler()
categorical_transformer = OneHotEncoder(handle_unknown='ignore')

preprocessor = ColumnTransformer(
    transformers=[
        ('title', title_transformer, text_cols[0]),
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

# df['dur_linreg] is my numerical target
X_train, X_test, y_train, y_test = train_test_split(df[text_cols+categorical_cols+numerical_cols],                 df['dur_linreg'], test_size=0.2, random_state=42)

#  fit_transform the preprocessor on X_train, only transform X_test
X_train_transformed = preprocessor.fit_transform(X_train)
X_test_transformed = preprocessor.transform(X_test)

I can build and compile a model as following:

model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Dense(64, activation='relu', input_shape=(X_train_transformed.shape[1],)))
modeladd(tf.keras.layers.Dense(1))
model.compile(optimizer='adam', loss='mean_squared_error')

But cannot fit it:

history = model.fit(X_train_transformed, y_train, epochs=20, batch_size=32, validation_data=(X_test_transformed, y_test))

InvalidArgumentError: Graph execution error: TypeError: 'SparseTensor' object is not subscriptable

Obviously because I am feeding the model with a sparse scipy.sparse._csr.csr_matrix matrix.

The size of my matrix and my resources restrict me to transform it to

  1. dense format:
X_train_transformed.todense()

MemoryError: Unable to allocate 205. GiB for an array with shape (275189, 100074) and data type float64 2) (obviously) array:

X_train_transformed.toarray()

MemoryError: Unable to allocate 205. GiB for an array with shape (275189, 100074) and data type float64

According to a post "https://stackoverflow.com/questions/41538692/using-sparse-matrices-with-keras-and-tensorflow" I there are two approaches " Keep it as a scipy sparse matrix, then, when giving Keras a minibatch, make it dense Keep it sparse all the way through, and use Tensorflow Sparse Tensors" The second approach is preferred for me as well. Therefore, I tried the following as well:

However, again I could only build and compile the model without a problem:

from tensorflow.keras.layers import Input, Dense, Dropout
from tensorflow.keras.models import Model

input_layer = Input(shape=(X_train_transformed.shape[1],), sparse=True)
dense1 = Dense(64, activation='relu')(input_layer)
dropout1 = Dropout(0.2)(dense1)
dense2 = Dense(64, activation='relu')(dropout1)
dropout2 = Dropout(0.2)(dense2)
output_layer = Dense(1)(dropout2)
model = Model(input_layer, output_layer)
model.compile(optimizer='adam', loss='mean_squared_error')

But cannot fit it:

history = model.fit(X_train_transformed, y_train, validation_data=(X_test_transformed, y_test), epochs=5, batch_size=32)

InvalidArgumentError: Graph execution error:TypeError: 'SparseTensor' object is not subscriptable

Lastly, in case it is relevant I am using Tensorflow version 2.11.0 installed January 2023.

Many Thanks in advance for your help.

Cyamc
  • 1
  • 1
  • 1
    I believe tensorflow has some sort of sparse tensor of its own. But for regular tensors that expect a dense numpy array, you'll have to convert the csr `toarray` firset. – hpaulj Jan 26 '23 at 20:36
  • Hi hpaulj, Thanks for your comment. Exactly, I still look for a way to convert my sparse matrix into a sparse tensor which is usable by TensorFlow. As you also mentioned, todense() and toarray() are 2 ways that I can convert my sparse matrix and feed the model. I tried them on smaller datasets and they work fine. But for actual dataset, they are not working because of this error: 'MemoryError: Unable to allocate 205. GiB for an array with shape (275189, 100074) and data type float64' – Cyamc Feb 10 '23 at 07:50
  • Given those dimensions, that dense memory requirement is obvious `275189*100074*8/1e9` is `220`. – hpaulj Feb 10 '23 at 08:33

0 Answers0