2

I have a dataframe that contains text data and numerical features. I have vectorized text data and I plan to concatenate it with the remaining numerical data for running on Machine Learning algorithms.

I have vectorized text data using TIDF as shown below:

from sklearn.feature_extraction.text import TfidfVectorizer
vect = TfidfVectorizer(max_features=10000)
text_vect = vect.fit_transform(myDataframe['text_column'])
text_vect_df = pd.DataFrame.sparse.from_spmatrix(text_vect)

text_vect_df.shape : 250000 x 9300

I have converted text_vect_df to a csv file and used vaex to convert it to hdf5 as shown below. Vaex must work well with hdf5 format.

text_vaex_hdf5 = vaex.from_csv('text_vectorized.csv', convert=True, chunk_size=5_000_000)

The text_vectorized.csv is 4GB.vaex.from_csv() is taking too much time and memory is crashing(8GB RAM).

I tried in my Jupyterhub(with external GPU) for the shape of text_vect_df.shape 200000 x 9300. It downloads in chunks with 7GB each and reading this is taking too much time.

text_vectorized.csv_chunk0.hdf5
7.51 GB
text_vectorized.csv_chunk1.hdf5
7.51 GB
text_vectorized.csv_chunk2.hdf5
2.5 GB

Question 1: How can hdf5 files be greater than the original csv5 files? Shouldn't it be smaller? Question 2: How do I store 950000 x 10000 sized dataframe if the lesser size is failing/crashing?

I read about vaex and it looks really cool because computations happen in seconds. I would love to continue working with this but I am stuck. I have tried dask. Not as cool as Vaex.

Already tried solutions:

  1. Pandas's to_hdf should not be used for storing sparse matrix because https://vaex.readthedocs.io/en/latest/faq.html

When one uses the pandas .to_hdf method, the output HDF5 file has a row based format. Vaex on the other hand expects column based HDF5 files

  1. Without dask or vaex, memory gets crashed while running KNN, SVM or any ML algorithms.
  2. Tried with dask, no luck, worker gets killed in client local cluster.
  3. With Vaex, not able to move forward
desertnaut
  • 57,590
  • 26
  • 140
  • 166
P H
  • 294
  • 1
  • 3
  • 16
  • What format is your sparse csv? Three column coordinate? – CJR Oct 22 '20 at 16:03
  • @CJR I don't really understand what you mean by three column coordinate. Could you please elaborate? The matrix returned by TFID is a document-term matrix(n_samples, n_features). So, the same thing is stored in CSV. The columns in CSV are numbers ranging from 1 to 10000(features), they have the cell values as 0.0 and weights. – P H Oct 22 '20 at 16:21
  • I don't really understand what on earth you're doing - you have a sparse matrix that you turn into a sparse dataframe that you turn into a dense csv file that you turn into a dense hdf5 file that you try to read in? That seems like an insane workflow to me. Also you almost certainly have non-numerics that are causing your post-CSV dataframe to be objects instead of numbers. – CJR Oct 22 '20 at 16:58
  • 1)If I save the sparse matrix to h5 directly, it stores in a row format. Reading it using vaex expects column based h5, so this gives me errors. 2)I can't combine a sparse matrix with remaining numeric dataframe, therefore I've to turn sparse matrix to sparse dataframe and then concatenate it with my other numeric dataframe. I still can't save this as h5 because I get an error while reading by vaex. 3)The only option now is, I need to save sparsematrix as a csv file, and then convert it to hd5 using vaex and then combine it with already converted numeric.h5 dataframe. :( – P H Oct 22 '20 at 20:48
  • Personally I'd probably keep your data sparse and skip all this nonsense, but my second choice would be to chunk convert row blocks to hdf5 files and then combine them all at the end if you're determined to use vaex for some reason (I've only tried to use vaex once and I thought it was pretty useless). You should probably ask a question about your problem, not the problems that you've encountered with your solution to your problem. Making a bad idea work is not as good as making a newer, better idea. – CJR Oct 22 '20 at 22:24
  • Thank you @CJR. I have decreased the number of features from TF-IDF and used this matrix. This is working fine now. I have run ML algorithms using VAEX wrappers. It is super fast compared to dask and gives no memory error. – P H Oct 30 '20 at 10:02

0 Answers0