How to convert Countvectorized data back to text data in Python?

Question

how can I convert count vectorized text data back to textual form. I have text data which I had made into sparse matrix using countvectorizer for classification. Now I want the sparse martix of text data to be converted back into text data.

My code

 cv = CountVectorizer( max_features = 500,analyzer='word') 
    cv_addr = cv.fit_transform(data.pop('Clean_addr'))

    for i, col in enumerate(cv.get_feature_names()):
        data[col] = pd.SparseSeries(cv_addr[:, i].toarray().ravel(), fill_value=0)

You want the *original text back*? That's impossible, vector-space representations loose all positional information. There's no way to tell "the dog ate the cat" from "the cat ate the dog" from "ate dog the cat the" — juanpa.arrivillaga, Nov 05 '17 at 09:27
@juanpa.arrivillaga,I have done address classification using address text data and some other numeric data columns. Now I have classified them two categories bassically(Business & Residential).How can I understand which was properly classified and which not classified.Sklearn doesnt accept text data as such into decision tree — aeapen, Nov 05 '17 at 09:36
I'm sorry, but that doesn't sound related to your question at all...? What exactly is the problem? You are working with labeled data, no? — juanpa.arrivillaga, Nov 05 '17 at 09:37
@juanpa.arrivillaga, How can I understand which record was classified properly and which not classified properly. I had Splitted dataset into test and train. These datasets contains only numeric values . — aeapen, Nov 05 '17 at 09:40
@juanpa.arrivillaga, I have labelled dataset with columns such as address , qty,weight, phone number type and Address Delivery flag( B/R). I am trying to predict address delivery flag with other columns and I have successfully classified them into B/R.In short, How to match B/R flag with orginal address — aeapen, Nov 05 '17 at 09:47
Then you should be able to simply check if your predicted flag is the same as your actual flag, no? — juanpa.arrivillaga, Nov 05 '17 at 09:48
@juanpa.arrivillaga I have no primary keys in the train and test data. So how can I find which correctly and which wrongly classified — aeapen, Nov 05 '17 at 09:49
@ashokeapen, can you post a small sample data set and your desired data set? This would help us to understand what are you trying to achieve... — MaxU - stand with Ukraine, Nov 05 '17 at 09:50
Primary keys? When you split the training sets, you should have kept the indices, **that is your key** — juanpa.arrivillaga, Nov 05 '17 at 09:52

score 5 · Answer 1 · answered Nov 05 '17 at 09:24

I don't think it's possible - all punctuations, spaces, tabs have been removed. Also all words have been converted to lower case. AFAIK there is no way to get it back in the original format. So you'd better keep Clean_addr column instead of dropping it.

Demo:

In [18]: df
Out[18]:
                                         txt
0                              a sample text
1  to be, or not to be, that is the question

In [19]: from sklearn.feature_extraction.text import CountVectorizer

In [20]: cv = CountVectorizer(max_features = 500, analyzer='word')

In [21]: cv_addr = cv.fit_transform(df['txt'])

In [22]: x = pd.SparseDataFrame(cv_addr, columns=cv.get_feature_names(), 
                                index=df.index, default_fill_value=0)

In [23]: x
Out[23]:
   be  is  not  or  question  sample  text  that  the  to
0   0   0    0   0         0       1     1     0    0   0
1   2   1    1   1         1       0     0     1    1   2

In [24]: df.join(x)
Out[24]:
                                         txt  be  is  not  or  question  sample  text  that  the  to
0                              a sample text   0   0    0   0         0       1     1     0    0   0
1  to be, or not to be, that is the question   2   1    1   1         1       0     0     1    1   2

How to convert Countvectorized data back to text data in Python?

1 Answers1