5

how can I convert count vectorized text data back to textual form. I have text data which I had made into sparse matrix using countvectorizer for classification. Now I want the sparse martix of text data to be converted back into text data.

My code

 cv = CountVectorizer( max_features = 500,analyzer='word') 
    cv_addr = cv.fit_transform(data.pop('Clean_addr'))

    for i, col in enumerate(cv.get_feature_names()):
        data[col] = pd.SparseSeries(cv_addr[:, i].toarray().ravel(), fill_value=0)
aeapen
  • 871
  • 1
  • 14
  • 28
  • 2
    You want the *original text back*? That's impossible, vector-space representations loose all positional information. There's no way to tell "the dog ate the cat" from "the cat ate the dog" from "ate dog the cat the" – juanpa.arrivillaga Nov 05 '17 at 09:27
  • @juanpa.arrivillaga,I have done address classification using address text data and some other numeric data columns. Now I have classified them two categories bassically(Business & Residential).How can I understand which was properly classified and which not classified.Sklearn doesnt accept text data as such into decision tree – aeapen Nov 05 '17 at 09:36
  • 1
    I'm sorry, but that doesn't sound related to your question at all...? What exactly is the problem? You are working with labeled data, no? – juanpa.arrivillaga Nov 05 '17 at 09:37
  • @juanpa.arrivillaga, How can I understand which record was classified properly and which not classified properly. I had Splitted dataset into test and train. These datasets contains only numeric values . – aeapen Nov 05 '17 at 09:40
  • Yes. But you *have the labels, no?* – juanpa.arrivillaga Nov 05 '17 at 09:40
  • @juanpa.arrivillaga, I have labelled dataset with columns such as address , qty,weight, phone number type and Address Delivery flag( B/R). I am trying to predict address delivery flag with other columns and I have successfully classified them into B/R.In short, How to match B/R flag with orginal address – aeapen Nov 05 '17 at 09:47
  • Then you should be able to simply check if your predicted flag is the same as your actual flag, no? – juanpa.arrivillaga Nov 05 '17 at 09:48
  • .In short, How to match B/R flag with original address – aeapen Nov 05 '17 at 09:48
  • @juanpa.arrivillaga I have no primary keys in the train and test data. So how can I find which correctly and which wrongly classified – aeapen Nov 05 '17 at 09:49
  • @ashokeapen, can you post a small sample data set and your desired data set? This would help us to understand what are you trying to achieve... – MaxU - stand with Ukraine Nov 05 '17 at 09:50
  • Primary keys? When you split the training sets, you should have kept the indices, **that is your key** – juanpa.arrivillaga Nov 05 '17 at 09:52
  • @MaxU,I have updated sample data – aeapen Nov 05 '17 at 09:59
  • @juanpa.arrivillaga,OK got it now.Thanks Mate – aeapen Nov 05 '17 at 10:11

1 Answers1

5

I don't think it's possible - all punctuations, spaces, tabs have been removed. Also all words have been converted to lower case. AFAIK there is no way to get it back in the original format. So you'd better keep Clean_addr column instead of dropping it.

Demo:

In [18]: df
Out[18]:
                                         txt
0                              a sample text
1  to be, or not to be, that is the question

In [19]: from sklearn.feature_extraction.text import CountVectorizer

In [20]: cv = CountVectorizer(max_features = 500, analyzer='word')

In [21]: cv_addr = cv.fit_transform(df['txt'])

In [22]: x = pd.SparseDataFrame(cv_addr, columns=cv.get_feature_names(), 
                                index=df.index, default_fill_value=0)

In [23]: x
Out[23]:
   be  is  not  or  question  sample  text  that  the  to
0   0   0    0   0         0       1     1     0    0   0
1   2   1    1   1         1       0     0     1    1   2

In [24]: df.join(x)
Out[24]:
                                         txt  be  is  not  or  question  sample  text  that  the  to
0                              a sample text   0   0    0   0         0       1     1     0    0   0
1  to be, or not to be, that is the question   2   1    1   1         1       0     0     1    1   2
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419