Creating TF_IDF vector from a Spark Dataframe with Text column

Question

I have a Spark Dataframe which has three columns. ['id', 'title','desc']. Here 'title' and 'desc' are both texts. Id is the id of the document. Example couple of rows look like below:

[Row(id=-33753621, title=u'Royal Bank of Scotland is testing a robot that could solve your banking problems (RBS)', desc=u"If you hate dealing with bank tellers or customer service representatives, then the Royal Bank of Scotland might have a solution for you.If this program is successful, it could be a big step forward on the road to automated customer service through the use of AI, notes Laurie Beaver, research associate for BI Intelligence, Business Insider's premium research service.It's noteworthy that Luvo does not operate via a third-party app such as Facebook Messenger, WeChat, or Kik, all of which are currently trying to create bots that would assist in customer service within their respective platforms.Luvo would be available through the web and through smartphones. It would also use machine learning to learn from its mistakes, which should ultimately help with its response accuracy.Down the road, Luvo would become a supplement to the human staff. It can currently answer 20 set questions but as that number grows, it would allow the human employees to more complicated issues. If a problem is beyond Luvo's comprehension, then it would refer the customer to a bank employee; however,\xa0a user could choose to speak with a human instead of Luvo anyway.AI such as Luvo, if successful, could help businesses become more efficient and increase their productivity, while simultaneously improving customer service capacity, which would consequently\xa0save money that would otherwise go toward manpower.And this trend is already starting. Google, Microsoft, and IBM are investing significantly into AI research. Furthermore, the global AI market is estimated to grow from approximately $420 million in 2014 to $5.05 billion in 2020, according to a forecast by Research and Markets.\xa0The move toward AI would be just one more way in which the digital age is disrupting retail banking. Customers, particularly millennials, are increasingly moving toward digital banking, and as a result, they're walking into their banks' traditional brick-and-mortar branches less often than ever before."),
 Row(id=-761323061, title=u'Teen sexting is prompting an overhaul in child pornography laws', desc=u"Rampant teen sexting has left politicians and law enforcement authorities around the country struggling to find some kind of legal middle ground between prosecuting students for child porn and letting them off the hook.Most states consider sexually explicit images of minors to be child pornography, meaning even teenagers who share nude selfies among themselves can, in theory at least, be hit with felony charges that can carry heavy prison sentences and require lifetime registration as a sex offender.Many authorities consider that overkill, however, and at least 20 states have adopted sexting laws with less-serious penalties, mostly within the past five years. Eleven states have made sexting between teens a misdemeanor; in some of those places, prosecutors can require youngsters to take courses on the dangers of social media instead of charging them with a crime.Hawaii passed a 2012 law saying youths can escape conviction if they take steps to delete explicit photos. Arkansas adopted a 2013 law sentencing first-time youth sexters to eight hours of community service. New Mexico last month removed criminal penalties altogether in such cases.At least 12 other states are considering sexting laws this year, many to create new a category of crime that would apply to young people.But one such proposal in Colorado has revealed deep divisions about how to treat the phenomenon. Though prosecutors and researchers agree that felony sex crimes shouldn't apply to a pair of 16-year-olds sending each other selfies, they disagree about whether sexting should be a crime at all.Colorado's bill was prompted by a scandal last year at a Canon City high school where more than 100 students were found with explicit images of other teens. The news sent shockwaves through the city of 16,000. Dozens of students were suspended, and the football team forfeited the final game of the season.Fremont County prosecutors ultimately decided against filing any criminal charges, saying Colorado law doesn't properly distinguish between adult sexual predators and misbehaving teenagers.In a similar case last year out Fayetteville, North Carolina, two dating teens who exchanged nude selfies at age 16 were charged as adults with a felony \u2014 sexual exploitation of a minor. After an uproar, the cha"),

I want to convert this 'desc' column (actual text of document) into a TF-IDF vector in Spark.

Here is what I did for that.

def tfIdf(df):
    """ This fucntion takes the text data and converts it into a term frequency-Inverse Document Frequency vector

    parameter: 
    returns: dataframe with tf-idf vectors

    """

    # Importing the feature transformation classes for doing TF-IDF 
    from pyspark.ml.feature import Tokenizer, StopWordsRemover, CountVectorizer, IDF

    # Carrying out the Tokenization of the text documents (splitting into words)

    tokenizer = Tokenizer(inputCol="desc", outputCol="tokenised_text")
    tokensDf = tokenizer.transform(df)

    # Carrying out the StopWords Removal for TF-IDF
    stopwordsremover=StopWordsRemover(inputCol='tokenised_text',outputCol='words')
    swremovedDf= stopwordsremover.transform(tokensDf)

    # Creating Term Frequency Vector for each word
    cv=CountVectorizer(inputCol="words", outputCol="tf_features", vocabSize=3, minDF=2.0)
    cvModel=cv.fit(swremovedDf)
    tfDf=cvModel.transform(swremovedDf)

    # Carrying out Inverse Document Frequency on the TF data
    idf=IDF(inputCol="tf_features", outputCol="tf-idf_features")
    idfModel = idf.fit(tfDf)
    tfidfDf = idfModel.transform(tfDf)

    tfidfDf.cache().count()

    return tfidfDf


tfidfDf=tfIdf(sdf_cleaned)

I first did tokenization of each text document in 'desc' column using Tokenizer Class. Then did the Stopwords Removal using StopWordsRemover Class. I then convert it into a bag of words model and get the term frequency using CountVectorizer class.

Finally I use the IDF class to apply the IDF weightings to the term frequency vector.

The end result of the returned data frame is as below- showing only for first row:

[Row(id=-33753621, title=u'Royal Bank of Scotland is testing a robot that could solve your banking problems (RBS)', desc=u"If you hate dealing with bank tellers or customer service representatives, then the Royal Bank of Scotland might have a solution for you.If this program is successful, it could be a big step forward on the road to automated customer service through the use of AI, notes Laurie Beaver, research associate for BI Intelligence, Business Insider's premium research service.It's noteworthy that Luvo does not operate via a third-party app such as Facebook Messenger, WeChat, or Kik, all of which are currently trying to create bots that would assist in customer service within their respective platforms.Luvo would be available through the web and through smartphones. It would also use machine learning to learn from its mistakes, which should ultimately help with its response accuracy.Down the road, Luvo would become a supplement to the human staff. It can currently answer 20 set questions but as that number grows, it would allow the human employees to more complicated issues. If a problem is beyond Luvo's comprehension, then it would refer the customer to a bank employee; however,\xa0a user could choose to speak with a human instead of Luvo anyway.AI such as Luvo, if successful, could help businesses become more efficient and increase their productivity, while simultaneously improving customer service capacity, which would consequently\xa0save money that would otherwise go toward manpower.And this trend is already starting. Google, Microsoft, and IBM are investing significantly into AI research. Furthermore, the global AI market is estimated to grow from approximately $420 million in 2014 to $5.05 billion in 2020, according to a forecast by Research and Markets.\xa0The move toward AI would be just one more way in which the digital age is disrupting retail banking. Customers, particularly millennials, are increasingly moving toward digital banking, and as a result, they're walking into their banks' traditional brick-and-mortar branches less often than ever before.", tokenised_text=[u'if', u'you', u'hate', u'dealing', u'with', u'bank', u'tellers', u'or', u'customer', u'service', u'representatives,', u'then', u'the', u'royal', u'bank', u'of', u'scotland', u'might', u'have', u'a', u'solution', u'for', u'you.if', u'this', u'program', u'is', u'successful,', u'it', u'could', u'be', u'a', u'big', u'step', u'forward', u'on', u'the', u'road', u'to', u'automated', u'customer', u'service', u'through', u'the', u'use', u'of', u'ai,', u'notes', u'laurie', u'beaver,', u'research', u'associate', u'for', u'bi', u'intelligence,', u'business', u"insider's", u'premium', u'research', u"service.it's", u'noteworthy', u'that', u'luvo', u'does', u'not', u'operate', u'via', u'a', u'third-party', u'app', u'such', u'as', u'facebook', u'messenger,', u'wechat,', u'or', u'kik,', u'all', u'of', u'which', u'are', u'currently', u'trying', u'to', u'create', u'bots', u'that', u'would', u'assist', u'in', u'customer', u'service', u'within', u'their', u'respective', u'platforms.luvo', u'would', u'be', u'available', u'through', u'the', u'web', u'and', u'through', u'smartphones.', u'it', u'would', u'also', u'use', u'machine', u'learning', u'to', u'learn', u'from', u'its', u'mistakes,', u'which', u'should', u'ultimately', u'help', u'with', u'its', u'response', u'accuracy.down', u'the', u'road,', u'luvo', u'would', u'become', u'a', u'supplement', u'to', u'the', u'human', u'staff.', u'it', u'can', u'currently', u'answer', u'20', u'set', u'questions', u'but', u'as', u'that', u'number', u'grows,', u'it', u'would', u'allow', u'the', u'human', u'employees', u'to', u'more', u'complicated', u'issues.', u'if', u'a', u'problem', u'is', u'beyond', u"luvo's", u'comprehension,', u'then', u'it', u'would', u'refer', u'the', u'customer', u'to', u'a', u'bank', u'employee;', u'however,\xa0a', u'user', u'could', u'choose', u'to', u'speak', u'with', u'a', u'human', u'instead', u'of', u'luvo', u'anyway.ai', u'such', u'as', u'luvo,', u'if', u'successful,', u'could', u'help', u'businesses', u'become', u'more', u'efficient', u'and', u'increase', u'their', u'productivity,', u'while', u'simultaneously', u'improving', u'customer', u'service', u'capacity,', u'which', u'would', u'consequently\xa0save', u'money', u'that', u'would', u'otherwise', u'go', u'toward', u'manpower.and', u'this', u'trend', u'is', u'already', u'starting.', u'google,', u'microsoft,', u'and', u'ibm', u'are', u'investing', u'significantly', u'into', u'ai', u'research.', u'furthermore,', u'the', u'global', u'ai', u'market', u'is', u'estimated', u'to', u'grow', u'from', u'approximately', u'$420', u'million', u'in', u'2014', u'to', u'$5.05', u'billion', u'in', u'2020,', u'according', u'to', u'a', u'forecast', u'by', u'research', u'and', u'markets.\xa0the', u'move', u'toward', u'ai', u'would', u'be', u'just', u'one', u'more', u'way', u'in', u'which', u'the', u'digital', u'age', u'is', u'disrupting', u'retail', u'banking.', u'customers,', u'particularly', u'millennials,', u'are', u'increasingly', u'moving', u'toward', u'digital', u'banking,', u'and', u'as', u'a', u'result,', u"they're", u'walking', u'into', u'their', u"banks'", u'traditional', u'brick-and-mortar', u'branches', u'less', u'often', u'than', u'ever', u'before.'], words=[u'hate', u'dealing', u'bank', u'tellers', u'customer', u'service', u'representatives,', u'royal', u'bank', u'scotland', u'solution', u'you.if', u'program', u'successful,', u'big', u'step', u'forward', u'road', u'automated', u'customer', u'service', u'use', u'ai,', u'notes', u'laurie', u'beaver,', u'research', u'associate', u'bi', u'intelligence,', u'business', u"insider's", u'premium', u'research', u"service.it's", u'noteworthy', u'luvo', u'does', u'operate', u'third-party', u'app', u'facebook', u'messenger,', u'wechat,', u'kik,', u'currently', u'trying', u'create', u'bots', u'assist', u'customer', u'service', u'respective', u'platforms.luvo', u'available', u'web', u'smartphones.', u'use', u'machine', u'learning', u'learn', u'mistakes,', u'ultimately', u'help', u'response', u'accuracy.down', u'road,', u'luvo', u'supplement', u'human', u'staff.', u'currently', u'answer', u'20', u'set', u'questions', u'number', u'grows,', u'allow', u'human', u'employees', u'complicated', u'issues.', u'problem', u"luvo's", u'comprehension,', u'refer', u'customer', u'bank', u'employee;', u'however,\xa0a', u'user', u'choose', u'speak', u'human', u'instead', u'luvo', u'anyway.ai', u'luvo,', u'successful,', u'help', u'businesses', u'efficient', u'increase', u'productivity,', u'simultaneously', u'improving', u'customer', u'service', u'capacity,', u'consequently\xa0save', u'money', u'manpower.and', u'trend', u'starting.', u'google,', u'microsoft,', u'ibm', u'investing', u'significantly', u'ai', u'research.', u'furthermore,', u'global', u'ai', u'market', u'estimated', u'grow', u'approximately', u'$420', u'million', u'2014', u'$5.05', u'billion', u'2020,', u'according', u'forecast', u'research', u'markets.\xa0the', u'ai', u'just', u'way', u'digital', u'age', u'disrupting', u'retail', u'banking.', u'customers,', u'particularly', u'millennials,', u'increasingly', u'moving', u'digital', u'banking,', u'result,', u"they're", u'walking', u"banks'", u'traditional', u'brick-and-mortar', u'branches', u'before.'], tf_features=SparseVector(3, {}), tf-idf_features=SparseVector(3, {})),

So first three column are the original ['id','title','desc']. New columns are added as per the transformation used. If you see Tokeniser and StopWords is working fine as the output columns for those are correct.

However I am not sure why tf_features column from CountVectorizer and tf_idf_features from IDF class are null and no vector of tf-idf values.

Also in Spark, we are passing one document from each column cell. How does Spark finds the vocabulary for tf vector? The vocabulary is the unique words appearing across the whole corpus(all documents) and not just one document. The Count of tf is the frequency of terms appearing in each document. So does Spark accumulate all the cell values in 'Desc' and find the unique vocabulary for the same and then counts for each docs in each cell for term frequency?

Please advise.

Edit1 :

I changed the vocabsize since obviously 3 doesn't make sense. and now I get the tf_fetures. as below:

tf_features=SparseVector(2000, {6: 1.0, 8: 1.0, 14: 1.0, 17: 2.0, 18: 1.0, 20: 1.0, 32: 1.0, 35: 2.0, 42: 1.0, 52: 1.0, 53: 3.0, 54: 1.0, 62: 1.0, 65: 1.0, 68: 1.0, 79: 1.0, 93: 4.0, 95: 2.0, 98: 1.0, 118: 1.0, 132: 1.0, 133: 1.0, 149: 1.0, 157: 1.0, 167: 5.0, 202: 3.0, 215: 1.0, 219: 1.0, 224: 1.0, 232: 1.0, 265: 3.0, 302: 1.0, 303: 1.0, 324: 2.0, 330: 1.0, 355: 1.0, 383: 1.0, 395: 1.0, 405: 1.0, 432: 1.0, 456: 1.0, 466: 1.0, 472: 1.0, 501: 1.0, 525: 1.0, 537: 1.0, 548: 1.0, 620: 1.0, 630: 1.0, 639: 1.0, 657: 1.0, 662: 1.0, 674: 1.0, 720: 1.0, 734: 1.0, 975: 1.0, 1003: 1.0, 1057: 1.0, 1148: 1.0, 1187: 1.0, 1255: 1.0, 1273: 1.0, 1294: 1.0, 1386: 1.0, 1400: 1.0, 1463: 1.0, 1477: 1.0, 1491: 1.0, 1724: 1.0, 1898: 1.0, 1937: 3.0, 1954: 1.0})

I am trying to understand this. First value is the no. of features (terms). This is also the vocabsize I entered. However what is the other dictionary here? is the key 'index' of the term (word) and value is the term frequency?. If yes, then how do we map this indices of words back to original words?. Like if I want to know which are those words which have the above counts how to map the key of the dict to word string?

Secondly, this output is not a vector (like its a dictionary). Is this consumable output for any ML algo? I would tropically need a feature vector and not a dict. How does this work?

score 4 · Answer 1 · edited May 23 '17 at 12:10

In the example you've shown tf_features and tf-idf_features are not null. These are valid SparseVectors with all features equal to 0.0 ([ 0.0, 0.0, 0.0]).

I believe that responsible is ridiculous configuration of the CountVectorizer. With vocabSize equal 3 you consider only the three most frequent terms ("CountVectorizer will build a vocabulary that only considers the top vocabSize terms ordered by term frequency across the corpus.") If none of these terms is present in the particular text you'll get the observed output.

df = sc.parallelize([
    (["a", ], ), # 'a' occurs only once
    (["b", "c"], ), (["c", "c", "d"], ), (["b", "d"], )
]).toDF(["tokens"])

vectorizer = CountVectorizer(
    inputCol="tokens", outputCol="features", vocabSize=3
).fit(df)

# With vocabulary size 3 A is not in the vocabulary (3 most common words)
vectorizer.vocabulary

['c', 'd', 'b']

vectorizer.transform(df).take(3)

[Row(tokens=['a'], features=SparseVector(3, {})),
 Row(tokens=['b', 'c'], features=SparseVector(3, {0: 1.0, 2: 1.0})),
 Row(tokens=['c', 'c', 'd'], features=SparseVector(3, {0: 2.0, 1: 1.0}))]

As you can see the first document doesn't contain any tokens from the vocabulary and as a result all features are equal to 0.

SparseVector(3, {}).toArray()

array([ 0.,  0.,  0.])

For comparison the first document contains two c and d:

v = SparseVector(3, {0: 2.0, 1: 1.0})

{vectorizer.vocabulary[i]: cnt for (i, cnt) in zip(v.indices, v.values)}

{'c': 2.0, 'd': 1.0}

More detailed explanation of the CountVectorizer behavior can be found in Handle unseen categorical string Spark CountVectorizer.

Depending on the applications vocabSize should be at lest in the range of hundreds, and it is not uncommon to use hundreds of thousands or more, especially if you consider applying some dimensionality reduction techniques.

Creating TF_IDF vector from a Spark Dataframe with Text column

1 Answers1