Wondering if there is a built-in Spark feature to combine 1-, 2-, n-gram features into a single vocabulary. Setting n=2
in NGram
followed by invocation of CountVectorizer
results in a dictionary containing only 2-grams. What I really want is to combine all frequent 1-grams, 2-grams, etc into one dictionary for my corpus.
Asked
Active
Viewed 5,917 times
9

zero323
- 322,348
- 103
- 959
- 935

Evan Zamir
- 8,059
- 14
- 56
- 83
1 Answers
22
You can train separate NGram
and CountVectorizer
models and merge using VectorAssembler
.
from pyspark.ml.feature import NGram, CountVectorizer, VectorAssembler
from pyspark.ml import Pipeline
def build_ngrams(inputCol="tokens", n=3):
ngrams = [
NGram(n=i, inputCol="tokens", outputCol="{0}_grams".format(i))
for i in range(1, n + 1)
]
vectorizers = [
CountVectorizer(inputCol="{0}_grams".format(i),
outputCol="{0}_counts".format(i))
for i in range(1, n + 1)
]
assembler = [VectorAssembler(
inputCols=["{0}_counts".format(i) for i in range(1, n + 1)],
outputCol="features"
)]
return Pipeline(stages=ngrams + vectorizers + assembler)
Example usage:
df = spark.createDataFrame([
(1, ["a", "b", "c", "d"]),
(2, ["d", "e", "d"])
], ("id", "tokens"))
build_ngrams().fit(df).transform(df)

zero323
- 322,348
- 103
- 959
- 935
-
1Thanks, that makes perfect sense. – Evan Zamir Oct 01 '16 at 21:01
-
2An alternative would be to combine the unigrams and bigrams using `VectorAssembler` then feed a single vector to `CountVectorizer`. I think this is more in line with scikit-learn CountVectorizer. Not sure if it makes a real difference though. – Daniel Nitzan Jun 09 '17 at 16:38
-
2@danieln If nothing changed `VectorAssembler` cannot assemble arrays of strings. – zero323 Jun 09 '17 at 19:19
-
4How would you go about feeding n-gram range, like (1,3) to HashigTF? – Daniel Nitzan Mar 23 '18 at 18:04