Here is the case. I want to run SparkNLP on Jupyterlab with Scala kernel. I want to use the RegexMatcher
annotation. I saved the pattern in a file named patterns.txt
on s3 bucket. And I tried the implementation below
import com.johnsnowlabs.nlp.util.io.ExternalResource
import com.johnsnowlabs.nlp.util.io.ReadAs.LINE_BY_LINE
val document = new DocumentAssembler().setInputCol("text").setOutputCol("document")
val regexmatcher = new RegexMatcher().
setInputCols(Array("document")).
setOutputCol("match").
setStrategy("MATCH_ALL").
setRules(ExternalResource("s3://bucket_name/patterns.txt", LINE_BY_LINE, Map("format" -> "text", "delimiter" -> " ")))
val pipeline_regex = new Pipeline().setStages(Array(document, regexmatcher))
val regex_match = pipeline_regex.fit(dev_data)
regex_match.transform(dev_data).select('match).show(false)
However, it seems thit doesn't work at all, and patterns.txt
are not used. How to fix it.