Increase of hash tables in MinHashLSH, decreases accuracy and f1

Question

I have used MinHashLSH with approximateSimilarityJoin with Scala and Spark 2.4 to find edges between a network. Link prediction based on document similarity. My problem is that while I am increasing the hash tables in the MinHashLSH, my accuracy and F1 score are decreasing. All that I have already read for this algorithm shows me that I have an issue.

I have tried a different number of hash tables and I have provided different numbers of Jaccard similarity thresholds but I have the same exact problem, the accuracy is decreasing rapidly. I have also tried different samplings of my dataset and nothing changed. My workflow goes on like this: I am concatenating all the text columns of my dataframe, which includes title, authors, journal and abstract and next I am tokenizing the concatenated column into words. Then I am using a CountVectorizer to transform this "bag of words" into vectors. Next, I am providing this column in MinHashLSH with some hash tables and finaly I am doing an approximateSimilarityJoin to find similar "papers" which are under my given threshold. My implementation is the following.

import org.apache.spark.ml.feature._
import org.apache.spark.ml.linalg._
import UnsupervisedLinkPrediction.BroutForce.join
import org.apache.log4j.{Level, Logger}
import org.apache.spark.ml.Pipeline
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions.{col, udf, when}
import org.apache.spark.sql.types._


object lsh {
  def main(args: Array[String]): Unit = {
    Logger.getLogger("org").setLevel(Level.ERROR) // show only errors

//    val cores=args(0).toInt
//    val partitions=args(1).toInt
//    val hashTables=args(2).toInt
//    val limit = args(3).toInt
//    val threshold = args(4).toDouble

    val cores="*"
    val partitions=1
    val hashTables=16
    val limit = 1000
    val jaccardDistance = 0.89

    val master = "local["+cores+"]"

    val ss = SparkSession.builder().master(master).appName("MinHashLSH").getOrCreate()
    val sc = ss.sparkContext

    val inputFile = "resources/data/node_information.csv"

    println("reading from input file: " + inputFile)
    println

    val schemaStruct = StructType(
      StructField("id", IntegerType) ::
        StructField("pubYear", StringType) ::
        StructField("title", StringType) ::
        StructField("authors", StringType) ::
        StructField("journal", StringType) ::
        StructField("abstract", StringType) :: Nil
    )

    // Read the contents of the csv file in a dataframe. The csv file contains a header.
    //    var papers = ss.read.option("header", "false").schema(schemaStruct).csv(inputFile).limit(limit).cache()

    var papers = ss.read.option("header", "false").schema(schemaStruct).csv(inputFile).limit(limit).cache()

    papers.repartition(partitions)
    println("papers.rdd.getNumPartitions"+papers.rdd.getNumPartitions)

    import ss.implicits._
    // Read the original graph edges, ground trouth
    val originalGraphDF = sc.textFile("resources/data/Cit-HepTh.txt").map(line => {
      val fields = line.split("\t")
      (fields(0), fields(1))
    }).toDF("nodeA_id", "nodeB_id").cache()

    val originalGraphCount = originalGraphDF.count()

    println("Ground truth count: " + originalGraphCount )

    val nullAuthor = ""
    val nullJournal = ""
    val nullAbstract = ""

    papers = papers.na.fill(nullAuthor, Seq("authors"))
    papers = papers.na.fill(nullJournal, Seq("journal"))
    papers = papers.na.fill(nullAbstract, Seq("abstract"))

    papers = papers.withColumn("nonNullAbstract", when(col("abstract") === nullAbstract, col("title")).otherwise(col("abstract")))
    papers = papers.drop("abstract").withColumnRenamed("nonNullAbstract", "abstract")
    papers.show(false)

        val filteredGt= originalGraphDF.as("g").join(papers.as("p"),(
          $"g.nodeA_id" ===$"p.id") || ($"g.nodeB_id" ===$"p.id")
        ).select("g.nodeA_id","g.nodeB_id").distinct().cache()

    filteredGt.show()

    val filteredGtCount = filteredGt.count()
    println("Filtered GroundTruth count: "+ filteredGtCount)

    //TOKENIZE

    val tokPubYear = new Tokenizer().setInputCol("pubYear").setOutputCol("pubYear_words")
    val tokTitle = new Tokenizer().setInputCol("title").setOutputCol("title_words")
    val tokAuthors = new RegexTokenizer().setInputCol("authors").setOutputCol("authors_words").setPattern(",")
    val tokJournal = new Tokenizer().setInputCol("journal").setOutputCol("journal_words")
    val tokAbstract = new Tokenizer().setInputCol("abstract").setOutputCol("abstract_words")

    println("Setting pipeline stages...")
    val stages = Array(
      tokPubYear, tokTitle, tokAuthors, tokJournal, tokAbstract
      //      rTitle, rAuthors, rJournal, rAbstract
    )

    val pipeline = new Pipeline()
    pipeline.setStages(stages)

    println("Transforming dataframe\n")
    val model = pipeline.fit(papers)
    papers = model.transform(papers)

    println(papers.count())
    papers.show(false)
    papers.printSchema()

    val udf_join_cols = udf(join(_: Seq[String], _: Seq[String], _: Seq[String], _: Seq[String], _: Seq[String]))

    val joinedDf = papers.withColumn(
      "paper_data",
      udf_join_cols(
        papers("pubYear_words"),
        papers("title_words"),
        papers("authors_words"),
        papers("journal_words"),
        papers("abstract_words")
      )
    ).select("id", "paper_data").cache()

    joinedDf.show(5,false)

    val vocabSize = 1000000
    val cvModel: CountVectorizerModel = new CountVectorizer().setInputCol("paper_data").setOutputCol("features").setVocabSize(vocabSize).setMinDF(10).fit(joinedDf)
    val isNoneZeroVector = udf({v: Vector => v.numNonzeros > 0}, DataTypes.BooleanType)
    val vectorizedDf = cvModel.transform(joinedDf).filter(isNoneZeroVector(col("features"))).select(col("id"), col("features"))
    vectorizedDf.show()

    val mh = new MinHashLSH().setNumHashTables(hashTables)
      .setInputCol("features").setOutputCol("hashValues")
    val mhModel = mh.fit(vectorizedDf)

    mhModel.transform(vectorizedDf).show()

    vectorizedDf.createOrReplaceTempView("vecDf")

    println("MinHashLSH.getHashTables: "+mh.getNumHashTables)

    val dfA = ss.sqlContext.sql("select id as nodeA_id, features from vecDf").cache()
    dfA.show(false)
    val dfB = ss.sqlContext.sql("select id as nodeB_id, features from vecDf").cache()
    dfB.show(false)

    val predictionsDF = mhModel.approxSimilarityJoin(dfA, dfB, jaccardDistance, "JaccardDistance").cache()

    println("Predictions:")
    val predictionsCount = predictionsDF.count()
    predictionsDF.show()
    println("Predictions count: "+predictionsCount)

        predictionsDF.createOrReplaceTempView("predictions")

        val pairs = ss.sqlContext.sql("select datasetA.nodeA_id, datasetB.nodeB_id, JaccardDistance from predictions").cache()
        pairs.show(false)

        val totalPredictions = pairs.count()

        println("Properties:\n")
        println("Threshold: "+threshold+"\n")
        println("Hahs tables: "+hashTables+"\n")
        println("Ground truth: "+filteredGtCount)
        println("Total edges found: "+totalPredictions +" \n")


        println("EVALUATION PROCESS STARTS\n")
        println("Calculating true positives...\n")

        val truePositives = filteredGt.as("g").join(pairs.as("p"),
          ($"g.nodeA_id" === $"p.nodeA_id" && $"g.nodeB_id" === $"p.nodeB_id") || ($"g.nodeA_id" === $"p.nodeB_id" && $"g.nodeB_id" === $"p.nodeA_id")
        ).cache().count()

       println("True Positives: "+truePositives+"\n")

        println("Calculating false positives...\n")

        val falsePositives = predictionsCount - truePositives

        println("False Positives: "+falsePositives+"\n")

        println("Calculating true negatives...\n")
        val pairsPerTwoCount = (limit *(limit - 1)) / 2

        val trueNegatives = (pairsPerTwoCount - truePositives) - falsePositives
        println("True Negatives: "+trueNegatives+"\n")

        val falseNegatives = filteredGtCount - truePositives

        println("False Negatives: "+falseNegatives)

        val truePN = (truePositives+trueNegatives).toFloat
        println("TP + TN sum: "+truePN+"\n")

        val sum = (truePN + falseNegatives+ falsePositives).toFloat
        println("TP +TN +FP+ FN sum: "+sum+"\n")

        val accuracy = (truePN/sum).toFloat
        println("Accuracy: "+accuracy+"\n")

        val precision = truePositives.toFloat / (truePositives+falsePositives).toFloat
        val recall = truePositives.toFloat/(truePositives+falseNegatives).toFloat

        val f1Score = 2*(recall*precision)/(recall+precision).toFloat
        println("F1 score: "+f1Score+"\n")

    ss.stop()

I forget to tell you that I am running this code in a cluster with 40 cores and 64g of RAM. Note that approximate similarity join (Spark's implementation) works with JACCARD DISTANCE and not with JACCARD INDEX. So I provide as a similarity threshold the JACCARD DISTANCE which for my case is jaccardDistance = 1 - threshold. (threshold = Jaccard Index ).

I was expecting to get higher accuracy and f1 score while I am increasing the hash tables. Do you have any idea about my issue?

Thank all of you in advance!

score 0 · Answer 1 · answered Feb 16 '19 at 18:50

There are multiple visible problems here, and probably more hidden, so just to enumerate a few:

LSH is not really a classifier and attempt to evaluate it as one doesn't make much sense, even if you assume that text similarity is somehow a proxy for citation (which is big if).
If the problem was to be framed as classification problem it should be treated as multi-label classification (each paper can cite or be cited by multiple sources) not multi-class classification, hence simple accuracy is not meaningful.
Even if it was a classification and could be evaluated as such your calculations don't include actual negatives, which don't meet the threshold of the approxSimilarityJoin
Also setting threshold to 1 restricts joins to either exact matches or cases of hash collisions - hence preference towards LSH with higher collisions rates.

Additionally:

Text processing approach you took is rather pedestrian and prefers non-specific features (remember you don't optimize your actual goal, but text similarity).
Such approach, especially treating everything as equal, discards majority of useful information in the set primarily, but not limited to, temporal relationships..

Thank you for your answer. I have edited my question to the correct one! I want to make a link prediction based on document similarity. I have chosen as a similarity measure the Jaccard Index. But, approxSimilarityJoin, provided by Apache Spark, is working with Jaccard distance and not with Jaccard Index. So, I have provided as an approxSimilarityJoin threshold the value 1- Jaccard index which is the Jaccard distance. I have taken your comments into consideration and I am trying to improve my code. Still, I have not found a solution to my problem. Is it an evaluation problem or MinHashLSH's? — atheodos, Feb 16 '19 at 19:41

Increase of hash tables in MinHashLSH, decreases accuracy and f1

1 Answers1