0

I have a list of files names (nearly 400 000). I need to parse each file's content and find a given string pattern.

Can any one help me best way to boost my searching process(I'm able to process the content in 90 seconds).

Here is the piece of code that need to be optimised.

/**
* This method is called over a list of files and file is parsed char by char and compared with pattern using prefix table( used in KMP algorithm).
* 
* @param pattern
*     Pattern to be searched
*  
* @param prefixTable
*     Prefix table is build is using KMP algorithm.
*     Example:- For a given pattern => results sets are { "ababaca" => 0012301, "abcdabca" => 00001231, "aababca" => 0101001, "aabaabaaa" => 010123452 }     
*    
*  @param file
*     File that need to be parsed to find the string pattern.
*  
*  @@return
*     For a given file it return a map of lines numbers with all multiple char location(start) of pattern with in that line.   
*     
*/



  def contains(pattern:Array[Char],prefixTable:Array[Int], file:String):LinkedHashMap[Integer, ArrayList[Integer]]= {
val pat:String = pattern.toString()
//stores a line and char location of each occurrence 
    var returnValue:LinkedHashMap[Integer, ArrayList[Integer]] = new LinkedHashMap[Integer, ArrayList[Integer]]()

    val source = scala.io.Source.fromFile(file,"iso-8859-1")

      val lines = try source.mkString finally source.close()
            var lineNumber=1
            var i=0
            var k=0
            var j=0
            while(i < lines.length()){
                if(lines(i)=='\n')
                {lineNumber+=1;k=0; j=0}
                var charAt = new ArrayList[Integer]();
                while( j<pattern.length && i < lines.length() && lines(i)==pattern(j)){
                    j+=1        
                    i+=1
                    k+=1
                }
                if(j==pattern.length){charAt.add(k-pattern.length+1);j=0}
                if(j==0) {i+=1;k+=1}
                else{j=prefixTable(j-1)}
                if(charAt.size()>0){returnValue.put(lineNumber, charAt)}
            }
    return returnValue;
}
James Z
  • 12,209
  • 10
  • 24
  • 44
Puneeth Reddy V
  • 1,538
  • 13
  • 28

1 Answers1

0

with this code :

object HelloWorld {
  def main(args: Array[String]) {

    val name="""A""".r
    val chaine="BCDARFA"

    val res=name.findAllIn(chaine)
    println("found?"+res)

    println("1st place "+res.start)

  }
}

you can find the position of the first occurence of the regex in a string. I don't now if it is faster than yours, but anyway it could simplify your code.

EDIT: here's the final code:

object HelloWorld {
  def main(args: Array[String]) {

    val name="""A""".r
    val chaine="BCDARFA"

    val res=name.findAllIn(chaine)
    println("found?"+res)

    println("1st place "+res.start)

    for (elt <- res.matchData) {
      println ("position : "+elt.start)
    }

  }
}
lolveley
  • 1,659
  • 2
  • 18
  • 34
  • Is there a way to find how many occurrences are there in a line and at what positions do they occur? – Puneeth Reddy V Aug 21 '15 at 17:27
  • yes, this is possible using res.matchData in my previous code : is creates an iterator whose elements each contains a start method giving the position of the regex in the string. I can give you an example if needed, it uses a for loop. – lolveley Aug 21 '15 at 17:45