I am currently working on a personal project where I am reading scanned documents using tesseract and storing the content as text files. I keep both the text files and the jpeg files in my web's directory and maintain the link between them using mySQL database. The aim of the project is to be able to search for key terms and be able to return the images.
I have thus far been able to index the text files using Zend Lucene, however I have experienced a lot of issues when dealing with searching the documents, the fields in my index are: date the image was uploaded, the body (contents of text file), and the URI to the image.
//Create document
$doc = new Zend_Search_Lucene_Document();
//Select database and get item to be indexed
mysql_select_db("database", $con);
$exampleSQL = "SELECT date_format(dateUploaded, '%Y%m%d') as formatted_date, imageLink, textLink
FROM `mappingTable`
WHERE imageLink='$item'";
$fileItem = mysql_fetch_assoc(mysql_query($exampleSQL));
//Add fields to document
$doc->addField(Zend_Search_Lucene_Field::UnIndexed('URL',
$fileItem['imageLink']));
$doc->addField(Zend_Search_Lucene_Field::Keyword('created',
$fileItem['formatted_date']));
$contents = file_get_contents("/path/to/data/".$fileItem['textLink']);
$doc->addField(Zend_Search_Lucene_Field::UnStored('body',
$contents));
All the above I believe is working just fine. For my searches I intent to search by the content in the text files and by date the image was uploaded to the directory, so I've devised the following queries which for some reasong keep failing to produce as required, specially when searching by date or by content and date.
if($queryType === "contentSearch"){
$term = new Zend_Search_Lucene_Index_Term($query, 'body');
$searchQuery = new Zend_Search_Lucene_Search_Query_Term($term);
try{
$hits = $index->find($searchQuery);
}
catch (Zend_Search_Lucene_Exception $ex) {
$hits = array();
}
} elseif ($queryType === "dateSearch"){
$searchQuery = '['.str_replace('/','',$fQuerydate)." TO ".str_replace('/','',$tQuerydate).']';
try{
$hits = $index->find($searchQuery);
}
catch (Zend_Search_Lucene_Exception $ex) {
$hits = array();
}
} elseif ($queryType === "bothSearch"){
$searchQuery = new Zend_Search_Lucene_Search_Query_MultiTerm();
$searchQuery->addTerm(new Zend_Search_Lucene_Index_Term($query, 'body'), true);
$searchQuery->addTerm(new Zend_Search_Lucene_Index_Term('['.str_replace('/','',$fQuerydate).' TO '.str_replace('/','',$tQuerydate).']', 'created'), true);
// $searchQuery = 'body:"'.$query.'" && created:'.$fQuerydate." TO ".$tQuerydate;
try{
$hits = $index->find($searchQuery);
} catch (Zend_Search_Lucene_Exception $ex) {
$hits = array();
}
} else {
$searchQuery = null;
}
As you can see i've even attempted using a parser, but the above would not return any result although I'm performing my search with the knowledge at least two documents should be returned.
e.g: +body:hill +created:[20110328 TO 20110628]
Returns zero documents.
As you can see I got rid of all '/', '.' and '-' in my date (created) field and I used the keyword declaration for it to ensure it would be compliant for a search, but even then nothing is returned.
I would also like to know how to apply my own stop list to be used as there are some terms I would like to be included for search that are currently not, and are important to the documents.
Since I am not working in my own server and have limited access to it, I have no option but to use either lucene or mySQL, would I be better off using a full text search in my DB?
Thanks in advance.