Indexing data inside database, with files stored on filesystem

Question

I'm trying to use Apache Solr as fulltext search engine in my .NET app (via SolrNet). My app has this data mode:

class Document 
{
    public int Id { get; set; };
    public string Name { get; set; }
    public DateTime CreateDate { get; set;}
    public Attach[] Attaches { get; set; }
}

class Attach
{
    public int Id { get; set; }
    public Document Parent { get; set; }
    //files are stored in filesystem, only path stored in database!
    public string FilePath { get; set; }
}

Now, I'm trying to index this files (Castle.Windsor used):

_container.AddFacility("solr", 
    new SolrNetFacility("http://localhost:8983/solr"));
var solr = _container.Resolve<ISolrOperations<Document>>();
solr.Delete(SolrQuery.All);

var conn = _container.Resolve<ISolrConnection>();

var docs = from o in Documents
           where o.Attaches.Count > 0
           select o;

foreach (var doc in docs)
{
    foreach (var att in doc.Attaches)
    {
        try
        {
            var file = Directory.GetFiles("C:\\Attachments\\" + doc.Id );
            foreach (var s in file)
            {
                var a = File.ReadAllText(s);
                conn.Post("/update", a);    
            }

        }
        catch (Exception)
        {           
            throw;
        }
    }
}
solr.Commit();
solr.BuildSpellCheckDictionary();

As described in code, I'm searching file pathes, and adding file content directly from disk. But, when I'm posting file's text to Solr, I recieve thie error:

<?xml version="1.0" encoding="UTF-8"?>
<response>
    <lst name="responseHeader">
        <int name="status">400</int><int name="QTime">2</int>
    </lst>
    <lst name="error">
        <str name="msg">Unexpected character 'Т' (code 1058 / 0x422) in prolog; expected '&lt;'
 at [row,col {unknown-source}]: [1,1]</str>
        <int name="code">400</int>
    </lst>
</response>

And I have this questions:

Can I post to index plain text, not XML?
Must I serialize my data-objects to index them? If YES, how I must represent file in "Attach" class?

score 2 · Accepted Answer · answered Feb 19 '13 at 14:00

To answer your questions:

Yes, you can post plain text to the index.
The items that you post must be serialized (default is XML, but JSON can also be used) in order to add them to the index.

From your example code, it looks like you are interested in just indexing the plain text of the files. Based on that, I would create the following class for passing data to Solr.

  public class IndexItem
  {
       [SolrField("id")]
       public string Id { get; set; }

       [SolrField("content")]
       public string Content { get; set; }
  }

Use this class to store the Id (must be a unique value) for each file that you read. The filename (also including the path) may be unique enough.

Change your example to the following:

_container.AddFacility("solr", 
    new SolrNetFacility("http://localhost:8983/solr"));
var solr = _container.Resolve<ISolrOperations<IndexItem>>();
solr.Delete(SolrQuery.All);

var docs = from o in Documents
           where o.Attaches.Count > 0
           select o;

foreach (var doc in docs)
{
    foreach (var att in doc.Attaches)
    {
        try
        {
            var file = Directory.GetFiles("C:\\Attachments\\" + doc.Id );
            foreach (var s in file)
            {
                       var indexItem = new IndexItem();
                       indexItem.Id = s.FileName;
                       indexItem.Content = File.ReadAllText(s);
                       solr.Add(indexItem);    
            }

        }
        catch (Exception)
        {           
            throw;
        }
    }
}
solr.Commit();
solr.BuildSpellCheckDictionary();

If you need to index more additional properties for each file, you can add them to the IndexItem class as I noticed that you have Name and CreateDate properties on the Document class above. You will just need to provide the mapping to the Solr so they are stored in an appropriate Solr field. Please see the SolrNet Mapping page for more details.

Paige, thnx for answer. But how I can send "Document" class with all detail "Attach" classes? Must I serialize them into one xml-file? And how I may setup fields in Solr.NET _schema.xml_? In documentation, **multivalued** field described as **ICollection**, but, in my case it is **ICollection**. — lewis, Feb 19 '13 at 18:58

score 1 · Answer 2 · answered Feb 19 '13 at 12:05

1

I guess you intend to extract plain text, HTML, DOCs and other rich documents. And your error message came from a XML parser trying to parse something not XML.

Use extracting request handler which is set to the /update/extract URL

answered Feb 19 '13 at 12:05

Jesvin Jose

22,498
32
109
202

Indexing data inside database, with files stored on filesystem

2 Answers2