0

I would like to fill an solr index from a pandas dataframe. The dataframe is as follows:

position        value
 5.6,-2.3        65
 -35.6,-1.2      43.1

#...

etc.

I am doing the following to transform the dataframe to a json object and then adding it to solr:

import json
import pandas as pd 
import pysolr

# I have a pandas dataframe df as described above
jsonObject = json.loads(df.to_json(orient='records'))

solrServer = pysolr.Solr('pathToMySolrIndex',timeout=100)

solrServer.add(jsonObject)

I get the following error:

multiple values encountered for non multiValued field position

If I change the name of the fied position to _position , then it kind of works. From pysolr's documentation page, I understand this creates a parent/child dependency which I don't really want. Indeed, reading back from the index using:

results = solrServer.search(**{'q':'*'})
df2 = pd.DataFrame(list(results))
print(df2.head())

I get something like this:

_position        value
 [5.6,-2.3]        [65]
 [-35.6,-1.2]      [43.1]

#...

Despite this "hackish" solution, I'm still not getting a good result: Each element is a list. I would have preferred tuples for position, and simple floats for value. I guess this comes from the orient keyword when converting to json.

Questions and Expected output

First, I would like to avoid renaming position to _position . The Solr database doesn't have to contain renamed fields for the sake of pysolr.

Second, I would like to avoid having lists when reading from the built Solr index. I know that Solr doesn't have to contain lists as numerical elements. The problem seems to come from the transformation from DataFrame to json. How to do this?

ma3oun
  • 3,681
  • 1
  • 21
  • 33
  • So, what is your expected output? Where do you have the trouble? With `solr` or `pandas`? – cs95 Dec 16 '17 at 12:40
  • The trouble is somewhere in between. Problem with pandas: generate a json objects that doesn't lead to lists as elements in the database. Problem with pysolr: add multivalued values without "hacking" the column name – ma3oun Dec 16 '17 at 13:49
  • You can't add multiple values to a field that has set multiValued=false explicitly (or left as default) in Solr schema. Could you please share relevant solr schema parts? – Qusai Alothman Dec 18 '17 at 18:51
  • I don't know how to specify the schema using pySolr – ma3oun Dec 18 '17 at 18:52
  • Solr schema is not defined using pysolr, it is defined using schema.xml or from the Admin UI. Didn't you define the fields and their types before using them? How did you do that? – Qusai Alothman Dec 18 '17 at 18:57
  • When I created my collection using the Admin UI, I set it to data driven. This means that the schema is deducted from the data itself – ma3oun Dec 18 '17 at 18:59
  • It seems to me that solr has set `multiValued=false` for that field, and that is the root problem. When you add the "_position" field, it doesn't make any parent/child relations, it just defines a new field with `multiValued=true`. Could you please check from the Admin UI if the field "position" is defined and doesn't have `multiValued=true`? You can find that information from "schema" tab, then selecting "position" from the drop-down list. – Qusai Alothman Dec 18 '17 at 19:13

0 Answers0