Solr - Load and Index custom delimited file

Question

I get a feed file data in below format separated by custom delimiters

employee_id||034100151730105|L|
employee_cd||03410015|L|
dept_id||1730105|L|
dept_name||abc|L|
employee_firstname||pqr|L|
employee_lastname||ppp|L|
|R||L|
employee_id||034100151730108|L|
employee_cd||03410032|L|
dept_id||4230105|L|
dept_name||fdfd|L|
employee_firstname||sasas|L|
employee_lastname||dfdf|L|
|R||L|
.....

So my row delimiter is |R||L| each record delimiter is |L| and record name (employee_id) and record value (034100151730105) is separated by ||

I need to load and index this data to SOLR using /update in below way

employee_id: 034100151730105
employee_cd: 03410015 
...

Can someone please help me, how can I parse and load this feed to SOLR?

Persimmonium · Answer 1 · 2017-03-25T15:00:05.877

1

As is Solr will not be able to ingest this. Easiest thing would be:

use some command line tools like grep/sed etc to convert this format to a proper csv that Solr's /update will understand. You need to: replace |L| and || with a delimiter, replace |R||L| with a new line, and take care of escaping delimiter you use etc.
then use /update with the usual parameters 'separator' etc.
ignore all field names with 'skip'

Or, you can write a quite simple piece of code that reads each doc into memory, and index it in solr via Solrj or http.

edited Mar 25 '17 at 15:00

answered Mar 25 '17 at 10:26

Persimmonium

15,593
11
47
78

Thanks for your response. I have replacing |L| as |, |R||L| as new line and **| as =** . When I try to update it treats "employee_id=034100151730105" as 1 entity instead of having "employee_id" as field name and "034100151730105" as value. Is there any way I can mention field level separator and say "fieldname=fieldvalue" – user1637487 Mar 25 '17 at 14:26
I have updated my answer, you have to handler the employee_id etc fields as normal fields too, just ignore them when indexing – Persimmonium Mar 25 '17 at 15:00
I need to rely on each record to find it's field name like fieldname is "employee_id" and value is "034100151730105". Reason being few of the records may not have few fields, so field name has to be dynamically assigned when loading data instead of specifying in /update. If I load them as normal fields and skip while indexing, the resulting data will just be 034100151730105= 034100151730105, 03410015=03410015 etc. Can you pls let me know if there is anyway I can have fieldnames assigned dynamically while parsing data? – user1637487 Mar 26 '17 at 16:16
then, just write some code, it's going to be easier – Persimmonium Mar 26 '17 at 21:41

Solr - Load and Index custom delimited file

1 Answers1