Index multiple CSV files with different headers in Solr

Question

I am trying to index multiple CSV files with different "schemas" in a Solr index. There's possibly some common schema elements (header columns) across these CSVs . My requirement is to be able to provide search across these CSVs amongst other items.

From what I understand, one way to index would be to treat the entire CSV as a giant text string and index that. I am not sure what searchability aspects get impacted if I index that way.
The other way is basically define a common schema and then programmatically extract the columns from the doc and index line by line with the caveat that if a file doesn't have any common schema I may not be able to index it. (BTW, this last part maybe a non-starter for me but just lets indulge the possibility for now)

Are there any other ways ? Is there any advantage to one over another?

BTW, I tried the schemaless mode but it doesn't work for me. I can index the first file but the moment I do the next file and it has some different columns, its giving back an error. Is this expected behaviour or am I doing something wrong?

Appreciate any pointers, thanks!

Update: the error with the schemaless mode is "Invalid date format". After doing some research, it seems like this is a different issue than what I'd thought, caused because Solr is autodetecting the data to be a date and it expects it to be in UTC format and its not. Is there any way for me to turn off autodetection of dates?

ref. schemaless mode - what is the error? It might be because the same column name has different types between the documents. In that case you'll have to explicitly set up the columns first, then if you still want unknown columns to be created, continue to use the schemaless mode - but with predefined columns. — MatsLindh, Mar 10 '20 at 22:49
The error with the schemaless mode is "Invalid date format". After doing some research, it seems like this is a different issue than what I'd thought, caused because Solr is autodetecting the data to be a date and it expects it to be in UTC format and its not. Is there any way for me to turn off autodetection of dates? — homtanks, Mar 10 '20 at 23:29
You should be able to change the default version used - i.e. remove the date detector, see https://lucene.apache.org/solr/guide/6_6/schemaless-mode.html#SchemalessMode-DefineanUpdateRequestProcessorChain for an example. — MatsLindh, Mar 11 '20 at 07:50

Index multiple CSV files with different headers in Solr

0 Answers0