I am trying to index multiple CSV files with different "schemas" in a Solr index. There's possibly some common schema elements (header columns) across these CSVs . My requirement is to be able to provide search across these CSVs amongst other items.
- From what I understand, one way to index would be to treat the entire CSV as a giant text string and index that. I am not sure what searchability aspects get impacted if I index that way.
- The other way is basically define a common schema and then programmatically extract the columns from the doc and index line by line with the caveat that if a file doesn't have any common schema I may not be able to index it. (BTW, this last part maybe a non-starter for me but just lets indulge the possibility for now)
Are there any other ways ? Is there any advantage to one over another?
BTW, I tried the schemaless mode but it doesn't work for me. I can index the first file but the moment I do the next file and it has some different columns, its giving back an error. Is this expected behaviour or am I doing something wrong?
Appreciate any pointers, thanks!
Update: the error with the schemaless mode is "Invalid date format". After doing some research, it seems like this is a different issue than what I'd thought, caused because Solr is autodetecting the data to be a date and it expects it to be in UTC format and its not. Is there any way for me to turn off autodetection of dates?