1

I am using Spark 3.x in Python. I have some data (in millions) in CSV files that I have to index in Apache Solr. I have deployed pysolr module for this purpose

import pysolr
def index_module(row ):
    ...
    solr_client = pysolr.Solr(SOLR_URI)
    solr_client.add(row)
    ...
df = spark.read.format("csv").option("sep", ",").option("quote", "\"").option("escape", "\\").option("header", "true").load("sample.csv")

df.toJSON().map(index_module).count()

index_module module simply get one row of data frame as json and then index in Solr via pysolr module. Pysolr support to index list of documents instead of one. I have to update my logic so that instead of sending one document in each request, I'll send a list of document. Definatelty, it will improve the performance.

How can I achieve this in PySpark ? Is there any alternative or best approach instead of map and toJSON ?

Also, My all activities are completed in transformation functions. I am using count just to start the job. Is there any alternative dummy function (of action type) in spark to do the same?

Finally, I have to create Solr Object each time, is there any alternative for this ?

Hafiz Muhammad Shafiq
  • 8,168
  • 12
  • 63
  • 121
  • Technically, you should use something like `df.write.format("solr")` which means spark write directly to SolR. But it is not native, it requieres some additionnal libs. Check this [link](https://databricks.com/fr/session/integrating-spark-and-solr), it may help you. If I find something helpful, i'll post as an answer. – Steven Jun 04 '21 at 07:55
  • Yes, but for Spark 3.x version, I have not found any such option – Hafiz Muhammad Shafiq Jun 04 '21 at 08:10
  • You can send a list of documents with `solr_client.add(docs)` with docs being a list of rows, you just need to define a batch size and split the data accordingly instead of mapping each row individually. – EricLavault Jun 04 '21 at 11:17
  • Yes, I have an idea that it will work but how?can you please guide with some example – Hafiz Muhammad Shafiq Jun 04 '21 at 12:09

0 Answers0