1

I want to query the SAP table from databricks. I have installed the JDBC library for connecting to the SAP server. I am able to connect and fetch records using spark.read.JDBC(url = jdbcUrl, table = query, properties = connectionProperties). In the table parameter, I am providing the schema and table name. I want to fetch selective records. Is there any way for providing a query?

Alex Ott
  • 80,552
  • 8
  • 87
  • 132
Aswad
  • 29
  • 2
  • 5

1 Answers1

1

You just need to use standard Spark functions like, .filter/.where - Spark will convert it into corresponding SQL query and will push it down to the database. This functionality (predicate pushdown) is enabled in Spark by default as you may see in documentation.

df = spark.read.JDBC(url = jdbcUrl, table = query, 
       properties = connectionProperties)
filtered = df.filter("condition")
Alex Ott
  • 80,552
  • 8
  • 87
  • 132
  • Thanks for the response. But this way I am querying on the dataframe, is there any way to query on the table itself as the table contains a huge amount of data I don't want to fetch all the records. – Aswad Nov 18 '21 at 05:56
  • Querying the dataframe will be translated automatically into querying the table - this is functionality called predicate pushdown. When you do .read, table isn’t read completely because Spark is lazy. So you can do filtering, etc. actual reading happens when you trigger “action”. I recommend to read first chapters of the Learnign Spark book - it’s freely available from Databricks site – Alex Ott Nov 18 '21 at 06:25
  • I have tried using a filter, but still, it's taking too much time for the data load. Data load that should be under a minute is taking around 30 mins. – Aswad Nov 19 '21 at 12:53
  • what `filtered.explain()` produces? – Alex Ott Nov 19 '21 at 13:19