0

I am trying to stream a delta table from cluster A to Cluster B, but I am not able to load or write data to a different cluster:

streamingDf = spark.readStream.format("delta").option("ignoreChanges", "true") \
              .load("hdfs://cluster_A/delta-table")

stream = streamingDf.writeStream.format("delta").option("checkpointLocation", "/tmp/checkpoint")\
         .start("hdfs://cluster_B/delta-sink")

Then, I get the following error:
org.apache.hadoop.hdfs.BlockMissingException: Could not obtain block

So, my question is if it is posiible to stream data directly from two clusters using delta format, or additional technologies are requiered to achieve this.

Thanks!

Alex Ott
  • 80,552
  • 8
  • 87
  • 132

2 Answers2

0

It is supported. Refer https://docs.delta.io/latest/quick-start.html#write-a-stream-of-data-to-a-table

Exception you are facing seems to be due to some issue with NameNode and Data Nodes in your hdfs/delta lake cluster. Refer https://www.linkedin.com/pulse/solved-mystery-blockmissingexception-hadoop-file-system-rajat-kumar

Vindhya G
  • 1,339
  • 2
  • 21
  • 46
  • Everything in the documentation is within the same cluster, I can not see if it is possible to share information between clusters using this delta streaming, I think a technology like Kafka is needed – Juan Camilo Calero Espinosa Jan 31 '22 at 14:44
  • There is nothing to indicate it should be within the same cluster. As long as both clusters are reachable from spark cluster it should be possible. https://docs.delta.io/latest/delta-storage.html. Example here too indicates you can have 2 different clusters as url as long as both are reachable from your spark cluster – Vindhya G Jan 31 '22 at 16:18
0

The error was related to firewall rules, all the nodes in cluster A must have access to all the nodes in cluster B with the corresponding ports. I had only set the ports on the namenodes