I am new to HDFS and Spark. I have input data for some simulations that is specific to regions (might be a country or part of country) and a function of time.
Lets assume I have following tables:
region:
id, name
-
population:
id, region_id, year_2020, year_2021, .... year_2050
-
sun_hours:
id, region_id, year_2020, year_2021, .... year_2050
(The actual situation is more complex, with more nested entities and foreign key relations).
I want to import all data from MySQL to HDFS and partition it by region_id.
Ideally, each node of my cluster would be responsible for one region, so that I easily can process the regions in parallel. If I would like to consider more regions, I could scale the simulation by adding further nodes. (A basic assumption of my simulation is that regions do not interact with each other).
I would like that during the simulation, the data does not need to be passed around between nodes, because all data that is required for one specific region is already located on one specific node. It is ok for me if some of the data is duplicated in the process of splitting/distribution. I expect the simulation for a single region to be quite demanding, so that a single node might not want to calculate several regions in parallel.
I found a sqoop command to import several tables to hdfs:
sqoop import-all-tables --connect jdbc:mysql://db.foo.com/corp
https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_literal_sqoop_import_all_tables_literal
However, I found no way to specify how the data should be partitioned and distributed. That process should consider some "main entity" (= region).
Can I do so with sqoop at all? If yes, could you please provide an example command?
If no, is there some other tool available for my purpose?
If I need to do it on my own, would you recommend to
a) first import all data and then reorganize it or
b) first reorganize the data and write it for example to region specific text files and afterwards import it to HDFS?
Even if I manage to reorganize the (hierarchical) MySQL data, how can I ensure that all related data can be found at a single node and is not fragmented over the whole cluster?
With other workds: does hdfs, spark or some other big data tool have a feature to keep related content together? Lets say I have some RDD containing population and some other RDD containing sun hours. The population and sun hours for region 1 should be located on node x... and the population and sun hours for region 2 should be on node y, and so on.
(In order to be less error-prone, it is fine and I guess required ... to replicate the data on several nodes. I just want ensure that during a simulation without node issues, the traffic between the nodes is as low as possible.)
Edit
I just found GeoSpark which states
GeoSpark spatial partitioning method can significantly speed up the join query. Three spatial partitioning methods are available: KDB-Tree, Quad-Tree and R-Tree. Two SpatialRDD must be partitioned by the same way.
If you first partition SpatialRDD A, then you must use the partitioner of A to partition B.
objectRDD.spatialPartitioning(GridType.KDBTREE) queryWindowRDD.spatialPartitioning(objectRDD.getPartitioner)
https://datasystemslab.github.io/GeoSpark/tutorial/rdd/
So maybe I should try to convert my MySQL database to a format that is compatible to GeoSpark ("can be loaded from CSV, TSV, WKT, WKB, Shapefiles, GeoJSON and NetCDF/HDF format").
Related articles: