1

I am new to HDFS and Spark. I have input data for some simulations that is specific to regions (might be a country or part of country) and a function of time.

Lets assume I have following tables:

region:
id, name

-

population:
id, region_id, year_2020, year_2021, .... year_2050

-

sun_hours:
id, region_id, year_2020, year_2021, .... year_2050

(The actual situation is more complex, with more nested entities and foreign key relations).

I want to import all data from MySQL to HDFS and partition it by region_id.

Ideally, each node of my cluster would be responsible for one region, so that I easily can process the regions in parallel. If I would like to consider more regions, I could scale the simulation by adding further nodes. (A basic assumption of my simulation is that regions do not interact with each other).

I would like that during the simulation, the data does not need to be passed around between nodes, because all data that is required for one specific region is already located on one specific node. It is ok for me if some of the data is duplicated in the process of splitting/distribution. I expect the simulation for a single region to be quite demanding, so that a single node might not want to calculate several regions in parallel.

I found a sqoop command to import several tables to hdfs:

sqoop import-all-tables --connect jdbc:mysql://db.foo.com/corp

https://sqoop.apache.org/docs/1.4.2/SqoopUserGuide.html#_literal_sqoop_import_all_tables_literal

However, I found no way to specify how the data should be partitioned and distributed. That process should consider some "main entity" (= region).

Can I do so with sqoop at all? If yes, could you please provide an example command?

If no, is there some other tool available for my purpose?

If I need to do it on my own, would you recommend to

a) first import all data and then reorganize it or

b) first reorganize the data and write it for example to region specific text files and afterwards import it to HDFS?

Even if I manage to reorganize the (hierarchical) MySQL data, how can I ensure that all related data can be found at a single node and is not fragmented over the whole cluster?

With other workds: does hdfs, spark or some other big data tool have a feature to keep related content together? Lets say I have some RDD containing population and some other RDD containing sun hours. The population and sun hours for region 1 should be located on node x... and the population and sun hours for region 2 should be on node y, and so on.

(In order to be less error-prone, it is fine and I guess required ... to replicate the data on several nodes. I just want ensure that during a simulation without node issues, the traffic between the nodes is as low as possible.)

Edit

I just found GeoSpark which states

GeoSpark spatial partitioning method can significantly speed up the join query. Three spatial partitioning methods are available: KDB-Tree, Quad-Tree and R-Tree. Two SpatialRDD must be partitioned by the same way.

If you first partition SpatialRDD A, then you must use the partitioner of A to partition B.

objectRDD.spatialPartitioning(GridType.KDBTREE) queryWindowRDD.spatialPartitioning(objectRDD.getPartitioner)

https://datasystemslab.github.io/GeoSpark/tutorial/rdd/

So maybe I should try to convert my MySQL database to a format that is compatible to GeoSpark ("can be loaded from CSV, TSV, WKT, WKB, Shapefiles, GeoJSON and NetCDF/HDF format").

Related articles:

Stefan
  • 10,010
  • 7
  • 61
  • 117

1 Answers1

1

sqoop (is not Spark) is more so for tables. It can use views but it was stated that for complex views the results may even be unreliable. So, that avenue is closed.

You will need to use a spark.read JDBC connection with a view in mySQL that uses region_id as key for distribution - for your parallelism - using the numPartitions approach defined on a "driving" table. The join with the other tables needs to rely on the mySQL engine.

I am not privy to your processing, but it seems hard to enforce a 1 to 1 region_id to partition approach. Moreover, more than 1 partition may exist on the same node - but independently.

You could get all tables independently and then JOIN, but there would be shuffling as there is no way to guarantee all individual READ's results would end up on same Node.

thebluephantom
  • 16,458
  • 8
  • 40
  • 83
  • If I understand you correctly, I should not use sqoop for the import. Instead I should try to create a single huge view in MySQL where one region corresponds to a single row. Then I could use spark to read and partition the data by row. That might be a way to go... even if I am afraid that the *single monster MySQL view* might be beyond the capabilities of my database. I tried to improve my question and will wait for some more answers. If there are no additions I'll accept yours. – Stefan Jun 28 '19 at 07:24
  • I would have thought more than 1 row for a region returned, surely. But I am not in line with your idea. I just gave my view. We have no notion of what is huge. You would need to add more detail. – thebluephantom Jun 28 '19 at 07:34
  • Well, if you prefer some numbers ... lets assume that joining all input data in a single master view would result in 10 000 columns and 10 000 000 rows. – Stefan Jun 28 '19 at 07:42
  • For machine learning? That's a lot for a relational db unless you are generating those columns. In that case you could sqoop and JOIN. I could modify my answer, but when are the 10K columns materialized. But even mysql with 10M is doable - if you have the right cluster. So, when10K rows? – thebluephantom Jun 28 '19 at 10:11
  • In general the more info one provides, the better the answer. None-the-less the answer suffices for the information given. – thebluephantom Jun 28 '19 at 10:13