I have two servers. The first server (A) contains the zookeeper, a mongodb database and a drillbit. The second server (B) contains a hadoop distribution with several hive tables, a postgresql database and the other drillbit. Both drillbits can see eachother on the drill homepages, since they are both connected to the zookeeper on server A. When a query like the one below is run (the view dfs.lineorder consists of the hive and postgresql data on server B), drillbit B becomes the foreman and executes everything. It does not delegate the parts dealing with mongodb tables to the drillbit on server A, so 90% of the time required to run the query (approx. 30 minutes) is spent on sending the mongodb collection from server A to server B. Is there a way to force drillbit B to delegate the mongodb part of the query to drillbit B, or could this be a misconfiguration problem? Also, can drillbits be configured to access different databases (e.g. when each drillbit has access to a distinct subnet, so that not all drillbits can access the same databases)?
Sample query:
SELECT SUM(revenue) AS revenue
FROM (
SELECT SUM(lo_extendedprice*lo_discount) AS revenue
FROM dfs.tmp.lineorder, dfs.tmp.`date`
WHERE lo_orderdate = d_datekey
AND d_year = 1993
AND lo_discount BETWEEN 1 AND 3
AND lo_quantity < 25
UNION ALL
SELECT SUM(lo_extendedprice * lo_discount) AS revenue
FROM mongo.test.ssb_europe ssb
WHERE ssb.orderdate.d_year = 1993
AND lo_discount BETWEEN 1 AND 3
AND lo_quantity < 25
);
Configuration (drill-override.conf):
On drillbit A (Windows Server 2008): drill.exec: { cluster-id: "drillbits1", zk.connect: "serverA:2181", impersonation: { enabled: true, max_chained_user_hops: 3 } }
On drillbit B (Cloudera CDH 5.8.0): drill.exec: { cluster-id: "drillbits1", zk.connect: "serverA:2181", impersonation: { enabled: true, max_chained_user_hops: 3 } }