0
  1. I use fields grouping with storm.
  2. The problem is that because I have multiple machines and multiple bolts (obviously) then the tuple moves in between the machine and I suspect this reduces my performance drastically.
  3. Is it possible for a specific field grouping outcome to be specific to a specific machine?
  4. Or in more details for field grouping account1 be sticky with all bolts to machine1 for account2 machine3 for account3 be sticky for all bolts to machine1 and so on for each account to have all its bolts running in a specific single machine?
  5. Note that once the first bolt processes the event it emits accountid in its output tuple all further bolts from that point and on have accountid meaning I would want to do field grouping from that point and on on accountid for further bolts in the topology. (added for clarification after seeing first answer)
Jas
  • 14,493
  • 27
  • 97
  • 148
  • http://stackoverflow.com/questions/36368224/is-there-a-way-to-apply-multiple-groupings-in-storm/36374837?noredirect=1#comment60465924_36374837 also discusses the same issue. – user2250246 Apr 05 '16 at 17:40

2 Answers2

1

Assume you have 3 producers P1,P2,P3 and three consumers C1,C2,C3 and 3 machines each hosting a single Producer-Consumer-Pair, ie, P1-C1. Furthermore, assume you have 3 distinct key values a,b,c. Furthermore, assume that C1 processes all tuples with key a.

In general, tuples with key a can be emitted by all three producers. Furthermore, P1, can also emit tuples with key b or c. Thus, you cannot limit the data transfer to local machines using fields-grouping all you need to re-partition all data.

Extension

If you have additional bolts B1 to B3 that consumes data from C1 to C3 and those use the same fields-grouping key as C1 to C3 (ie, Bx could exploit the already given partitioning from Cx), you would need to ensure that B1 to B3 are co-located on the same machine as C1 to C3 and avoid a re-partitioning. Co-location can be achieved by providing a custom scheduler to Storm. See here for an example: https://xumingming.sinaapp.com/885/twitter-storm-how-to-develop-a-pluggable-scheduler/

However, you cannot use field-grouping to connect B1->C1 etc. because fields-grouping is agnostic to operator co-location and the already given partitioned data (it would just re-partition that data again). Instead, you would need to use direct- or custom-grouping to ensure that all data from B1 is sent to C1 etc.

Matthias J. Sax
  • 59,682
  • 7
  • 117
  • 137
  • What I thought of is having each and every of the `(Px,Cx)` pair on all of the machines and in my case as soon as `C1` emits its results it has the `accountid` in this result so that I can from now and on do `fields grouping` on this `accountid` all further producers and consumers in this topology from that point and on would all have an `accountid` in their outputs so in that case I do have an `accountid` emitted in all outputs but the first output and was planning to have all Px,Cx on all hosts. any chance such a thing would be possible with `storm` implementation? – Jas Feb 01 '16 at 15:05
  • I cannot follow... What do you mean by "having each and every of the `(Px,Cx)` pair on all of the machines" ? I extended my answer (hope this covers the second part of your question -- if I understand you correctly) – Matthias J. Sax Feb 01 '16 at 15:21
1

can localOrShuffleGrouping help? https://github.com/apache/storm/blob/a4f9f8bc5b4ca85de487a0a868e519ddcb94e852/storm-core/src/jvm/org/apache/storm/topology/TopologyBuilder.java#L360

hobgoblin
  • 865
  • 1
  • 10
  • 18
  • is it deterministic i mean for a certain key `account1` would it be possible that on one tuple it will reach `local bolt` and on another reach a `remote bolt` or if it sent something to `local bolt` it will keep on with it and for `account2` if sent to `remote bolt` it would continue sending it to that same `remote bolt` thanks. – Jas Feb 01 '16 at 17:15