How is an export/import operation different when it happens within the same instance as opposed to when it happens between two different instances. Is there a performance difference between a two?
1 Answers
I suspect there may be confusion about what an instance is. A Streams instance is responsible for managing some set of hosts and jobs. Jobs are submitted to a Streams instance, and the instance deploys those jobs to some of the hosts that instance manages.
Import/Export only works between jobs in the same instance; jobs in separate instances have no knowledge of each other. So part of your question is not possible: there is no Import/Export between jobs in different instances.
In case you meant to ask about the performance difference between Import/Export in the same job versus in different jobs, there is none. However, there is rarely a good reason to use Import/Export inside the same job, as the purpose of Import/Export is to enable communication between jobs.

- 86
- 3
-
I meant to ask the difference between Export/Import operator and TCP Source/TCPSink operator. TCPSource/Sink can be used to transfer data between different instances. My question is, data between two jobs can be transferred when are in same instance (using Export/Import) operator, while for different instance(TCPSource/Sink operator) has to be used. Are the two methods same? Is anyone among them a more resource consuming operation? – Ankit Sahay Mar 16 '18 at 08:58
-
The two methods are not the same. In fact, I would consider using TCPSource/Sink to allow communication between applications in _different instances_ to be an anti-pattern: if two Streams applications need to communicate, they should be in the same instance. Import/Export gives you rich publication-subscribe semantics, including filters. For TCPSource/Sink, if you use one of the text based formats (`csv`, `txt` or `line`), then there will be more overhead for sending and receiving versus Import/Export or TCPSource/Sink with the `bin` or `block` formats. – Scott S Mar 16 '18 at 20:35
-
That's what I was suspecting, a greater overhead. Is there a way to measure or quantify the overhead? Or to measure the performance difference between these two methods? The whole reason for doing this to use different instances equally rather than deploying most of the jobs in 1 instance, but then in future there can be a need to communicate among the instances. – Ankit Sahay Mar 16 '18 at 20:43
-
The simplest way to measure the performance difference is to write simple applications which use Import/Export and simple applications which use TCPSource/TCPSink and measure their respective throughputs. But, again, let me restate: I think you are about to apply an anti-pattern. Instances are intended to have multiple jobs. Why do you feel the need to "use different instances equally"? I'm worried you have confused _instances_ with _hosts_. One instance can have many hosts. The only reason to have different jobs in different instances is to keep them _isolated_, often for security reasons. – Scott S Mar 17 '18 at 00:48
-
I have 3 instances, One instance has close to 100 jobs running, while 3rd instance has only 1 job running. Isn’t this an unequal use of resource? Will not be more load on instance1? – Ankit Sahay Mar 17 '18 at 04:59
-
Yes, but probably not how you're thinking of it. An instance is a set of _management services_ and _application hosts_. Yes, the management services will have more work to do as the number of jobs and job-related activity increases on the instance. But it's not necessarily more load on the application hosts. It's up to you to assign application hosts to your instances; if you give an instance many hosts such that it can accommodate many running jobs, then you should be okay. – Scott S Mar 18 '18 at 20:03
-
Thank you Scott for you answers, do you think removing the 3rd instance and using the resources to add application hosts to instance 1 will be a better idea? – Ankit Sahay Mar 18 '18 at 20:06
-
Yes, I think that is the best direction. The only reasons to _not_ do it this way is if the amount of job-related activity is so high that your management services are overloaded. But I doubt that is the case, and I would only start having separate instances after you have proved your management services are overloaded. The simpler solution (one instance with many hosts for all of your applications) should be what you try first. – Scott S Mar 18 '18 at 21:46
-
Is there a way to find out load on management services? How do i find out the load on it? – Ankit Sahay Mar 18 '18 at 21:49