3

I am not a native English speaker but I try to express my question as clear as possible. I encountered this problem which has confused me for two days and I still can't find the solution.

I have built a stream which will run in the Spring Could Data Flow in the Hadoop YARN.

The stream is composed of Http source,processor and file sink.

1.Http Source
The HTTP Source component has two output channels binding with two different destinations which are dest1 and dest2 defined in the application.properties.

spring.cloud.stream.bindings.output.destination=dest1 spring.cloud.stream.bindings.output2.destination=dest2

Below is the code snipet for HTTP source for your reference..

@Autowired
    private EssSource channels; //EssSource is the interface for multiple output channels

##output channel 1:
    @RequestMapping(path = "/file", method = POST, consumes = {"text/*", "application/json"})
    @ResponseStatus(HttpStatus.ACCEPTED)
    public void handleRequest(@RequestBody byte[] body, @RequestHeader(HttpHeaders.CONTENT_TYPE) Object contentType) {
        logger.info("enter ... handleRequest1...");
        channels.output().send(MessageBuilder.createMessage(body,
                new MessageHeaders(Collections.singletonMap(MessageHeaders.CONTENT_TYPE, contentType))));
    }

##output channel 2:
    @RequestMapping(path = "/test", method = POST, consumes = {"text/*", "application/json"})
    @ResponseStatus(HttpStatus.ACCEPTED)
    public void handleRequest2(@RequestBody byte[] body, @RequestHeader(HttpHeaders.CONTENT_TYPE) Object contentType) {
        logger.info("enter ... handleRequest2...");
        channels.output2().send(MessageBuilder.createMessage(body,
                new MessageHeaders(Collections.singletonMap(MessageHeaders.CONTENT_TYPE, contentType))));
    }

2. Processor
The processor has two multiple input channels and two output channels binding with different destinations. The destination binding is defined in application.properties in processor component project.

//input channel binding  
spring.cloud.stream.bindings.input.destination=dest1
spring.cloud.stream.bindings.input2.destination=dest2

//output channel binding  
spring.cloud.stream.bindings.output.destination=hdfsSink
spring.cloud.stream.bindings.output2.destination=fileSink

Below is the code snippet for Processor.

@Transformer(inputChannel = EssProcessor.INPUT, outputChannel = EssProcessor.OUTPUT)
    public Object transform(Message<?> message) {
        logger.info("enter ...transform...");

        return "processed by transform1";;
    }


    @Transformer(inputChannel = EssProcessor.INPUT_2, outputChannel = EssProcessor.OUTPUT_2)
    public Object transform2(Message<?> message) {
        logger.info("enter ... transform2...");
        return "processed by transform2";
    }

3. The file sink component.

I use the official fil sink component from Spring. maven://org.springframework.cloud.stream.app:file-sink-kafka:1.0.0.BUILD-SNAPSHOT

And I just add the destination binding in its applicaiton.properties file. spring.cloud.stream.bindings.input.destination=fileSink

4.Finding:

The data flow I expected should like this:

Source.handleRequest() -->Processor.handleRequest()

Source.handleRequest2() -->Processor.handleRequest2() --> Sink.fileWritingMessageHandler();

Should only the string "processed by transform2" is saved to the file.

But after my testing, the data flow is actual like this:

Source.handleRequest() -->Processor.handleRequest() --> Sink.fileWritingMessageHandler();

Source.handleRequest2() -->Processor.handleRequest2() --> Sink.fileWritingMessageHandler();

Both the "processed by transform1" and "processed by transform2" string are saved to the file.

5.Question:

Although the destination for the output channel in Processor.handleRequest() binds to hdfsSink instead of fileSink,the data still flows to file Sink. I can't understand this and this is not what I want. I only want the data from Processor.handleRequest2() flows to file sink instead of both. If I don't do it right, could anyone tell me how to do it and what is the solution? It has been confused me for 2 days.

Thanks you for your kindly help.

Alex

Alex Xu
  • 33
  • 2

1 Answers1

2

Is your stream definition something like this (where the '-2' versions are the ones with multiple channels) ?

http-source-2 | processor-2 | file-sink

Note that Spring Cloud Data Flow will override the destinations defined in applications.properties which is why, even if spring.cloud.stream.bindings.output.destination for the processor is set to hdfs-sink, it will actually match the input of file-sink.

The way destinations are configured from a stream definition is explained here (in the context of taps): http://docs.spring.io/spring-cloud-dataflow/docs/current/reference/htmlsingle/#spring-cloud-dataflow-stream-tap-dsl

What you can do is to simply swap the meaning of channel 1 and 2 - use the side channel for hdfs. This is a bit brittle though - as the input/output channels of the Stream will be configured automatically and the other channels will be configured via application.properties - in this case it may be better to configure the side channel destinations via stream definition or at deployment time - see http://docs.spring.io/spring-cloud-dataflow/docs/current/reference/htmlsingle/#_application_properties.

It seems to me that these could be just as well be 2 streams listening to separate endpoints, using regular components - given that data is supposed to be flowing side by side.

Marius Bogoevici
  • 2,380
  • 1
  • 13
  • 14
  • Hi Marius, Thanks for the answer. It works now. Thank you so much. It really helps. – Alex Xu Sep 07 '16 at 11:17
  • By the way, when I did the testing on YARN this afternoon, I found that it seems the SCDF on YARN not only caches the Apps in the /dataflow/artifacts/cache folder, but also elsewhere like database. I am not sure if it also caches the apps in DB. As even I destroy the stream and unregister the Apps and delete all the files in /dataflow/artifacts/cache folder, I found that the SCDF on Yarn still use the cached Apps to deploy. At last, I have no choice but restart the SCDF Server and it works. Do you know how to clear the cache data completely without restart the Server? Thanks, – Alex Xu Sep 07 '16 at 11:19
  • This [section in the docs](http://docs.spring.io/spring-cloud-dataflow-server-yarn/docs/1.0.1.RELEASE/reference/htmlsingle/#yarn-how-it-works) could be useful. Caching is done at `/dataflow/artifacts/cache` and as well as in the `hdfs` directory, so whenever you clear the cache, you may have to restart the server, too. Please feel free to share your feedback and/or improvements to this workflow. – Sabby Anandan Sep 07 '16 at 14:04
  • Thanks Sabby for the feedback. Except the /dataflow/artifacts/cache directory where Caching is done, do you know where is the hdfs directory? I don't want to restart the server to clear the cache because if we do, all the streams have to be re-deployed manually again even we only update one app in one stream. – Alex Xu Sep 08 '16 at 02:08
  • Have you tried registering that "one app" with `--force` option? This should override the existing and the latest would be used. Also, the original topic on this thread is already addressed and this is a different discussion altogether - perhaps you could consider closing this and open a new thread if you're still having issues with it. – Sabby Anandan Sep 08 '16 at 14:10
  • Ok, I will try it and if needed, I will open a new thread for it. I will close this thread. Thanks Sabby. – Alex Xu Sep 10 '16 at 14:57