Problem with transformation JavaRDD to JavaRDD

Question

I'm trying to save tweets from twitter to MongoDb database.

I have got RDD<Status> and i'm trying to convert this one to JSON format with help ObjectMapper.But there is some problem with this transformation(

public class Main {


    //set system credentials for access to twitter
    private static void setTwitterOAuth() {
        System.setProperty("twitter4j.oauth.consumerKey", TwitterCredentials.consumerKey);
        System.setProperty("twitter4j.oauth.consumerSecret", TwitterCredentials.consumerSecret);
        System.setProperty("twitter4j.oauth.accessToken", TwitterCredentials.accessToken);
        System.setProperty("twitter4j.oauth.accessTokenSecret", TwitterCredentials.accessTokenSecret);
    }


    public static void main(String [] args) {

        setTwitterOAuth();

        SparkConf conf = new SparkConf().setMaster("local[2]")
                                        .setAppName("SparkTwitter");
        JavaSparkContext sparkContext = new JavaSparkContext(conf);
        JavaStreamingContext jssc = new JavaStreamingContext(sparkContext, new Duration(1000));
        JavaReceiverInputDStream<Status> twitterStream = TwitterUtils.createStream(jssc);

        //Stream that contains just tweets in english
        JavaDStream<Status> enTweetsDStream=twitterStream.filter((status) -> "en".equalsIgnoreCase(status.getLang()));
        enTweetsDStream.persist(StorageLevel.MEMORY_AND_DISK());


        enTweetsDStream.print();
        jssc.start();
        jssc.awaitTermination();
    }

    static void saveRawTweetsToMondoDb(JavaRDD<Status> rdd,JavaSparkContext sparkContext) {
     try {
            ObjectMapper objectMapper = new ObjectMapper();
            SQLContext sqlContext = new SQLContext(sparkContext);
            JavaRDD<String> tweet =  rdd.map(status -> objectMapper.writeValueAsString(status));

            DataFrame dataFrame = sqlContext.read().json(tweet);

            Map<String, String> writeOverrides = new HashMap<>();
            writeOverrides.put("uri", "mongodb://127.0.0.1/forensicdb.LiveRawTweets");
            WriteConfig writeConfig = WriteConfig.create(sparkContext).withJavaOptions(writeOverrides);
            MongoSpark.write(dataFrame).option("collection", "LiveRawTweets").mode("append").save();

        } catch (Exception e) {
            System.out.println("Error saving to database");
        }
    }

JavaRDD<String> tweet =  rdd.map(status -> objectMapper.writeValueAsString(status));

here is a problem.Incompatible types required JavaRDD<String> but map was inferred to javaRDD<R>

Dici · Accepted Answer · 2019-08-17T17:26:33.157

1

Java type inference isn't always super smart unfortunately, so what I do in these cases is extracting all the bits of my lambda as variables until I find one that Java can't give an accurate type for. I then give the expression the type I think it should have and see why Java is complaining about it. Sometimes it will just be a limitation in the compiler and you'll have to explicitly "cast" the expression as the desired type, other times you'll find an issue with your code. In your case, the code ooks fine to me so there must be something else.

I have a comment however: here you are paying the cost of JSON serialization once (from Status to JSON string) and then deserialization (from JSON string to Row). Plus, you're not providing any schema to your Dataset so it will have to make two passes of the data (or a sample of it depending on your config) to infer the schema. All that can be quite expensive if the data is large. I would advise you to write a conversion from Status to Row directly if performance is a concern and if Status is relatively simple.

Another "by the way": you are implicitly serializing your ObjectMapper, chances are you don't want to do that. It seems like the class does support Java serialization, but with special logic. Since the default config for Spark is to use Kryo (which has much better performance than Java serialization), I doubt it will do the right thing when using the default FieldSerializer. You have three options:

make the object mapper static to avoid serializing it
configure your Kryo registrator to serialize/deserialize objects of type ObjectMapper with Java serialization. That would work but not worth the effort.
use Java serialization everywhere instead of Kryo. Bad idea! It's slow and uses a lot of space (memory and disk depending on where the serialized objects will be written).

edited Aug 17 '19 at 17:26

answered Aug 17 '19 at 17:18

Dici

25,226
7
41
82

"I would advise you to write a conversion from Status to Row directly if performance is a concern and if Status is relatively simple."Could you please provide example)? It wasn't clear for me)Thanks) – Alexander Romanov Aug 17 '19 at 17:28
@AlexanderRomanov with the map function (https://spark.apache.org/docs/2.3.0/api/java/index.html?org/apache/spark/sql/Dataset.html), you could convert a `Status` into a `Row`. Actually re-reading your code, you might be using an old version of Spark that didn't have this feature, because the `DataFrame` class was removed. – Dici Aug 17 '19 at 17:33
Otherwise, did you try extracting the lambda out to see what the compiler says? – Dici Aug 17 '19 at 17:34
What do you mean extract the lambda?I'm using 1.6.2 spark version. – Alexander Romanov Aug 17 '19 at 17:37
You may want to upgrade your Spark distribution if you can, 1.6 is pretty old now. I mean extracting the lambda function in a variable instead of having it directly in the `map` statement – Dici Aug 17 '19 at 17:39
I'm using this spark version because according to this https://stackoverflow.com/questions/38714256/spark-2-0-0-twitter-streaming-driver-is-no-longer-available i can't read data from twitter in older version.My apologize, i'm again cant understand you(xtracting the lambda function in a variable instead of having it directly in the map statement) could you please provide an example? – Alexander Romanov Aug 17 '19 at 17:44
`status -> objectMapper.writeValueAsString(status)` this is a lambda function, I'm just suggesting to store it in a variable with the type you expect it to be and see if the compiler complains about it or not. – Dici Aug 17 '19 at 17:51
The StackOverflow post you linked also has a solution to fix the issue. You just have to add a custom dependency to your project instead of having the Twitter classes being distributed in the main Spark distribution. This is because Spark 2.0 has moved away from distributing a single jar to instead distributing all of its dependencies, and its own modules, in separate jars so that people can better handle classpath conflicts and also pick what they need or not. – Dici Aug 17 '19 at 17:54
I used to use this dependency from bahir.But he didn't work. – Alexander Romanov Aug 17 '19 at 17:59
According to this guide https://www.baeldung.com/jackson-object-mapper-tutorial objectMapper.writeValueAsString generates a JSON from a Java object and returns the generated JSON as a string or as a byte array: – Alexander Romanov Aug 17 '19 at 18:03
Yeah I know, but just try extracting it out as I explained and see what happens. I haven't tried to compile your code but visually it looks ok. – Dici Aug 17 '19 at 18:04
Everything ok)I just run code and check how writeValueAsString() and i get result like this {"rateLimitStatus":null,"accessLevel":0,"createdAt":1566068331000,"id":11628":1004407330107248640,"favoriteCount":0,"inReplyToScreenName":"Piklefn","geoLocation":null,"place":null,"retweetCount":0,"lang":"en","retweetedStatus":null,"userMentionEntities":[{"start":0,"end":8,"name":"Pikle Pinned","screenName":"Piklefn","id":1004407330107248640,"text":"Piklefn"}],"hashtagEntities":[],"mediaEntities":[],"extendedMediaEntities":... – Alexander Romanov Aug 17 '19 at 19:00
Your question made it look like you had a compile error, but now your code is running? What's wrong with the result you had? Seems ok to me – Dici Aug 17 '19 at 19:58
No, everything is ok with ObjectMapper() he is working).As you recomended i extracted out ObjectMapper() and provided result of his execution in previous comment. String status = objectMapper.writeValueAsString(rdd.first()) – Alexander Romanov Aug 17 '19 at 21:06
But error with this line:JavaRDD tweet = rdd.map(status -> objectMapper.writeValueAsString(status)) stayed.As you recommended i will try conversion from Status to Row directly tomorrow. But i really can't understand why this code doesn't work. – Alexander Romanov Aug 17 '19 at 21:10
I don't think this is what I meant. I suggested to write that: `Function toJsonString = status -> objectMapper.writeValueAsString(status); JavaRDD tweet = rdd.map(toJsonString);` and see what the compiler tells you – Dici Aug 17 '19 at 21:11
Function toJsonString = status -> objectMapper.writeValueAsString(status);JavaRDD tweet = (JavaRDD) rdd.map(toJsonString); Everythyng is ok)Compiler allows this expression thanks) – Alexander Romanov Aug 18 '19 at 20:22
Cool, I don't understand what the problem was then, probably just the compiler being a bit stupid! – Dici Aug 18 '19 at 20:43
1

Thank you again, that was at first time in my career when problem in compiler not in me) – Alexander Romanov Aug 18 '19 at 21:06
Hi, again)Could you please help me again?Question related to previous but a little different)Could you please explain me more detail what does it mean:" Plus, you're not providing any schema to your Dataset so it will have to make two passes of the data (or a sample of it depending on your config) to infer the schema. All that can be quite expensive if the data is large. I would advise you to write a conversion from Status to Row directly if performance is a concern and if Status is relatively simple."It wasn't clear for me)However i'm trying to save Status to mongodb without defining scheme. – Alexander Romanov Sep 11 '19 at 21:36
Does it poosible to smth like that in Java? MongoSpark.save(rawTweetsDF.coalesce(1).write.format("org.apache.spark.sql.json").option("forensicdb", "LiveRawTweets").mode("append"), writeConfig) – Alexander Romanov Sep 11 '19 at 21:39
Please take a look in my post if you are interested)https://stackoverflow.com/questions/57649731/cant-save-dataframe-to-mongodb?noredirect=1#comment101750365_57649731 – Alexander Romanov Sep 11 '19 at 21:42
Hello, just saw the comments. I'll have a loot at the question – Dici Sep 12 '19 at 18:55
Sure, here or in this topic https://stackoverflow.com/questions/57649731/cant-save-dataframe-to-mongodb?noredirect=1#comment101750365_57649731? – Alexander Romanov Sep 12 '19 at 19:50

Problem with transformation JavaRDD to JavaRDD

1 Answers1