1

I am trying to implement a real-time recommendation system using the Google Cloud Services. I've already build the engine using Kafka, Apache Storm and Cassandra but I want to create the same engine in Scala using Cloud Pub/Sub, Cloud Dataflow and Cloud Bigtable.

So far in Cassandra, since I read and write multiple times during my Apache Storm bolt operation, I've implement the following connector MyDatabase.scala that initiates a singleton connection with the database and use this connection inside the bolt to read and update the User table using the streamed data that comes from Kafka spout. I used the Phantom Scala API driver for Cassandra.

MyDatabase.scala

import scala.concurrent.Await
import scala.concurrent.duration._
import com.websudos.phantom.dsl._


object CustomConnector {

  val hosts = Seq("localhost")

  val connector = ContactPoints(hosts).keySpace(""my_keyspace")

}

class MyDatabase(val keyspace: KeySpaceDef) extends Database(keyspace) {
  object users extends Users with keyspace.Connector
}

object MyDatabase extends MyDatabase(CustomConnector.connector) {
  Await.result(MyDatabase.autocreate.future(), 5.seconds)
}

Users.scala

import com.websudos.phantom.CassandraTable
import com.websudos.phantom.dsl._

import scala.concurrent.Future

case class User(
                 id: String,
                 items: Map[String, Int]
               )

class UsersTable extends CassandraTable[Users, User] {

  object id extends StringColumn(this) with PartitionKey[String]
  object items extends MapColumn[String, Int](this)

  def fromRow(row: Row): User = {
    User(
      id(row),
      items(row)
    )
  }
}

abstract class Users extends UsersTable with RootConnector {

  def store(user: User): Future[ResultSet] = {
    insert.value(_.id, user.id).value(_.items, user.items)
      .consistencyLevel_=(ConsistencyLevel.ALL)
      .future()
  }

  def getById(id: String): Future[Option[User]] = {
    select.where(_.id eqs id).one()
  }
}

The Dataflow pipeline will be like this:

  1. Ingest streaming data from Pub/Sub.
  2. Implement the logic in a single parDo where we will update multiple tables in Bigtable with some new values that generated from the ingested data from Pub/Sub.

The creation of the connection with Cassandra is pretty straigthforward when you are working with the Phantom DSL. My question is if there is any equivelant library like Phantom for Google Cloud Bigtable or what is the correct way to implement this using the Google Cloud API and Scio (since I'll be implementing the Dataflow pipeline with Scala). It seems that I can't find nowhere any relevant example to establish a connection with Bigtable and use this connection inside a Dataflow pipeline in Scala.

Thanks

Andrew Nguonly
  • 2,258
  • 1
  • 17
  • 23
billiout
  • 695
  • 1
  • 8
  • 22
  • 1
    Scio 0.4.7 provides a built-in `BigtableDoFn` abstract class which you can subclass to gain access to a `BigtableSession` object for async lookups to Bigtable. I haven't confirmed if you can actually use this for writing back to Bigtable. Ideally, you would use the`.saveAsBigtable()` method of the `BigtableSCollection` class for writing to Bigtable so that you can automatically retry on error. Internally, this just uses Apache Beam's `BigtableIO` class. Reference: http://spotify.github.io/scio/api/com/spotify/scio/bigtable/BigtableDoFn.html. – Andrew Nguonly Feb 06 '18 at 05:38
  • @Andrew thank you very much. BigtableDoFn is the solution to my question. You can both read/write as well. – billiout Feb 13 '18 at 19:05
  • Btw, do you happen to know if Scio has a similar DoFn function to have a single connection with Datastore and asynchronously read/write from it? – billiout Feb 13 '18 at 19:07
  • 1
    As of Scio 0.4.7, there is no built-in interface similar to `BigtableDoFn` for Cloud Datastore. Before the release of 0.4.7, I would usually just subclass `ScalaAsyncDoFn` or `DoFnWithResource` and set the DoFn's resource to some connection object. I imagine you could achieve something similar for Datastore. Reference: http://spotify.github.io/scio/api/com/spotify/scio/transforms/ScalaAsyncDoFn.html. – Andrew Nguonly Feb 13 '18 at 20:55

1 Answers1

1

The Beam way to share a connection to a database between processing multiple elements in a DoFn is to use the @Setup and @Teardown methods. See the source code of the Beam Cassandra connector for an example.

jkff
  • 17,623
  • 5
  • 53
  • 85