I am trying to implement a real-time recommendation system using the Google Cloud Services. I've already build the engine using Kafka, Apache Storm and Cassandra but I want to create the same engine in Scala using Cloud Pub/Sub, Cloud Dataflow and Cloud Bigtable.
So far in Cassandra, since I read and write multiple times during my Apache Storm bolt operation, I've implement the following connector MyDatabase.scala that initiates a singleton connection with the database and use this connection inside the bolt to read and update the User table using the streamed data that comes from Kafka spout. I used the Phantom Scala API driver for Cassandra.
MyDatabase.scala
import scala.concurrent.Await
import scala.concurrent.duration._
import com.websudos.phantom.dsl._
object CustomConnector {
val hosts = Seq("localhost")
val connector = ContactPoints(hosts).keySpace(""my_keyspace")
}
class MyDatabase(val keyspace: KeySpaceDef) extends Database(keyspace) {
object users extends Users with keyspace.Connector
}
object MyDatabase extends MyDatabase(CustomConnector.connector) {
Await.result(MyDatabase.autocreate.future(), 5.seconds)
}
Users.scala
import com.websudos.phantom.CassandraTable
import com.websudos.phantom.dsl._
import scala.concurrent.Future
case class User(
id: String,
items: Map[String, Int]
)
class UsersTable extends CassandraTable[Users, User] {
object id extends StringColumn(this) with PartitionKey[String]
object items extends MapColumn[String, Int](this)
def fromRow(row: Row): User = {
User(
id(row),
items(row)
)
}
}
abstract class Users extends UsersTable with RootConnector {
def store(user: User): Future[ResultSet] = {
insert.value(_.id, user.id).value(_.items, user.items)
.consistencyLevel_=(ConsistencyLevel.ALL)
.future()
}
def getById(id: String): Future[Option[User]] = {
select.where(_.id eqs id).one()
}
}
The Dataflow pipeline will be like this:
- Ingest streaming data from Pub/Sub.
- Implement the logic in a single parDo where we will update multiple tables in Bigtable with some new values that generated from the ingested data from Pub/Sub.
The creation of the connection with Cassandra is pretty straigthforward when you are working with the Phantom DSL. My question is if there is any equivelant library like Phantom for Google Cloud Bigtable or what is the correct way to implement this using the Google Cloud API and Scio (since I'll be implementing the Dataflow pipeline with Scala). It seems that I can't find nowhere any relevant example to establish a connection with Bigtable and use this connection inside a Dataflow pipeline in Scala.
Thanks