We are using the Camel BigQuery API (version 2.20) to stream records from a message queue on an ActiveMQ server (version 5.14.3) into a Google BigQuery table.
We have implemented and deployed the streaming mechanism as an XML route definition in a Spring Framework running on our primary site and it seems to work well.
<?xml version="1.0" encoding="UTF-8"?>
<beans
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns="http://www.springframework.org/schema/beans"
xmlns:beans="http://www.springframework.org/schema/beans"
xsi:schemaLocation="
http://www.springframework.org/schema/beans
./spring-beans.xsd
http://camel.apache.org/schema/spring
./camel-spring.xsd">
<!--
# ==========================================================================
# ActiveMQ JMS Bean Definition
# ==========================================================================
-->
<bean id="jms" class="org.apache.camel.component.jms.JmsComponent">
<property name="connectionFactory">
<bean class="org.apache.activemq.ActiveMQConnectionFactory">
<property name="brokerURL" value="nio://192.168.10.10:61616?jms.useAsyncSend=true" />
<property name="userName" value="MyAmqUserName" />
<property name="password" value="MyAmqPassword" />
</bean>
</property>
</bean>
<!--
# ==========================================================================
# GoogleBigQueryComponent
# https://github.com/apache/camel/tree/master/components/camel-google-bigquery
# ==========================================================================
-->
<bean id="gcp" class="org.apache.camel.component.google.bigquery.GoogleBigQueryComponent">
<property name="connectionFactory">
<bean class="org.apache.camel.component.google.bigquery.GoogleBigQueryConnectionFactory">
<property name="credentialsFileLocation" value="MyDir/MyGcpKeyFile.json" />
</bean>
</property>
</bean>
<!--
# ==========================================================================
# Main Context Bean Definition
# ==========================================================================
-->
<camelContext id="camelContext" xmlns="http://camel.apache.org/schema/spring" >
<!--
# ==================================================================
# Message Route :
# 1. consume messages from my AMQ queue
# 2. set the InsertId / INSERT_ID (it is not clear which is the correct one)
# 3. write message to Google BigQuery table
# see https://github.com/apache/camel/blob/master/components/camel-google-bigquery/src/main/docs/google-bigquery-component.adoc
# ==================================================================
<log message="${headers} | ${body}" />
-->
<route>
<from uri="jms:my.amq.queue.of.output.data.for.gcp?acknowledgementModeName=DUPS_OK_ACKNOWLEDGE&concurrentConsumers=20" />
<setHeader headerName="CamelGoogleBigQuery.InsertId">
<simple>${header.KeyValuePreviouslyGenerated}</simple>
</setHeader>
<setHeader headerName="GoogleBigQueryConstants.INSERT_ID">
<simple>${header.KeyValuePreviouslyGenerated}</simple>
</setHeader>
<to uri="gcp:my_gcp_project:my_bq_data_set:my_bq_table" />
</route>
</camelContext>
</beans>
For high(er) availability we have now deployed the same implementation to our backup site, streaming to the same destination BigQuery table. With the same records streaming from two sites into the same table, as expected, there are duplicate records. In order to eliminate record duplication, we are trying to follow the guidance given here :
https://camel.apache.org/staging/components/latest/google-bigquery-component.html
The section Message Headers advises setting a message header called CamelGoogleBigQuery.InsertId with a suitable run-time key value.
However on the same page lower down, the section Ensuring Data Consistency, advises setting GoogleBigQueryConstants.INSERT_ID.
We have checked that our primary and back-up servers are running in the same time time zone (UTC), and that we are generating what we believe to be suitable runtime unique keys : a string containing a UNIX time to the nearest second.
Our code sample above shows we have tried both, but a review of the data landed in our target BigQuery table indicates that neither seem to work, i.e. we still have duplicate records.
Questions
- Is there an error in the way we are setting the InsertID / INSERT_ID in the code above ?
- Have you used the Camel Google BigQuery API to stream data into BigQuery ?
- If so, have you used the InsertId / INSERT_ID de-duplication mechanism successfully ? If so, which one and how ?
- What de-duplication time window(s) have you observed ?