We have been using karaf version 2.3.3 for some months now on a system that receives data files, translates the data into objects, and persists the data to a persistent store.
Recently, we've noticed that when karaf is stopped/restarted the bundles will get into some kind of locked state for a period of time.
Here is a sequence of events:
1) During chef run, bundles are deployed into the deploy directory while karaf is down
2) When karaf comes up, all bundles and blueprints resolve correctly
3) When karaf is cycled, bundles resolve correctly, but blueprints get into a locked state where most are up, but one is in a stopping state, and several might be in a resolved state
4) After 5 mins (timeout), the stopping bundle goes to resolved, and some other bundle moves into the stopping state
5) Some of the time (most of the time?), if you wait long enough, all bundles will eventually move to an Active state and the system will be fully up
While karaf is starting, I can use the karaf client to issue 'list' commands and watch the bundles start up. They cycle from:
Installed -> Resolved -> Active,
while the blueprints cycle from:
blank -> Creating -> Created with an occasional GracePeriod thrown in while dependent services are coming up.
After it appears that all services are Active and all blueprints are Created, one bundle will get stuck in a Stopping state while others revert to a Resolved state:
[ 136] [Active ] [Created ] [ 80] transformation-services (1.0.3)
[ 137] [Active ] [Created ] [ 80] event-services (0.1.2)
[ 138] [Active ] [Created ] [ 80] ftp-services (0.0.0)
[ 139] [Active ] [Created ] [ 80] ingest-resources (0.0.1)
[ 140] [Active ] [Created ] [ 80] orchestration-app (0.2.3)
[ 141] [Active ] [Created ] [ 80] aws-services (0.4.0)
[ 142] [Resolved ] [ ] [ 80] point-data-service-test (0.2.0)
[ 143] [Active ] [Created ] [ 80] event-consumer-app (1.3.4)
[ 144] [Stopping ] [ ] [ 80] XXXX_no_op_log_transform.xml (0.0.0)
[ 145] [Resolved ] [ ] [ 80] persistence-app (1.3.3)
[ 146] [Active ] [Created ] [ 80] ftp-ingest-endpoint (1.0.2)
[ 147] [Resolved ] [ ] [ 80] secondary_ftp.xml (0.0.0)
[ 148] [Resolved ] [ ] [ 80] event-rest-test (0.0.0)
[ 149] [Resolved ] [ ] [ 80] customer_credentials.xml (0.0.0)
[ 150] [Resolved ] [ ] [ 80] customer1_xml.xml (0.0.0)
[ 151] [Active ] [Created ] [ 80] endpoint-services (0.0.0)
[ 152] [Active ] [Created ] [ 80] scheduler-services (0.1.0)
[ 153] [Active ] [Created ] [ 80] fourhundred_xml.xml (0.0.0)
[ 154] [Active ] [Creating ] [ 80] point-data-service (2.3.3)
[ 155] [Installed ] [ ] [ 80] customer1_csv.xml (0.0.0)
We have around 20 custom bundles that perform a variety of services. Some describe services that run in a scheduled executor. Some expose cxf REST services. Some are simple blueprint files that have been dropped into the karaf deploy directory. We are using the whiteboard pattern to discover, register, and access the services from the blueprint files that are dropped in the hot deploy.
I've played around with using a feature file or setting the bundle start levels, but still see the same behavior. There are a few JIRAs that I've found that talk about the problem being a blueprint synchronization problem (https://issues.apache.org/jira/browse/KARAF-1724 https://issues.apache.org/jira/browse/ARIES-1051) but don't really give any real advice.
Has anyone come across this same issue and come up with a reliable way to workaround it?