0

Our company uses Geode services for some of our applications, we are making use of Geode Member group configurations as well for maintaining different regions.

We have been undergoing an effort of migrating our applications from Geode version 1.6 to the latest version 1.12.

We have seen dramatic performance decrease after the upgrade, if we use the older parameters for the server and locator startup scripts, and things work fine when we remove those parameters.

We are now planning to take the route of understanding the parameters (earlier used) and available, to determine the most optimal configurations for the server and locator to get the best out of the new Geode version.

I was wondering if someone has any best practices or recommendations to follow for this task.

Below are the configurations for the Geode locator and server startup scripts for old and new versions.


Locator startup command

---Old configurations ( works great with Geode 1.6 version but not with any version after Geode 1.8)

gfsh start locator --locators=$locators_str --name=${EC2_HOSTNAME}.aws.compnaynamedigital.net --initial-heap=2G --max-heap=2G --dir=/opt/compnayname/geode/locator --J=-Dlog4j.configurationFile=/opt/compnayname/geode/log4j2-locator.xml --J=-DCLUSTER=${ECS_CLUSTER} --J='-javaagent:/opt/compnayname/geode/jmxtrans-agent-1.2.6.jar=/opt/compnayname/geode/jmxtrans-agent-locator.xml' --J=-Dgemfire.distributed-system-id=${DISTRIBUTED_SYSTEM_ID} --J=-Dgemfire.member-timeout=30000 --J=-Dgemfire.max-num-reconnect-tries=0 --J=-Dgemfire.jmx-manager=true --J=-Dgemfire.jmx-manager-start=true --J=-Dgemfire.jmx-manager-port=1099 --J=-Dgemfire.http-service-port=0 --J=-Dgemfire.log-level=info --J=-Dgemfire.log-file-size-limit=10 --J=-Dgemfire.log-disk-space-limit=10 --J=-Dgemfire.disable-auto-reconnect=true

---New configuration (works great with all versions)

gfsh start locator --locators=$locators_str --name=${EC2_HOSTNAME}.aws.compnaynamedigital.net --J=-Xmx2048m --dir=/opt/compnayname/geode/locator --J=-Dlog4j.configurationFile=/opt/compnayname/geode/log4j2-locator.xml --J='-javaagent:/opt/compnayname/geode/jmxtrans-agent-1.2.6.jar=/opt/compnayname/geode/jmxtrans-agent-locator.xml'

Server Startup command

---Old configurations ( works great with Geode 1.6 version but not with any version after Geode 1.8)

gfsh start server --locators=$locators_str --name=${EC2_HOSTNAME}.aws.compnaynamedigital.net --initial-heap=${GEODE_INIT_HEAP} --max-heap=${GEODE_MAX_HEAP} --group=${SERVER_GROUP} --dir=/opt/compnayname/geode/server --classpath=/opt/compnayname/geode/services-geode.jar --J=-Dlog4j.configurationFile=/opt/compnayname/geode/log4j2-server.xml --J=-DCLUSTER=${ECS_CLUSTER} --J='-javaagent:/opt/compnayname/geode/jmxtrans-agent-1.2.6.jar=/opt/compnayname/geode/jmxtrans-agent-server.xml' --J=-Dgemfire.distributed-system-id=${DISTRIBUTED_SYSTEM_ID} --J=-Dgemfire.member-timeout=30000 --J=-Dgemfire.max-num-reconnect-tries=0 --J=-Dgemfire.socket-buffer-size=16777215 --J=-Dgemfire.off-heap-memory-size=${GEODE_OFF_HEAP} --J=-XX:+UseParNewGC --J=-XX:+UseConcMarkSweepGC --J=-XX:CMSInitiatingOccupancyFraction=60 --eviction-heap-percentage=70 --critical-heap-percentage=90 --J=-Dgemfire.http-service-port=0 --J=-Dgemfire.log-level=info --J=-Dgemfire.log-file-size-limit=10 --J=-Dgemfire.log-disk-space-limit=10 --J=-Dgemfire.disable-auto-reconnect=true ${ADDTL_GEODE_SERVER_OPTS}

---New configuration (works great with all versions)

gfsh start server --locators=$locators_str --name=${EC2_HOSTNAME}.aws.compnaynamedigital.net --J=-Xmx${GEODE_MAX_HEAP}  --group=${SERVER_GROUP} --dir=/opt/compnayname/geode/server --classpath=/opt/compnayname/geode/services-geode.jar --J=-Dlog4j.configurationFile=/opt/compnayname/geode/log4j2-server.xml --J='-javaagent:/opt/compnayname/geode/jmxtrans-agent-1.2.6.jar=/opt/compnayname/geode/jmxtrans-agent-server.xml'

Test Environment Details

We are using the exact same environment (read AWS) for testing the old and new configurations and performing the same test to measure the response time. We are using 3 Geode locators and 3 Geode servers for the different member groups.

The only difference is the Geode version

We are actually doing a count operation (we have written a count function to execute on Geode regions to count the records existing in the downloaded data which is actually data sketches (https://datasketches.apache.org/)). This count operation on the same data in the same testing environment is giving a drastically slow response with the old configuration using any Geode version beyond 1.8

Another surprising thing is that if I use the old configurations in my local laptop (my laptop serves as locator and server both) with any Geode version greater than 1.8 (including the latest version of Geode), then I am not seeing this issue. Somehow these extra configurations are causing slowness in the AWS environment in the distributed infrastructure.

Please let me know if more information is required and I will be glad to provide more details.

Any information on this will be appreciated.

1 Answers1

0

The main difference I see is the inclusion of --J='-javaagent:/opt/xyz/geode/jmxtrans-agent-1.2.6.jar=/opt/xyz/geode/jmxtrans-agent-server.xml' in the startup parameters. This seems to be a third party Java Agent to expose several JVM metrics through JMX.

Do you know if the agent itself is modifying the byte code?, I've seen negative effects of that approach for Geode applications in the past (not performance related, though). Have you tried upgrading the agent to the latest available version (1.2.10)?. As a side note, Geode already exposes a lot of metrics and information through JMX out of the box, is there any reason why you're relying on yet another external tool for this?.

We have seen dramatic performance decrease after the upgrade

How are you measuring performance?, where do you see the degradation?, are you executing exactly the same workload on exactly the same machines, where the only difference is the Geode version?. There are several actors in play here.


That said, diagnosing and troubleshooting performance degradations can be a long and though process, so my suggestion would be to open a Geode JIRA Ticket with all the relevant information and artefacts.

Cheers.

Community
  • 1
  • 1
Juan Ramos
  • 1,421
  • 1
  • 8
  • 13
  • Juan, Thanks for the details. Let me first apologize for an incomplete and in-accurate configuration that was posted by me accidentally. I just updated it with accurate configurations and context. We are measuring performance using a JUnit test (performs a count operation on the data-sketch downloaded from S3) We are seeing slowness in S3 download and in sketch operations. Yes, everything is the same except for the geode version. One surprising thing is that if I use the old configurations in my local laptop (my laptop serves as locator and server both) then slowness isn't there. – AmitkAgrawal Jun 05 '20 at 11:19