Performance & Stability Tuning for a minilith: back to basics

Posted on: 2020-07-11

I recently assisted a medium-sized organization to resolve chronic instability and performance problems which had 'out of nowhere' begun to plague their core in-house management server application.

Infrastructure practices:

Keeping a tidy server
Logging
Performance Tuning
Scaling

Technologies:

I call the application a "minilith" because it is a monolith, in a scenario where a monolith makes perfect sense. Plus it is medium-sized, not huge - but "medilith" really doesn't sound great.

The application debuted in 2016 and since then has expanded coverage to three of the organization's business units, served by three separate instances with the same codebase, but different databases and parametrization. It is used by staff and customers. It runs on Apache Tomcat with a MySQL database, accessed via a hibernate-like (but not hibernate) ORM framework. It exposes a traditional web interface to users, plus it integrates with a COTS accounting system via a SOAP/HTTP bridge.

The initial call for help was:

"the application has been working fine for a while but just last week it started falling over a lot - what happened?".

The answer to the question makes for an interesting story about the interaction between the organizational and technical aspects of the infrastructure, as well as the difficulty of tracking down a problem which has multiple distinct causes of failure.

First step was to discuss with key users and the dev team to *understand how the usage pattern of the application had evolved* since it was launched. This brought up several key facts: - the business unit had shown slow but steady organic growth over the period, meaning that the user base over the period had grown by over 70% - during that time a new Operations Director had taken over the unit and steadily pushed paper-based workflows to be partially or fully digitalized via the application - since March 2020, with the impact of Covid-19 related measures restricting physical access, many clients who had previously used the application sparingly were now using it for all their interactions

Taking these all together, I estimated that the application was in fact handling a 5-fold increase in transaction volume compared to 2016.

Second step was to discuss the current server setup and its historical evolution with the infrastructure team. Key discoveries:

the application hardware had been upgraded once during this period, moving from a single-core 8Gb node to a four-core 16Gb node
as part of the upgrade, the node had been virtualized ... and subsequent to the upgrade, the application VM had been moved to different physical hosts a couple of times
the team had recently undergone severe staff turbulence: over the last six months of 2019, all staff (including the manager) but one had been replaced at different times

Third step was to look at the actual infrastructure setup and compare the reality to what had been described. Some very interesting discoveries here:

the application VM was sharing its host with two other low-usage nodes - which nevertheless were taking a big chunk of resources - leaving the VM with effective access to only 1 core and 8Gb
the Tomcat JVM had been reinstalled and resorted to default settings - mean a max heap size of only 1Gb and only 100 threads
the application had in fact been freezing intermittently for over a month - however due to the organizational shakeup the new sysadmin had been not been clear about how and when to report.

Fourth step was to immediately free up more resources for the application:

the other VMs were pruned down to a smaller level of resources (which was in fact all they needed), freeing up 2 (bursting to 4) cores and 10Gb of RAM for the application VM
Tomcat's max heap was increased to 4Gb

This caused the application to run perceptibly faster most of the the time - however at peak times - and also at random non-peak times - it would still hit a bottleneck and start degrading, until eventually seizing up altogether and requiring a Tomcat restart.

Fifth step was to take some metrics and activate more detailed logging by both MySQL and Tomcat, then watch it until the next failure:

MySQL: error and slow query log; also putting in a cronjob to monitor and log connection-related status variables on an hourly basis
Tomcat: alter the formatting of the access log to make it easy to analyse request-processing times by path and session

Sixth step was where the fun really began:

we found several dozen aborted connection attempts to mysql per day.... from external public IP addresses: i.e. a dictionary attack
- which led to us nmapping the public interface of the server and lo and behold its MySQL service port (and several others besides) were wide open
- it turned out that the server was in a DMZ which had been at some stage correctly configured. However at some point during the reshuffle of VMs between hosts, it had been put on a host which the firewall was exposing.
- once found, this issue was quickly sorted by adjustments in the network firewall as well as MySQL's configuration
- after this, we still found some aborted connections at peak times: in fact these were due to the application's connection pool having no maximum size, whereas MySQL had its default max_connections setting (150)
- bumping up the MySQL max_connections and also increasing the Tomcat connection and thread limits somewhat, solved this issue. However we were still hitting failures at unexpected non-peak times.
- importing the Tomcat access logs to excel and graphing them showed that most requests were being handled in tens of ms: however at a certain point massive spikes would occur (hundreds of seconds), and after this, even previously fast requests would start to take hundreds or thousands of ms
- from the same logs we could see who the users were around the time of the first spikes, so having an open conversation with them together with the dev team, we discovered that "two releases ago", a new and more complex version of an existing report had been deployed, which was typically used in the last week of each month to monitor staff performance. This report in the (unstressed) test environment took 30-60s to generate and "sometimes didn't run the first time". Studying the report we found that it was not implemented in SQL, but in Java, traversing the ORM framework's object graph: leading to the well-known "impedance mismatch" that libraries such as Slick explicitly define and avoid. In this particular report, there were seven levels deep of nested for-loops, leading to one single run of the reporting possibly producing as many as 900,000 database queries (depending upon how the ORM framework handled its cacheing). Sure enough, when a controlled test was performed, a single click on the UI button to run this report, caused the entire Tomcat JVM to begin to freeze up.

That report has now been deactivated, while it is re-imagined and rewritten, and several other corrective actions will be implemented over the next few weeks. The Infrastructure team will also apply a similar approach to the instances which serve the other business units. Meanwhile the application is now running smoothly and MySQL is chugging along nicely with 20-30 queries per second, and the application VM is using about 30% of its CPU capacity and 50% of its RAM.

In conclusion, we can see that the final, serious failure (as perceived by end-users and management) had multiple causes, with different positions on the organization-technical spectrum and also on the temporal spectrum. - the long-term underlying organic growth of the business unit (a good thing) and its progressive move to a paperless approach (a good thing) put the infrastructure generally under stress - the medium-term turbulence in the infrastructure team caused technical knowledge to be lost and reduced its ability to see the wood for the trees - the short-term factor of Covid-19 remote working caused additional stress - and finally the very short-term deployment of that one new - but very hefty - report was the "straw that broke the camel's back"

Technically, too, the fixes were several:

on the purely infrastructure side, configuring the firewall and VM hardware resources appropriately
lying between infrastructure and dev (and a good example of where, if this institution had a "devops" approach, it could have done better) was the configuring of connections and threads between MySQL and Tomcat
on the dev side was the incremental complexification of the report, which led each new version to add another nested for loop and thus increase the query volume by an order of magnitude.

Performance & Stability Tuning for a minilith: back to basics

Infrastructure practices:

Technologies:

#covid19 #ha #java #mysql #performance #security #virtualization