Setting up a High Availability for PMM

We have been working with PMM for quite a long time, we do most of the performance analysis with PMM for most of our clients. It also provides the flexibility that we have built our own custom dashboard. PMM has many advantages

  • Easy to deploy (docker based)
  • Flexible
  • Customizable
  • Query Analytics
  • One-stop solution for MySQL,Mongo,ProxySQL & PostgresSQL
  • Orchestrator
  • Rich and Deep metrics stats

Highly recommended for any production deployments its equivalent to Enterprise-grade monitoring and graphing tool.

Recently we have been working for our client on MySQL Consulting to scale peak sale of the year. Wherein we have deployed PMM to view the performance insights of DB.

We started on-boarding a considerable number of servers under PMM, Everyone liked the way PMM was able to project the performance insights with a beautiful dashboard over a period of time.

Need for HA and Backup of PMM server

When PMM started covering border group of server, we were raised on the question of HA for PMM server, of course it is good to have a High Availability for all possible solutions.

What if the server crashes due to Hardware failures?

  • Does PMM support redundancy?
  • Is PMM server a single point of failure (SPOF) ?
  • What would be the backup strategy?

We had search for many options to achieve HA, with containerized (docker) PMM server. We explored

  • lsyncd – Rsync based daemon to copy incremental changes at the block level, failed when we tried to sync the entire /var/lib/docker and start back docker services
  • DRBD – will work, but it has complexity with the setup and maintenance

The Accidental HA solution:

As the name suggests, “it was Accidental,”. lets see how it’s done.We have below servers, those were running on debian9(Stretch), with the same version of docker.

Source     : live.mydbops.com (production)
Destination: livebackup.mydbops.com (Replica)

The requirement here is to sync the metrics data between the source(live.mydbops.com) and destination (livebackup.mydbops.com), On the source server we had Prometheus data set size around 178G, so the initial sync took some time.

Stop PMM-server on source

# docker stop pmm-server

Copy the existing data within the docker volume.

#docker cp pmm-data:/opt/prometheus/data opt/prometheus/
# docker cp pmm-data:/opt/consul-data opt/
# docker cp pmm-data:/var/lib/mysql var/lib/
# docker cp pmm-data:/var/lib/grafana var/lib/

Once the initial copy is done, start back the source PMM server

# docker start pmm-server

Next, to transfer the data copy to the destination server, I have used SCP here.

#scp -r /backup/* livebackup.mydbops.com:/backup

Now on the destination server, make sure to have the same version of the PMM server as the source, I have here used version “1.17-1” Since the monitoring server used for testing does not have internet access enabled I had to do an offline load of PMM image as below

# docker load < pmm-server_1.17.tar
071d8bd76517: Loading layer [==================================================>] 210.2MB/210.2MB
6e66f289f547: Loading layer [==================================================>] 899.3MB/899.3MB
Loaded image: percona/pmm-server:latest

Creating docker volume as same as the source server

# docker create -v /opt/prometheus/data -v /opt/consul-data -v /var/lib/mysql -v /var/lib/grafana --name pmm-data percona/pmm-server:latest /bin/true
985630d4aa64375cc4834ceb0f66654bada98cda151636bcd63e9f9b0d3fe3a2

Once the data volume is created now proceed to copy back the data into docker data volume as below

# docker cp /backup/opt/prometheus/data pmm-data:/opt/prometheus/
# docker cp /backup/opt/consul-data pmm-data:/opt/
# docker cp /backp/var/lib/mysql pmm-data:/var/lib/
# docker cp /backup/var/lib/grafana pmm-data:/var/lib/

Once the data-copy is done, PMM data volume is ready to be used after the change of ownership as below.

docker run --rm --volumes-from pmm-data -it percona/pmm-server:latest chown -R pmm:pmm /opt/prometheus/data /opt/consul-data
docker run --rm --volumes-from pmm-data -it percona/pmm-server:latest chown -R grafana:grafana /var/lib/grafana
docker run --rm --volumes-from pmm-data -it percona/pmm-server:latest chown -R mysql:mysql /var/lib/mysql

Now let’s proceed with the final step to run the pmm-server

#docker run -d -p 8080:80 --volumes-from pmm-data --name pmm-server -e SERVER_USER=fhuser -e SERVER_PASSWORD=FHPassword -e METRICS_RETENTION=1500h -e ORCHESTRATOR_ENABLED=true -e ORCHESTRATOR_USER='pmm_user' -e ORCHESTRATOR_PASSWORD='PMM_pass' -e METRICS_RESOLUTION=5s --restart always percona/pmm-server:latest

Note this command should be exactly the same command executed on the master.After a few moment the PMM service was started in the destination box, we could see the live metrics data polling in (Graph below).

We wanted to check the data sync between the source and destination, so we decided to make it run for a couple of more days

After this test run period, We have verified the graphs between source and destination, which seems exactly the same, I am just sharing a sample graph for a DB server(172.31.11.11) between source and destination.

Graph from Destination

Graph from source :

How does it work?

In the first instance when we saw the data is being polled in the backup we were astonished !! and more curious to know how it’s actually working ?. We had a long discussion with our expert team on this

This is just a conceptual understanding to explain this scenario.Let’s go back to the official architecture of the PMM server by Percona, As below.

.

As the above architecture depicts PMM has two parts one is the QAN(Query Analytics) and (MM) Metrics Monitor, where we have achieved the redundancy.

Metrics monitor works by a “PULL” mechanism, ie., the Prometheus collects the metrics data from registered exporters. Since we have duplicated and created the destination server with the same registered exporter from servers. Hence destination, it makes one more pull from DB servers and stores data on to its own data-volume. What we have achieved is in the below architecture

To analyze this we decided to capture packets on the DB machine for those servers and we could

TCP packet analysis via TCP Dump:

root@mydbops-db-8046755:/home/kabilesh.pr# tcpdump -c 20 -tttt -i eth0 src livebackup.mydbops.com

tcpdump: verbose output suppressed, use -v or -vv for full protocol decode

listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
2019-10-08 13:49:30.593668 IP livebackup.mydbops.com.59502 > mydbops-db-8046755.42002: Flags [P.], seq 1314972321:1314972594, ack 1936498908, win 1444, options [nop,nop,TS val 130344282 ecr 540951618], length 273
2019-10-08 13:49:30.614009 IP livebackup.mydbops.com.59502 > mydbops-db-8046755.42002: Flags [.], ack 4126, win 1422, options [nop,nop,TS val 130344287 ecr 540952633], length 0
2019-10-08 13:49:30.614022 IP livebackup.mydbops.com.59502 > mydbops-db-8046755.42002: Flags [.], ack 9052, win 1390, options [nop,nop,TS val 130344287 ecr 540952633], length 0
2019-10-08 13:49:30.712817 IP livebackup.mydbops.com.59516 > mydbops-db-8046755.42002: Flags [P.], seq 1684281946:1684282219, ack 3948074006, win 1444, options [nop,nop,TS val 130344312 ecr 540951623], length 273
2019-10-08 13:49:31.803979 IP livebackup.mydbops.com.42998 > mydbops-db-8046755.42000: Flags [P.], seq 703692847:703693117, ack 1617572969, win 1444, options [nop,nop,TS val 130344585 ecr 540952412], length 270
2019-10-08 13:49:31.816501 IP livebackup.mydbops.com.42998 > mydbops-db-8046755.42000: Flags [.], ack 15606, win 1436, options [nop,nop,TS val 130344588 ecr 540952933], length 0

20 packets captured
23 packets received by filter
0 packets dropped by kernel
root@mydbops-db-8046755:/home/kabilesh.pr# tcpdump -c 20 -tttt -i eth0 src live.mydbops.com
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
2019-10-08 13:49:42.615378 IP live.mydbops.com.40336 > mydbops-db-8046755.42000: Flags [P.], seq 2445371654:2445371924, ack 1808532851, win 1444, options [nop,nop,TS val 905019213 ecr 540954908], length 270
2019-10-08 13:49:42.627810 IP live.mydbops.com.40336 > mydbops-db-8046755.42000: Flags [.], ack 4126, win 1428, options [nop,nop,TS val 905019217 ecr 540955636], length 0
2019-10-08 13:49:42.627833 IP live.mydbops.com.40336 > mydbops-db-8046755.42000: Flags [.], ack 14262, win 1386, options [nop,nop,TS val 905019217 ecr 540955636], length 0
2019-10-08 13:49:43.490791 IP live.mydbops.com.55843 > mydbops-db-8046755.42002: Flags [P.], seq 611310810:611311083, ack 2742084436, win 1444, options [nop,nop,TS val 905019432 ecr 540954843], length 273
2019-10-08 13:49:43.512377 IP live.mydbops.com.55843 > mydbops-db-8046755.42002: Flags [.], ack 4126, win 1432, options [nop,nop,TS val 905019438 ecr 540955857], length 0
2019-10-08 13:49:43.512388 IP live.mydbops.com.55843 > mydbops-db-8046755.42002: Flags [.], ack 9064, win 1410, options [nop,nop,TS val 905019438 ecr 540955857], length 0

20 packets captured
20 packets received by filter
0 packets dropped by kernel

Key Takeaways:

  • This works well for replicating exactly the same set of servers which is already registered with the source. If a new server is added, I believe the samessteps have to be followed back.
  • This is not officially supported, though it works and full-fills the requirement.
  • QAN HA is not done here, but it can be done since the PMM server uses MySQL internally to store the QAN data, hence it can be replicated by exposing the MySQL port.
  • There can be minor differences or deviation in the metrics data since the time at which it polls the data might be different between the servers.

As said earlier it is good to have a High Availability for PMM as it plays a critical role.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s