Faster Logical Backup/Restore using pgcopydb – PostgreSQL

No doubt, pg_dump/pg_restore is an amazing tool to perform logical backup and restorations in PostgreSQL. We have already explained in detail about pg_dump/pg_restore here.

The tool pgcopydb is an Open Source tool that automates executing pg_dump and pg_restore between two PostgreSQL servers. It is a wrapper over pg_dump and pg_restore to simplify the process with more functionalities. In this blog post, we will look at pgcopydb in detail.

Why do we need to consider pgcopydb?

pgcopydb helps to achieve two important things that are not possible to achieve directly with pg_dump/pg_restore.

1. No intermediate files

To achieve faster dump and restore using pg_dump/pg_restore, we need to use parallelism. To achieve that, the –jobs parameter is mandatory on both. Whenever we are using the –jobs parameter, to support parallelism, it needs to write to an intermediate file.

For example, let’s plan to copy 1 TB of data from source to destination, we need to store the dump file first in local. It means we require additional disk space on the server just to place the dump. Disk space required depends on the size of the data.

By using pgcopydb, we do not require any additional disk space to place the dump because this tool will completely bypass the intermediate files for the table data and it supports parallelism.

This is achieved by connecting to the source databases during the entire operation, whereas for pg_restore, it is not required to connect to the source database always because it restores data from on disk intermediate files.

2. Create all indexes concurrently

pg_dump/pg_restore uses ALTER TABLE command directly to build both indexes and constraints. But the ALTER TABLE .. ADD CONSTRAINT command requires ACCESS EXCLUSIVE lock that prevents any concurrency.

In pgcopydb, it first executes CREATE UNIQUE INDEX statement. Once it is completed, it just adds the constraint using ALTER TABLE .. ADD CONSTRAINT .. PRIMARY KEY USING INDEX.. This is the way it helps to build all indexes in the table concurrently.

Installation

It is very straightforward to install in Ubuntu. The following are the steps

1. Import the repository key

sudo apt install curl ca-certificates gnupg
curl https://www.postgresql.org/media/keys/ACCC4CF8.asc | gpg --dearmor | sudo tee /etc/apt/trusted.gpg.d/apt.postgresql.org.gpg >/dev/null

2. Create /etc/apt/sources.list.d/pgdg.list.

sudo sh -c 'echo "deb http://apt.postgresql.org/pub/repos/apt $(lsb_release -cs)-pgdg main" > /etc/apt/sources.list.d/pgdg.list'

3. Install pgcopydb

sudo apt update
sudo apt-get install pgcopydb

For other distributions, please refer here.

The following are the dependencies

  1. Source and Destination should be accessible from the host where pgcopydb is running. Also, Both source and destination should be available throughout the entire process for the successful migration.
  2. Even though pgcopydb does not require any storage to store data files locally, it requires a very small amount of storage on the server where pgcopydb is running. This is required to keep the schema backup and a few temporary files to keep progress. This would be very minimal in size as it contains only the structure of the schema objects. By default, it will use the path /tmp/pgcopydb. If required to use some other path, we can use the –dir option.
  3. pgcopydb depends on the pg_dump and pg_restore tools in the server where pgcopydb is running. The tool version of pg_dump and pg_restore should match the PostgreSQL version of the target database.

Workflow

The following workflow explains how a database copy happens using pgcopydb between two different PostgreSQL servers.

  1. pgcopydb first produces schema only backup using pg_dump in custom format(Not in plain text) from the source server. It will be in two files a) Pre-data section b) Post-data section
  2. pre-data section file will be restored in the destination server using the pg_restore command. It creates all the PostgreSQL objects in the destination server.
  3. pgcopydb gets the list of all tables from the source server and for each of them executes the copy process from the source to the target in a dedicated sub-processes. It will be performed for all tables until all the data has been copied over.
  4. An auxiliary process is started concurrently with the main copy process mentioned in point 3. This process is responsible for copying all the Large Objects from the source to the destination.
  5. In each copy table sub-process mentioned in point 3, as soon as the table data is copied, then the pgcopydb gets the list of indexes and creates them in parallel in the destination server.
  6. Index creations will be performed in parallel as we discussed in the previous section.
  7. As soon as the data and indexes are created, Vacuum analyze will be executed on each target table.
  8. Now, pgcopydb gets the list of sequences and updates the values accordingly in the destination server.
  9. Finally, the post-data section which contains foreign key constraints will be created in the target database.
  10. Above all successful completions makes the restoration of the database from the source to the destination successful.

Demo

In this demo, let’s see how to perform a complete database copy from the source to the destination and its comparison with pg_dump/pg_restore.

Server Type Host IP
Source 192.168.33.11
Destination192.168.33.12
pgcopydb192.168.33.13

It is not mandatory to run pgcopydb in a separate server. It can be installed in either source or destination server as well.

Now copying complete database demo from source to destination.

Setting up connection string in pgcopydb server

export PGCOPYDB_SOURCE_PGURI="port=5432 host=192.168.33.11 dbname=demo user=backup password=K5cq5M8yZtpV"
export PGCOPYDB_TARGET_PGURI="port=5432 host=192.168.33.12 dbname=demo user=backup password=RK5cq5M8ZtpV"

Starting pgcopydb

pgcopydb copy-db --table-jobs 16 --index-jobs 16


12:17:10 3971226 INFO  Running pgcopydb version 0.6-1.pgdg20.04+1 from "/usr/bin/pgcopydb"
12:17:10 3971226 INFO  [SOURCE] Copying database from "postgres://backup@192.168.33.11:5432/demo?password=****"
12:17:10 3971226 INFO  [TARGET] Copying database into "postgres://backup@192.168.33.12:5432/demo?password=****"
12:17:10 3971226 INFO  Using work dir "/tmp/pgcopydb"
12:17:10 3971226 INFO  STEP 1: dump the source database schema (pre/post data)
12:17:10 3971226 INFO  Exported snapshot "00000004-00000F28-1" from the source database
12:17:10 3971226 INFO   /usr/bin/pg_dump -Fc --snapshot 00000004-00000F28-1 --section pre-data --file /tmp/pgcopydb/schema/pre.dump 'postgres://backup@192.168.33.11:5432/demo?'
12:17:10 3971226 INFO   /usr/bin/pg_dump -Fc --snapshot 00000004-00000F28-1 --section post-data --file /tmp/pgcopydb/schema/post.dump 'postgres://backup@192.168.33.11:5432/demo?'
12:17:11 3971226 INFO  STEP 2: restore the pre-data section to the target database
12:17:11 3971226 INFO   /usr/bin/pg_restore --dbname 'postgres://backup@192.168.33.12:5432/demo?' /tmp/pgcopydb/schema/pre.dump
12:17:11 3971226 INFO  STEP 3: copy data from source to target in sub-processes
12:17:11 3971226 INFO  STEP 4: create indexes and constraints in parallel
12:17:11 3971226 INFO  STEP 5: vacuum analyze each table
12:17:11 3971226 INFO  Listing ordinary tables in source database
12:17:11 3971226 INFO  Fetched information for 10 tables, with an estimated total of 100000108 tuples and 20 GB
12:17:11 3971226 INFO  Now starting 10 processes
12:17:11 3971226 INFO  Reset sequences values on the target database
12:17:11 3971226 INFO  Listing sequences in source database
12:17:11 3971226 INFO  Fetched information for 10 sequences
12:17:11 3971243 INFO  COPY "public"."sbtest1";
12:17:11 3971244 INFO  COPY "public"."sbtest10";
12:17:11 3971247 INFO  COPY "public"."sbtest2";
12:17:11 3971245 INFO  COPY "public"."sbtest3";
12:17:11 3971249 INFO  COPY "public"."sbtest4";
12:17:11 3971251 INFO  COPY "public"."sbtest5";
12:17:11 3971253 INFO  COPY "public"."sbtest6";
12:17:11 3971256 INFO  COPY "public"."sbtest7";
12:17:11 3971255 INFO  COPY "public"."sbtest8";
12:17:11 3971257 INFO  COPY "public"."sbtest9";
12:17:11 3971242 INFO  Copying large objects
12:20:23 3971244 INFO  Creating 2 indexes for table "public"."sbtest10"
12:20:23 3972757 INFO  VACUUM ANALYZE "public"."sbtest10";
12:20:23 3972758 INFO  CREATE UNIQUE INDEX sbtest10_pkey ON public.sbtest10 USING btree (id);
12:20:23 3972759 INFO  CREATE INDEX k_10 ON public.sbtest10 USING btree (k);
12:20:23 3971243 INFO  Creating 2 indexes for table "public"."sbtest1"
12:20:23 3972764 INFO  VACUUM ANALYZE "public"."sbtest1";
12:20:23 3972765 INFO  CREATE UNIQUE INDEX sbtest1_pkey ON public.sbtest1 USING btree (id);
12:20:23 3972766 INFO  CREATE INDEX k_1 ON public.sbtest1 USING btree (k);
12:20:24 3971249 INFO  Creating 2 indexes for table "public"."sbtest4"
12:20:24 3972789 INFO  VACUUM ANALYZE "public"."sbtest4";
12:20:24 3972790 INFO  CREATE UNIQUE INDEX sbtest4_pkey ON public.sbtest4 USING btree (id);
12:20:24 3972791 INFO  CREATE INDEX k_4 ON public.sbtest4 USING btree (k);
12:20:24 3971251 INFO  Creating 2 indexes for table "public"."sbtest5"
12:20:24 3972808 INFO  VACUUM ANALYZE "public"."sbtest5";
12:20:24 3972809 INFO  CREATE UNIQUE INDEX sbtest5_pkey ON public.sbtest5 USING btree (id);
12:20:24 3972810 INFO  CREATE INDEX k_5 ON public.sbtest5 USING btree (k);
12:20:25 3971245 INFO  Creating 2 indexes for table "public"."sbtest3"
12:20:25 3972815 INFO  VACUUM ANALYZE "public"."sbtest3";
12:20:25 3972816 INFO  CREATE UNIQUE INDEX sbtest3_pkey ON public.sbtest3 USING btree (id);
12:20:25 3972817 INFO  CREATE INDEX k_3 ON public.sbtest3 USING btree (k);
12:20:25 3971253 INFO  Creating 2 indexes for table "public"."sbtest6"
12:20:25 3972822 INFO  VACUUM ANALYZE "public"."sbtest6";
12:20:25 3972823 INFO  CREATE UNIQUE INDEX sbtest6_pkey ON public.sbtest6 USING btree (id);
12:20:25 3972824 INFO  CREATE INDEX k_6 ON public.sbtest6 USING btree (k);
12:20:25 3971257 INFO  Creating 2 indexes for table "public"."sbtest9"

Summary of the total time taken

      OID | Schema |     Name | copy duration | indexes | create index duration
----------+--------+----------+---------------+---------+----------------------
392532910 | public |  sbtest1 |         3m12s |       4 |                 3m26s
392532903 | public | sbtest10 |         3m11s |       4 |                 3m11s
392532907 | public |  sbtest2 |         3m14s |       4 |                 5m10s
392532912 | public |  sbtest3 |         3m13s |       4 |                 4m28s
392532894 | public |  sbtest4 |         3m13s |       4 |                 3m14s
392532895 | public |  sbtest5 |         3m13s |       4 |                 5m54s
392532906 | public |  sbtest6 |         3m14s |       4 |                 5m55s
392532913 | public |  sbtest7 |         3m14s |       4 |                 5m59s
392532911 | public |  sbtest8 |         3m14s |       4 |                 5m04s
392532896 | public |  sbtest9 |         3m14s |       4 |                 3m17s


                                          Step   Connection    Duration   Concurrency
 ---------------------------------------------   ----------  ----------  ------------
                                   Dump Schema       source       482ms             1
                                Prepare Schema       target       110ms             1
 COPY, INDEX, CONSTRAINTS, VACUUM (wall clock)         both       6m16s       10 + 16
                             COPY (cumulative)         both      32m17s            10
                    Large Objects (cumulative)         both         7ms             1
        CREATE INDEX, CONSTRAINTS (cumulative)       target      45m42s            16
                               Finalize Schema       target        63ms             1
 ---------------------------------------------   ----------  ----------  ------------
                     Total Wall Clock Duration         both       6m17s       10 + 16
 ---------------------------------------------   ----------  ----------  ------------

It took a total of 6 minutes and 17 seconds to copy the 25GB database from one server to another. Let’s try the same data copy with pg_dump/pg_restore directly.

time pg_dump -h192.168.33.11 -Ubackup --port=5432 -Fd  demo -j 16 -f backup

real    2m34.974s
user    24m38.031s
sys     0m24.372s

time pg_restore -h192.168.33.12  -Ubackup  --port=5432  -d demo  -j 16 backup/

real    6m5.533s
user    1m58.733s
sys     0m8.671s

It took a total of 2 minutes and 34 seconds for backup, 6 minutes and 5 seconds for restore, and 22 seconds for vacuum. ie) 9 minutes. Used 16 parallel threads for both the tests.

MethodTime taken ( sec )
pgcopydb377
pgdump/pgrestore519

As per the stats, we could see how efficient pgcopydb is to perform logical backup and restore in PostgreSQL. So far, we have discussed only how to perform a complete database copy from source to destination. pgcopydb also supports filtering(copy specific objects) from its latest release. In our next blog post, we will look at how to use filtering in pgcopydb. This feature will be really helpful for daily database operations as well.

2 thoughts on “Faster Logical Backup/Restore using pgcopydb – PostgreSQL

    1. Hi Chris,

      Thanks for your time in reading the blog. Yes, I would suggest setup a VM on Linux OS and do the pgcopydb installation in it. Ensure you have connectivity of port 5432 between Linux VM and 2 Postgres Window Instances. Then proceed with the process mentioned in the blog. It should work and transfer the data between two PG servers on windows. Please try this out and let me know in case of any issues or findings. Thank you.

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s