Two Methods of ClickHouse Backup and Recovery

ClickHouse is known as a data analytics processing engine. ClickHouse is one of the open-source column-oriented database-oriented management systems capable of real-time generation of analytical data reports using SQL queries.

Clickhouse came a long way since it inception 3 years ago.

Why Mydbops recommends ClickHouse for Analytics ?

  • ClickHouse is a Columnar Store built for SORT / SEARCH queries performance on a very large volume of database.
  • In Columnar Database Systems the values from different columns are stored separately, and data from the same column is stored together – Benefits Performance of Analytical Queries (ORDER / GROUP BY & Aggregation SQLs).
  • Columnar Stores are best suited for analytics because of their ability to retrieve just those columns instead of reading all of the rows and filter out unneeded data makes the data accessed faster.

Need for Backup and Restore:

  • As a DBA responsibility, we have to backup the data regularly for security reasons.
  • If the database crashes or some fatal errors happen, backup is the only way to restore the data and to reduce the loss to the minimum.

There are multiple ways of taking backup. but they all have their own shortcomings. We will be discussing about the below two methods and how to perform the backup and restoration with the below two methods.

  • Clickhouse Client
  • Clickhouse backup tool

Method 1 ( Using ClickHouse Client ):

ClickHouse Client is a simple way to backup the data and restore it in ClickHouse without any additional tooling. We are going to make the backup of metadata and data separately here

Metadata Backup:

In example, I am taking the dump of the structure of the table “test_table” from the database “testing” with the TAbSeparatedRaw format. This format is only appropriate for outputting a query result, but not for parsing (retrieving data to insert in a table). ( i.e )  Rows are written without escaping.

 [root@mydbopslabs202]# clickhouse-client --query="SHOW CREATE TABLE testing.test_table" --format=TAbSeparatedRaw > metadata_dump.tsv

[root@mydbopslabs202]# less metadata_dump.tsv | head -n 1
 
CREATE TABLE testing1.test_table (`id` UInt64, `duration` Float64, `url` String, `created` DateTime) ENGINE = MergeTree() PRIMARY KEY id ORDER BY id SETTINGS index_granularity = 819

Metadata Restore:

I have created the database named “testing1” and trying to restore metadata backup taken earlier.

mydbopslabs202 :) create database testing1

CREATE DATABASE testing1

Ok.

0 rows in set. Elapsed: 0.002 sec. 

mydbopslabs202 :) \q

Bye.

Restoring backup :

[root@mydbopslabs202]# clickhouse-client testing1 < metadata_dump.tsv 

Metadata Validation :

Here is the same comparison of the table Structure from dump file and restored data :


Table structure of dump file :

mydbopslabs202 :) show create table testing1.test_table

SHOW CREATE TABLE testing1.test_table

┌─statement────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ CREATE TABLE testing1.test_table (`id` UInt64, `duration` Float64, `url` String, `created` DateTime) ENGINE = MergeTree() PRIMARY KEY id ORDER BY id SETTINGS index_granularity = 8192 │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

1 rows in set. Elapsed: 0.001 sec. 

Table Structure from the restored dump file :

mydbopslabs202 :) show create table testing1.test_table

SHOW CREATE TABLE testing1.test_table

┌─statement───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ CREATE TABLE testing1.test_table (`id` UInt64, `duration` Float64, `url` String, `created` DateTime) ENGINE = MergeTree() PRIMARY KEY id ORDER BY id SETTINGS index_granularity = 8192 │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

1 rows in set. Elapsed: 0.001 sec. 

mydbopslabs202 :) \q
Bye.
1

Data Backup :

Before taking the dump of the data, Let us validate the count of records that are going to backup.We can validate the records by making a count.


mydbopslabs202 :) select count(*) from test_table;

SELECT count(*)
FROM test_table

┌─count()─┐
│ 1681983 │
└─────────┘

1 rows in set. Elapsed: 0.009 sec. 

Alternative Methods to check the record count :

[root@mydbopslabs202 vagrant]# clickhouse-client --query "SELECT count(*) FROM testing1.test_table";

1681983

Here I’m taking the dump of the table “test_table” with TabSeparated format (tsv). In a tab-separated (tsv) format, data is written by row. Each row contains the values separated by tabs. Values are written in text format, without enclosing quotation marks, and with special characters escaped.

clickhouse-client --query "SELECT * FROM testing1.test_table FORMAT TabSeparated > test_table.tsv

Data Restore :

We need to ensure the database and the table (metadata) is created. The table format should be the same as the source table format. The meta data is restored and data dump is restored.

clickhouse-client -d testing1 --query='insert into test_table format TabSeparated' < test_table.tsv

Once the data dump is restored, I have cross checked the count of the data which is restored in the database from the dump file

[root@mydbopslabs202 vagrant]# clickhouse-client --query "SELECT count(*) FROM testing1.test_table";

1681983

We can make an automated program to make the metadata and data backup of each table. And the recovery also has to be formulated too.

Method 2 (Clickhouse-backup):

Clickhouse-backup tool for easy backup and restore with S3 (AWS) and GCS support. It is an open source tool which is available on git.

Features

  • Supports Full and incremental backups.
  • Supports AWS, GCS, and Alibaba cloud object stores.
  • Ease of configuration with environment variables.
  • Support backup administrative tasks like list, delete, and download.

Run the clickhouse backup tool from root user or clickhouse user 

GLOBAL OPTIONS:

–config FILE, -c FILE Config FILE name. (default: “/etc/clickhouse-backup/config.yml”)
–help, -h show help
–version, -v  print the version
Global Options

Default Config Path : 

Default config path is defined in the location /var/lib/clickhouse/backup/ 

Important Note :

We shouldn’t change the file permission for the default path /var/lib/clickhouse/backup/. As this path contains the hard links. If we change the permission or ownership of default path on hard link, this will be changed the clickhouse too. This will leads to data corruption.

Config File :

All options can be overwritten via environment variables

general:
  remote_storage: s3           # REMOTE_STORAGE
  disable_progress_bar: false  # DISABLE_PROGRESS_BAR
  backups_to_keep_local: 0     # BACKUPS_TO_KEEP_LOCAL
  backups_to_keep_remote: 0    # BACKUPS_TO_KEEP_REMOTE

clickhouse:
  username: default            # CLICKHOUSE_USERNAME
  password: ""                 # CLICKHOUSE_PASSWORD
  host: localhost              # CLICKHOUSE_HOST
  port: 9000                   # CLICKHOUSE_PORT
  timeout: 5m                  # CLICKHOUSE_TIMEOUT
  data_path: ""                # CLICKHOUSE_DATA_PATH
  skip_tables:                 # CLICKHOUSE_SKIP_TABLES
    - system.*
  timeout: 5m                  # CLICKHOUSE_TIMEOUT
  freeze_by_part: false        # CLICKHOUSE_FREEZE_BY_PART

s3:
  access_key: ""                   # S3_ACCESS_KEY
  secret_key: ""                   # S3_SECRET_KEY
  bucket: ""                       # S3_BUCKET
  endpoint: ""                     # S3_ENDPOINT
  region: us-east-1                # S3_REGION
  acl: private                     # S3_ACL
  force_path_style: false          # S3_FORCE_PATH_STYLE
  path: ""                         # S3_PATH
  disable_ssl: false               # S3_DISABLE_SSL
  part_size: 104857600             # S3_PART_SIZE
  compression_level: 1             # S3_COMPRESSION_LEVEL
  # supports 'tar', 'lz4', 'bzip2', 'gzip', 'sz', 'xz'
  compression_format: gzip         # S3_COMPRESSION_FORMAT
  # empty (default), AES256, or aws:kms
  sse: AES256                      # S3_SSE
  disable_cert_verification: false # S3_DISABLE_CERT_VERIFICATION

gcs:
  credentials_file: ""         # GCS_CREDENTIALS_FILE
  credentials_json: ""         # GCS_CREDENTIALS_JSON
  bucket: ""                   # GCS_BUCKET
  path: ""                     # GCS_PATH
  compression_level: 1         # GCS_COMPRESSION_LEVEL
  compression_format: gzip     # GCS_COMPRESSION_FORMAT

cos:
  url: ""                      # COS_URL
  timeout: 2m                  # COS_TIMEOUT
  secret_id: ""                # COS_SECRET_ID
  secret_key: ""               # COS_SECRET_KEY
  path: ""                     # COS_PATH
  compression_format: gzip     # COS_COMPRESSION_FORMAT
  compression_level: 1         # COS_COMPRESSION_LEVEL
  debug: false                 # COS_DEBUG
  1. Backup the data from tool:

From the backup tool, i have used the option “create” to create a new backup.

[root@mydbopslabs202 clickhouse-backup]# ./clickhouse-backup create -t testing1.test_table data_dump

By default, while creating the backup from this backup tool, It will create the folder metadata and shadow under the backup directory.

In metadata directory, the metadata file will be present. ( i.e ) it contains the table structure.

In the shadow directory, the data files will be present.

2020/05/26 04:49:14 Create backup 'data_dump'
2020/05/26 04:49:14 Freeze `testing1`.`test_table`
2020/05/26 04:49:14 Copy metadata
2020/05/26 04:49:14   Done.
2020/05/26 04:49:14 Move shadow
2020/05/26 04:49:14   Done.

The default dump file is stored in the path -> /var/lib/clickhouse/backup/.

[root@mydbopslabs202 vagrant]# cd /var/lib/clickhouse/backup/
[root@mydbopslabs202 backup]# ll
total 0
drwxr-xr-x. 4 root root 36 May 26 04:49 data_dump
            
                     DUMP FILE :
[root@mydbopslabs202 backup]# cd data_dump/
[root@mydbopslabs202 data_dump]# ll
total 0
drwxr-xr-x. 3 root root 18 May 26 04:49 metadata
drwxr-xr-x. 2 root root 6 May 26 04:49 shadow

                     METADATA DIR :
[root@mydbopslabs202 metadata]# ll
total 1
-rw-r-----. 1 clickhouse clickhouse 39 Jun  7 13:51 test_table.sql
             
                      SHADOW DIR :
[root@mydbopslabs202 data_dump]# cd shadow/
[root@mydbopslabs202 shadow]# ll
total 0
drwxr-xr-x. 3 root root 18 Jun  7 17:26 testing1

[root@mydbopslabs202 shadow]# cd testing1/
[root@mydbopslabs202 testing1]# ll
total 0
drwxr-xr-x. 3 root root 23 Jun  7 17:26 test_table
[root@mydbopslabs202 testing1]# cd test_table/
[root@mydbopslabs202 test_table]# ll
total 0
drwxr-xr-x. 2 root root 222 Jun  7 17:26 all_1_4_1
[root@mydbopslabs202 test_table]# cd all_1_4_1/
[root@mydbopslabs202 all_1_4_1]# ll
total 48
-r--r-----. 2 clickhouse clickhouse 379 Jun  7 13:55 checksums.txt
-r--r-----. 2 clickhouse clickhouse 100 Jun  7 13:55 columns.txt
-r--r-----. 2 clickhouse clickhouse   1 Jun  7 13:55 count.txt
-r--r-----. 2 clickhouse clickhouse  43 Jun  7 13:55 created.bin
-r--r-----. 2 clickhouse clickhouse  48 Jun  7 13:55 created.mrk2
-r--r-----. 2 clickhouse clickhouse  54 Jun  7 13:55 duration.bin
-r--r-----. 2 clickhouse clickhouse  48 Jun  7 13:55 duration.mrk2
-r--r-----. 2 clickhouse clickhouse  48 Jun  7 13:55 id.bin
-r--r-----. 2 clickhouse clickhouse  48 Jun  7 13:55 id.mrk2
-r--r-----. 2 clickhouse clickhouse  16 Jun  7 13:55 primary.idx
-r--r-----. 2 clickhouse clickhouse  65 Jun  7 13:55 url.bin
-r--r-----. 2 clickhouse clickhouse  48 Jun  7 13:55 url.mrk2

[root@mydbopslabs202 testing]# cat test_table.sql 

ATTACH TABLE test_table
(
    `id` UInt64, 
    `duration` Float64, 
    `url` String, 
    `created` DateTime
)
ENGINE = MergeTree()
PRIMARY KEY id
ORDER BY id
SETTINGS index_granularity = 8192
  1. List the dump file :

We can check the list of backups using the option “list” from the backup tool. It’s shows the dump file with the created date time.

[root@mydbopslabs202 clickhouse-backup]# ./clickhouse-backup list

Local backups:

- 'data_dump' (created at 26-05-2020 04:49:14)
  1. Restoring the dump file using clickhouse-backup tool :

Restore ” is an option to restore the data from the dump file in the clickHouse server.

While restoring from clickHouse backup tool, first it will restore the metadata ( Structure of the table ) from the dump file which is in the metadata directory. Once the metadata is restored in the table, it will prepare the data by restoring the data files present in the shadow directory. Finally, it will do an ALTER TABLE…ATTACH PART. Simply it will add the data to the table from the detached directory.

[root@mydbopslabs202 clickhouse-backup]# ./clickhouse-backup restore data_dump

2020/05/26 06:40:51 Create table `testing1`.`test_table`
2020/05/26 06:40:51 Prepare data for restoring `testing1`.`test_table`
2020/05/26 06:40:51 ALTER TABLE `testing1`.`test_table` ATTACH PART 'all_1_4_1'

Validating the logs from the restored backup tool :

mydbopslabs202 :) show create table `testing1`.`test_table`

SHOW CREATE TABLE testing1.test_table

┌─statement───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ CREATE TABLE testing1.test_table (`id` UInt64, `duration` Float64, `url` String, `created` DateTime) ENGINE = MergeTree() PRIMARY KEY id ORDER BY id SETTINGS index_granularity = 8192 │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

1 rows in set. Elapsed: 0.001 sec. 


mydbopslabs202 :) select count(*) from test_table;

SELECT count(*)
FROM testing1.test_table

┌─count()─┐
│ 1681983 │
└─────────┘

1 rows in set. Elapsed: 0.003 sec. 

There is best pros in the backup tool in which it differentiate the metadata structure and data files in the separate folder such as metadata dir and shadow directory under the backup directory. As mentioned earlier, The data structure will be available in meta data directory and data files will be available in the shadow directory under the mentioned backup directory.

There are some cons in the backup tool as the backup size of remote storage is maximum upto 5TB. This backup tool support only MergeTree family table engine

These are the simple possible ways to backup and restore the data from clickHouse server, We can choose the backup type based on our requirement. Depending on the size of the data, we need to choose the backup type based on our environment. ClickHouse-Copier is another way to take the backup. In the upcoming day, we will discuss the more about Clickhosue further.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s