In our work, We used to get a lot of requirements for replicating data from one data source to another. Our team provided solutions to replicate data from MySQL to Vertica, Amazon Redshift, Hadoop. Out of which Amazon Redshift replication is a bit complicated as Amazon Redshift is a Database as a service (DBaaS) and the process is not straightforward.
So, I take this opportunity to guide on how to replicate the specific set of tables from MySQL to AWS Redshift using Tungsten replicator.
1.0. Tungsten Replicator:
Tungsten Replicator is an open source replication engine supports data extract from MySQL, MySQL Variants such as RDS, Percona Server, MariaDB and Oracle and allows the data extracted to be applied on other data sources such as Vertica, Cassandra, Redshift etc.
Tungsten Replicator includes support for parallel replication, and advanced topologies such as fan-in and multi-master, and can be used efficiently in cross-site deployments.
1.1.0. General Architecture:
There are three major components in tungsten replicator
1. Extractor / Master Service
2. Transaction History Log (THL)
3. Applier / Slave Service
1.1.1. Extractor / Master Service:
The extractor component reads data from MySQL’s binary log and writes that information into the Transaction History Log (THL).
1.1.2. Transaction History Log (THL):
The Transaction History Log (THL) acts as a translator between two different data sources. It stores transactional data from different data servers in a universal format using the replicator service acting as a master, That could then be processed Applier / Slave service.
1.1.3. Applier / Slave Service:
All the raw row-data recorded on the THL logs is re-assembled or constructed into another format such as JSON or BSON, or external CSV formats that enable the data to be loaded in bulk batches into a variety of different targets.
Therefore Statement information is not supported for heterogeneous deployments. So It’s mandatory that Binary log format on MySQL is ROW.
1.2.0. Pre Requisites:
1.2.1. Server Packages:
- JDK 7 or higher
- Ant 1.8 or higher
- Ruby
- Net-SSH
1.2.2. MySQL:
- All the tables to be replicated must have a primary key.
- Following MySQL configuration should be enabled on MySQL
binlog-format = row
binlog-row-image = full
collation-server = utf8_general_ci
character-set-server = utf8
1.2.3. Redshift:
- Database name, Schema_name should be same as MySQL Database name of the tables to be replicated.
1.2.4. S3 Bucket:
- Read & write access to an AWS S3 Bucket. (Access key, Secret key is required)
2.0. Requirement:
- Consider the servers with below details are used for Demo.
AWS EC2 MySQL Server – 172.19.12.234
AWS Redshift – 172.19.12.116 (Database as a Service)
AWS S3 bucket – s3://mydbops-migration
As Redshift is a database as a service, We just have an endpoint to connect. Therefore We will be installing both the tungsten Master / Slave service on the MySQL server itself.
- We would need to replicate the tables emp, emp_records from new_year database on the MySQL server to Redshift. Structures of the table are given below.
CREATE TABLE `emp` ( `no` int(11) NOT NULL, `city` varchar(50) DEFAULT NULL, `state` varchar(50) DEFAULT NULL, PRIMARY KEY (`no`) ) ENGINE=InnoDB; CREATE TABLE `emp_records` ( `no` int(11) NOT NULL, `name` varchar(50) DEFAULT NULL, `address` varchar(50) DEFAULT NULL, PRIMARY KEY (`no`) ) ENGINE=InnoDB;
3.0. Implementation:
The implementation consists of following steps.
- Installation / Building tungsten from source
- Preparing equivalent schema for Redshift
- Configuring Master service
- Configuring Slave service
- Generating worker tables (temp tables used by tungsten) for replication to be created on redshift
- Start the replication
3.1. Installation / Building From Source:
- Download the source package from the GIT.
#git clone https://github.com/continuent/tungsten-replicator.git
- Compile this package it will generate the tungsten-replicator.tar file.
#sh tungsten-replicator/builder/build.sh #mkdir -p tungsten
- Once the tar file is generated extract the file to the folder created named tungsten and remove the old tungsten replicator package.
#tar --strip-components 1 -zxvf tungsten-replicator/builder/build/tungsten-replicator-5.2.1.tar.gz -C tungsten/
- Now we have got tungsten binaries, Clean up source packages unless required.
#rm -rf tungsten-replicator
3.2. Preparing equivalent schema for Redshift:
- Create database new_year on Redshift.
dev=# create database new_year; CREATE DATABASE
- The new database was created. Now I am going to create a new schema.
- Before creating schema first you have to switch to new_year database.
dev=# \c new_year psql (9.2.24, server 8.0.2)
- Then create tables in new_year schema.
new_year=# create table new_year.emp(no int primary key, city varchar(50),state varchar(50)); CREATE TABLE new_year=# create table new_year.emp_records(no int primary key, name varchar(50),address varchar(50) ); CREATE TABLE
Note:
- If you do not mention schema name while creating the table, the table will create inside the public schema.
- To check tables are created inside the correct new_year schema.
new_year=# \dt new_year.*; List of relationsList of relations schema | name | type | owner ----------+--------------+-------+----------- new_year | emp | table | redshift-usr new_year | emp_records | table | redshift-usr (2 rows)
3.3. Configuring Master Service:
- Create a replication user on MySQL with Replication Slave privilege to stream binlog from MySQL to Tungsten Master service.
mysql> grant replication slave on *.* to 'tungsten'@'localhost' identified by 'secret';
- Switch to tungsten directory and Reset the defaults configuration file.
#cd ~/tungsten #./tools/tpm configure defaults --reset
- Configure the Master service on the directory of your choice, We have used /opt/master
- Following commands will prepare the configuration file for Master service.
#./tools/tpm configure master \ --install-directory=/opt/master \ --enable-heterogeneous-service=true \ --members=mysql-db-master \ --master=mysql-db-master
#./tools/tpm configure master --hosts=mysql-db-master \ --replication-user=tungsten \ --replication-password=tungsten \ --skip-validation-check=MySQLUnsupportedDataTypesCheck \ --property=replicator.filter.pkey.addColumnsToDeletes=true \ --property=replicator.filter.pkey.addPkeyToInserts=true
- Once the configuration is prepared, Then we can install it using tpm.
#./tools/tpm install Configuration is now complete. For further information, please consult Tungsten documentation, which is available at docs.continuent.com. NOTE >> Command successfully completed
- Now Master service will be configured under /opt/master/
- Start the tungsten Master service.
#/opt/master/tungsten/cluster-home/bin/startall Starting Tungsten Replicator Service... Waiting for Tungsten Replicator Service. running: PID:22291
- Verify it’s working by checking the master status.
#/opt/master/tungsten/tungsten-replicator/bin/trepctl services Processing services command... NAME VALUE ---- ----- appliedLastSeqno: 0 appliedLatency : 1.667 role : master serviceName : master serviceType : local started : true state : ONLINE Finished services command...
#/opt/master/tungsten/tungsten-replicator/bin/trepctl status Processing status command... NAME VALUE ---- ----- appliedLastEventId : mysql-bin.000011:0000000000000510;-1 appliedLastSeqno : 0 appliedLatency : 1.667 autoRecoveryEnabled : false autoRecoveryTotal : 0 channels : 1 clusterName : master currentEventId : mysql-bin.000011:0000000000000510 currentTimeMillis : 1525355498784 dataServerHost : mysql-db-master extensions : host : mysql-db-master latestEpochNumber : 0 masterConnectUri : thl://localhost:/ masterListenUri : thl://mysql-db-master:2112/ maximumStoredSeqNo : 0 minimumStoredSeqNo : 0 offlineRequests : NONE pendingError : NONE pendingErrorCode : NONE pendingErrorEventId : NONE pendingErrorSeqno : -1 pendingExceptionMessage: NONE pipelineSource : jdbc:mysql:thin://mysql-db-master:3306/tungsten_master?noPrepStmtCache=true relativeLatency : 21.784 resourcePrecedence : 99 rmiPort : 10000 role : master seqnoType : java.lang.Long serviceName : master serviceType : local simpleServiceName : master siteName : default sourceId : mysql-db-master state : ONLINE timeInStateSeconds : 21.219 timezone : GMT transitioningTo : uptimeSeconds : 21.741 useSSLConnection : false version : Tungsten Replicator 5.2.1 Finished status command...
- If the master did not start properly refer to this (/opt/master/service_logs/trepsvc.log) error log.
3.4. Configuring Slave service:
- Switch to tungsten directory and Reset the defaults configuration file.
#cd ~/tungsten #./tools/tpm configure defaults --reset
- Create JSON file with name s3-config-slave.json in the format below, Fill in your AWS S3 Bucket details like Access key, Secrect key, S3 bucket path.
{ "awsS3Path" : "s3://mydbops-migration", "awsAccessKey" : "XXXXXX", "awsSecretKey" : "YYYYYYY", "gzipS3Files" : "false", "cleanUpS3Files" : "true" }
- Configure the Slave service on the directory of your choice, We have used /opt/slave
- Following commands will prepare the configuration file for Slave service.
#./tools/tpm configure slave \ --install-directory=/opt/slave \ --enable-heterogeneous-service=true \ --members=mysql-db-master
- Add the replication filter to only replicate those two tables. Use Redshift host, user, password to configure the slave service.
#./tools/tpm configure slave --hosts=mysql-db-master \ --replication-host=172.19.12.116 \ --replication-user=redshift-usr \ --replication-password='redshift-pass' --datasource-type=redshift \ --batch-enabled=true \ --batch-load-template=redshift \ --redshift-dbname=new_year \ --svc-applier-filters=dropstatementdata,replicate \ --property=replicator.filter.replicate.do=new_year.emp,new_year.emp_records \ --svc-applier-block-commit-interval=10s \ --svc-applier-block-commit-size=5 \ --rmi-port=10002 \ --thl-port=2113 \ --master-thl-port=2112 \ --master-thl-host=mysql-db-master
- Once the configuration is prepared, Then we can install it using tpm.
#./tools/tpm install Configuration is now complete. For further information, please consult Tungsten documentation, which is available at docs.continuent.com. NOTE >> Command successfully completed Once it complete copy the s3-config-slave.json file to slave (share) directory.
#cp s3-config-slave.json /opt/slave/share/
- Now the slave is configured, Before starting we need to create worker/stage table used by tungsten to replicate data on Redshift.
3.5. Generating Worker / Stage tables To Be Created On Redshift:
- Tungsten provides a utility named ddlscan to generate the Worker / Stage tables required for the replication functionality to work.
#/opt/slave/tungsten/tungsten-replicator/bin/ddlscan -db new_year -template ddl-mysql-redshift-staging.vm > staging_ddl
- Apply the schema generated from the above operation on Redshift.
- Now we have Worker / Stage tables created on redshift.
new_year=# \dt new_year.*; List of relations schema | name | type | owner ----------+-----------------------+-------+----------- new_year | emp | table | redshift-usr new_year | emp_records | table | redshift-usr new_year | stage_xxx_emp | table | redshift-usr new_year | stage_xxx_emp_pkey | table | redshift-usr new_year | stage_xxx_emp_records | table | redshift-usr new_year | stage_xxx_emp_pkey | table | redshift-usr (6 rows)
3.6. Starting Replication:
- Once the slave is configured and the stage tables are created in Redshift, then start the slave
#/opt/slave/tungsten/cluster-home/bin/startall Starting Tungsten Replicator Service... Waiting for Tungsten Replicator Service. running: PID:23968
- Verify it’s working by checking the slave status.
#/opt/slave/tungsten/tungsten-replicator/bin/trepctl services NAME VALUE ---- ----- appliedLastSeqno: -1 appliedLatency : -1.0 role : slave serviceName : slave serviceType : local started : true state : ONLINE Finished services command...
# /opt/slave/tungsten/tungsten-replicator/bin/trepctl status Processing status command... NAME VALUE ---- ----- appliedLastEventId : mysql-bin.000011:0000000000000510;-1 appliedLastSeqno : 0 appliedLatency : 251.018 autoRecoveryEnabled : false autoRecoveryTotal : 0 channels : 1 clusterName : slave currentEventId : NONE currentTimeMillis : 1525355728202 dataServerHost : 172.19.12.116 extensions : host : 172.19.12.116 latestEpochNumber : 0 masterConnectUri : thl://mysql-db-master:2112/ masterListenUri : null maximumStoredSeqNo : 0 minimumStoredSeqNo : 0 offlineRequests : NONE pendingError : NONE pendingErrorCode : NONE pendingErrorEventId : NONE pendingErrorSeqno : -1 pendingExceptionMessage: NONE pipelineSource : thl://mysql-db-master:2112/ relativeLatency : 251.202 resourcePrecedence : 99 rmiPort : 10002 role : slave seqnoType : java.lang.Long serviceName : slave serviceType : local simpleServiceName : slave siteName : default sourceId : 172.19.12.116 state : ONLINE timeInStateSeconds : 12.558 timezone : GMT transitioningTo : uptimeSeconds : 24.407 useSSLConnection : false version : Tungsten Replicator 5.2.1 Finished status command...
- If the slave did not start properly refer to this (/opt/slave/service_logs/trepsvc.log) error log.
4.0. Testing:
- Now both master & slave are in sync. Now I am going to insert a few record on MySQL server in emp and emp_records table.
insert into emp values(1,'chennai','tamilnadu'); insert into emp values (2,'Banglore','Karnataka'); insert into emp_records values(1,'suresh','Noth car street'); insert into emp_records values(2,'John','South car street');
- Above these records are inserted in the master server. At the same I have checked redshift these records are replicated or not.
new_year=# select * from new_year.emp; no | city | state ----+-----------+---------- 1 | chennai | tamilnadu 2 | Banglore | Karnataka (2 rows)
new_year=# select * from new_year.emp_records; no | name | address ----+----------+--------- 1 | suresh | Noth car street 2 | John | South car street (2 rows)
5.0. Troubleshooting:
Replication can be broken due to incorrect data types. During such scenarios, We should analyze the issue and fix the datatype and resume replication.
Sample Error :
# /opt/slave/tungsten/tungsten-replicator/bin/trepctl status Processing status command... NAME VALUE ---- ----- appliedLastEventId : NONE appliedLastSeqno : -1 appliedLatency : -1.0 autoRecoveryEnabled : false autoRecoveryTotal : 0 channels : -1 clusterName : slave currentEventId : NONE currentTimeMillis : 1526577299571 dataServerHost : 172.25.12.119 extensions : host : 172.25.12.119 latestEpochNumber : -1 masterConnectUri : thl://mysql-db-master:2112/ masterListenUri : null maximumStoredSeqNo : -1 minimumStoredSeqNo : -1 offlineRequests : NONE pendingError : Stage task failed: stage=q-to-dbms seqno=75 fragno=0 pendingErrorCode : NONE pendingErrorEventId : mysql-bin.000027:0000000000072461;-1 pendingErrorSeqno : 75 pendingExceptionMessage: CSV loading failed: schema=new table=doc_content CSV file=/opt/slave/tmp/staging/slave/staging0/yp-yp_user_doc_content-69.csv message=Wrapped org.postgresql.util.PSQLException: ERROR: Value too long for character type Detail: ----------------------------------------------- error: Value too long for character type code: 8001 context: Value too long for type character varying(256) query: 1566568 location: funcs_string.hpp:395 process: query0_75_1566568 [pid=10475] ----------------------------------------------- (/opt/slave/tungsten/tungsten-replicator/appliers/batch/redshift.js#256) pipelineSource : UNKNOWN relativeLatency : -1.0
This error info explains that value is too long for character data type for table doc_content on new database in Redshift.
- In MySQL, the table doc_content consists of column “context” with TEXT data type.
- Even in Redshift, context is a TEXT column.
- Here the catch, In Redshift, the text datatype is equivalent to varchar(256).
- So writing anything above 256 on MySQL will break replication.
So the solution is to increase the varchar length from 255 to varchar of 1000. In Redshift changing the datatype will not work.
yp=# alter table new.doc_content ALTER COLUMN content TYPE varchar(2000);
ERROR: ALTER COLUMN TYPE is not supported
- We can’t increase the column size in Redshift without recreating the table.
- The alternate solution is to add a new column with the required changes and move the data and then the old column can be dropped.
ALTER TABLE yp.yp_user_doc_content ADD COLUMN content_new VARCHAR(2000); UPDATE yp.yp_user_doc_content SET content_new = content; ALTER TABLE yp.yp_user_doc_content DROP COLUMN content; ALTER TABLE yp.yp_user_doc_content RENAME COLUMN content_new TO content;
- Now we’re good to restart the replication again.
6.0. Conclusion:
Tungsten replicator is a great tool when it comes to replication of data with heterogeneous data sources. If we understand it’s working, It’s easy to configure and operate.
One thought on “MySQL to Amazon Redshift Replication.”