Replication is not just a word, it is a technology ruling the data world. In Database, numerous elements contribute to the overall process of creating and managing database replication. In this blog, I will walk you through the Anatomy of MongoDB replica set.
Types of Replication MongoDB Support
Basically, MongoDB supports two types of replications:
- Master-Slave Replication(Dropped from 4.0)
- Replica Sets.
Master-Slave replication similar to Replica Sets, but it doesn’t have capable of electing another DB node as a master when the current master node goes down. But it has its own set of features that were not supported by the replica set.
- Support filtered replication.
- Unlimited number of slave nodes.
Replica Sets is a superset of Master-Slave replication.
The most influential differentiation between a master-slave cluster and a replica set is, a replica set does a lot of the administration for us like promoting slaves to a Master automatically, make sure we won’t run into any inconsistencies state and for the developer, if they specify a replica set in the DB driver configuration, the driver will automatically figure out and handle failover if the current master dies. These rich features made people to choosing Replica set over Master-Slave.
While addressing MongoDB replica set scaling, it can have a maximum of 50 members and from that only 7 members can vote for the elections. MongoDB has a quorum based voting protocol so we always recommend having an ODD number of nodes in MongoDB replica set.
Internal Design of MongoDB Replica Set Communication
MongoDB replica set is an asynchronous kind of replication. When a user writes data to the replica set, the write first goes to the Primary server and it syncs to all the secondaries server of a replica set.
In the high level just like that, we can say that it automatically syncs with secondaries. but here I like to discuss what & how are the MongoDB components help to sync data from primary to secondaries.
I don’t like to write the working flow like an article, I believe Algorithmic writing will be good for understanding the workflow. Let begin
Anatomy of Primary:
The client writes data to the primary node database.
- Once the write is completed, OpObserver is a component run on MongoDB server takes responsibility and insert the document to the local.oplog.rs collection.
Once the document is inserted into local.oplog.rs collection, secondaries will tail the data.
- The Oplog is a capped collection and it is the only collection that doesn’t include an _id field.
- Whenever the write does multiple operations, each will have its Oplog entry.
- Each operation in the Oplog is idempotent.
|In-Memory||5% of Physical Mem||50 MB||50 GB|
|WiredTiger||5% of free disk space||990 MB||50 GB|
Anatomy of Secondary:
- A Server running in the secondaries states will operate a BackgroundSync thread continuously.
- BackgroundSync is like a parent process for data synchronize on a secondary server.
- BackgroundSync sits in a true loop. In its every iteration first it will try to select a sync source.
- Once the sync source is selected, it will startup the OplogFetcher.
- Now OplogFetcher will call Fetcher class to fetch data from a collection on the remote node.
- Finally, now Fetcher will execute find command on sync source.
- The find command query shape predicate on the timestamp of the last Oplog entry it has fetched greater than or equal.
- Base on the query shape prediction, the find command must return at least 1 document.
- This query can produce three states of output 3. non-empty batch with last Oplog timestamp entry 2. empty batch 1. non-empty batch without last Oplog timestamp.
- when find command return non-empty batch without last Oplog timestamp it means our secondary server Oplog diverged from the sync source and it should go into ROLLBACK/RECOVERY State.
- when find command return empty batch for 3 maximum of retries, it means sync source Oplog is behind the secondary server. So Secondary server needs to choose a new sync source. Now secondary server BackgroundSync thread will terminate the current Oplogfetcher and create new Oplogfetecher and restart the new sync source selection and take it forward.
- when find command return non-empty batch with last Oplog timestamp entry it response sync source is good and that it is still ahead of them. Now OplogFetched uses long-polling so that the ‘getMore’ block until their maxTimeMS.
- When the data received by the OplogFetcher, it won’t apply the data directly, It will place the data at OplogBuffer from that place another thread takes charge and apply them to disk.
I believe this blog will give a piece of basic knowledge about how MongoDB system continuously copies data between primary and secondary. But there is a lot of topics that need to be discussed to understand the excellence of MongoDB Replica set. In the coming days, I also try to write about other components of MongoDB Replica Set.