MongoDB Aggregation Framework an Overview

          The MongoDB Aggregation Framework is a way to query the data from MongoDB. It helps us to break the complex logics into a simple set of sequential operations. Just like a waterfall model, Output from one stage is fed as an input for the next stage until the desired result is achieved.

Each stage of transformation during this process is termed as pipelines.

Mongodb_Aggregation

Fig.1.1. Sample workflow on The Aggregation Framework.

How it works ?

Syntax:

db.<Collection Name>.aggregate(<Aggregation Operation>)

Aggregation operation should be defined as an array, Each array value is a stage.

( [ {Stage 1}, {Stage 2}, {Stage 3} ] )

It will pass the set of existing documents or it will generate the computed result based on aggregation condition that result should be passed (piped) to the next stage.

Sample Query:

db.mydbops.aggregate( [

{$match : {"AUTHOR" : "SAKTHI"}},

{ $project : {AUTHOR : 1 ,BLOG : 1 ,TAGS : 1}},

{$group: { "_id" : "$BLOG", THOUGHTS: { $push: "$TAGS" }}},

{ "$skip": 1 },

{ "$limit": 1 }

] );

Guidelines:

    • Query for each stage has to be optimised for optimal performance.
    • You may require only a subset of data from a collection to perform an aggregation operation. Therefore, use the $match, $limit, and $skip stages to filter the documents.
    • Using $match operation in the start stages of the aggregation will be helpful in certain cases as it scans only the documents is satisfied the given condition and pass only the matching documents in a collection to the next stage.
    • Ensure data set size is filtered during the initial stages than the later stages to improve the performance.

The most common stages in aggregation framework:

$project − Pass documents to specific field or newly computed field to the next stage

$match − Reduce the number of documents that is passed to the next stage

$group − Group document by a distinct key, the key can also be a computed key

$sort − Returns the input documents in a sorted order

$skip − It is possible to skip the particular amount of documents for a given amount of documents.

$limit − Limit the number of documents for the next stage

$unwind − This is used to unwind document that are using arrays. When using an array, the data is kind of pre-joined and this operation will be undone with this to have individual documents again. Thus with this stage we will increase the amount of documents for the next stage.

Explanation With Use Case:

Consider a collection named mydbops having the following documents with information containing Blogging information such as Blog name, Tag information, Author name etc.

Sample Document:

{
"_id"      : ObjectId("5c7b83fdfe01e9de2e7d4935"),
"BLOG"     : "MySQL INNODB ARCHITECTURE",
"AUTHOR"   : "SAKTHI",
"POSTED"   : ISODate("2019-03-03T07:36:29.879Z"),
"DATA"     : "{
. . . . . . . . . . . . . . . . . . . . . . . .
}",
"VIEWERS"  : 10,
"TAGS"     : [ "MySQL", "InnoDB" ],
"comments" : [ { "AUTHOR" : "KARTHIK", "TEXT" : "NICE THOUGHT!!!!" }, { "AUTHOR" : "SELVA", "TEXT" : "GOOD!!!!" } ]
}

We can run the aggregation queries against the above set of documents. Let us discuss about how the aggregation works on step by step.

Requirement:

  • Need to find the total number of hits for the blogs posted during this year so far, for each Author, With tag MySQL in order of highest hits (viewers) first.

Aggregate query:

db.mydbops.aggregate([
{ $match : { TAGS : "MySQL" , 'POSTED' : { '$gte' : ISODate("2018-12-31T23:59:59") }} },
{ $project : {AUTHOR : 1 ,VIEWERS :1}},
{"$group" : {_id:"$AUTHOR", HITS:{ $sum:"$VIEWERS"}}},
{$sort: {HITS: -1}}])

Stages Explained:

STAGE 1:

{ $match : {TAGS: "MySQL", 'POSTED': {'$gte': ISODate("2018-12-31T23:59:59")}} }

In Stage 1, the $match operator scan’s what are all  the documents satisfies the above condition only that documents should pass to the next stage

STAGE 2:

{ $project: {AUTHOR: 1, VIEWERS:1} }

In Stage 2, What are all the fields mentioned in the $project operator only that fields should pass to the next stage

STAGE 3:

{ "$group": {_id:"$AUTHOR", HITS: {$sum: "$VIEWERS"}} }

In Stage 3, $group operator group the documents by some specified expression and outputs to the next stage a document for each distinct grouping. The output documents contain an _id field which contains the distinct group by key.

STAGE 4:

{ $sort: {HITS: -1} }

In Stage 4, $sort operator takes a document that specifies the field(s) to sort by and the respective sort order. 1 to specify ascending order and -1 to specify descending order.

Aggregation flow Diagram:

Aggregatio_new19

Production Problems:

    • Using $match in initial stage is a best approach, But if the document size is huge, Then further stages will suffer and a lot of CPU cycles will be wasted.
    • In case of huge document size, It’s advisable to have required columns projected first than later.
    • Aggregate supports $lookup which is comparative to JOINS in RDBMS, Should be used only when necessary.
    • Not all complex logics can be fixed by Aggregate. Try to optimise the document structure by analysing the Frequency of operation and query pattern.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s