The Day Our MongoDB Went Down

The checkout backend team handles the complete order lifecycle — beginning with items being placed in the basket, securing payment for an order to the processing of refunds on an order weeks later (with a couple of bits in-between). To facilitate this, we rely on a variety of storages — in this post I would like to focus on our MongoDB and how, in the midst of the Christmas shopping surge, it nearly brought us down for a week.

We use a number of databases to work with Order and Customer data. Since we are firm believers in using the best tools for any given job, we evaluate their usage based on the characteristics of the data being stored and the advantages that technology brings to the table. That said, we ended up using MongoDB to store information regarding the processing state of an order and all of its’ dependencies (emails, reporting, etc.). MongoDB fits the job perfectly as none of these documents have any hard relation to each other — once we generate the data for an email delivery, for example, that particular document becomes part of an entirely different process, completely disconnected from the Order processing.

We’ve been using MongoDB for a few years now, starting with TokuMX (still on 2.0) which worked great and with no real major issues. Eventually, we decided to take the plunge and move from the then venerable TokuMX to MongoDB 3, to handle some of the growing pains of our data storage (automated failover and easier data archiving were the big selling points for us). After lengthy data migrations and a fair share of poking and prodding, we were the proud owners of a shiny and faster MongoDB 3 cluster running PerconaFT under the hood.

At the end of the migration, in our setup we had two nodes in high availability rotation and one node as a read-slave.

Everything went smoothly and we were happy with the results. Until one fateful day, a few weeks later.

Day 1

(07–12)

It all started with a count. Through a round of debugging, a developer needed to figure out how many documents we had in a collection, after a count() query followed by a find(), we encountered something puzzling: the result of the count() differed from the actual number of documents in the collection. These were the actual results:

Weird. After a some research we discovered that this could happen on collections that had a high amount of document throughput (as ours did) and the recommended approach was to trigger a reIndex() on the given collection. After re-indexing, however, the results got a lot stranger:

Yes, that is indeed a negative value.

Everything else was still running smoothly, so having superficially investigated the issue it seemed to be an inexplicable idiosyncrasy without real-world impact. Something we fully intended to look into the following day.

The first real indication that the MongoDB was not a happy bunny came a couple of hours later at 10pm, when the sending of transactional mails slowed down to a crawl (our queue length alerts were doing overtime). Transactional emails include the password reset and order confirmation mails so delays in processing these have a measurable and not acceptable impact on the quality of service to our customers.

Our next step was to deactivate all non-essential services of the Checkout to reduce the overall load on the database. We kept only the processes that ensured the delivery of critical emails and the actual ordering to ensure our customers would still get their products with no delays. Everything else was stopped. This reduced the load just enough to get some of the customer impacting processes closer into shape. With that handled, we jumped on to the next problem: the load was still way too high and we couldn’t explain the cause.

Examining the increase of the load throughout that day, it seemed that the performance had been slowly worsening since the reIndex() we triggered that afternoon. Also, examining the slow queries hitting the database, it seemed that they were all targeting that same collection. In the middle of that analysis, our then primary node died.

We were surprised but not quite alarmed: that’s what keepalived is for, after all. The second MongoDB node kicked into rotation and we saw no change in behaviour. Checking the collection pointed to the fact that MongoDB was not happy either with the differences in the number of documents in that collection:

But we had some more hints: whatever was happening, it was related to the indexes in that collection. We proceeded to look into the indexes and their usage in the application, and triggered a reIndex() on that collection once more, which seemed to help and immediately reduce the load on the database. Sadly, however, the load crept up again within minutes, leaving us exactly where were before.

Then the second database node died. At that point we were surprised and alarmed. As we went deeper into the night, we got one of the nodes back up and running and after a lot cursing and ranting, copying the data out of the offending collection to a new one and re-creating the original indexes helped. Since the whole cluster was out of sync by then, we decided to leave our brave (and lone) database node running on it’s own and kept the non-essential processes stopped for the time being. The load was still higher than usual, but the levels were back into acceptable ranges.

By then, it was early morning. The battle had been long and weary, so we all went home for a couple of hours of sleep. We would re-group later that morning.

Day 2

(08–12)

The node was still stable and running. Our non-essential processes still halted, so we started evaluating alternatives. We got in contact with Percona and together with their excellent support team we worked through our setup and our issues.

The support team informed us of other users had reported failures similar to ours when using the PerconaFT engine. The recommended course of action was to migrate to a different Mongo engine: WiredTiger. Moving the huge amount of data was challenging: because we were changing engines, we couldn’t take the usual course of action: dump the data, restore it on the new node and let it sync to catch up. We needed to start with a completely cold WiredTiger node and let it sync over all of the data.

The good news was that once the WiredTiger node was fully synced, we could 1) dump and restore the data on more WiredTiger nodes and 2) keep the old PerconaFT node in sync, just in case things went south.

Our Operations team sprang into action and delivered a fully provisioned MongoDB node running WiredTiger for us to use, so once the sync was started the only thing we could do was wait.

Our biggest priorities at the time were to get back to a healthy cluster with a high availability set up and to restore full functionality to our backend processes. We made some attempts to restore functionality, but we had no way of doing so without degrading the service to our customers, so all non-urgent services had to remain disabled for the time being.

Day 3

(09–12)

The third day started with good news: the WiredTiger node was fully synced and we were ready to switch over to the new, (hopefully) stable node. However, our reporting picked up some errors on the production API’s. It appeared that some Mongo documents did not make it intact throughout the whole ordeal.

Examining the data revealed that the initial load spikes on the Cluster had had a broader impact on the health of our data than initially envisioned. It appears that while the performance of the MongoDB collection worsened, the syncing of the data over the database nodes was affected, which led to some of the documents being broken or missing altogether.

Out of the three original PerconaFT nodes, two had document fragments. These were updates to existing documents that seemed to have been inserted instead of applied to an existing document. This presumably happened because the sync lagged behind and the original documents ended up not being inserted before the update query was applied. The most alarming part was that these document fragments were (theoretically) impossible to insert manually, as they were all missing an indexed unique attribute, which is something that is really not supposed to happen[1]. This, of course, also meant that the documents could’ve been out of date all together as well, as updates might have been dropped as well.

This opened up the door to an alarming possibility: constraints were not applied to synced data and the operational order of the sync was not guaranteed. This meant that the PerconaFT nodes could theoretically quietly go out of date and sprinkle broken documents across collections, especially during high load events.

The even worse part is that our new and shiny WiredTiger node had replicated data that was already broken, and attempting to fetch the affected documents resulted in the WiredTiger node simply seg faulting away.

We did have at least one node that we knew had the last good state of the data we needed: our master node from the first day had died and we had not gotten that node back up since then, so it was still lying around, broken and defeated. At that point it became clear, we had to salvage the broken documents from the initial master and somehow get the data over to the WiredTiger node.

[1]: https://docs.mongodb.com/manual/core/index-unique/#unique-index-and-missing-field

Day 4

(10–12)

In the night between the third and fourth days, we developed a solution that allowed us to compare and manually move data over from one instance to the other. After carefully examining all of the broken data, we were able to assert that affected documents were all within a specific range, which meant that we could easily identify all of them.

Our solution identified the broken documents based on key markers of the data they contained and synced the differences between the initial MongoDB master node and the current PerconaFT node. After some testing, we deployed it and ensured that a) all the fragments were gone and b) all the documents that were previously broken or missing were now present and correct. We double checked the data against the database we use to store the

We were finally up to date. The only thing missing: moving out of the PerconaFT engine and on to the WiredTiger node. The “good” thing about the issue that allowed document fragments to be synced over to the new node was now helping us clean them up. We ensured that the WiredTiger node was completely up to date and that all the fragments were gone, and we finally pointed our applications to the new engine.

Thankfully, it all worked like a charm. The node performed a lot better, load decreased drastically and we seemed to be on the road to recovery.

At this point, it was a late Saturday morning. Our non-critical processes remained stopped, as we couldn’t assert to 100% that WiredTiger was going to perform better than its’ predecessor. At that point we all went to rest for a few hours, while still keeping an eye on it to ensure nothing would self-combust once again.

Day 5, 6 and 7

(10–12 / 11–12 / 12–12)

During the next few days, we focused on getting all of our processes up and running, getting back to a high availability setup and taking some pre-emptive measures to reduce the potential load on the database.

We carefully re-started our internal processing while keeping a close eye on the performance. The processing backlog was big, but within a few hours, it was completely cleared up. By day 7 we were fully operational once again.

We started by removing our dependency in the original collection that caused issues in the MongoDB cluster by replacing it with a “simpler” MySQL database. This reduced our flexibility, but had the advantage of taking the pressure out of the Mongo database. This change was tested and deployed by day 6.

Finally, we prepared, deployed and synced over some more WiredTiger nodes to prepare our high availability setup. At the end of the 7th day, we had two more WiredTiger nodes in the cluster, happily syncing away. In addition to this, the Percona support team kindly supplied us with a solution to check data consistency across all of our nodes so we’ve had no more occurrences of fragmented documents being generated via sync or any other means.

Aftermath

Since then we’ve implemented a structure that allows us to react better to similar issues. As we were bringing up new machines, we scaled up each one of the new nodes on RAM considerably and given our cluster some more space to breathe (as was recommended by the Percona Support team at the time).

Additionally, we introduced a Delayed Replica Set into the cluster. This is a replica node that is lagging behind the master by an hour. This would’ve been a great help and would’ve acted as a sort of pre-warmed backup node that can be used to, for example, go back to a state where data has not been lost or altered. At the time, we also had a “stone-cold” MongoDB node with no data, as a last resort, should a catastrophic event occur.

We’ve also improved our backup strategies to ensure that they properly reflect the importance and rates of change of the data.

Finally, as mentioned before, we also moved some of the functionality that used to rest on MongoDB’s shoulders to other databases. As solid as Mongo has been over the last few years, this event served as a wake-up call, letting us know that it was time to redistribute and re-think some of our data structures to move them to other platforms, with different, better fitting characteristics.

Learnings

Sync is not a Backup

After the issues we encountered, this feels obvious. However this is a very important learning. We had frequent backups of the databases, but they were not as frequent as they could have been, because we always felt we could rely on the synced nodes (there were three of them) to contain data in a “good” state.

The problems we encountered demonstrate quite clearly that syncing is indiscriminate in a database. It will carry over bad states of the data just as happily as it will carry anything else. Only frequent enough backups can minimize the risks here.

Data Segregation is useful

Segregating our data, or simply not having it all in a single database and platform, has some disadvantages. More systems to maintain and more knowledge to retain are definite drawbacks to keeping your spread across 2 or more different systems. However, having a second data storage unaffected by any outages allowed us to safely reconstruct the data on the MongoDB.

Our MySQL databases hold the order and customer data that requires more structure and less flexibility. It contains a huge amount of data and applying changes to said data and data structure comes with a few challenges of its’ own. It does, however, offer an excellent and battle tested solution for maintaining consistency at all costs.

About the Author: Pedro Nogueira is working for ABOUT YOU as Tech Lead at our Checkout Unit.

INSIGHTS The Day Our MongoDB Went Down