Today’s Global system outage finally got us some action to follow on in these quiet few weeks of summer vacation.
Ironically, Microsoft themselves have so much content published to avoid a single point of failure, implementing robust testing and effective rollback/roll forward mechanisms, designing graceful degradation, diversifying critical infra, and the list goes on.
As an Architect, it's an apt problem to preach upon and a perfect example to learn so many anti-patterns and what can go wrong if we are not careful with our simple system designs. I wanted to share some thoughts on what we should avoid to prevent similar issues in any IT system or landscape.
Don't Put All Your Eggs in One Basket
The first and foremost principle is to avoid a single point of failure. Relying too much on one vendor, service or solution is always risky. It's like putting all your eggs in one basket. If that basket falls, you're in big trouble. We need to mix things up and have backup plans.
Test, Test, and Test Again
We have heard the saying, "Measure twice, cut once"? in IT it's more like "test a hundred times, deploy once." We can't just roll out updates and hope for the best. We need to test thoroughly in a safe and like environment first.
Have an "Undo" Button
Sometimes, things go wrong no matter how careful we are. That's why we need a way to undo changes quickly. It's like having a time machine for our systems. If we can't roll back or roll forward easily, small problems can soon turn into big headaches.
Keep the Lines of Communication Open
When things go south, we need to be able to talk to everyone affected. It's not just about fixing the problem, it's about keeping people in the loop. We should have multiple ways to reach out and give updates.
Plan for the Worst
Our systems should be like cats - able to land on their feet. Even if part of the system fails, the rest should keep working. It's about being prepared for the worst while hoping for the best.
Know Your Weak Spots
We should regularly check our technology supply chain. Who and what third-party systems, services, and tools are we depending on? What could go wrong? It's like doing a health check-up but for our IT systems.
Change with Care
Rushing changes is asking for trouble, especially in production. We need a solid process for making updates. Think of it like air traffic control for our systems - everything needs to be cleared before it takes off.
Don't Put All Your Faith in One System
Using the same operating system or platform for everything is convenient, but risky. It's good to mix things up a bit. That way, if one system has issues, not everything should go down.
In the end, it's all about being prepared and thinking ahead. For me, the CrowdStrike incident is not a surprise and it's more of a wake-up call for all of us in IT. We need to learn from this to build stronger, more reliable systems that can weather any storm.
Interesting! These insights serve as essential reminders for IT professionals to build resilient and reliable systems.
ReplyDelete