Sunday, September 8, 2024

In search of a cost effective strategy for modernizing a monolithic - Part 1

This week, I was asked to delve into a complex monolithic Java application that seems to be more than a decade old. With its tightly coupled modules and heavy reliance on a massive Oracle database, this application was a classic example of a system that had evolved into a tangled web of dependencies with years of tech debt. 

The task is to modernize this behemoth without disrupting the business operations or incurring high costs. On top of this, there was only 1 developer who knew about the applications, and that too with a lot of assumptions. The application's architecture, characterized by tightly coupled modules, posed significant challenges in terms of scalability, maintainability, technology fitment, and modernization efforts. 

Like my previous monolith-to-microservices journey, the first step right now is to understand the intricacies of the applications. It's pretty evident now that the modules are interdependent that even minor changes could have ripple effect to systems. All of these also need to happen with tight budget constraints in mind, which is a unique challenge. 

The main focus is to prioritize high-impact areas and I have started to identify the domains, mainly core, supporting, and generic. To accurately identify these domains, stakeholders and domain experts are engaged. I guess it would be several days until we even think of starting on the decomposition strategy. 

To be continued....


Monday, August 12, 2024

The Burden of Network on Architecture - Part 1 Network Bandwidth

IT infrastructure in most organizations is a separate department that operates and manages the IT environment of an enterprise. We as Solution Architects are confined to our landscape and very seldom dive deep into the hardware and network services that directly impact our applications. 

One of the critical factor is the Network Bandwidth that can make or break an Architecture's performance. Network bandwidths are not infinite and directly impact the Cost, Speed, and Performance of applications. It is very essential to understand how much bandwidth is allocated to a given set of applications and what all applications are controlled by the network. As the network traffic increases, the network bandwidth increases clogging the applications.  

I once at a client encountered a situation in an on-premise environment where a massive 10 TB file transfer brought an entire network to its knees. The transfer was initiated without proper planning, and it quickly saturated the available bandwidth. As a result, critical business applications slowed down, and some systems even crashed due to timeout errors. Employees couldn't access essential resources, and customer-facing services experienced significant delays.

This incident taught the importance of implementing robust traffic shaping and prioritization mechanisms. Post that incident, the client network team always had a bandwidth alert. They also ensured that large data transfers were scheduled during off-peak hours and that critical services have guaranteed bandwidth allocations. 

Perils of Holiday Code Freezes

Holiday Code Freeze has now become an age-old practice where deployments are stopped, and it's aimed to reduce any risk of new bugs or issues in the system when a majority of support staff is away. However, this also means the servers and infrastructure are left untouched for several weeks, and surprisingly, in my experience, this can be a bit chaotic. 

Application and Database Leaks

First and foremost, the most common error noticed is the Application leak. I recall one particular instance where an e-commerce application began to slow down significantly a week into the holiday break. The application retained references to several objects that were not required, causing the heap memory to fill up gradually. As the memory usage increased, the application became sluggish, eventually leading to crashes and "out of memory" errors.

Leaks can also happen when connecting to a database. In another example, the same e-commerce application began experiencing intermittent outages. The root cause was traced to a connection leak where the application was not releasing database connections after it was used. As the number of open connections grew, the database server eventually refused new connections, causing the application to crash. 

Similarly, I have also experienced code freeze situations, where thread leaks were consuming system resources and slowing down the application. This typically happens when threads are created but not terminated. 

Array index out-of-bounds errors

Another issue I recently encountered during a recent freeze was the Array index out-of-bounds error. The application was a CMS, and system downtime started in the middle of a week when an application tried to access an index in an array that didn't exist. It happened due to unexpected input and data changes not accounted for in the custom code.

Array Index out-of-bound exceptions can also be caused by data mismatch when interacting with external services or APIs, not under code freeze. Once, during a holiday season, a financial reporting application began throwing array index out-of-bounds exceptions. The root cause was traced back to an external data feed that had changed its format. The application was expecting a certain number of fields, but the external feed had added additional fields, causing the application to attempt to access non-existent indices. It led to errors that took the application offline until a patch was deployed after the freeze.

Cache Corruption

Cache corruption is another potential way of bringing down heavy cache-dependent applications. In online real-time applications, caches improve the application performances, but, on several occasions, I have seen over time, if not cleared, caches can become corrupt, leading to stale and receiving of incorrect data.

COnclusion

While it's funny that IT stakeholders think that the code freeze aims to maintain stability, in most cases, they expose underlying issues that might not be apparent during regular operations. The even more funnier thing is that a majority of the time, these issues are resolved by a simple server restart.

Wednesday, August 7, 2024

Extracting running data out of NRC/Nike + (Nike Run Club) using API's

For the past few weeks, I have been struggling to see the running kilometers getting updated in my  Nike + App. It could be a bug or a weird feature of the app and since this was kind of a demotivation, I decided to go ahead and create my own dashboard to calculate the results. Also, for some reason, Nike discontinued viewing and editing activities on the web.

Considering I had about 8 years of data and you never know when this kind of apps stop to exist or when they become paid versions. It's always better to persist your data to a known source and if required use it to feed it into any other application. I also went ahead and uploaded my data to UnderArmour's "MapMyFitness" App which has much better open-source documentation. 

It turns out that there is a lot of additional information the NRC app captures which are typically not shown on the mobile app. Few of the information include 

  1. Total Steps during the workout including detail split between intervals
  2. Weather Details during the workout 
  3. Amount of the time the workout was halted for 
  4. Location details including latitude and longitude information that can help you plot your own Map

Coming to the API part, I could not get hold of any official Nike documentation, but came across some older blogs https://gist.github.com/niw/858c1ecaef89858893681e46db63db66 in which they mentioned few API endpoints to fetch the historic activities. I ended up creating a  spring-boot version of fetching the activities and storing it in a CSV format in my Google Drive. 

The code can be downloaded here ->  https://github.com/shailendrabhatt/Nike-run-stats (currently unavailable)

The code also includes a postman repository which contains a Collection that can also be used to fetch one's activities. Just update the {{access_token}} and run the Get requests.

While the blog that had details of the API was good enough, a few tips that can be helpful 

  • Fetching the Authorization token can be tricky and it has an expiry time. For that, you will need a https://www.nike.com/se/en/nrc-app account and fetch the authorization token from the XML HTTP request headers for the URL type api.nike.com. There are few requests hitting this URL and the token can be fetched from any of them.
  • The API described in the link shows details of after_time, one can also fetch before_time information 
/sport/v3/me/activities/after_time/${time}
/sport/v3/me/activities/before_time/${time} 
  • Pagination can be easily achieved using the before_id and after_id. These ids are of different formats ranging from GUIDs to a single-digit number and can be confusing.

Saturday, August 3, 2024

Instilling the idea of Sustainability into Development Teams

Inculcating Green coding practices and patterns with the development team is a modern-day challenge. It can go a long way to reducing the carbon footprints and long-term sustainable goals of an organization. 

Good Green coding practices improve the quality of the software application and directly impact the energy efficiency on which the software applications are running. However, the software developers of today's agile work environment seldom focus away from rapid solution building in reduced sprint cycles. They have all the modern frameworks and libraries at their behest, and writing energy-efficient code is not always the focus. Furthermore, modern data centers and cloud infrastructure provide developers with unlimited resources resulting in high energy consumption and impacting the environment. 

Below are some of the factors that improve the programming practices and can show a drastic impact on the Green Index 

a) Fundamental Programming Practices

Some of the fundamental programming practices start with proper Error and Exception handling. It also includes paying extra attention to the modularity and structure of the code and being prepared for unexpected deviation and behavior, especially when integrating with a different component or system.

b) Efficient Code Development

Efficient code development helps to make the code more readable and maintainable. Efficient code writing includes avoiding memory leaks, high CPU cycles, and managing network and Infrasturc Storage in a proficient manner. It also includes avoiding expensive calls and unnecessary loops, and eliminating unessential operations. 

c) Secured Programming Mindset

A secured programming mindset ensures that the software application has no weak security features or vulnerabilities. Secured programming also includes protecting data, data encoding, and encryption. OWASP vulnerability list awareness and performing timely Penetration testing assessments ensure the application code is compliant with the required level of security.

d) Avoidance of Complexity

A complex code is the least modified and the hardest to follow. A piece of code developed may in the future be modified by several different developers, and hence avoiding complexity when writing code can go a long way to keep the code maintainable. Reducing the cyclomatic complexity of the methods by dividing the code and logic into smaller reusable components helps the code to remain simple and easy to understand. 

e) Clean Architecture concepts

Understanding Clean Architecture concepts is essential to allow changes to the codebases. In a layered architecture, understanding the concerns of tight coupling and weak cohesion helps in reusability, minimal disruption to changes, and avoids rewriting code and capabilities. 

Conclusion

As Architects and developers, it is essential to collect Green metrics on a timely basis and evaluate the compliance and violations of the code. Measuring these coding practices can be done using various static code analysis tools. The tools can further be integrated into the IDE, at the code compilation or deployment layer, or even as a standalone tool. 

With organizations in several industries now focusing on individual sustainability goals, green coding practices have become an integral part of every software developer. The little tweaks to our development approach can immensely contribute to the environmental impact in the long run.

Wednesday, July 24, 2024

AI Aspirations but lacking the Automation Foundation

I am witnessing a growing need for more clarity among IT teams regarding AI and Automation. They see competitors touting AI initiatives and feel pressured to follow suit, often without even grasping the fundamental differences between AI and automation. Everyone wants to implement AI, but they do not realize that they have yet to scratch the surface of basic automation. 


In a recent event at a client, the management heads announced an AI workshop day and their plans to implement AI into their development process. However, as the workshop started, I observed the lack of technical know-how regarding AI. Even developers struggled to differentiate between rule-based automation and the more complex, adaptive nature of AI. This knowledge gap has led to unrealistic expectations and misaligned strategies.


Let me cite another example from a client and elaborate. A year back the business management was pushing to implement an AI-driven customer service chatbot, which was the need of the hour, and went live with some cutting-edge services and technology. However, since its implementation, the chatbot did not see much traffic. As I tried to understand the reasons were several:-


  1. Poor integrations to existing systems like CRM, customer service tools, or even marketing automation. This meant the chatbot could not even access or update customer information in real time. Everything was done manually.
  2. It lacked typical customer interaction functionalities like personalization, order tracking, appointment scheduling, and even FAQs efficiently as it lacked automated processes.
  3. It could not seamlessly hand off to a human agent 
  4. finally, the bot engine lacked sufficient training and updates.

All of the above reasons are directly related to the lack of automation in various aspects of IT and business.


One initiative that hopefully works is to begin by asking teams to map out their current automated processes. This exercise usually reveals significant gaps and helps shift the focus from AI to necessary automation steps.


As we read and learn from others successful AI implementation is a journey, not a destination. It requires a solid foundation of automated processes, clean data, and a clear understanding of organizational goals. Until this reality is grasped, AI initiatives will continue to fall short of expectations.

Friday, July 19, 2024

Learnings from Microsoft Global outages due to Crowdstrike incident — An Architect’s view

Today’s Global system outage finally got us some action to follow on in these quiet few weeks of summer vacation. 

Ironically, Microsoft themselves have so much content published to avoid a single point of failure, implementing robust testing and effective rollback/roll forward mechanisms, designing graceful degradation, diversifying critical infra, and the list goes on.

As an Architect,  it's an apt problem to preach upon and a perfect example to learn so many anti-patterns and what can go wrong if we are not careful with our simple system designs. I wanted to share some thoughts on what we should avoid to prevent similar issues in any IT system or landscape.

Don't Put All Your Eggs in One Basket

The first and foremost principle is to avoid a single point of failure. Relying too much on one vendor, service or solution is always risky. It's like putting all your eggs in one basket. If that basket falls, you're in big trouble. We need to mix things up and have backup plans.

Test, Test, and Test Again

We have heard the saying, "Measure twice, cut once"? in IT it's more like "test a hundred times, deploy once." We can't just roll out updates and hope for the best. We need to test thoroughly in a safe and like environment first.

Have an "Undo" Button

Sometimes, things go wrong no matter how careful we are. That's why we need a way to undo changes quickly. It's like having a time machine for our systems. If we can't roll back or roll forward easily, small problems can soon turn into big headaches.

Keep the Lines of Communication Open

When things go south, we need to be able to talk to everyone affected. It's not just about fixing the problem, it's about keeping people in the loop. We should have multiple ways to reach out and give updates.

Plan for the Worst

Our systems should be like cats - able to land on their feet. Even if part of the system fails, the rest should keep working. It's about being prepared for the worst while hoping for the best.

Know Your Weak Spots

We should regularly check our technology supply chain. Who and what third-party systems, services, and tools are we depending on? What could go wrong? It's like doing a health check-up but for our IT systems.

Change with Care

Rushing changes is asking for trouble, especially in production. We need a solid process for making updates. Think of it like air traffic control for our systems - everything needs to be cleared before it takes off.

Don't Put All Your Faith in One System

Using the same operating system or platform for everything is convenient, but risky. It's good to mix things up a bit. That way, if one system has issues, not everything should go down.

In the end, it's all about being prepared and thinking ahead. For me, the CrowdStrike incident is not a surprise and it's more of a wake-up call for all of us in IT. We need to learn from this to build stronger, more reliable systems that can weather any storm. 

Saturday, May 11, 2024

AWS Migration and Modernization Gameday Experience

 I was at the AWS gameday and it was a very fun and learning experience partnering with fellow colleagues and competitors. AWS has kind of created this concept very differently than a typical hackathon and it is more like a gamified activity in a much more stress-free environment.

For the Migration and Modernization Gameday, it was a 6-hour activity with an hour's break for lunch (most of them had it by their desk). We were asked to migrate a 2-tier e-commerce application to AWS in a specific region with all AWS services at our behest. This specific gameday was a level-300 and required at least an associate certification, but I felt even non-experts with some AWS hands-on knowledge can also contribute immensely to the team.

The first part of the day went into setting up the basic infrastructure on new VPCs and following certain guidelines to migrate databases (using DMS) and web servers (using App Migration service). We followed the AWS documentation for the migration part.

The fun part of the gameday was in the latter half of the session post-lunch when the basic migration was completed and we had to switch the DNS from the on-premise to the cloud infrastructure. That’s when the application is throttled with real-world traffic, volumetric attacks, fault injections, etc. The better your application performed the more points you got and vice versa.


Here are some of the learnings for folks wanting to participate in the next Gameday.

a) Be thorough with the networking concept in AWS. Outline your end-state network architecture view and naming conventions to begin with. As you will be on the console this will help avoid confusion.

b) Plan all the AWS services that would be the right fit for the requirements. Since it's a real-world environment scenario, extra points are awarded to teams that include all different AWS services.

E.g. Cloudfront as CDN, AWS WAF for firewall, AWS Guard Duty for threat detection, AWS Cloudwatch for monitoring, etc.

c)Ensure that the architecture follows the well-architected framework pillars.

Operation Excellence, Security, Reliability, Performance Efficiency, Cost Optimization, Sustainability

d) Segregate tasks between team members and ensure to review each other's changes, especially the networking part which is typically confusing. 

e) Last but not least, if stuck in any step for long, try to get the application running first and then try implementing the right principles.


Saturday, February 3, 2024

The Coevolution of API Centric and Event-based Architecture

When evaluating communication between different systems, there is always an argument of choosing between an API-first approach and an event-first approach. In a distributed ecosystem, it’s not one or the other, but the combination of both these strategies that can solve data transmission between one or more systems.

API’s are the de facto way of interacting for synchronous operations. That means performing tasks one at a time in sequential order. When designing systems with a specific responsibility, APIs shield the underlying systems from being accessed directly and expose only the reusable data, thus ensuring no duplication of data happens elsewhere. When using simple API’s all that is needed is a readable API structure and systems that follow a request and response pattern. API’s are beneficial in the case of a real-time integration where the requesting system needs information abruptly.

However, designing and Scaling APIs can also get intricate. In high transactions microservices architecture, throttling and caching of APIs are not simple as APIs need to scale on-demand. Also, in such integrations, API gateway becomes necessary to make the systems loosely coupled.

The below example depicts a reporting system that creates different reports based on the Customer, Order, and Catalog data. The source system exposes an API. The reporting system fetches the data via the API and sends the information to the underlying destination systems.

API First Architecture

This architecture looks fine if there are no changes to the Information from the source systems. But, if the order information has properties that keep getting updated, then the Reporting system needs to have the capability of ensuring that the changed state gets updated in subsequent systems.


Handling Cascading Failures

In a chain of systems that interact using APIs, handling errors or failure can also become cumbersome. Similarly, if there are multiple dependent API calls between two systems, the cascading failures become complex. The complexity further increases when there is a need for systems to react based on dynamic state changes. This is where Event-based architecture can help address some of the issues.

The basis of Event-based strategy is asynchronous means of communication. There is an intermediate system that decouples the source and the destination service interfaces. This strategy is apt for applications that need near real-time communication and when scalability is a bottleneck.




With an Event-based architecture, all the source system has to do is adhere to a contract, and on any state changes, trigger a message to the intermediate broker system. One or more destination systems can subscribe to the broker system to receive messages on any state changes. Also, since the source system triggers an event, the scalability of the APIs is not an issue.

Event First Architecture


With a pure Event-based architecture with an increase in the number of messages, the architecture can get complicated. Tracking the statuses of a message if they are processed or not becomes tricky. In this case, every order tracking needs to happen for the latest state, and error handling needs to be robust. Also, this entire process is slow and there is a huge latency between the end-to-end systems.

Another way of simplifying the architecture is by combining API and the event design. The below diagram illustrates that the Reporting system interacts with the Order system using both API and events. The Order system sends the state change notification to the broken. The Reporting system reads the state change and then triggers an API call to update the Order information. The reporting system makes API calls to the Catalog and Customer systems to fetch the static data. It can further push the created destination messages to consume using the event broker.




In conclusion, both API and events have their pros and cons and solve a specific problem. They are not a replacement for one another and architecture can be made less complex if they co-exist. In a modern micro-services architecture to have both of them handy can help ease distributed system interaction complexities.

Wednesday, January 10, 2024

Edge Tech Realities - Part 1 - Web Performance

The edge technology has improved severalfold in the last decade or so. In the late 2000s, I used Akamai to serve static content for several market-leading clients. These platforms typically made the application calls lesser and improved the performance and latency of the websites. 

Last year or so, I have been working with Microsoft Azure's front door, and this is when I realized in detail that, like any technological advancement, edge platforms have their fair share of disadvantages and practical challenges. 

First and foremost is the application performance. Recently, after caching all static pages on the CDN, a global website that I was part of surprisingly increased its average page load times. That's when I realized that the distribution of resources across multiple edge locations can lead to inconsistencies in processing power and network connectivity. Also, the page load times will differ based on where the nearest edge locations are present. 

Also, I learned that with Microsoft Azure backbone networking, the distributed nature of edge platforms and their edge servers are not created equal. They differ in processing power, storage capacity, and network bandwidth. Based on the logs, some users had slower response times, as their requests were directed to an underperforming edge server. 

In a web application whose static content needs frequent updates, the purging of the cache happens very often. This will need a very close synchronization between all the edge servers. A lot of times, it takes time for these edge locations to be in sync, and the more the purges, the more inconsistencies.

Lastly, if there are sudden spikes in traffic, say due to promotional campaigns or seasonal traffic. Scaling of edge servers may not always be instantaneous, leading to performance bottlenecks. Though this can be only for a certain period, it still impacts the end user.


Wednesday, November 29, 2023

Mastering the Art of Reading Thread Dumps

I have been for years trying to find a structured way to read thread dumps in production whenever there is an issue. I have often found myself in a wild goose chase, deciphering the cryptic language of thread dumps. These snapshots of thread activities within a running application have so much information, providing insights into performance bottlenecks, resource contention, and High Memory/CPU. 


In this article, I'll share my tips and tricks based on my experience, having read several production thread dumps effectively across multiple projects, demystifying the process for fellow my expert engineers.


Tip 1: Understand the Thread States

Thread states, such as RUNNABLEWAITING, or TIMED_WAITING, offer a quick glimpse into what a thread is currently doing. Mastering these states helps in identifying threads that might be causing performance issues. For instance, a thread stuck in a WAITING state can be a candidate for further investigation.


Tip 2: Identify High CPU Threads

The threads consuming a significant amount of CPU time are often the culprits behind performance degradation. Look for "Top 5 Threads by CPU time" threads and dig into their stack traces. It is where the full stack trace is defined, pinpointing to the exact method or task responsible for the CPU spike.


Tip 3: Leverage Thread Grouping

Grouping threads by their purpose or functionality can simplify the analysis process. In complex applications, the number of threads can be really confusing. Hence, collating or grouping them together can be helpful. For e.g, grouping threads related to database connections, HTTP requests, or background tasks together. This approach often provides a more coherent view of the application's concurrent activities.


Tip 4: Pay Attention to Deadlocks

Deadlocks are the nightmares of multithreaded applications. Thread dumps provide clear indications of deadlock scenarios. Look for threads marked as "BLOCKED" and investigate their dependencies to identify the circular dependencies causing the deadlock.


Tip 5: Explore External Dependencies

Modern applications often rely on external services or APIs. Threads waiting for responses from these external dependencies can significantly impact performance. Identify threads in WAITING states and trace their dependencies to external services.


Tip 6: Utilize Profiling Tools

While thread dumps offer a snapshot of the application state, profiling tools like VisualVM, YourKit, or jVisualVM provide a dynamic and interactive way to analyze thread behavior. These tools allow you to trace thread activities in real time, making it easier to pinpoint performance bottlenecks.


Tip 8: Contextualize with Application Logs

Thread dumps are more powerful when correlated with application logs. Integrate logging within critical sections of your code to capture additional context. This fusion of thread dump analysis and log inspection provides a holistic view of your application's behavior.


In conclusion, reading thread dumps is both an art and a science. It requires a keen eye, a deep understanding of the application's architecture, and the ability to connect the dots between threads and their activities. By mastering this skill, one can unravel the intricacies of their applications, ensuring optimal performance and a seamless user experience.

Monday, November 27, 2023

Pitching IaC to Stakeholders

As a Cloud Architect, I have several times explained to our stakeholders regarding Infrastructure as Code (IaC) and how it makes our cloud project a no-brainer, especially for applications running on the cloud.

In every discussion, I keep explaining how all our development work can be super fast without any mistakes, reusable, and save us tons of time and money in the future.

The first question I always get is, what is wrong with the current manual ways, and it has served us well so far? Will the cost increase our short-term budget?

I take a deep breath and re-iterate that having a blueprint always saves time, increases accuracy, and saves costs in the long run. IaC simplifies future changes, and environments can be replicated without any major rework. 

The gap between Business and IT often arises not due to the incapability of IaC but the challenge of translating its intricacies into a language both realms can comprehend. An Architect has to be persistent and repetitive. 

With Cloud first implementations, surely there will be a time when Businesses in large organizations will take efficiency and automation seriously.

Saturday, November 25, 2023

The Plight of an Architect in an Agile Project

Agile methodology in software development has emerged as a guiding light, promising flexibility, collaboration, and adaptability. But organizations have mistaken it for a luxury cruise liner while treating it like The Pirates of Carribeans, Black Pearl on the high seas of chaos.

Agile, with its sprints, stand-ups, and user stories, was supposed to be the antidote to the rigid and often cumbersome Waterfall methodology. However, in the real world, Agile is sometimes wielded like a double-edged sword – misused by developers and misunderstood by business leaders.

The Agile coaches are like the Pirate Captain, are the ones mainly responsible to steer the meetings, and are the ones who navigate the ship without an ounce of technical know-how. Picture the Agile stand-up meetings as the meeting of the Brethren Court, which typically turns into recitations of individual developers achievements. Each developer trying to resolve epics and making their own stories for their everyday chores, trying their best to please their captain. 

Then there are the Product Owners who act as the Pirate Lords, holding the keys to the treasure chest of project priorities. These lords of prioritization often struggle to let go the old ways of the Waterfall, treating Agile like a mere parrot on their shoulder rather than a shipboard companion. They treat technology debt as The dead man's chest, which is not supposed to be opened or seen.

Amidst all of them, the Architects are often left in the lurch as Agile teams treat their decisions as an afterthought. Their long-term vision gets lost in the relentless pursuit of project priorities and sprint goals. Good Architects are aware of the so called mirage on the horizon. But, often find themselves relegated to the backseat of the ship, much like a passenger becoming mere spectator watching their maps of successful navigation become damp and tattered in this unpredictable Agile Storm. 

Friday, April 21, 2023

Funnel-based Architecture for application Security on the Cloud - Part 1 - The Framework

As a Solution Architect, I've got a few opportunities to work with organizations facing security challenges on the cloud, especially with public facing applications. One of the most common issues I've encountered is a lack of visibility and control over their cloud environments.


To solve these security issues I've implemented a funnel-based framework for enhancing security on the cloud. This framework involves identifying the data flow within the cloud platform and implementing funnel points, which act as choke points at each layer for security controls. The last steps of the framework include increasing observability and continuous security improvements.


Below are the different steps :-




Step 1: Identify the Data Flow within the Cloud Platform


The first step in implementing a funnel-based framework for security on the cloud is to identify the data flow within the platform. It concerns understanding the data types processed through the platform and identifying the various stages in the data flow. It also includes getting to know every service or layer through the data flows.


Step 2: Implement Funnel Points


Based on the data flow, the next step involves implementing funnel points throughout the platform. Funnel points are choke points in the data flow where security controls are added at each layer to protect from threats. These funnel points are part of the Network, Transport, and Application Layers. These funnel points in the system may include network gateways, data storage, web and application services, and other components. 


Step 3: Implement Security Controls at Each Funnel Point


At each funnel point, security controls at each layer or service protect the cloud environment. It includes access controls, encryption and decryption processes, network security controls, monitoring and logging mechanisms, vulnerability management, and incident response processes. Each security control design addresses a specific threat or vulnerability and works together to provide comprehensive protection for the cloud environment.


Step 4: Regularly Monitor and Update the Security Controls


Once the security controls are implemented in each layer, it is critical to regularly monitor and update them to ensure they are working effectively. It involves monitoring the platform for suspicious activity, regularly reviewing access controls, updating software and security patches, and testing the security controls to identify any weaknesses or vulnerabilities.


Step 5: Continuously Improve the Framework


Finally, to continuously improve the funnel-based framework for security on the cloud, it is critical to stay ahead of emerging threats and vulnerabilities. It involves staying up-to-date on the latest security trends and best practices, regularly reviewing the security controls to identify areas for improvement, and working with clients to identify new threats and risks.


By following these steps, I was able to implement a comprehensive funnel-based framework for security on the cloud that provided good protection against a wide range of threats and vulnerabilities. I will deep dive into the Funnel based Architecture with examples in Part 2.

Funnel-based Architecture for Website Security on the Cloud - Part 2 - Using Microsoft Azure Services

In Part 1 of the article, I described the Funnel-based framework and various steps to improve web application security on the cloud. In this article, I will cite a real-world example of how I used the funnel-based framework and designed a Funnel-based architecture to filter and analyze malicious traffic for a web application.


The layered approach of Funnel-based Architecture is essential in providing multiple levels of security to web applications. By having multiple layers of security, each layer is responsible for detecting and blocking various attacks, making it more challenging for attackers to breach several layers at once. If an attacker bypasses one layer of defense, the other layers can still provide protection, making it harder for them to compromise the web application.


Below is an example of a multi-layered funnel that blocks malicious web requests. As each layer provides an increased level of security. The diagram illustrates 





a) The data or request flow from the browser, DNS, across edge layers, and all Azure services in the background. 

b) All layered funnel points have independent layers to choke malicious traffic by ip filtering, Geo-blocks, custom WAF rules, rate limiting, content caching, etc. 

c) Security controls at each layer or funnel point where access controls and restrictions using user authentication, authorization, audit trails, data encryption at rest, transit, via Intrusion Detection and Prevention System.

d) Deep Monitoring and Alerting of each layer and creating custom automated ways to update infrastructure and WAF rules, log analysis, auto threat detections, triggering application protection via scaling, captchas, static sites, etc. 

e) Finally, continuous improvement by providing regular security assessments and benchmarking, performing penetration testing, security awareness training, incident response planning, etc.


Here are some examples of security tools that we used to create a Funnel-based Architecture on Azure:


  1. Azure Firewall: A network layer security tool that provides a managed, cloud-based firewall service to protect Azure virtual networks and resources from network-based threats.
  2. Azure Front Door: A global, scalable, and secure entry point that provides routing, caching, and load balancing of web traffic at the network layer.
  3. Azure Application Gateway: A layer-7 load balancer that provides WAF and SSL termination capabilities to protect web applications from application-layer attacks.
  4. Marketplace WAF: An Advanced WAF that provides robust in-house web application firewall protection by securing applications against layer 7 DDoS attacks, malicious bot traffic, all OWASP top 10 threats, and API protocol vulnerabilities.
  5. Azure DDoS Protection: A layer 3/4 protection service that protects against DDoS attacks by automatically mitigating them in the Azure network before they reach the targeted resource.
  6. Azure Key Vault: A cloud-based service that provides secure storage and management of cryptographic keys and secrets used by cloud applications and services.
  7. Azure Sentinel: A cloud-native SIEM and SOAR solution that provides intelligent security analytics and threat intelligence across the enterprise.




Building Microservices by decreasing Entropy and increasing Negentropy - Series Part 5

Microservice’s journey is all about gradually overhaul, every time you make a change you need to keep the system in a better state or the ...