It’s all about laying blame


I heard this in House MD

Good things happen often, Bad things happen sometimes.

If you are are talented and skilled you will try things that others are afraid to try out and because you are going to live dangerously there will be times when you will fail. What you did and What the outcome is are 2 different things. You can fail or succeed but that will have no bearing on whether you are right or wrong. So next time you have to make a tough decision where the chances of failing are high don’t be afraid – do the right thing.

There will be some who will try to lay blame but eventually you will see – people will realize that your process works because you succeed and disrupting your process makes you ineffective and no one wants to do that.

Motivated or Skilled?


It’s interesting how when we have to select a bunch of people to be on our projects we immediately latch onto the skilled ones whether or not they are motivated and the average guys are first ones to go even if they have a vested interested in our success (as they succeed along with us).

Next time when you have to make a choice of keeping someone on the team, pick the one who will do anything to make you successful and you will be surprised what that person will do to gather all the skills needed. He will do everything and a lot more to make him/her successful and in the process you will see success beyond your wildest imagination. You just need to provide them some mentorship – thats all.

Alternatively, keep finding ways to motivate people but don’t forget The Paradox of Rising Expectations.

Just Write away…


I have been meaning to write a few articles for “let’s just say someone” but that publication will mean a bunch of reviews and approvals. I get the fact that I am getting publiched in anotehr brand and they need to make sure it (if not represents) fits the bill. Having said that, the fact that a person is reviewing my article may have very different opinion on the topic and that difference of opinion will impact how my article is reviewed and if gets selected or not.

This article is my POV and what I want to tell the readers, and not what this reviewer thinks and what he wants to get out. This is not just an “edit” on punctuations and language but on my thoughts is my biggest deterrent on writing for any other publication and I realized that those 2 articles have been sitting in draft mode for 18 months now. It just gets “uuggghhh…”

So my advice to myself here and to all of you out there – “just write away.. find a way where you just get to state what you think // dont get stuck into that review trap”.

Unit Testing in AEM (thinking loud)


This is not a recommendation of any sorts but a culmination of ideas and a few options that are available for us to use if we want to do unit testing within AEM. I had done some research for a client some time back and this article is largely influenced by that work but a lot of contextual stuff has been pulled out. I have still tried my best to ensure that the article here holds it’s essence. I will try to do a follow-up soon with a lot more details.

Option 1: Use Sling tools and test in-container

Apache sling has released a set of tools http://sling.apache.org/documentation/development/sling-testing-tools.html which can assist unit testing in the application. There tools offer several ways of doing the testing like a) good old JUnits where there are no external dependencies or b) Use of mocks – sling provides readymade mocks which reduce the effort or c) we can deploy the test cases in a CQ box (or sling) and run using OSGi references.

The approach I am recommending here is where we will deploy JUnits in an already hosted CQ instances and invoke the test cases remotely. I understand that this is not “old school unit testing as i am not abstracting any dependencies and my units include dependencies” but i have a reason for doing that. As a matter of fact if you have been following up my writings on unit testing you would know that I am not a big fan of mocking and actually am very happy to do any unit testing against dependencies if i can set it up.

 To do this we need a few things to happen as follows:

  1. We will need to have a hosted CQ instance that can be used as a container for running test cases
    1. We can use embedded systems but then we will have to spend additional effort creating content and what not. Also the embedded container will be sling and not CQ and we would like to keep the environment as close to what we use as possible
  2. The CQ instance should have a pre-populated set of products and images (this setup does uses AEM eCommerce module and PIM and DAM have been integrated with external systems) and that acts for us as readymade test data. These can be achieved using our backend integrations. We can chose to do it independently or can do it automatically (automation of these things can also happen over time to allow us to start quickly)
  3. For interactions with any backend services (like Order Management, Pricing, Account information), we would need to have a backend service instance running (as i said i prefer systems over mocks if possible) with all the variables and pieces setup. This instance should also have various data setup like user accounts, products instances, availability, prices, etc to ensure our use cases work. There are obvious challenges setting up independent backend services and we can explore one of the following 2 options
    1. Capture all requests and responses for a certain request type and serialize those into a test-data store. It can be a huge XML that we can store in a key-value pair sort of a system – can be a database like mongo (even SQL would do) or we can serialize on file system or;
    2. We can use an already existing backend system

Option 2: Use selenium as the functional testing tool

In this approach I am recommending not to use JUnits at all. The idea is to use the philosophy of system testing which can test all of your units in the code. This is a big departure from the traditional way of unit testing where all dependencies are mocked out, and we can run several tests quickly. While Option 1 is also to the same effect, in this approach we go a step further and leverage our system test suits. The idea is not to do this for every single use case, but pick up business critical functions like checkouts, order management, account management and automate those. The selenium scripts can then be integrated with a JUnit runner where we can then integrate it with CI tools and can run it from Eclipse or Maven and hence can be integrated with CI itself. This saves us the time to write those JUnits and manages a whole suite independently. This approach also needs a hosted CQ instance with product data setup, some content setups, and backend integrations just like in Option 1.

Of course this is bit tricky and not really unit testing but it has some huge upsides if done right.

Don’t be afraid to course correct


When driving a car if you sense a flat tyre; do you stop and fix it or do you just go about driving with a sluggish pace which eventually will lead to an accident?

I guess we all know what we will do! So when we see a problem on a project why don’t we take the time and fix it? Why do we get afraid of telling stakeholders that something has gone wrong and we need to course correct which might lead to an increased cost or a bit of a delay in schedule?

I am sure the counter argument would be – we do fix it or inform it. I am sure if you introspect enough you will realize that we do one of the following:

  1. It’s trivial and we can deal with it ourselves – leads to no communication and eventually there are enough stacked up trivial(s) which eventually leads to cost of schedule gone wrong or;
  2. It’s a engine gone wrong and the cost is already so huge and I will blamed or it will taint my reputation of not being able to handle it well so let me fix it

Either ways you will find yourself screwed compared to if you would just tell the stakeholders who will more often than not be able to share some ideas of how to fix the problem faster. You may still end up bearing the cost (which you do anyways) but your journey towards delivering the project will be so much relaxing and i can almost guarantee your relationships with the stakeholders will be lot more trusting.

Start by getting familiar


The level of confidence a team can get for the number of functional defects that a client can find in UAT has a direct relation to how much they know of the application they are building. You can write any number of scripts (manual or automated) but unless you get yourself to be in a position where you are like a user of the site you will never know if what you are testing will meet the client’s end requirements.

So at some point in time, get to know the application upfront even before you write a single line of code or test script or find someone who does. It’s no different from how our mind works when we travel from home to office or to a new place – either we know where we are going and our eyes and brain give is immediate and instant feedback if we go wrong or we need something like Google Maps to tell us (even though it is a bit delayed).

This is a step that you need to have to be able to deliver Quality on any project

High availability design


If you have ever travelled in an Indian Railways you would have noticed that the capacity for which the train is supposed is handle holds no meaning because the number of people it will be carrying is just going to be way over. That’s how the passenger load and platforms all across the country are managed. The method mostly works fine, but from time to time there are breakdown and trains are delayed, sometime cancelled but the life goes on as people expect that this will happen with Indian Railways. 

When we design and write those big platforms/software something similar happens but the biggest difference is that customers/clients who have paid for the software don’t like those downtimes (cancellations) and slowness (delays). Last 2 years or so I have had so many conversations where 2 key NFRs intersect – Performance and Availability. I have noticed and started to realize that while these two end up being joint at the hip and need to work closely together they still mean a world of difference and what it means when we speak about performance and availability and each of these need to be addressed differently. 

Of course, eventually with all many fixes you will eradicate a lot of cases that led to failures but it would have taken you so long and the reputation that the brand holds do dear is already damaged.

 

The Start 

Most project (almost all) in today’s world have some Non Functional Requirements and the 3 that take the top most priority are Performance, Availability and Security. Some numbers that get most often thrown around are:

  • Pages should open in under 1 second and so forth
  • There should be an uptime of 99.99% – which is actually 1.01 minutes per week 

And that’s about it. Of course there are more, but 95% of the conversations revolve around the two here. then we go about designing solutions to meet those numbers. 

 

Just before GO Live

Things are al good when we are under implementation and we do everything to make sure we meet those 2 or so numbers met with out design. We do performance modeling and then we execute those performance models and prove out that the trafic model/simulation as to what we understand as client use cases work fine. So that is not we say with confidence that our system will meet the performance needs. In this model, we do take a capacity increase of 40% or so; again based on anayltic and some future growth and we incubate those numbers into our calculations and then we are even sure that our system will be able to handle a bit more if that happens at times. 

Now that we are so sure that performance is all good for the traffic we expect to get we believe that our software should continue to work fine because everything will work in the same constraints in which we have tested it. And because those constraints are well defined there should really be no problem with us meeting our availability numbers. 

 

We are in flight Houston

Then the next cool thing happens – we go live and our system that we have build with so much pride and caution and we have tested so much is Live and so many people start to use it. It certainly is an exhilarating feeling seeing your sweat and hard work go live and people are seeing and interacting with what you have build. 

Until a time comes when the system goes down. You would be sleeping in middle of the night and you will get a call and someone will be telling you to get up, switch on your laptop and get ready to debug as to why the system went down and get it up and running quick. It takes you a while to think – what the hell just happened. We did everything we had to do. There were even reviews that we did and everything was looking good. How can it go down.

Well there is something called Universe that has a different set of plans for you and those plans just went into motion.

So what happens?

Indian Railways happens :) You realize that there is a traffic pattern that has suddenly come and hit your servers which you did not know. A Chinese search engine has started to crawl all over your site and the site is not even in china. Well it’s a free ride this internet and anyone can get on. Why can some people sitting in a country not see a website that is not supposed to be targeted at them – but we did not have those specifications. We put in all checks but we never anticipated for all those search engines and bots and the crawling they will do. 

What do we do?

We fix the problem and put in either a block to stop that traffic pattern or we throttle it or we add servers to handle it. And then we go about thinking okay now i am good; this is done and dusted and wont happen again.

Universe has other plans

Next time someone will add some bad servers into WIP and cluster will fail. and the next time someone will delete the database and it will crash again and the next and the next and the next…

In the whole software development process we miss this key step – to design for this game and how we will be setup to get the system back up and running in that time frame. 

 

Fixing begins

Of course we have to do something but what we do is to start looking at our solution and check why performance is bad. Performance – really? We do everything we can do to fix the performance problem but we spend no time on availability aspect.

Going back to the India Railways analogy I drew upfront; a train – engine and bogeys are built to handle certain load and they was agreed. It cant be more precise as there are seats and there are tickets that need to be bought to get into that train. As long as the number of people that get in there are within those constraints our problems will be much less. Everything around our software (our train) needs to work in tandem. But, it is difficult to control. Internet is much more wider than an Indian Railways and who comes, when they come and how many come is just not predictable. It becomes important to acknowledge that no matter how you do there will always be a model of traffic that will come and visit you that will take your system outside of the known boundaries and more often than not once our system is operating outside of those boundaries it’s bound to fail some point. This is where the Availability and Resilience perspectives need to be brought into the picture. 

 

Next Time do something else too

Availability and Resilience

At their core this perspective asks you to set some designs, practice and most importantly an expectation with your clients as to what you are dealing with. We all know that in the last decade how we run our business and how we deal with internet hosted sites is very different than how we used to make systems in past. I paraphrase from article – “If your site is down, your business will suffer” and yet everyone will want a 24×7 uptime but yet we sold what NFRs – 99.99 (1.06 minutes) or maybe 99.9999 (0.605 seconds) thinking it’s okay. If the expectation is 24×7 why would we even start with something less?

We then need to look at the next 2 most important metrics which we miss all along and we never plan, design or test for. It’s like we take them for granted. It’s the Recovery Time Objective (RTO) and RPO (Recovery Point Objective). As we speak about uptime and outages, whenever we have an unplanned outage and we have promised a certain uptime, we need to have the Operational Ability to be do whatever is needed to get the system up and running. If we designed for 99.99% we need to have methods in place to get system back in 1.06 seconds – it feels like Minute to Win It. In the whole software development process we miss this key step – to design for this game and how we will be setup to get the system back up and running in that time frame. 

The Operational ViewPoint

Operation Viewpoint is a key architectural principle that we omit to design for when we are building software systems and platform. How we run software now has been completely changed in the last decade with the cloud hosting. As cloud makes things so easy to provision and host (AWS) we believe that everything should be easy. So where in the past when we used to focus on Availability design a lot more we almost take it for granted. This Viewpoint is something that should become our bread ad butter during the implementation phase with a dedicated team who is going to look at operational processes and tools and provide methods to make the recovery possible in the time it’s expected to be. 

Categorize and Prioritize

This is where it becomes critical to have a conversation with out clients and understand how various parts of the systems can be identified and broken down into services. A classification of sorts like “Platinum”, “Gold”, “Bronze” starts to make sense to get the business to prioritize as to what services should get the top most priority incase of an unplanned outage. The focus of the operational design and implementation team then needs to focus on how to look at the system and how to make those services up and running quickly. This is a key inputs for the implementation phase because unless those services are not known there won’t be a way those services are coded such. 

Recover and not debug

When these unplanned outages happen, the team which is responsible for managing the system more often than not start with a different mind set. They are like Cops who have reached a crime scene after a crime has happened. They start by looking around for evidence and analyze the crime scene as to what happened. The idea is to look for evidence and then solve the crime and hopefully then find the criminal and put them behind bars. Well we all know it takes so long to get there. With cops, i can see the point – you can’t have cameras in all the home and everywhere, so you will have do post-mortems. But this is a software system and we need to be fire fighters. The idea has to be to put the fire out and do it quickly before it takes the block away. The idea has to be to realize some damage has been done – thats a lost cause; let’s see how we can save what’s left of it. 

In softwares that we write and platforms that we host we need to have a something I refer to as “Last know good state” and when an outage happens what we need to do it so just ourselves back to that state. But, when do we do when the state or behavior is not under your control. Going to back to Indian Railways, what do you do when you can’t control the number of people who are coming in on the platform and onto your train – they just keep coming in; no matter if you find a way to replace the train, they will keep coming in. The other way is to of course start adding more trains on the platform. With cloud you can do that and keep adding servers until all the traffic is dealt with. This is where we move seamlessly into Performance and Scalability perspective

This is where we lose sight of the problem and we try to fix something else in modern software.

 So what should do if we can not control traffic. We need an effective mechanism on our train that wont allow everyone to get in. We need to have the ability to know who can get in. We have these ID cards and Turnstil on platform. So if our platforms do not give us those why can we not put those in in our trains. It may not stop all malicious traffic, but it will certainly stop a lot of it. Most importantly you can go back and authorize your Platinum users to get in while you block everyone else.

So in software world, you need to have a cutover switch that will stop all traffic and only allow what is key for business. Unless everything that is coming in is Platinum which is not the case most times, you will be able to recover your most important services easy. Of course there is a degradation of other services but that is something you would already set expectations with your clients and the business. They will be mad but less mad. 

The 300

If you have not seen 300, you have got to see this and learn what it can do to your systems and ability to recover if you handle your enemy (the traffic) though a funnel. You will longer and you will get a lot more time to fight the enemy. On top of it, the pressure and stress that the business creates when their platinum services are not available will also be reduced. You can then go about debugging once you have contained. 

 

Nothing is for Free

Of course we do so much more, but the more we do the more it will cost. Back to indian railways, we can either chose to save some money on building my train by not installing those ID cards turnstiles or we can invest that money and ensure we have continuity. The Turnstiles will be more and more complex and will need a lot more fine tuning to handle all scenarios. So if you need to also handle for use cases where you don’t want your turnstile to fail, then you need to install 2 of them on all bogeys which will cost more and the setup goes on. 

The point I am trying to make is that when we go from 99% to 99.99% to 99.999% we dont look at the cost drivers and what it will do to the project. We may think – it would mean a few more rounds of performance testing and we should done. We we know now what’s going to happen. 

If you fail to articulate to the clients what these numbers will mean to them in terms of cost, you wont ever get them accept the reality of internet and the universe. More often than not you will realize how business will realize that there are services they can live without. Of course, eventually with all many fixes you will eradicate a lot of cases that led to failures but it would have taken you so long and the reputation that the brand holds do dear is already damaged. If you think of this as “risk money” and how you invest in guard your reputation will justify the cost every time – that much i can assure you.