High availability design

High availability design

If you have ever travelled in an Indian Railways you would have noticed that the capacity for which the train is supposed is handle holds no meaning because the number of people it will be carrying is just going to be way over. That’s how the passenger load and platforms all across the country are managed. The method mostly works fine, but from time to time there are breakdown and trains are delayed, sometime cancelled but the life goes on as people expect that this will happen with Indian Railways. 

When we design and write those big platforms/software something similar happens but the biggest difference is that customers/clients who have paid for the software don’t like those downtimes (cancellations) and slowness (delays). Last 2 years or so I have had so many conversations where 2 key NFRs intersect – Performance and Availability. I have noticed and started to realize that while these two end up being joint at the hip and need to work closely together they still mean a world of difference and what it means when we speak about performance and availability and each of these need to be addressed differently. 

Of course, eventually with all many fixes you will eradicate a lot of cases that led to failures but it would have taken you so long and the reputation that the brand holds do dear is already damaged.

 

The Start 

Most project (almost all) in today’s world have some Non Functional Requirements and the 3 that take the top most priority are Performance, Availability and Security. Some numbers that get most often thrown around are:

  • Pages should open in under 1 second and so forth
  • There should be an uptime of 99.99% – which is actually 1.01 minutes per week 

And that’s about it. Of course there are more, but 95% of the conversations revolve around the two here. then we go about designing solutions to meet those numbers. 

 

Just before GO Live

Things are al good when we are under implementation and we do everything to make sure we meet those 2 or so numbers met with out design. We do performance modeling and then we execute those performance models and prove out that the trafic model/simulation as to what we understand as client use cases work fine. So that is not we say with confidence that our system will meet the performance needs. In this model, we do take a capacity increase of 40% or so; again based on anayltic and some future growth and we incubate those numbers into our calculations and then we are even sure that our system will be able to handle a bit more if that happens at times. 

Now that we are so sure that performance is all good for the traffic we expect to get we believe that our software should continue to work fine because everything will work in the same constraints in which we have tested it. And because those constraints are well defined there should really be no problem with us meeting our availability numbers. 

 

We are in flight Houston

Then the next cool thing happens – we go live and our system that we have build with so much pride and caution and we have tested so much is Live and so many people start to use it. It certainly is an exhilarating feeling seeing your sweat and hard work go live and people are seeing and interacting with what you have build. 

Until a time comes when the system goes down. You would be sleeping in middle of the night and you will get a call and someone will be telling you to get up, switch on your laptop and get ready to debug as to why the system went down and get it up and running quick. It takes you a while to think – what the hell just happened. We did everything we had to do. There were even reviews that we did and everything was looking good. How can it go down.

Well there is something called Universe that has a different set of plans for you and those plans just went into motion.

So what happens?

Indian Railways happens 🙂 You realize that there is a traffic pattern that has suddenly come and hit your servers which you did not know. A Chinese search engine has started to crawl all over your site and the site is not even in china. Well it’s a free ride this internet and anyone can get on. Why can some people sitting in a country not see a website that is not supposed to be targeted at them – but we did not have those specifications. We put in all checks but we never anticipated for all those search engines and bots and the crawling they will do. 

What do we do?

We fix the problem and put in either a block to stop that traffic pattern or we throttle it or we add servers to handle it. And then we go about thinking okay now i am good; this is done and dusted and wont happen again.

Universe has other plans

Next time someone will add some bad servers into WIP and cluster will fail. and the next time someone will delete the database and it will crash again and the next and the next and the next…

In the whole software development process we miss this key step – to design for this game and how we will be setup to get the system back up and running in that time frame. 

 

Fixing begins

Of course we have to do something but what we do is to start looking at our solution and check why performance is bad. Performance – really? We do everything we can do to fix the performance problem but we spend no time on availability aspect.

Going back to the India Railways analogy I drew upfront; a train – engine and bogeys are built to handle certain load and they was agreed. It cant be more precise as there are seats and there are tickets that need to be bought to get into that train. As long as the number of people that get in there are within those constraints our problems will be much less. Everything around our software (our train) needs to work in tandem. But, it is difficult to control. Internet is much more wider than an Indian Railways and who comes, when they come and how many come is just not predictable. It becomes important to acknowledge that no matter how you do there will always be a model of traffic that will come and visit you that will take your system outside of the known boundaries and more often than not once our system is operating outside of those boundaries it’s bound to fail some point. This is where the Availability and Resilience perspectives need to be brought into the picture. 

 

Next Time do something else too

Availability and Resilience

At their core this perspective asks you to set some designs, practice and most importantly an expectation with your clients as to what you are dealing with. We all know that in the last decade how we run our business and how we deal with internet hosted sites is very different than how we used to make systems in past. I paraphrase from article – “If your site is down, your business will suffer” and yet everyone will want a 24×7 uptime but yet we sold what NFRs – 99.99 (1.06 minutes) or maybe 99.9999 (0.605 seconds) thinking it’s okay. If the expectation is 24×7 why would we even start with something less?

We then need to look at the next 2 most important metrics which we miss all along and we never plan, design or test for. It’s like we take them for granted. It’s the Recovery Time Objective (RTO) and RPO (Recovery Point Objective). As we speak about uptime and outages, whenever we have an unplanned outage and we have promised a certain uptime, we need to have the Operational Ability to be do whatever is needed to get the system up and running. If we designed for 99.99% we need to have methods in place to get system back in 1.06 seconds – it feels like Minute to Win It. In the whole software development process we miss this key step – to design for this game and how we will be setup to get the system back up and running in that time frame. 

The Operational ViewPoint

Operation Viewpoint is a key architectural principle that we omit to design for when we are building software systems and platform. How we run software now has been completely changed in the last decade with the cloud hosting. As cloud makes things so easy to provision and host (AWS) we believe that everything should be easy. So where in the past when we used to focus on Availability design a lot more we almost take it for granted. This Viewpoint is something that should become our bread ad butter during the implementation phase with a dedicated team who is going to look at operational processes and tools and provide methods to make the recovery possible in the time it’s expected to be. 

Categorize and Prioritize

This is where it becomes critical to have a conversation with out clients and understand how various parts of the systems can be identified and broken down into services. A classification of sorts like “Platinum”, “Gold”, “Bronze” starts to make sense to get the business to prioritize as to what services should get the top most priority incase of an unplanned outage. The focus of the operational design and implementation team then needs to focus on how to look at the system and how to make those services up and running quickly. This is a key inputs for the implementation phase because unless those services are not known there won’t be a way those services are coded such. 

Recover and not debug

When these unplanned outages happen, the team which is responsible for managing the system more often than not start with a different mind set. They are like Cops who have reached a crime scene after a crime has happened. They start by looking around for evidence and analyze the crime scene as to what happened. The idea is to look for evidence and then solve the crime and hopefully then find the criminal and put them behind bars. Well we all know it takes so long to get there. With cops, i can see the point – you can’t have cameras in all the home and everywhere, so you will have do post-mortems. But this is a software system and we need to be fire fighters. The idea has to be to put the fire out and do it quickly before it takes the block away. The idea has to be to realize some damage has been done – thats a lost cause; let’s see how we can save what’s left of it. 

In softwares that we write and platforms that we host we need to have a something I refer to as “Last know good state” and when an outage happens what we need to do it so just ourselves back to that state. But, when do we do when the state or behavior is not under your control. Going to back to Indian Railways, what do you do when you can’t control the number of people who are coming in on the platform and onto your train – they just keep coming in; no matter if you find a way to replace the train, they will keep coming in. The other way is to of course start adding more trains on the platform. With cloud you can do that and keep adding servers until all the traffic is dealt with. This is where we move seamlessly into Performance and Scalability perspective

This is where we lose sight of the problem and we try to fix something else in modern software.

 So what should do if we can not control traffic. We need an effective mechanism on our train that wont allow everyone to get in. We need to have the ability to know who can get in. We have these ID cards and Turnstil on platform. So if our platforms do not give us those why can we not put those in in our trains. It may not stop all malicious traffic, but it will certainly stop a lot of it. Most importantly you can go back and authorize your Platinum users to get in while you block everyone else.

So in software world, you need to have a cutover switch that will stop all traffic and only allow what is key for business. Unless everything that is coming in is Platinum which is not the case most times, you will be able to recover your most important services easy. Of course there is a degradation of other services but that is something you would already set expectations with your clients and the business. They will be mad but less mad. 

The 300

If you have not seen 300, you have got to see this and learn what it can do to your systems and ability to recover if you handle your enemy (the traffic) though a funnel. You will longer and you will get a lot more time to fight the enemy. On top of it, the pressure and stress that the business creates when their platinum services are not available will also be reduced. You can then go about debugging once you have contained. 

 

Nothing is for Free

Of course we do so much more, but the more we do the more it will cost. Back to indian railways, we can either chose to save some money on building my train by not installing those ID cards turnstiles or we can invest that money and ensure we have continuity. The Turnstiles will be more and more complex and will need a lot more fine tuning to handle all scenarios. So if you need to also handle for use cases where you don’t want your turnstile to fail, then you need to install 2 of them on all bogeys which will cost more and the setup goes on. 

The point I am trying to make is that when we go from 99% to 99.99% to 99.999% we dont look at the cost drivers and what it will do to the project. We may think – it would mean a few more rounds of performance testing and we should done. We we know now what’s going to happen. 

If you fail to articulate to the clients what these numbers will mean to them in terms of cost, you wont ever get them accept the reality of internet and the universe. More often than not you will realize how business will realize that there are services they can live without. Of course, eventually with all many fixes you will eradicate a lot of cases that led to failures but it would have taken you so long and the reputation that the brand holds do dear is already damaged. If you think of this as “risk money” and how you invest in guard your reputation will justify the cost every time – that much i can assure you.

Yahoo Moment: Google gets my content as #1 link on search

I have been blogging off and on in recent times, and last few weeks it has been all about sling, adobe aem (cq), unit testing and what not. As i continue my research and journey to define a developer workflow and unit testing for a AEM Development environment, I found out that Google ranked a piece of my content as #1 when searched for specific key words. The key words are not very common and not a lot of people will see my content as #1, but it’s a mini victory for me to be able to get ranked #1 on a content and not having keywords that has my or my blog name in search terms. 

 

Yahoo Moment

Yahoo Moment

AEM Development Workflow – Part 3 (Coding Old School)

In this series I have been trying to define the various development workflows (that have existed or will arise in near future) and what sort of problems do I see with each of those. We start off by seeing the very first use case that has probably existed for years now “the old school” way of coding in CQ. 

Requirements

carousel-authoring-requirements

Image 1: Component Breakdown

It is always beneficial to define what the end result has to be so that we can ensure that we have achieved what we set out to do. I had finally received a HTML for a carousel from my Site Development friend (download here) and our job is to take that carousel and a simple page and make it Content Managed in Adobe AEM. For this purpose I am going to use AEM 6.x as we need to use sightly in later phases. As we you see in the Image 1, you will how the carousel should look like. I have broken the page down to 2 parts/components that we will have to develop:

  1. Carousel Component – this is highlighted in blue block and marked #4
  2. Title Component – For the same of this use case and trying to handle how we can reuse 1 component for various displays, I have classified #1, #2 and #3 into “Text” component

Implementation

CQ Project Structure Setup & Tools

Image 2: Project Structure

Image 2: Project Structure

We are going to use the default CQ structure that allows us to handle components, templates, client libs, etc. I had also setup Eclipse for AEM for development purposed but I was unable to use it to it’s full extend because of the lack of integration and I had to fall back on CRXDE-Lite several times. Following are the details for each of the folders defined:

  • aem6scratchpad: This is the project application folder and is at the root of  everything we would be coding 
  • components: this folder holds all the components that we will create in this series
  • components/page: this folder holds the implementation of the templates
  • install: this is the folder where any .jar files will get installed
  • templates: holds the templates which will be used to create pages

 

Task: Move Static files in AEM

Image 3: Final Output in AEM

Image 3: Final Output in AEM

Image 3, shows us that I was able to achieve what I started to do as the very first step. I set out to take the HTML that I received and just convert the files into a set of files/components in CQ so that while the content would not be managed, but it would look the same. If you go back to where we started this, this Step is going to make or break how the Development Workflow looks like.

Approach: Like a Novice Developer

Coming into this exercise, I did everything that I see happen between a Site Developer and a CQ Developer. I went ahead a bit and did how a CQ developer would work – just dive into the coding and don’t worry about any standards, unit testing (JIT coding as I have come to call it). I have not coded in CQ for a long time from scratch. I have been reviewing code for best practices but a lot of my time last couple of years has been spent into designing solutions. I have looked at code from time to time, but it was a long time since I had coded a set of components and templates from scratch. For this exercise I believe that me in that situation was a very good thing, because that how our developers go though. Some of them are new to this domain, but even when they are well versed with coding, the approach that is taken is more or less what I ended up doing. 

 

Journey: Painful

The journey was nothing but painful all along the way. It took me 5 hours to do what should have been a few minutes job. The Site developer had the code up and running in a HTML file in a browser and all I had to do was to make it work “as is” within CQ. It seemed like the Force of Nature were working against me and everything I did, had a problem in it. I finally got it up and running (the designs done match off as is still), but it was excruciating pain. Here are a list of things that went wrong (and don’t be surprised if you find this the same as it happens at the start of every project):

The set of tools which I thought would do a wonderful job did not cut it out

I was so excited to work with AEM development tools for Eclispe. I got it up and running after a bit of struggle, but working in it only helped in coding a bit. A lot of things that had to be CQ specific like creating components, templates, dialogs weren’t doable in Ecliple. I ad to move into CRXDE-Lite to do all of that. The good part was that I didn’t have to worry about vault (vlt). I could easily synchronize between hosted CQ and Eclipse easy. It made it just a wee-bit easy

I had to write repetitive code into components

Image 4: CQ IncludesRecall those famous set of lines that we need in every component which if don’t get copied something will stop working. If you have forgotten those you can have a look at Image 4 or download this gist. Well, i tried the shortcut – the ultra famous shortcut that Adobe also recommend us to take (they like to call is Adapting it from existing component) which is supposed to be easy.

In the CRXDE Lite, create a new component folder in /apps/<myProject>/components/<myComponent> by copying an existing component, such as the Text component, and renaming it.

Well I did that, but seemed I kept picking up a wrong component because something or the other didn’t happen. There were times, when the template will not come up when I wanted to create the page or once i opened the page, the component did not show up in the side kick. Well, it seemed, one or the other properties were wrong. 

Where do the client libs go and where they come from

This is probably was the biggest one. The Site developer has picked up whatever version of client lib he wanted to work with. This was by design – I didnt intentionally tell him what to pick it up. Now when I had to get the JS and CSS into CQ along with the assets I had to ensure what he had picked up worked for me. No surprises there that once i got the carousel in the CQ, the first thing I saw was initialization errors which once solved led into next problem as to not the right JS and CSS were picked up. I finally “as a CQ developer” ended up fixing those and got my Carousel to work just as it was working on HTML. Yes!! all of the interactions were working.

 

End Result: Not a Production Ready code

I am going to paste snapshots of the files I ended up creating with links to them on GitHub. This code is not final code and does not have 1/10th of the quality standards that I would deliver. But, the code here will clearly highlight how and where the problem starts.

contentpage.jsp

Image 5: Content Page Template Code

carousel.jsp

Image 6: Carousel Code

 

In Near Future: This is beyond repair

At this point in time, we have seen how the code from Site Developers starts to get into the CQ and what shape it takes. When you take the above and give it to 10 odd developers on a project before you know you will have 10 different standards (or maybe 5) in a project. All of this happens very fast because all one has to do is to drop in a code and make it work. And then we identify the gravity of the problem when we start doing code reviews and realize that this needs more than just refactoring – more often than not this now needs to be fixed only via rewrite. I have not known many projects who have the time and budget to rewrite code. The project releases (thanks to Agile) come to a point when we get workable demos in front of clients – so it all gets tested and done and dusted. With no automated test cases (unit or integration or functional) the risk of breaking something as we make changes it so high that many will just drop and hope that the next component is fixed per a standard.

 

Fix Something

Absolutely!! Thats the intent. But, do we really need to fix the technology. Do we really see that technology that was at fault here? If we had a different set of technologies would this problem be solved for us? Or can we solve this in another way? 

 

Previous > AEM Development Workflow > The problem statement

Next > AEM Development Workflow > Coding Old School with Standards (TBD)

TIP: Make Sling Testing Framework work

We have been trying to find the right mix of unit testing (Automated) in our project, and I have been looking at various options that Sling has to offer. This was done for development of AEM based projects. I tried to follow a few articles to help me get started and each one of those had some issue or the other. I hope that this article allows you not to spend the 10 hours I did only find out that there are tweaks that are needed to make it work.

  1. http://docs.adobe.com/docs/en/dev-tools/aem-eclipse.html – this is the Adobe’s newly released AEM Dev tools for Eclipse. The documentation targets starting new projects and there is also a documentation to move existing projects into Eclipse, but the documentation is pretty weak while it has depth. When I create a new project using Archetype version 7 and run the tests either via maven or Eclipse 2 errors come up
    1. Dependency for slf4j is missing so test dont run
    2. Once you add that dependency, the tests just dont run. I have tried running via Eclipse JUnit plugin and maven test command. I added an assert statement to fail and test cases do not fail
    3. Still Open – I dont know what is wrong here and why these do not work. Still trying to unravel this mystery
  2. http://labs.sixdimensions.com/blog/2013-06-05/creating-integration-tests-apache-sling/ – First of all this article works if you do exactly as it states you have to do. However, if you have password set to anything but “admin” this will not work. It fails in the steps where it has to check if thebundleshave been installed or not in the sling runtime (hosted one).
    1. The defect is pretty stupid, which i opened in sling’s bug tracking system
    2. Also, the test cases will not work if you are running the test cases via Eclipse plugin for JUnit; it just doesn’t work

 

Tips to get going quick

If you are looking to work with server side tests for Sling, I strongly recommend that you start with Dan Klco’s article on Sling’s Integration Tests and use it as is. But, If you want to use a hosted server runtime, then you have to make some changed to POM.XML in the project that you download as follows:

Additional Properties needed hosted server

[code language=”xml”]
<p class="p1"><span class="s2"><</span>sling.additional.bundle.2<span class="s2">></span><span class="s1">jstl</span><span class="s2"></</span>sling.additional.bundle.2<span class="s2">></span></p>
<p class="p1"><span class="s2"><</span>launchpad.http.server.url<span class="s2">></span><span class="s1">http://54.179.160.9:4502</span><span class="s2"></</span>launchpad.http.server.url<span class="s2">></span></p>
<p class="p1"><span class="s2"><</span>test.server.username<span class="s2">></span><span class="s1">admin</span><span class="s2"></</span>test.server.username<span class="s2">></span></p>
<p class="p1"><span class="s2"><</span>test.server.password<span class="s2">></span><span class="s1">admin</span><span class="s2"></</span>test.server.password<span class="s2">> <!– this password has <span class="hiddenGrammarError" pre="has " data-mce-bogus="1">to be</span> admin to work because of the defect (–>https://issues.apache.org/jira/browse/SLING-3873) in sling’s framework –></span></p>
[/code]

 

The following change to the server ready path is needed if you are using AEM 6.x

[code language=”xml”]</pre>
<p class="p1"><span class="s1"><</span><span class="s2">server.ready.path.1</span><span class="s1">></span>/projects.html:src="/libs/cq/gui/components/common/wcm/clientlibs/wcm.js"<span class="s1"><!–</span–><span class="s2">server.ready.path.1</span><span class="s1">></span></span></p>

<pre>
[/code]

 

Closing thoughts

Sling testing framework looks to have potential, but the documentation is so bleak to make adoption so tough. I interact with a lot of CQ (Sling) developers every day and almost each one of them have some issues not to use unit testing – unit testing seems to be so chaotic and non-pleasure that it is like a burden. While developers like to do it, they just feel they are spending so much more time in writing test cases than they are writing code and they do not like it. But, here is just one example that working with CQ/Sling tools for doing unit testing is so primitive and not advertised that it makes things so much more difficult for us. Only if Sling/Adobe would improve this not only they would get adoption, people like me would not have to spend several hours just to get it up and running.

 

 

AEM Development Workflow – Part 2 (finding the problem)

While this post will be a self sufficient one, I will recommend that you go back and read up on the context and the background on the AEM Development Workflow to be able to fit this on in the bigger picture.

I was planning to do a grand component to be able to prove my point but it turns our that my HTML and CSS3 skills are not that great and I had to wait up on a Site Developer to be able to provide me with a HTML, CSS, JS for a Carousel – and that can take a while. In the interest of me moving forward, I adapted and now I am going to work with the OOTB List component that AEM offers. Why I picked this component because it a very good use case of what we do in AEM a lot of the time in component development. We build lists of everything, actually from how i see every component can inherit from a list component and render either 1 or more; but more about some other time.

Let’s set the background

Let’s open the OOTB List Component and it looks like having a bunch of files that so several things like render lists in a different way, do pagination, show a feed of list, analytics, has a dialog box and a bunch of other things. To how how the component looks like, i created a new page and then dropped the OOTB List Component. I the configured to pick a list of child pages and show me a list of those. The following Image is shows the authoring dialog as well as the output – ignore the Hello World for now (this is me being lazy). We can clearly see that the OOTB component works well and it does what it’s supposed to do.

spad - adwflow part 2 - list component

List (foundation) component authoring configuration and rendering

 

List component on Geometrixx

List component on Geometrixx

This is all good, but now the real project requirements come into play and the first thing we will notice that we have to show this same list in a different design; a design that was created by designers. Let’s look at the GeoMetrixx Outdoors and let’s open a Product Listing Page (PLP) where we are sure to find ‘a list component’. The image on the right depicts the component and it’s authoring configuration. You will notice a few key differences in both the consumer facing and authoring requirements:

  1. On the Authoring side the most significant is that now the Author is selecting a “Display As”. This property is used by the component logic to apply different styles. These different styles or views as I call it are how the component can be rendered on a page
  2. On the Consumer Side is going to be look an feel like those boxes around a list element (T-Shirt) in this case, of the price and its Currency, etc.

When I open up the code I see the following resource type definition

[code language=”java”]

sling:resourceType = geometrixx-outdoors/components/nav_products

[/code]

as we go deeper into the code we find that the code for rendering this View (read design) is actually sitting in a localized component of the Geometrixx Outdoors site.

So now let’s look at the code of this rendition from the list component in foundation as well as in the geometrixx (you can get these from Gist), but the lines I want to highlight are as follows

Code from libs/foundation/list

[code language=”java”]

&amp;lt;li&amp;gt;
&amp;lt;p&amp;gt;
&amp;lt;a href=&amp;quot;&amp;lt;%= listItem.getPath() %&amp;gt;.html&amp;quot; title=&amp;quot;&amp;lt;%= StringEscapeUtils.escapeHtml4(listItem.getTitle()) %&amp;gt;&amp;quot;
onclick=&amp;quot;CQ_Analytics.record({event: ‘listItemClicked’, values: { listItemPath: ‘&amp;lt;%= listItem.getPath() %&amp;gt;’ }, collect: false, options: { obj: this }, componentPath: ‘&amp;lt;%=resource.getResourceType()%&amp;gt;’})&amp;quot;&amp;gt;&amp;lt;%= StringEscapeUtils.escapeHtml4(listItem.getTitle()) %&amp;gt;
&amp;lt;/p&amp;gt;
&amp;lt;/li&amp;gt;

[/code]

Code from geometrixx-outdoors (list)

[code language=”java”]

&amp;lt;li&amp;gt;
&amp;lt;p&amp;gt;
&amp;lt;a href=&amp;quot;&amp;lt;%= listItem.getPath() %&amp;gt;.html&amp;quot; title=&amp;quot;&amp;lt;%= StringEscapeUtils.escapeHtml4(listItem.getTitle()) %&amp;gt;&amp;quot;
onclick=&amp;quot;CQ_Analytics.record({event: ‘listItemClicked’, values: { listItemPath: ‘&amp;lt;%= listItem.getPath() %&amp;gt;’ }, collect: false, options: { obj: this }, componentPath: ‘&amp;lt;%=resource.getResourceType()%&amp;gt;’})&amp;quot;&amp;gt;&amp;lt;%= StringEscapeUtils.escapeHtml4(listItem.getTitle()) %&amp;gt;
&amp;lt;/p&amp;gt;
&amp;lt;/li&amp;gt;

[/code]

If you have not already spotted, let me highlight the singular fact that codes in their rendering logic include presentation attributes like width, height (one of them has nothing to it will render like plain old HTML as browser does).

Identification of the problem

My Site Development team delivers to me this great looking markup that works very well in my browsers. I just received my Carousel Code from a Site Developer colleague (Thanks a lot Nitin Suri) and I see this great HTML which works fine (download the whole working copy from my Github). This is the where it all goes haywire. This guy Nitin, does not know CQ while is an awesome Site Developer; one of the best I have worked with. And I am a guy who knows little bit of HTML 5, but very little CSS3. JS i do okay. I now have to take up this great looking markup from these static files and create the following in CQ:

  1. A Component called “Carousel” this does everything that the Site Developers have delivered
    1. Identify the set of authoring components in this component; I see some key content areas in there like Main Heading, Sub Heading, Description and some Slides where I presume authors can include images, Text, Rich Text, Videos, Flash (really Flash!!) – for now I will presume only images
    2. Authors can chose how many slides they want, should there be an auto rotate and what should be the wait interval; what sort of effects does this carousel needs to have
  2. A Template called “Landing Page” where this carousel will rendered among other things that I don’t have yet

I don’t see this as a problem, while this is quoted to be the problem that everyone wants to fix and hence as i read it this classic model of coding in CQ does not work. Remember my post earlier here where I spoke about how even Adobe has come up with sightly as the templating language to fix this inefficiency between this handover.

The problem is how we code this component. Rewind to the way the list component is coded, and we see all of the presentation logic, the business logic – all of it is stuffed into this single JSPs. Well if you are going to stuff shit up while coding, you are going to take time to find the meat from the shit. It’s like mining that takes so long to get to the real stones.

So the question is why do we have to code the way we code in classic methods. Why can we not go back and write good code and make everyone’s lives easier. The AEM workflow problem is not really an inefficiency in the handover of HTMLs to CQ developers but how we should have been writing the code to begin with. We start here by seeing where the problem starts and how the code has been written. Unfortunately, we do see the OOTB Code in AEM as provided by Adobe itself are not coded to solve the problem. When I speak with Adobe they make it clear that these are reference sites and are to used as “Self-learning” but little did they know at the time that people will take this a practice and convert this into a culture.

 

Previous > 

Next > AEM Development Workflow – Part 3 Coding Old School

Tip: Using AEM Developer Tools for Eclipse

If you have been working in AEM for a while now, you are probably aware of the tools we used to have like CRXDE, CRXDE-lite, VLT. With the latest release of AEM 6.x, Adobe has released this Eclipse plugin that can be used moving forward. This page here, provides information as to how you can start off your new project development using these tools. However, if you are working on an existing project on a previous version of AEM (say 5.x), then you need to refer to another set of instruction guide here. This instruction guide talks about a “Getting Started in 5 minutes” as to how to create a new project and then import into Eclipse. If you are already on an existing project you may not need, but if you are a maven buff and want to start off by working with a maven archetype then this guide will fail for a specific step. I realized that this guide is incomplete and some key steps are missing to make this work.

If you are interested in using Eclipse for working on JSPs instead of CRXDE, them the step you will be most interested will be How to work with JSPs. For your JSPs to work you will need to add dependencies for TagLibs to compile and this guide provides a list of dependencies which you need to add. That list is not exhaustive and project the content bundle project will not compile. When working on this guide, you have to use the dependencies from this gist.

This is a key step, and unless you get all those dependencies and if your project is brand new your project will not compile. If you are using an existing project it may just work if you have added those dependencies, but from experience as we dont compile JSPs those dependencies will not kick in. The code still works on CQ server because those dependencies are available at runtime.

 

AEM Development Workflow – Part 1 (introduction)

If you have followed my blogs recently, you would know that I have debated the Development Workflow for a CMS based application being developed on top of Adobe AEM. If you missed those posts you can read here and here. To summarize and set context

The challenge we have seen is too much time spent to integrate the HTML/JS/CSS being delivered by the Site Developers into the AEM’s components and templates. And the solution that seems to be cropping up is not to fix the process but to pick technologies that change the workflow fundamentally. More specifically, I have seen some suggestions that try to solve many other things and lose sight of this simplest of the problems.

Technology should really be an enabler and in this case if there is a technology that can enable us to do something better and faster why not. I am thinking through is if there is a universal solution to this problem and can we apply this solution pattern or technology to all solutions CMS or not; AEM or Tridion, etc.; or does there needs to be a more scientific way of arriving at the solution.

aem-dev-workflowReferring to a few of the slides from Adobe’s presentation presentation here; I want to highlight two specific slides (merged here into one) where the problem now stands in how the components get build and how much time it takes (the problem itself). If you have read the entire presentation, you would already know that Adobe’s answer to fixing the problem is sightly.

What I want to do here is to compare 3 workflows and see what each one has to offer and what’s the best possible way to remove this inefficiency or improve productivity.

  1. Follow the current set of technologies JSP-Java but change the way of working aka different set of tools, trainings and processes
  2. Use Sightly ~ the new templating language pushed by AEM
  3. Use other templating languages like handlebars or angular which are more platform agnostic and goes beyond just CMS and AEM (old school application development also fits)

 

There will be a set of posts in coming week as we go through each of those 3 ways of writing the application. For the sake of this exercise, I am going to take on one of the most common components ever used in the CMS world – the carousel. For each of these activities I will elaborate upon the steps which entail to reach from a design to end product aka – final working part for a consumer. The solution will be built in Adobe AEM (not going to touch any other set of CMS tools for now; maybe later). For me to be objectively look at the end results, I will use the following parameters for comparison in the methods:

  1. Level of Effort (LOE) spent by a Site Developer in converting the design into a working front-end code
  2. LOE spent by a AEM developer to take up the front-end code and convert it into a CQ Component
  3. LOE spent in any sort of integrations and bug fixes between the two paradigms

 

As we progress, I will post sample code, screenshots of what I am trying to do. I will also publish a product (Code, Configuration, Content) that you should be able to run on an instance of AEM if you have access to one handy. I will also publish the tools I would use along the way – including instructions or setting it up on your boxes. If you like this subscribe to this URL and you can directly see posts only applicable to this topic.

 

Next > Finding the Problem

(Unit Testing) – Cost vs. Benefit

downtrendI have been a big fan of unit testing for a very long time; my blog is ridden with posts about it. If I look at at how automated unit testing has progressed in my projects over the last decade I am noticing that the inclination to do unit testing is just not there. Recently, I came across this Podcast series between Martin Fowler, Kent Beck and DHH which was in response to DHH’s post TDD is dead, long live Testing. The Podcast did’t reveal something new to me but it helped enlighten me. I spent the last evening introspecting and thinking how I have progressed TDD/Unit Testing in general over my projects and the trend I see is that the ones where I have been more involved as a developer or a senior developer or even an architect the effort to make unit testing has been there, BUT unit testing (not TDD) didnt work very well in last big project. And since I have taken up the responsibilities of a Solution Architect and as with my responsibilities I have moved away from Code a bit, I see the next layer not doing unit testing very much.

As I spent the time thinking what could be the causes, this is where the Podcast series helped a lot. The reasons that I can attribute to Unit Testing (not TDD) just not happening on projects recently

  1. Writing Automated Unit Tests is just not as fun as Coding is
  2. Writing Automated Unit Tests is just not as fun as Coding is

You get the point! I compare the cases when I first did unit testing back in 2004 (my first project with an IT Consultancy Organization) and then the last one in 2012 and I see why is that was the case. The environment in which I was doing unit testing back in 2004 and then in 2012 (8 years apart) were very different. Back in the days, I was an Enterprise/Core Java/.Net developer, and my life was to deal with things like making Database calls, writing services – business logic, validations, yada yada yada. The part which involved writing automated test cases for Web layer was minimal because the code that went into Web Controller (MVC frameworks we used varied) was only view. And we used Functional/System testing approach to test the UI. The responsibility of creating those selinium test cases were not with the QA Function but still with me as the developer. So the whole testing was “what makes sense” – we did TDD on services code and then functional testing (post coding) on Front-end. But the most important part was that as a developer I was supposed to make sure that nothing breaks when I committed the code. The CI tools were not that advanced back in the days, but they were there. We used Ant for build and there was a task in there which was responsible for running test cases else failed the build (I even coded the ant build script). We were not able to integrate Selenium test runner ant script so that was a manual step, but we had to run a whole suite to ensure everything works. The phenomena “Self Testing Code” existed and we followed it to the core. The were build break fines place – 50 INR as fine of we failed a break; 3 consecutive breaks and it went upto 500 INR; 20 INR for any regression defects introduced (QA had their own suits to keep us in control). When I coded and when I committed the code, I knew that at night no one is going to call me to tell me that other members of my team is stuck. I was working in India and I had a team working from Boston and i knew that when I left and when they came in they wont be gated because i fucked up a compilation or broke something – they can just go about doing their work and be productive. Everyone did so and we all were happy bunch of developers/team. We did not have Continuous Delivery, but once story was done, a couple of rounds of QA testing and it was demo able to the clients. Out test coverage was 80%+ and we had a solid test suite. The suite however took 2 hours to run (I will talk about is very soon as that was the part which make Unit Testing a pleasure).

That was the last time, I saw the workflow work so very well. Surprisingly, that was also the first time I was doing it.

Fast forward 2007-08; I was the architect on a 40 people team (Developers, QA, Site Developers) and we were working on a web development – some heavy web development. It was using Backbase, JSF on JBoss, talking to MySQL via hibernate, everything wrapped around Spring. We started with the project and like before I wanted to do full blown TDD (not just Unit Testing). We started, but very soon (Sprint 1) there was feedback from developers that unit testing was too tough, it takes more time than it takes time to code. I was not the architect and I had produced a Golden Copy as a reference point which the team was using; so the feedback team was giving me made no sense. I let the team struggle on it for another sprint thinking they just need time to come around it, but when it started to hit our velocity points, I was forced to see what was going on in that Unit Testing Departments and what I saw shocked the soul out of me. The automated tests were there for a formality – the tests did test something, they provided some coverage but they were monstrous. They were ridden with mocks (almost) everywhere. No wonder it took so much time for the team to write test cases. For everything that needed to test they had to create a mock. A bit of it was lack of experience as well where team did not create helpers classes to provide for those common mocks, but trying to mock things like HTTPSession, HTTPRequest, Data Adapters for MySQL logic – aka every collaborator was in there. Well they did the right thing (per Unit Testing definition) – test the unit itself, so they mocked even the classes making calls to databases. It make so much harder to test a functionality. We quickly turned it around and set some ground rules

  1. We will not use mocks
  2. If we find a scenario where mocks could not be avoided we will not write automated unit tests for it

What this meant was a) this was not Unit Testing in it’s purest form (I don’t even know what Unit Testing is in it’s purest form, but what definitions and my seniors told me) and b) we had a bit of less coverage. But, I was okay with those consequences. The outcome was that team was not needed to write stuff that was used for just testing; it made our whole test suite a bit slow because it now made calls to the DB, but DB was local so it was not so slow – the team started to spend less time on unit testing, we wrote lesser tests and achieved better coverage to code. This wasn’t unit testing, but we did achieve Self Testing Code. Our CI build was now once again telling us if a commit broke something else. The fun was coming back into the game and we started to make progress. Then in Jan 2008, we decided to switch out Backbase and move over to Flex and this needed a redesign not just in our front-end but also in our back-end services. When we put together a team to redesign and reimplement the services they were given a solid test suite of 300 odd tests. The implementation was thrown away a few interfaces were changed. We fixed the test suite only (We made a decision to let our QA team fix the test suite – which need a QA person to code/script in Java which caused a lot of pushback, but I won that battle) the interface part and then started writing those services again and all we had to do was to make those 300 test cases pass. It was a breeze; we didn’t even need a QA team to test our services. Our test suite was good enough to do it.

We never achieved perfection, but then I was able to meet the goals

Fast Forward 2012-2013; I was the architect on Adobe CQ 5.5 project and the frameworks had changed. There were no services we were writing. The traditional MVC framework was all not there at the developers disposal. I saw the code and there was monstrosity of JSP having business logic in scriplets, single tiered Java classes which did everything from getting data, manipulating them, validating inputs, sending outputs in very fluid formats like Maps (ValueMaps to be specific to Sling implementation). Amongst all this we made some hard design decisions like no scriptlets or business logic in JSPs. Hence there was now going to be Java Code, but there were numerous sling and JCR dependencies that we had to deal with. I once again wanted to do what I had done before – No MOCKS/STUBS. But, the frameworks to be able to do that were’t at my disposal back then. They existed but the community around then was thin, support was less and whatever existed was not going to work. And I ended up deciding to do Mocks. 3 months and it was exactly what was in 2008 – it wasn’t working.

The mocks were so bad that we could change the logic and tests would still pass.

What had happened that developers just mangle everything and tests were not even testing state or behavior. They just went for coverage. The tools were better this time and we had sonar. The idea was to write a test case and make it pass and show an increased covered in sonar. We had used mockit and it seemed like a good idea at the time, but just didn’t work. I understand mocks but i just don’t get them. We setup some state and we check for that state – all makes sense. But when everything is coming from Collaborators and the object you are testing is just aggregating and assembling data into a ValueObject (Map in our case), testing that assembly makes no sense. It’s like saying a developer can code the following wrong.

[code language=”java”]

outputData.put("athletes", athleteList);

[/code]

Well, if a developer can’t get a bunch of basic java statements right, well they have no business coding; let’s send them into a training and save ourselves from a world of pain.

Fast Forward 2014, I am not a solution architect and not focusing a whole lot of coding per say. I love to code, but it’s not my day job (i would like coding to be my day job some day). But, now I see teams struggle a lot to get unit testing even off of the ground. The environment is similar to what I had back in 2012, but now I do have a solution and I think there is a way of doing it without Mocks. Yet, the struggle is there, developers dont see this as fun any more – it looks like a burden.

The Cost vs. Benefit argument is a valid one in every scenario, and in case of our software development lifecycle where getting any sort of confidence is just too tough, as a developer the need to have something to fall back on and knowing that we are not breaking a system with my next commit is huge. In my development lifecycle, we can probably still live with some regression, but as we reach application maturity, as we start to reach stabilization not having a Self Testing Code makes us like Mammoths (not elephants) who move ever so slowly. In the current world of marketing, where we have clients who want to run campaigns in next 2 weeks, we can’t be slow in how soon we release code. You can only be relevant in the industry if you can move quickly, achieve the Continuous Delivery or at least reach a point when you can get releases out in production with a reasonable speed. Those days where releases used to happen once every 6 months are gone; or at least gone in the environment/market I am operating.

We have to have Self Testing Code, We have to Unit Test, We have to have feedback loops and We have to have feedback loop as soon as possible. There is no avoidance; let’s understand and embrace Or be extinct in a few years.

 

Unit Testing – Why not?

For JUnit implementation in our project, we see a great challenge in having them implemented as we are already running behind for Sprint 2 and Sprint 3. Team was provided with all the knowledge walkthrough of the JUnit implementation and example test case. I do not see the team would be able to meet the JUnit coverage as expected. We need to first come to track in terms of delivery and can plan to write JUnits as team gets bandwidth. Please let us know your thoughts.

Really – you are going to give me that shit yet again. Does the behind-ness on your project does not tell you something, does it not even rake your mind that if developers weren’t testing their code there might actually be lesser problems and maybe the project would be on track.

This is how I would have responded a week back. But today, I am willing to have a reasonable conversation to understand what are the pain points which makes it difficult for the team to do “automated” Unit Testing (automated being the key here). Let’s go back to the very core of unit testing and even before that – Self Testing Code. Quoting Martin Fowler …

These kinds of benefits are often talked about with respect to TestDrivenDevelopment (TDD), but it’s useful to separate the concepts of TDD and self-testing code. I think of TDD as a particular practice whose benefits include producing self-testing code. It’s a great way to do it, and TDD is a technique I’m a big fan of. But you can also produce self-testing code by writing tests after writing code – although you can’t consider your work to be done until you have the tests (and they pass). The important point of self-testing code is that you have the tests, not how you got to them.

Read this para again and again until you imprint the 2 key messages a) you cant consider your work to be done until you have the tests and they pass and b) important point is that you have the tests, not how you got to them. Now that you understand this full (hopefully), let me ask you once this simple question – Are the developers doing any unit testing (again not automated)? If the answer is No then I have a problem and I am outraged.

I am big fan of Unit Testing and my first experience was on my first job and it was with TDD and yes it was a adrenaline rush a dose of dopamine (as David H says in his podcast). I saw it’s effectiveness when at a stage we decided to throw away the implementation because we realized that it would just not scale. We were 6 months into the development and we had a lot of LOC and along side a solid test suite of 400+ test cases. When we decided we needed to do something fundamentally different we chose to keep the test suite (we used NUnits back in the day). Over the course of next 2 months as we wrote the new implementation those test cases were our feedback loop. Our job (team of 4 developers) was to ensure all of those 400+ test cases turned Green. The day we saw all of them to be Green we knew we were exactly where (functional delivery) when we had been when threw the code/implementation out. The speed at which we redid the implementation was because the confidence we had in the new code we were writing – we were being told almost immediately if we did something right or wrong (they were not unit tests as per definition – but that is for later). Then a few years later in 2008 we did the same thing again and it was a 9 month x 25 people implementation which we redid and it took us 3 month x 20 people to catch up and meet same functional delivery.

In this situation I read a project is behind track, and they need to go faster with confidence to catch up and meet delivery timelines. Not doing Unit Testing (again not saying “automated”), is like someone telling me – Kapil, you are driving to go to your wedding and you are late and now you have to drive faster. But, instead of putting in a few more airbags and giving you better set of wheels, better brakes; We are going to take the 1 Air Bag you have today and also replace your wheels with an older set. Now, go drive else you will not get married. What do you think I would do – Drive faster and risk my life or start driving even slower because I hope that my would be wife loves enough to know that I had no option but to drive slowly. Well you get the point – While I may get married, She is going to stay mad at me for a very long time for ruining her perfect day.

I have not yet broken down this problem if delay on this project (which I have to), but I can bet my ridiculous salary that it’s because of 2 things – a) “Regression” that developer keeps introducing and b) sending incomplete functional code to QA where defects are found and then it all comes back to them and in some cases these defects are found even in client layer of testing. In my experience this is how my ideal day as a developer used to look like

  1. Day 1
    1. Understand requirements – 1 hour
    2. Design and Implement and unit test- 6 hours
    3. Test (integration + functional/system) – 1 hour
  2. Day 2
    1. Repeat Day 1

And when I delivered code to QA I hardly got back any defects to fix and I would continue this cycle day in and day out. The times would stretch if someone did wrong estimates but the activities remained same always and time spent in each of these always increased in some proportion. And I have seen developers do that. But, more recently, I see developers, this is how their day looks like:

  1. Day 1
    1. Implement new story – 12 hours
  2. Day 2
    1. Fix Defects from Day 1 – 4 hours
    2. Implement new story – 12 hours

Well you already see that on Day 2 the developer is fixing their technical debt and they are struggling to catch up. This endless cycle lets call it Cyclone of technical debt” is unrecoverable. The project will deliver one day and people will get burnt. Developers blame estimations, and Architects blame skills, training, indiscipline. Well let’s call the team failed.

As a developer I ask you – Why would you not want to get High? Why would you not want to be in a state of confidence where you know that your code commits is not going to break something else? Why would you not Unit Test?

Squares pegs in Round holes

This is a part 2 of a series of articles I have just started to write. I spoke about Think Clients (SPA) and CMS and what sort of problems do we have.

World Wide Web had a boost back in 2000s and then more recently there has been a huge surge on web frameworks and more importantly the kind of a user experience we have to deliver to our Square Peg in a Round Hole_0565consumers. The web is no longer a way we push some static content; it’s more about telling a compelling story, making people come to you and buy stuff. MyVessyl is one of the cool things that many of us would like to buy, and why we would want to buy is going to me some of because of it’s utility in daily life, but more so how this has been marketed to us. As you watch the site, it’s just so cool how it looks (watch the Video with sound on).

Let’s start decomposing the home page down:

  1. Rich content including Videos, Images, Animated Graphics and of course text
  2. There are parallax affects on the web site, making it a very rich user experience
  3. There are Call To Action (CTA) on the changing the behavior of the site on the page itself and some take you away
  4. Dynamic Capability like “placing order” and “My Account”

If this was a typical project we would start by ideating on the concept, which then gets passed along to the creative team to wireframe it up and apply designs, and then it reaches the implementation team where it the developers pick it up to convert it to code into whatever technology stack that has been picked up and it all end up with everyone then trying to bridge the gap and make it all come together. This process is just so complex and it has literally no collaboration until we hit the last step when we need to get things up and running.

 

blog-square pegs

 

This was not a very big problem a few years back when we had people who were T-Shaped, but this skill development has changed dramatically to a point when we have started to have Specialists in jobs. The job profiles and need to have specialists has landed us into this situation where at some point we thought it would help us improve productivity, but we find ourselves at a problem that handshakes and getting things work holistically is causing us more time than it did in past. Well who knew this would happen – side effects of evolution I guess.

Here is our Square Peg

 

In my last post around Thick Client and CMS, the way we are moving is to make our web client ‘Think’ and most common rationale I have heard is that we want to solve the handshake problem. It is true that if we allow a site developer to work amongst their tools they will be more productive and our backend engineers will be more productive doing what they do best (java, .net, php, or whatever). So the middle ground I hear we can achieve is to make our engineers expose content as a service and then let the front-end developer consume and showcase the data. If you are interested in seeing such an implementation, you can hop over to SapientNitro’s newly launched site. SapientNitro’s site is a good example of using similar technologies and making a SPA in middle of a Content Managed System and my delight Adobe’s AEM.

Going back, the most common argument I am hearing in space today is that we have to go to SP applications because it helps us solve the developer productivity problem, which I don’t understand.

Here is our Round Hole

How I see  this – SPA does not seems like a logical next step in this evolutionary cycle. It looks like the easiest way to get there which will lead to several other problems in a CMS world. SapientNitro was redesigned to give our viewers a phenomenal experience – it had to be what SapientNitro stands for and hence mobile first design and for that matter every design decision made was to meet that experience. The technology and architectural choices from there on was just to enable the experience. Now that to me makes sense.

Back to the dilemma – I was not sure if SPA was the way to go because it helped us solve a developer handshake problem,  and no I am confident that this is not a good way. We will simply be trying to push a wrong solution for a problem.

 

Next to come – How do we solve this in AEM