Caching Architecture (Adobe AEM) – Part 1

Cache (as defined by Wikipedia) is a component that transparently stores data such that future requests for data can be faster. I hereby presume that you understand cache as a component and any architectural patterns around caching and thereby with this presumption I will not go into depth of caching in this article. This article will cover some of the very basics of fundamentals of caching (wherever relevant) and then will take a deep dive into the point-of-view on the caching architecture with respect a Content Management Plan in context to Adobe’s AEM implementation.


Problem Statement

Principles for high performance and high availability don’t change but for conversation sakes lets assume we have a website where we have to meet the following needs.

  • 1 Billion hits on a weekend (hit is defined by a call to the resource and includes static resources like CSS, JS, Images, etc.)
  • 700 million hits in a day
  • 7.2 million page views in a day
  • 2.2 million page views in an hour
  • 80K hits in a second
  • 40K page views in a minute
  • 612 page views in a second
  • 24×7 site availability
  • 99.99% uptime
  • Content availability to consumers in under 5 minutes from the time editors publish content

While the data looks steep the use case is not uncommon one. In current world where everyone is moving to devices, and digital there will be cases when brands are running campaigns. When those campaigns are running there will be needs for support such steep loads. These loads don’t stay for long but when then come they come fast, they come thick and we will have to support them.

For the record, this is not some random theory I am writing, I have had the opportunity of being on a project (I cant name) where we supported similar number.

The use case I picked here is of a Digital Media Platform where we have a large portion of the content is static, but the principles I am going to talk here will apply to any other platform or application.


The problems that we want to solve here are:

  • Performance: Caching is a pattern that we employ to increase the overall performance the application by storing the (processed) data in a store that is a) closest to the consumer of the data and b) is accessible quickly
  • Scalability: In cases when we need to make the same data-set available to various consumers of the system, caching as a pattern makes it possible for us to scale the systems much better. Caching as we discussed earlier allows us to have processed data which takes away the need to run the same processing time and again which facilitates scalability
  • Availability: Building on similar principles as of scalability, caching allows us to put in place data in areas where systems/components can survive outages be it network or other components. While it may lead to surfacing stale data at points, the systems are still available to the end users.


Adobe AEM Perspective

Adobe AEM as an application container (let’s remember AEM is not built on top of a true application container, though you can deploy in one), have it’s own nuances with scalability. In this article I will not dive into the scalability aspects, but as you scale the AEM publishers horizontally it leads to increase in several concerns around operations, licensing, cost, etc. The OOTB architecture and components we get with Adobe AEM themselves tell you to make use of cache on the web servers (using dispatcher). However, when you have to support Non Functional Requirements (NFRs) I listed above, the standard OOTB architecture will need a massive infrastructure falls short. We can’t just setup CQ with some backend servers and an apache front-ending with local cache, throw hardware at it with capacity and hope it will come together magically.

As I explained, there is no magic in this world and everything needs to happen via some sort of science. Let’s put in perspective the fact that a standard apache web servers can handle a few thousand requests in a second and when you need to handle 80K hits in a second which includes resources like HTML, JS, CSS, images, etc.; with a variety of sizes. Without going into the sizing aspects, it is pretty clear that you would need not just a cluster of servers but a farm of servers to cater to all that traffic. With a farm of servers, you get yourself a nightmare to setup an organization and processes around operations and maintenance to ensure that you keep the site up and running 24×7.



Cache, cache and cache



Before we dive into the design here, we need to understand some key aspects around caching. These concepts will be talked about in the POV below and it is critial that you understand these terminologies clearly.

  • Cache miss refers to a scenario when a process requests the data from a cache store and the object does not exist in the store
  • Cache hit refers to a scenario when a process requests the data from a cache store the data is available in the store. This event can only happen only when a cache object for the requests has been primed
  • CachePrimeis a term associated with the process where we fill up the caching storage with the data. You can do this in two waysinwhichthiscan be achieved
    • Pre-prime is the method where we run a process (generally at the startup of the application) proactively to load all the various objects whose state we are aware of we can cache. This can be achieved by either using an asynchronous process or by an event notification
    • OnDemand-prime is a method where the cache objects are primes real-time i.e. when a process which needs the resource does not finds it in the cache store, the cache is primed up as the resource is served back to the process itself
  • Expiration is a mechanism that allows the cache controller to remove objects from memory based on either a time duration or an event
    • TTL known as Time To Live a method which defines a time duration for which a data can live on a computer. In caching world this is a common expiration strategy where a cache object is expired (flagged to be evicted if needed) based on a time duration provided when the cache object is created
    • Event-based (I can’t find a standards naming convention for this) cache expiration method is one where we can fire an event to mark a cache object as expired so that if needed it can be evicted from memory to make way for new objects or re-primed on the next request
  • Eviction is a mechanism when the cache objects is removed from the cache memory to optimize for space


The Design // Where Data should be Cached

This point-of-view builds on top of an architecture that designed by a team which I was a part of (I was the load architect) for a digital media platform. The NFRs we spoke about earlier are the ones that we have successfully supported on the platform during a weekend as part of a campaign the brand was running. Also, since then we continue to support very high traffic week everytime there are such events and campaigns. The design that we have in places takes care of various layers of caching in the architecture.


Local clients

When we talk about such high traffic load, we must understand that this traffic has certain characteristics that work in our favor.

a) All this traffic is generated by a smaller subset of people who access the site. In our case, when we say that we serve 1 billion hits on a single weekend, it is worthwhile to note that there are only 300,000 visitors on the site on that day who generate this much load

b) A very large portion of all this traffic is static in nature (this is also one of the characteristics of a Digital Media Platform) // these construct of resources like javascript, css and also media assets like images and videos. These are files which once deployed seldom change or change with a release that doesn’t happen every day

c) As the users interact with the site/platform, there is content and data which maps back to their profile and preferences and it does not necessarily changes frequently and also this is managed directly and only by the user themselves

With these characteristics in play, there are things which off of the bat we can cache on the client’s machine i.e. the browser (or device). The mime-types which classify for client-side caching would be images, fonts, css, javascripts. Some of these can be cached infinitely while others should be cached for a medium’ish duration like a couple of days and what have you.


Akamai (CDN)

Content Delivery Networks (CDN) are service providers that enable serving content to end consumers with high performance and availability. To know more about CDN networks, you can read here. In the overall architecture CDN plays a very critical role. Akamai, AWS’s CoudFront and CloudFlare are some of the CDN providers with which we integrate very well.

CDN for us over other things provides a highly available architecture. Some of these CDN provide you an ability to configure your environment such that if the origin servers (pointing to your data center) are unavailable they continue to serve content for a limited time from their local cache. This essentially means that while the backend services may be down, the consumer facing site is never down. Some aspects of the platform like content delivery and new publications are activated and in certain cases like breaking news we may have an impact on a SLA, but the consumers /end users never see your site as unavailable.

In our architecture we use CDN to cache all the static content be it HTML pages, images or Videos. Once a static content is published via the Content Delivery Network, those pages are cached on the CDN for a certain duration. These durations are determined based on the refresh duration but the underlying philosophy is to break the content in tiers of Platinum, Gold, Silver and then assign duration for which each of these would be cached. In a platform like NFL where we are say pushing game feeds these need to be classified as Platinum and they had a TTL of 5 seconds, while content types like Home Page, News (not breaking news) etc. have a TTL of 10 minutes and has been classifies as Gold. Then on the same project we have TTL of 2 hours (or so) so sections like search and have been classified as Bronze. The intent was to identify and classify if not all most of the key sections and ensure that we leverage CDN cache effectively.

We have observed that even for shorter TTLs like Platinum with increase/spike in traffic the Offload %age (defined as the number of hits served by CDN to humber of hits sent to backend) grows and touched a peak of 99.9% where the average offload %age is around 98%.



Varnish is a web-accelerator which (if i may classify it as is as web server on steroids). If you are hearing its name for the first time, I strongly urge you to hop over here to get to know more about it.

We had introduced Varnish as a layer to solve for the following:

  1. Boost Performance (reduce number of servers) – We have realized that Varnish bring an in-memory accelerator gives you a boost of anywhere between 5x-10x over using apache. This basically means that you can handle several times the load with Varnish sitting on top of Apache. We had done rigorous testing to prove these numbers out. The x-factor was mostly dependent on the page size aka the amount of content we loaded over the network
  2. Avoid DoS attacks – we had realized that if there are cases where you see a lot of influx of traffic coming into your server (directed and intentional or arbitrary) and if you cant to block all such traffic, your chances of successfully blocking the traffic on varnish without bringing down the server when compared to doing same on apache increase many fold. We also use Varnish as a mechanism to block any traffic that we don’t want to hit our infrastructure and those could be spiders and bots from markets and regions not targeted by campaigns we run
  3. Avoid Dog-pile effect – if you have heard this term the first time, then hop over here hype-free: Avoiding the dogpile effect. In high traffic situations and even when you have CDN networks setup, it is quite normal for your infrastructure to be hit by a dog-pile as cache expired. Chances of dog-pile effect to increase, increase as you hit lower TTL. Using varnish we have setup something that we call a Grace Configuration where we don’t allow requests for same URLs to pass through. These are queued and after a certain while if the primary request is still not getting through consequent requests are served off of stale cache.



If you haven’t heard about Apache WebServer, you might have heard about httpd. If none of these ring a bell, then this (Welcome! – The Apache HTTP Server Project) will make explain things. AEM’s answer to scale is what sits on this layer and is famously known as Dispatcher. This is a neat little module which can be installed on the Apache HTTP server and acts as a reverse proxy with a local disk cache. This module only supports one model of cache eviction which is event based. We can configure either the authoring or the publishing systems to send events of deleting and invalidating cache files on these servers in which case the next call on this server will be passed back to the publisher. The simplest of the model in AEM world and also recommended by one of the Adobe is to let everything invalidate (set statlevelfile = 0 or 1). This design simplifies the page/component design as now we don’t have to figure out any inter-component dependencies.

While this is the simplest of the things to do, when we have to support such complex needs it calls for some sophistication in design and setups. I would recommend that is not the right way to go as it reduces the cache usage. We made sure that the site hierarchy is such that when a content is published we would never invalidate the entire site hierarchy and only relevant and contextual content is what gets evicted (invalidated in case of dispatcher).



AEM publishing layer which is the last layer in this layer cake, seems like something which should be simplest and all figured out. That’s not the case. This is where you can be hit most (and it will be below the belt). AEM’s architecture is designed to work in a specific way and if you deviate from it, you are bound to fall into this trap. There are 2 things you need to be aware of

  1. When we start writing components that are heavily dependent on queries it will eventually lead to system crumbling. You should be very careful with AEM Queries (which is dependent on underlying AEM’s Lucene implementation).
  2. This article tells us that we have about 4 layers of caching before anything should hit publisher. This means that the number of calls that should ever hit this layer is only a minuscule number. From here on, you need to establish how many calls your servers will receive in a second/minute. We have seen in cases where we have used search heavily AEM’s supported TPS takes a nose dive. I have instances across multiple projects where this number is lower then 5 transactions per second.

The answer is to build for some sort of an application cache which we used to do in a typical JEE application. This will solve this issues, assuming that content creation either manually by authors or via ingestion is limited which means the load we put on search can be reduced significantly. The caveat you should be aware of is that we are adding one more layer of cache which is difficult to mange and if you have cluster of publishers this is one layer which will have distributed cache across servers and can lead to old pages cached on dispatcher. The chances of that happening will increase as the number of calls coming into publishers increase or as the number of servers in the cluster increase.


In my next article I will explain how you can go about modeling the load your platform/site is expected to get and discuss about how the finer details of the architecture.


13 thoughts on “Caching Architecture (Adobe AEM) – Part 1

  1. Pingback: Don’t Like Throttling? | Scratch Pad

    • hey Tom, for the backend, I have used distributed caching systems like REDIS to achieve the same. if the objects you are caching are not going to change you can even use in-container caching frameworks like ehCache. However, the minute you need objects to be synchronized within a cluster it’s much better to use a distributed cache system and not worry about synchronization and creating Java objects that can be serialized and then shared between clusters.

      Distributed cache technologies like REDIS work best in my experience.

  2. Hi Kapil,

    I would like to know how to calculate the number of publishers required while designing the AEM solution based on number of hits per day. For example if a website to be deployed in AEM expects a million hits per day how many publishers should be configured. Is there any formula?

    • hi Ravi, the trick to calculate the load is to figure out what is the traffic that will be hitting your publishers. Given AEM architecture has dispatcher which will cache all pages in apache not everything will come back. given a AEM server doesnt scale as well as apache can; you have to be careful of how you architect your cache solution. There are basically following options that you have:

      1. use dispatcher’s OOTB capability to flush cache and use statfile to control how much cache is invalidated on each activation from author. while the simplest option is to set the statfile to as high as possible, it basically means that an activation can lead to a lot of files in dispatcher being invalidated – which will leak a lot of traffic to publishers and unless they are not changing you dont really need so. So you need to carefully architect your stat-file
      2. Dont use statfiles at all and use event base invalidation – if a lot of pages are interlinked and there is not pattern to activation (more like a news and media site), you may even not want to use statfile at all. You can override the dispatcher’s invalidate script and write logic to selectively clear up pages that have been pushed.

      to answer your question on how to determine the capacity of the servers, what I do is to find the cache offload ratio. In simple math terms, if I expect 1M page hits, and i figure out my cache offload will be 90%, then all i expect is 10% of the hits to come back to my publishers. This gives me a 100,000 pages / day OR 4166 page views / hour OR 69 page views / minute or ONLY 1.15 page views / second.

      so you see the problem for me now is to design for only 2 page views in a second and not worry what my apache can handle. If you have CDN like akamai in place, you dont even need to worry about Apache scale.

      The trick is to get that cache offload right. So back to point #1, if you use top level stat file and someone publishes a HomePage.html and everything is invalidated, your publishers will be flooded and you are done.

      P.S.: you also should worry about the Dog-Pile Effect which in edge cases can cause flooding and in turn failing of publishers.

  3. Hi Kapil, can the cached contents in one dispatcher be shared by other dispatchers in a clustered environment. In other words is there a way to replicate contents across dispatchers?

      • We have 4 dispatchers pointing to 4 publishers on production, sharing contents across dispatcher is going to help reduce the load on publish instance. Publishers repository are in sync with each other, knowing that each publisher is going to generate the same response for a certain request would it make sense talk about sharing cache contents in dispatcher

      • 4×4 setup is fine; however it will lead to some overheads when Authoring instances publishes’ the content to publishers. The events will trickle down to all publishers and dispatchers hence leading to additional cache invalidation.

        Several people like the idea of 1 (dispatcher) – 1 (publisher) mapping.

        Sharing Cache between publishers is something i have always felt uncomfortable with. Cache sharing will have to happen on a shared network device like NAS or something. These devices only add to network latency. If you do want to do that, make sure you have either Varnish or NGINX sitting as a reverse proxy in front of Cache to server static content like images, JS, CSS which can add overheads from a network device.

Initiate your idea here...

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s