Month: January 2012

Modeling Content in CQ54

CQ54 is not a a typical RDBMS where I can model a set of relationships in table and soon a pretty picture starts to present itself. CQ54 stores everything in its content repository (CRX) as nodes which follow an entirely different data model i.e. Hierarchical Structure. My experience with hierarchical databases has been with day to day applications like MS Windows File Explorer, outlook Folder structure and in application development directory services like LDAP. So, I am going to start off by listing down what I understand of hierarchical database before I ponder down to my set of questions.

A hierarchical database model means that my data is arranged into a structure that is similar to tree (organization chart). This resides on the premise of a 1:N relationship where a child can have only one parent, where in a parent can have multiple child records. It has characteristics that differ a lot from a relational database. To list a few:

1. Every node is a record
2. Data is stored as properties on the node
3. Every node can be of a different data type – a hierarchical model does not mandate to have same record types under a same parent
4. A child node can be a child to one and only one parent

Hierarchical databases have their advantages:

Performance: Navigating records in a hierarchical model is faster because the references are basically pointers to the nodes/records directly. I don’t have to search in a index or a set of indexes. This however, is true in a case only when my data model does not have a lot of references. If i am working off with a content-model that includes multi-level references, performance will head south
Easy to understand: It is a simple hierarchy; and it represents something that is “non-technical”. It naturally represents what exists.
And Hierarchical databases have their limitations:

Unable to draw complex relationships between various child nodes – Given the premise that a child node will have only one parent, they are identified only by their parents. We have the capability like XPath to navigate directly to a node, which may be faster. If we do not know the exact path, we will have to navigate the tree (up to a parent, maybe the root) and then down to all nodes before we find what we are looking for. Some questions that I am asking myself:
1. What qualifies as a reference for an object?
2. Should speed at which the data can be fetched a driver to defining a reference?
3. What are the best practices that I should be aware off, when I am modeling my domain?
4. When do i decide I need a network model instead of hierarchical model?
Difficult of maintain – hierarchical models also mean that I do not have a command like ALTER TABLE. This essentially means then if I later decide to add another property to a specific node type I will have to write code to update all the nodes
1. Is there a way where I can update a node-type thus updating all the objects which are of that node type?
2. Is there a way to avoid such situations (apart from saying that lets get it right in Release 1.0 and pray to God client will not ask for a change request :))
Lack of Flexibility – In this article, Scott Ambler quotes – “Hierarchical databases fell out of favor with the advent of relational databases due to their lack of flexibility because it wouldn’t easily support data access outside the original design of the data structure. For example, in the customer-order schema you could only access an order through a customer, you couldn’t easily find all the orders that included the sale of a widget because the schema isn’t designed to all that.”. This is a typical case of where reporting is a must and it might be in many systems.
1. Are there other scenarios?
With all the context set of Hierarchical, it is now important we look at CQ54’s content repository – CRX. While CRX is a hierarchical repository it should not be confused with a hierarchical database. CRX provides us with JCR node types which allow us to force structure. We also have the capability of creating custom nodes, but should do it with care. The principle is not to go overboard with structure.

Question remains – “how do I manage content in CQ54”. I do not have a “go-to” answer, but what I have described below is how I am going to think when I start the process.

Content modeling: Look at the requirements i.e. wireframes, creative design assets and identify various content types, structures and relationships between content types. We can take the object-oriented approach and define everything as an object or keep similar content types together. There are several things that should be considered when taking one approach over the other:

What is the business process for crating an object type. Do the content types follow same workflow?
1. Steps that are required to activate a content. An article, a blog, a discussion forum entry may have the same process flow of an author and a reviewer then there is a case of having a single abstract content type
2. However, if an article needs a legal review and can be used in several other business process than just a simple article we may want to bring article out as its own content type
1. What kind properties do they share
2. Modeling content for an education system where we have content types like a college or a school where we see a lot of similarities there is a case we can build on creating an abstract content
How does the content author wants to look at the content
1. If we have a set of users who want to manage their content as structured content like books, movies etc we should look to provide those content types very specifically
2. In another scenario if we have authors who do not worry a lot about specific objects i.e. Page-centric content creation then we can decide to club content types together


Managing Relationships: In CQ, given it has a hierarchy based data storage model which complies with JCR specifications, we do not have a way to create strict rule-based relationships. We can create relationships using one of the following ways:

Path based references: We can do this by creating properties on objects that hold a “path” or a “list of path” to which the content has relationships with
1. They are semantic
2. Not bound to an “obscure IDs”
3. Do not enforce integrity constraints which may create troubles in extensibility later
4. Being REST-ful they allow us to navigate directly to the node, thus making navigations very quick
5. Being REST-ful, they allow author to visualize their content relationships well thus providing them a business view of the content
Taxonomy based references: CQ uses tags to represent a taxonomy. However,  we can not extend tags to hold various profile information. So, you will need to have a mapping system that maps a tag to a content in CRX
1. Taxonomy is the foundation on which the IA stands. Taxonomy allows us a classification system and how the users will view the content on the site.
2. Allow us to clearly identify where in the system the content type resides
3. Is a conceptual framework allowing customers and their customer to locate what they need easily
4. It is hierarchical
Relational Database
1. Can be used in case we reach a point where relationships are too complex
2. Transactional Data should be kept out of CMS and placed in a relational database (or similar)
3. If we do not have to manage the lifecycle of the content
4. Please note that this  will make architecture complex, but if this is needed that it is

The Language of Risk « The IT Risk Manager | Kapil | Scratch Pad | Java | Architecture | Design | Open Source

The language is important because it helps you think about the problem in the right way.

This statement stuck a chord reminding me of an instance not so long ago . She asked me – “why did my project had just a couple of risks?” . She went on to probe us (my PM and me) to understand if we were not thinking of the risk.

At that time, I did not answer her in the way the Author summarizes it. We presumed that we had a functional scope documented and hence the risk of not being able to deliver what was required did not exist.

via The Language of Risk « The IT Risk Manager | Kapil | Scratch Pad | Java | Architecture | Design | Open Source.

Write Through Cache

I was a young budding developers when I was first introduced to the concept of Cache. My Senior Architect then told me

Cache is a component that will magically store data so that future requests of that same data will not be to the Remote Server, and hence it will improve the performance of our application significantly faster

We were working on a website which was to integrate with an existing application through the use of APIs and for purposes of closer integration we had decided to store the data in form of XML as artifacts in this tool as tracker items – no RDBMS. It was like running two applications joint at the hip.

I had gathered enormous experience working with SourceForge platform APIs as I had integrated them with Ms-Excel and now we were going to build an entire application on ALM space using the same APIs. Our biggest challenge was going to be performance because of the use of APIs against a remote server sitting in a different geography. And Cache was going to be instrumental in helping us solve that problem.

We had just decided to implement Write-through cache that helped us build a system which was significantly faster that anyone could have thought about. And since then this is one pattern that I have come to use (if possible) whenever I am working with diversified systems. Surprisingly, the percentage of people who have used cache have never heard of this pattern of cache implementation when the underlying issues with the systems can be solved using this pattern (I am still mystified as to why not?).

Before we dig deep, I want to run through some definitions that we are going to use during the course of this article:

  • Cache hit refers to a request to the data and finding it in the cache
  • Cache miss refers to a request to the data and not finding it in the cache
  • Dirty refers to a cached data if it has not the same as the original data
  • Lazy refers to an action if it not performed real time, but only when it is required

In its simplest form, a cache implementation is going to something similar to the image below.

Cache - Workflow

Cache – Workflow

As you can already see that this is just one part of the cache implementation and implementing this workflow alone would mean that once I have a data in the cache it would only always fetch the data from the local cache location and go back to the original data source only if the cache data is Dirty. And there are several ways we have to mark the data as dirty – it can be an action we configure in our system like – “If we are update records in the data source, we try to find the key and mark it dirty”. Another way is to decide a time after which the the cache should be expired automatically.

While, this approach is simple enough, this does presents a unique “problem” (and it may not be a problem for everyone). Lets revisit the reason for which we decided to implement cache.

 Cache is a component that will magically store data so that future requests of that same data will not be to the Remote Server, and hence it will improve the performance of our application significantly faster.

This implies that “Caching is a mechanism that is faster when compared to our data source when it comes to data loading”. I have observed cache implementations against RDBMS sitting next to a Application Server, which means that loading data from the RDBMS is still faster and hence there is really no need to improvise on the cache flow as defined earlier.

In our case, we were dealing with an external system from where we fetched XML over HTTPS and then converted the XML to an object. This entire process was time consuming – 3 seconds for one object and there was nothing we could do about reducing the transportation time. It also meant that the classic workflow for us would not work either, especially if a user would update a specific record, and if the cache would be marked dirty, the next request would mean a significant delay time.

We improvised and it was then we used the write-through cache logic that allowed us to manage the data in the cache in real-time with the data-source. The workflow was changed to the one below:

As simple as it may look it was not so. Lets see first what we did. We added a hook to the code which was required to save the data in the DataSource to do two things:

  1. Find the cache entry and mark it dirty if it was a hit and;
  2. Update the cache after a successful update to the data store

This allowed us to keep the cache in sync with the DataSource and hence not requiring to spend additional time to load the data back again.

But, as I feel that every solution will bring its challenges, this one had as well especially when we decided to scale and move over to a cluster of application server. This meant that a local cache would simply not work because it was meant that an update to the DataSource meant that the cache on other application servers was out dated and users would not get the most latest data making it impossible to work. We did use version to records to manage the concurrency checks, and not keeping the cache in sync meant that other users will see their updates fail because of that very fail-safe. Eventually, we had to find a cache that can be scaled in a cluster which only made things more complicated.

A pattern that I learn in my development adolescence, this had proved to be a powerful technique to build solutions that would work fine in a given scenario.



Java EE 6 vs. Spring Framework: A technology decision making process

I came across this article which I just want to share with others – I found this a good read.

Java EE 6 vs. Spring Framework: A technology decision making process.

OSGI: The new Toy

I heard about OSGI sometime early last year, but I did not care about it – it meant start thinking about a new way of development and deployment (thats what I heard from my friends) and I did not want to learn something else when Spring worked great for me. And, my colleagues who spoke about OSGI did not do a good job of advocating OSGI. Last month, I came across Adobe Day as a potential platform for a project implementation. Day was amongst some of the CMS platforms that I was evaluating, namely – SDL Tridion, Oracle Stellant and Interwoven.

It was Adobe Day that introduced me to OSGI and during one of the webinars, I was introduced to OSGI using a simple yet powerful graphic (see below).

OSGI - Intro

OSGI - Intro

And from that moment on, I was simply hooked on. I have been a big fan of Java and its deployment frameworks like Maven – but essentially using those frameworks and tools also meant that sooner or later I would be dealing with various versions of same library (ehCache, Log4j being the most common) and when I wen to deployment on JBoss or Apache Tomcat, more often than not I would have to tweak my project dependencies to or servers to ensure that those “dependencies” are resolved appropriately.

Suddenly, OSGI seems to be the answer to everything (or almost everything). I would be able to create and deploy components and then choose to include them as needed; I did not have to worry about backward compatibility or which service to deploy – opportunities were endless.

So today, I am going to start off my journey to read more about OSGI and explore some common available containers like Equinox (used by Eclipse) and Apache Felix (used by Adobe Day). And I hope that while I learn more I am able to share my thoughts and some of the best practices around the same.