Modeling Content in CQ54
CQ54 is not a a typical RDBMS where I can model a set of relationships in table and soon a pretty picture starts to present itself. CQ54 stores everything in its content repository (CRX) as nodes which follow an entirely different data model i.e. Hierarchical Structure. My experience with hierarchical databases has been with day to day applications like MS Windows File Explorer, outlook Folder structure and in application development directory services like LDAP. So, I am going to start off by listing down what I understand of hierarchical database before I ponder down to my set of questions.
A hierarchical database model means that my data is arranged into a structure that is similar to tree (organization chart). This resides on the premise of a 1:N relationship where a child can have only one parent, where in a parent can have multiple child records. It has characteristics that differ a lot from a relational database. To list a few:
1. Every node is a record
2. Data is stored as properties on the node
3. Every node can be of a different data type – a hierarchical model does not mandate to have same record types under a same parent
4. A child node can be a child to one and only one parent
Hierarchical databases have their advantages:
Performance: Navigating records in a hierarchical model is faster because the references are basically pointers to the nodes/records directly. I don’t have to search in a index or a set of indexes. This however, is true in a case only when my data model does not have a lot of references. If i am working off with a content-model that includes multi-level references, performance will head south
Easy to understand: It is a simple hierarchy; and it represents something that is “non-technical”. It naturally represents what exists.
And Hierarchical databases have their limitations:
Unable to draw complex relationships between various child nodes – Given the premise that a child node will have only one parent, they are identified only by their parents. We have the capability like XPath to navigate directly to a node, which may be faster. If we do not know the exact path, we will have to navigate the tree (up to a parent, maybe the root) and then down to all nodes before we find what we are looking for. Some questions that I am asking myself:
1. What qualifies as a reference for an object?
2. Should speed at which the data can be fetched a driver to defining a reference?
3. What are the best practices that I should be aware off, when I am modeling my domain?
4. When do i decide I need a network model instead of hierarchical model?
Difficult of maintain – hierarchical models also mean that I do not have a command like ALTER TABLE. This essentially means then if I later decide to add another property to a specific node type I will have to write code to update all the nodes
1. Is there a way where I can update a node-type thus updating all the objects which are of that node type?
2. Is there a way to avoid such situations (apart from saying that lets get it right in Release 1.0 and pray to God client will not ask for a change request )
Lack of Flexibility – In this article, Scott Ambler quotes – “Hierarchical databases fell out of favor with the advent of relational databases due to their lack of flexibility because it wouldn’t easily support data access outside the original design of the data structure. For example, in the customer-order schema you could only access an order through a customer, you couldn’t easily find all the orders that included the sale of a widget because the schema isn’t designed to all that.”. This is a typical case of where reporting is a must and it might be in many systems.
1. Are there other scenarios?
With all the context set of Hierarchical, it is now important we look at CQ54′s content repository – CRX. While CRX is a hierarchical repository it should not be confused with a hierarchical database. CRX provides us with JCR node types which allow us to force structure. We also have the capability of creating custom nodes, but should do it with care. The principle is not to go overboard with structure.
Question remains – “how do I manage content in CQ54″. I do not have a “go-to” answer, but what I have described below is how I am going to think when I start the process.
Content modeling: Look at the requirements i.e. wireframes, creative design assets and identify various content types, structures and relationships between content types. We can take the object-oriented approach and define everything as an object or keep similar content types together. There are several things that should be considered when taking one approach over the other:
What is the business process for crating an object type. Do the content types follow same workflow?
1. Steps that are required to activate a content. An article, a blog, a discussion forum entry may have the same process flow of an author and a reviewer then there is a case of having a single abstract content type
2. However, if an article needs a legal review and can be used in several other business process than just a simple article we may want to bring article out as its own content type
1. What kind properties do they share
2. Modeling content for an education system where we have content types like a college or a school where we see a lot of similarities there is a case we can build on creating an abstract content
How does the content author wants to look at the content
1. If we have a set of users who want to manage their content as structured content like books, movies etc we should look to provide those content types very specifically
2. In another scenario if we have authors who do not worry a lot about specific objects i.e. Page-centric content creation then we can decide to club content types together
Managing Relationships: In CQ, given it has a hierarchy based data storage model which complies with JCR specifications, we do not have a way to create strict rule-based relationships. We can create relationships using one of the following ways:
Path based references: We can do this by creating properties on objects that hold a “path” or a “list of path” to which the content has relationships with
1. They are semantic
2. Not bound to an “obscure IDs”
3. Do not enforce integrity constraints which may create troubles in extensibility later
4. Being REST-ful they allow us to navigate directly to the node, thus making navigations very quick
5. Being REST-ful, they allow author to visualize their content relationships well thus providing them a business view of the content
Taxonomy based references: CQ uses tags to represent a taxonomy. However, we can not extend tags to hold various profile information. So, you will need to have a mapping system that maps a tag to a content in CRX
1. Taxonomy is the foundation on which the IA stands. Taxonomy allows us a classification system and how the users will view the content on the site.
2. Allow us to clearly identify where in the system the content type resides
3. Is a conceptual framework allowing customers and their customer to locate what they need easily
4. It is hierarchical
1. Can be used in case we reach a point where relationships are too complex
2. Transactional Data should be kept out of CMS and placed in a relational database (or similar)
3. If we do not have to manage the lifecycle of the content
4. Please note that this will make architecture complex, but if this is needed that it is