Alfresco high availability infrastructure in real life (Part 2)

December 12, 2009

As mentioned in my previous post (part 1), the purpose of this series of articles is to provide some recommendations to help you build your own alfresco high-availability (H.A) infrastructure.

So let’s review more in details the first 3 advices I think you should consider:

1 – Read the Alfresco wiki and the editor technical documentation to give you an overview of what is and what is not possible with the latest version of the product,

2 – Work with Alfresco service team at the early beginning of the study,

3 – Define and validate the SLA with your “functional” team. Common indicators are RTO (Recovery Time Objective) and RPO (Recovery Point Objective).

1 – Read the Alfresco wiki and the editor technical documentation to give you an overview of what is and what is not possible with the latest version of the product.

To prepare your study, I think the first thing to do is to review all the existing documentation related to Alfresco clustering and H.A. This will help you to choose the “best” approach for your design, at least the high-level principles.

Whatever your Alfresco version, here is a list of documents and articles (provided by Alfresco editor) which could be of interest:

Also, you should carefully choose the target version of Alfresco you will use to build your H.A architecture. In our case, we are currently using v2.1.1 E in production, but we are waiting for the latest v3.2 E version (end of this year) to build the new H.A infra.

Of course, in most cases, using the latest version is likely to be the best choice. But for some technical reasons (e.g CPU/memory limitation of your current production infra), you might have to migrate as soon as possible to the current available version (which is 3.1). The purpose of this post is not to do a comparative study of 3.1 and 3.2, but you should be aware that even between 2 minor releases there might be significant evolutions (as Alfresco developers are working very hard).

For instance, changes you can do through the JMX console (e.g changing any server configuration parameter) can be done at runtime with both 3.1 and 3.2, but changes are persistent only in 3.2…So if you need more advanced administration capabilities (like sub-system, JMX, LDAP synchronization, etc) 3.2 is clearly the best target.

Please note however that most of the major clustering features and options are already included in 3.0 version, and that there are no significant changes between 3.0 to 3.2 regarding clustering. The next major evolutions (sharing lucene indexes between cluster nodes) is planned for 3.3 (or even 4.0) only.

2 – Work with Alfresco service team at the early beginning of the study:

To start our study, we have involved the Alfresco service team. This is clearly the best advice I can give you for your own project…Indeed, even after a single consulting day (or 2), you will not only clarify the main Alfresco cluster principles, but also (and this is more important) learn that some of the most common documentation on this topic does not necessarily reflect the limitation of the clustering or HA methodology in a “real life” context. I especially do think about the “Replicating Content Store” (RCS), which initially seems to be the most common way to replicate your documents between cluster nodes (assuming you need to have 2 content stores for failover), but which is actually a very limited process.

In our case, we plan to build a failover site, which must be the “mirror” of the live site in term of data (content store and meta-data) to avoid loosing data as much as possible when switching to the failover site is required. At the beginning of the project we thought that using the “Replicating Content Store” could be a good option to perform the document (content store) synchronization. But the problem is that the RCS component does not ensure that the failover site will contain the exact same documents. For instance, if failover site is unavailable (e.g maintenance operation), then the RCS will not be able to re-run the transaction which have not been executed during downtime period. This is not the only limitation of RCS, so clearly it is not well suited in our context.

So the service team clearly told us that the RCS is usually not used by customers for documents replication purpose. Insteed, a tool like rsync is usually recommended by the Alfresco editor. So we will study if rsync can be a good fit for our datacenter infrastructure. But, with this advice at the very beginning of the project, we have avoided a big design mistake…

So once again, involving Alfresco service team is a good investment !

 

3 – Define and validate the SLA with your “functional” team. Common indicators are RTO (Recovery Time Objective) and RPO (Recovery Point Objective):

One of the main technical constraints that will drive the design of your HA infrastructure are RTO (Recovery Time Objective) and RPO (Recovery Point Objective) .

Basically RTO is the time required to re-activate the service in case of major crash of your production system. So let’s assume your Alfresco server crashed for any reason (unexpected power shutdown, corruption of database, etc, etc), you usually need to have a DRP (Disaster Recovery Plan) to be able to restore the system as soon as possible. In our case, we will use a failover site (in another geographical location) and we will switch the users on this new system. Our “functional team” has specified that our RTO must not be more than 2 hours, so that means we must be able to start the failover Alfresco instance and switch the load-balancer in less than 2 hours.

Once your failover site is started, another indicator is “how much data have been lost” during the crash and the switch operation ? This usually depends on the frequency of your backup. In our case, our plan is to synchronize the documents content store every 30 mn (using rsync), and the database every 60 mn (using Oracle archive log feature). That means that we should not lost more than 60 mn of data creation time.

Choosing a RTO of 2 hours and RPO of 1 hour is already a challenge, but I think this is something technically (and humanly) feasible. Depending on the criticity of your documents, this might of course not be enough, or not acceptable for your end-users…but you should know that the highest your expectation are, and the biggest the infrastructure cost will be…For instance, being able to reduce RTO to less than 30 mn might need to involve some monitoring solution coupled to advanced mechanism to automatically start your Alfresco instance (and database) on the failover site. Such solution already exist (like Veritas Storage Foundation HA), but the price of such solution is usually significant. We have studied this kind of very advanced solution, but finally we have decided (mainly for cost reason) to start with a pragmatic and more simple HA solution.

To give you an overview of what our Alfresco HA infrastructure will be, here is a logical diagram:


The final design is still not completely validated but basically here are the main concepts which have been selected so far:

  • Failover (YES): 2 distinct Alfresco instances located in 2 distinct Data-Centers geographically separated. Site A is the “live” site exposed to end-users. Site B is used only for failover (only in case of crash of Site A).
  • Clustering (YES/NO): yes, because we will “simulate” a 2 cluster nodes. But this will be more a hot standby system rather than a real cluster. Here the standby instance is mainly used for backup purpose (and of course failover). The failover site will be in read only mode and it will not be synchronized in real time (the data are replicated only every 1 hour).
  • Data replication (YES): rsync will be used for files synchronization. The archive log export/import mechanism (Oracle) will be used to replicate the database meta-data. One very important constraint is that Content Store replication must be performed before the DB replication (I will detail that later on).
  • Lucene indexes management: even with 3.2 version of Alfresco, each node of the cluster needs to have a local copy of the indexes. Site A indexes are updated in real time during each transaction. On Site B, we might need to enable the lucene index tracking process to make sure the lucene indexes are “up to date”.
  • Load-balancing (NO): we have decided not to address load-balancing for the first release of the project. The main difficulty is that load-balancing requires the 2 nodes of the cluster to be “in synch” in real time which can be complex to achieve (especially if you consider the limitations of the RCS component). Also, managing CIFS and load-balancing is something we did not have the time to study for the first release.

Hope that you begin to have a good overview of what infra we currently try to setup in our case. In a next post, I will continue to describe the following recommendation:
4 – Estimate the number of concurrent users you plan to have.
5 – Estimate your documents data volume and how it will scale up in the future (this is a key point that will drive your backup strategy).
6 – Study carefully the backup and the restore strategy (DRP). Also define if you want to be able to restore a single file.
7 – Involve your internal infrastructure team (and Alfresco service team) to finalize the design of your infrastructure.