Alfresco high availability infrastructure in real life (Part 2)

As mentioned in my previous post (part 1), the purpose of this series of articles is to provide some recommendations to help you build your own alfresco high-availability (H.A) infrastructure.

So let’s review more in details the first 3 advices I think you should consider:

1 – Read the Alfresco wiki and the editor technical documentation to give you an overview of what is and what is not possible with the latest version of the product,

2 – Work with Alfresco service team at the early beginning of the study,

3 – Define and validate the SLA with your “functional” team. Common indicators are RTO (Recovery Time Objective) and RPO (Recovery Point Objective).

1 – Read the Alfresco wiki and the editor technical documentation to give you an overview of what is and what is not possible with the latest version of the product.

To prepare your study, I think the first thing to do is to review all the existing documentation related to Alfresco clustering and H.A. This will help you to choose the “best” approach for your design, at least the high-level principles.

Whatever your Alfresco version, here is a list of documents and articles (provided by Alfresco editor) which could be of interest:

Also, you should carefully choose the target version of Alfresco you will use to build your H.A architecture. In our case, we are currently using v2.1.1 E in production, but we are waiting for the latest v3.2 E version (end of this year) to build the new H.A infra.

Of course, in most cases, using the latest version is likely to be the best choice. But for some technical reasons (e.g CPU/memory limitation of your current production infra), you might have to migrate as soon as possible to the current available version (which is 3.1). The purpose of this post is not to do a comparative study of 3.1 and 3.2, but you should be aware that even between 2 minor releases there might be significant evolutions (as Alfresco developers are working very hard).

For instance, changes you can do through the JMX console (e.g changing any server configuration parameter) can be done at runtime with both 3.1 and 3.2, but changes are persistent only in 3.2…So if you need more advanced administration capabilities (like sub-system, JMX, LDAP synchronization, etc) 3.2 is clearly the best target.

Please note however that most of the major clustering features and options are already included in 3.0 version, and that there are no significant changes between 3.0 to 3.2 regarding clustering. The next major evolutions (sharing lucene indexes between cluster nodes) is planned for 3.3 (or even 4.0) only.

2 – Work with Alfresco service team at the early beginning of the study:

To start our study, we have involved the Alfresco service team. This is clearly the best advice I can give you for your own project…Indeed, even after a single consulting day (or 2), you will not only clarify the main Alfresco cluster principles, but also (and this is more important) learn that some of the most common documentation on this topic does not necessarily reflect the limitation of the clustering or HA methodology in a “real life” context. I especially do think about the “Replicating Content Store” (RCS), which initially seems to be the most common way to replicate your documents between cluster nodes (assuming you need to have 2 content stores for failover), but which is actually a very limited process.

In our case, we plan to build a failover site, which must be the “mirror” of the live site in term of data (content store and meta-data) to avoid loosing data as much as possible when switching to the failover site is required. At the beginning of the project we thought that using the “Replicating Content Store” could be a good option to perform the document (content store) synchronization. But the problem is that the RCS component does not ensure that the failover site will contain the exact same documents. For instance, if failover site is unavailable (e.g maintenance operation), then the RCS will not be able to re-run the transaction which have not been executed during downtime period. This is not the only limitation of RCS, so clearly it is not well suited in our context.

So the service team clearly told us that the RCS is usually not used by customers for documents replication purpose. Insteed, a tool like rsync is usually recommended by the Alfresco editor. So we will study if rsync can be a good fit for our datacenter infrastructure. But, with this advice at the very beginning of the project, we have avoided a big design mistake…

So once again, involving Alfresco service team is a good investment !

3 – Define and validate the SLA with your “functional” team. Common indicators are RTO (Recovery Time Objective) and RPO (Recovery Point Objective):

One of the main technical constraints that will drive the design of your HA infrastructure are RTO (Recovery Time Objective) and RPO (Recovery Point Objective) .

Basically RTO is the time required to re-activate the service in case of major crash of your production system. So let’s assume your Alfresco server crashed for any reason (unexpected power shutdown, corruption of database, etc, etc), you usually need to have a DRP (Disaster Recovery Plan) to be able to restore the system as soon as possible. In our case, we will use a failover site (in another geographical location) and we will switch the users on this new system. Our “functional team” has specified that our RTO must not be more than 2 hours, so that means we must be able to start the failover Alfresco instance and switch the load-balancer in less than 2 hours.

Once your failover site is started, another indicator is “how much data have been lost” during the crash and the switch operation ? This usually depends on the frequency of your backup. In our case, our plan is to synchronize the documents content store every 30 mn (using rsync), and the database every 60 mn (using Oracle archive log feature). That means that we should not lost more than 60 mn of data creation time.

Choosing a RTO of 2 hours and RPO of 1 hour is already a challenge, but I think this is something technically (and humanly) feasible. Depending on the criticity of your documents, this might of course not be enough, or not acceptable for your end-users…but you should know that the highest your expectation are, and the biggest the infrastructure cost will be…For instance, being able to reduce RTO to less than 30 mn might need to involve some monitoring solution coupled to advanced mechanism to automatically start your Alfresco instance (and database) on the failover site. Such solution already exist (like Veritas Storage Foundation HA), but the price of such solution is usually significant. We have studied this kind of very advanced solution, but finally we have decided (mainly for cost reason) to start with a pragmatic and more simple HA solution.

To give you an overview of what our Alfresco HA infrastructure will be, here is a logical diagram:

The final design is still not completely validated but basically here are the main concepts which have been selected so far:

Failover (YES): 2 distinct Alfresco instances located in 2 distinct Data-Centers geographically separated. Site A is the “live” site exposed to end-users. Site B is used only for failover (only in case of crash of Site A).
Clustering (YES/NO): yes, because we will “simulate” a 2 cluster nodes. But this will be more a hot standby system rather than a real cluster. Here the standby instance is mainly used for backup purpose (and of course failover). The failover site will be in read only mode and it will not be synchronized in real time (the data are replicated only every 1 hour).
Data replication (YES): rsync will be used for files synchronization. The archive log export/import mechanism (Oracle) will be used to replicate the database meta-data. One very important constraint is that Content Store replication must be performed before the DB replication (I will detail that later on).
Lucene indexes management: even with 3.2 version of Alfresco, each node of the cluster needs to have a local copy of the indexes. Site A indexes are updated in real time during each transaction. On Site B, we might need to enable the lucene index tracking process to make sure the lucene indexes are “up to date”.
Load-balancing (NO): we have decided not to address load-balancing for the first release of the project. The main difficulty is that load-balancing requires the 2 nodes of the cluster to be “in synch” in real time which can be complex to achieve (especially if you consider the limitations of the RCS component). Also, managing CIFS and load-balancing is something we did not have the time to study for the first release.

Hope that you begin to have a good overview of what infra we currently try to setup in our case. In a next post, I will continue to describe the following recommendation:
4 – Estimate the number of concurrent users you plan to have.
5 – Estimate your documents data volume and how it will scale up in the future (this is a key point that will drive your backup strategy).
6 – Study carefully the backup and the restore strategy (DRP). Also define if you want to be able to restore a single file.
7 – Involve your internal infrastructure team (and Alfresco service team) to finalize the design of your infrastructure.

This entry was posted on Saturday, December 12th, 2009 at 4:53 pm and is filed under Alfresco DM, Backup and Restore, Clustering and High-availability. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

5 Responses to Alfresco high availability infrastructure in real life (Part 2)

denniszeng2008 says:

July 27, 2011 at 8:15 am

How long will it take for rsync to synchronize alf_data with 100G?

Reply
- Enguerrand SPINDLER says:
  
  July 27, 2011 at 11:45 am
  
  As far as I remember, I think the copy speed rate was approx. 20 GB of documents (alf_data) per hour.
  
  However, the speed really depends on the number of files: if you have a small number of big files, the copy process will be much faster than if you have a big number of small files….
  
  Reply
jjacob says:

July 28, 2011 at 7:04 pm

Hi,

Very nice post on HA. We are also building a similar active/stand-by setup. I was just wondering how the lucene-indexes were tracked in the stand-by server. If the “index.recovery.mode” is set to AUTO, will it track and rebuild the index automatically when the server is up or it will be checked only during the startup? We are planning to have a hot stand-by server and dont want to restart the server to rebuild the indexes (from the last backup). How did you handle this?

Thanks,
Jerry

Reply
- Enguerrand SPINDLER says:
  
  August 23, 2011 at 8:51 am
  
  Our “standby” server was actually stopped by default (because the corresponding database was also not active, we were using the Oracle archive log replication because we did not have a real Oracle cluster).
  So when switching on the failover site we had to rebuild the indexes partially (incremental reindexing, index.recovery.mode property to “AUTO”).
  Of course we were using the latest indexes backup (generated on the Live Alfresco instance using the Alfresco indexes snapshot job).
  We were only doing 1 indexes snapshot per day (every day at 3 AM, when only a few people were accessing the server), so the indexes backup could be created 24 hours before (maximum).
  
  In most cases the incremental reindexing took about a few minutes to 1 hours…In other cases, for unknow reason, indexes was rebuilt completely (even with index.recovery.mode property to “AUTO”), but as far as I konw it was due to a bug of 3.2 E version of Alfresco, and I think it is fixed now.
  
  Hope this will help.
  Enguerrand
  
  Reply
Anonymous says:

January 13, 2012 at 4:25 pm

great post on HA

Reply

Alfresco & Share Blog