Alfresco high availability infrastructure in real life (Part 1):

November 29, 2009

In our company, as the Alfresco DM service becomes more and more popular everyday (+ 2 GB of documents added each day), we are currently studying the implementation of a high availability (H.A) infrastructure (cluster, failover, etc).

So I will start this series of post to share my comments, questions and recommendations with you, because implementing such a H.A is a somewhat complex project…I should say that this complexity is not due to Alfresco software itself, but to the common underlying infrastructure concerns you have when you need to install any “big scale” architecture, with huge volume of documents to backup and hundreds or thousands of users.

I think most of the companies which have deployed Alfresco a few years ago, did start the project infrastructure with a pragmatic approach, i.e. using a more or less “basic” standalone server, as we did. So after a few years of production deployment, I’m sure they are in the same situation as we are: due to the success of the project, you now need to build a more industrialized architecture to better manage DRP, failover, load-balancing, backup, etc.

As I said, we have almost completed the design of our H.A infrastructure. I will give you more details about it once the final design will be validated (see next post), and also why we have taken these decisions, but basically here are the main concepts which have been selected so far:

  • Failover (YES): 2 distinct Alfresco instances located in 2 distinct Data-Centers geographically separated.
  • Clustering (YES/NO): yes, because we will have a 2 cluster nodes. But this will be a hot standby system (only one of the 2 Alfresco instances is exposed to end-users). The standby instance is mainly used for backup purpose (and of course failover).
  • Load-balancing (NO): after long discussion we have decided not to address load-balancing for the first release of the project. The main constraint is that load-balancing requires the 2 nodes of the cluster to be “in synch” in real time which can be complex to achieve (see next post). Also, managing CIFS and load-balancing is something we did not have the time to study for the first release.

Please note that these criteria does NOT illustrate the perfect infra which can match every company needs, but this is rather a pragmatic approach which address our functional constraints (RTO, RPO), leverage existing infra components (datacenter, backup robot), and which takes into account the “cost restrictions” .

At this step of the project study, I can suggest you the following recommendations:

1 – Read the Alfresco wiki and the editor technical documentation to give you an overview of what is and what is not possible with the latest version of the product.

2 – Work with Alfresco service team at the early beginning of the study,

3 – Define and validate the SLA with your “functional” team. Common indicators are RTO (Recovery Time Objective) and RPO (Recovery Point Objective).

4 – Estimate the number of concurrent users you plan to have.

5 – Estimate your documents data volume and how it will scale up in the future (this is a key point that will drive your backup strategy).

6 – Study carefully the backup and the restore strategy (DRP). Also define if you want to be able to restore a single file.

7 – Involve your internal infrastructure team (and Alfresco service team) to finalize the design of your infrastructure.

I will detail each of these items in the next blog post.

But the conclusion for us, at this step of the H.A design study, is clearly that there is no “perfect” architecture, but that the final infrastructure design is only the best compromise between all your internal company constraints (functional and technical). Once again, I can say that most of these compromises are not due to the Alfresco software limitation itself, but rather to the huge volume of documents that we have to manage.

So go to the next blog post if you want to know the next episode of the story…