Understanding Alfresco document life cycle for backup strategy

December 17, 2009

Before implementing your Alfresco backup strategy it is highly recommended that you first fully understand the Alfresco document life cycle (i.e what happen once a file is deleted).

The following diagram summarize the complete life cycle of a document and what kind of operations occure during the deletion process:

Diagram legend:

 

So at the end of the life cycle, the document binary is still not deleted from the file system but rather moved to the designated ‘deletedContentStore’ (usually ‘./alf_data/contentstore.deleted’). On the storage file system, the doc can then be removed via script or cron job once an appropriate backup has been performed.

For more information about the Modules involved in the document life cycle, you can refer to the following ressources:

Alfresco Trashcan Cleaner Module:
The purpose of this module is to automate the cleanup of the trashcan. That means that the user trashcan is not automatically cleaned up, so a document can stay there forever.
The “retention period” can be configured through the protectedDays parameter (trashcan-cleaner-context.xml).

For more details:
trashcancleaner readme
trashcancleaner code

Content Store Cleaner Module:
Once a document has been removed from trashcan (when trashcan has been manually or automatically cleaned up), it becomes orphan. The purpose of the Content Store Cleaner Module is to move the orphan document to the designated deletedContentStore, usually contentstore.deleted.

The orphan document retention period is configurable through the protectDays parameter.

For more details:
Content Store Cleaner Module

————

Basically, the conclusion is that you can configure the Alfresco system so that a file binary is deleted only if it has been backuped properly. Otherwhise, you can also choose to never delete files (and move it on low cost file storage solution). So you have several solutions available…

I will detail the backup strategy in another dedicated post (because it is relatively complex), but hope this one will help you !


Alfresco high availability infrastructure in real life (Part 2)

December 12, 2009

As mentioned in my previous post (part 1), the purpose of this series of articles is to provide some recommendations to help you build your own alfresco high-availability (H.A) infrastructure.

So let’s review more in details the first 3 advices I think you should consider:

1 – Read the Alfresco wiki and the editor technical documentation to give you an overview of what is and what is not possible with the latest version of the product,

2 – Work with Alfresco service team at the early beginning of the study,

3 – Define and validate the SLA with your “functional” team. Common indicators are RTO (Recovery Time Objective) and RPO (Recovery Point Objective).

1 – Read the Alfresco wiki and the editor technical documentation to give you an overview of what is and what is not possible with the latest version of the product.

To prepare your study, I think the first thing to do is to review all the existing documentation related to Alfresco clustering and H.A. This will help you to choose the “best” approach for your design, at least the high-level principles.

Whatever your Alfresco version, here is a list of documents and articles (provided by Alfresco editor) which could be of interest:

Also, you should carefully choose the target version of Alfresco you will use to build your H.A architecture. In our case, we are currently using v2.1.1 E in production, but we are waiting for the latest v3.2 E version (end of this year) to build the new H.A infra.

Of course, in most cases, using the latest version is likely to be the best choice. But for some technical reasons (e.g CPU/memory limitation of your current production infra), you might have to migrate as soon as possible to the current available version (which is 3.1). The purpose of this post is not to do a comparative study of 3.1 and 3.2, but you should be aware that even between 2 minor releases there might be significant evolutions (as Alfresco developers are working very hard).

For instance, changes you can do through the JMX console (e.g changing any server configuration parameter) can be done at runtime with both 3.1 and 3.2, but changes are persistent only in 3.2…So if you need more advanced administration capabilities (like sub-system, JMX, LDAP synchronization, etc) 3.2 is clearly the best target.

Please note however that most of the major clustering features and options are already included in 3.0 version, and that there are no significant changes between 3.0 to 3.2 regarding clustering. The next major evolutions (sharing lucene indexes between cluster nodes) is planned for 3.3 (or even 4.0) only.

2 – Work with Alfresco service team at the early beginning of the study:

To start our study, we have involved the Alfresco service team. This is clearly the best advice I can give you for your own project…Indeed, even after a single consulting day (or 2), you will not only clarify the main Alfresco cluster principles, but also (and this is more important) learn that some of the most common documentation on this topic does not necessarily reflect the limitation of the clustering or HA methodology in a “real life” context. I especially do think about the “Replicating Content Store” (RCS), which initially seems to be the most common way to replicate your documents between cluster nodes (assuming you need to have 2 content stores for failover), but which is actually a very limited process.

In our case, we plan to build a failover site, which must be the “mirror” of the live site in term of data (content store and meta-data) to avoid loosing data as much as possible when switching to the failover site is required. At the beginning of the project we thought that using the “Replicating Content Store” could be a good option to perform the document (content store) synchronization. But the problem is that the RCS component does not ensure that the failover site will contain the exact same documents. For instance, if failover site is unavailable (e.g maintenance operation), then the RCS will not be able to re-run the transaction which have not been executed during downtime period. This is not the only limitation of RCS, so clearly it is not well suited in our context.

So the service team clearly told us that the RCS is usually not used by customers for documents replication purpose. Insteed, a tool like rsync is usually recommended by the Alfresco editor. So we will study if rsync can be a good fit for our datacenter infrastructure. But, with this advice at the very beginning of the project, we have avoided a big design mistake…

So once again, involving Alfresco service team is a good investment !

 

3 – Define and validate the SLA with your “functional” team. Common indicators are RTO (Recovery Time Objective) and RPO (Recovery Point Objective):

One of the main technical constraints that will drive the design of your HA infrastructure are RTO (Recovery Time Objective) and RPO (Recovery Point Objective) .

Basically RTO is the time required to re-activate the service in case of major crash of your production system. So let’s assume your Alfresco server crashed for any reason (unexpected power shutdown, corruption of database, etc, etc), you usually need to have a DRP (Disaster Recovery Plan) to be able to restore the system as soon as possible. In our case, we will use a failover site (in another geographical location) and we will switch the users on this new system. Our “functional team” has specified that our RTO must not be more than 2 hours, so that means we must be able to start the failover Alfresco instance and switch the load-balancer in less than 2 hours.

Once your failover site is started, another indicator is “how much data have been lost” during the crash and the switch operation ? This usually depends on the frequency of your backup. In our case, our plan is to synchronize the documents content store every 30 mn (using rsync), and the database every 60 mn (using Oracle archive log feature). That means that we should not lost more than 60 mn of data creation time.

Choosing a RTO of 2 hours and RPO of 1 hour is already a challenge, but I think this is something technically (and humanly) feasible. Depending on the criticity of your documents, this might of course not be enough, or not acceptable for your end-users…but you should know that the highest your expectation are, and the biggest the infrastructure cost will be…For instance, being able to reduce RTO to less than 30 mn might need to involve some monitoring solution coupled to advanced mechanism to automatically start your Alfresco instance (and database) on the failover site. Such solution already exist (like Veritas Storage Foundation HA), but the price of such solution is usually significant. We have studied this kind of very advanced solution, but finally we have decided (mainly for cost reason) to start with a pragmatic and more simple HA solution.

To give you an overview of what our Alfresco HA infrastructure will be, here is a logical diagram:


The final design is still not completely validated but basically here are the main concepts which have been selected so far:

  • Failover (YES): 2 distinct Alfresco instances located in 2 distinct Data-Centers geographically separated. Site A is the “live” site exposed to end-users. Site B is used only for failover (only in case of crash of Site A).
  • Clustering (YES/NO): yes, because we will “simulate” a 2 cluster nodes. But this will be more a hot standby system rather than a real cluster. Here the standby instance is mainly used for backup purpose (and of course failover). The failover site will be in read only mode and it will not be synchronized in real time (the data are replicated only every 1 hour).
  • Data replication (YES): rsync will be used for files synchronization. The archive log export/import mechanism (Oracle) will be used to replicate the database meta-data. One very important constraint is that Content Store replication must be performed before the DB replication (I will detail that later on).
  • Lucene indexes management: even with 3.2 version of Alfresco, each node of the cluster needs to have a local copy of the indexes. Site A indexes are updated in real time during each transaction. On Site B, we might need to enable the lucene index tracking process to make sure the lucene indexes are “up to date”.
  • Load-balancing (NO): we have decided not to address load-balancing for the first release of the project. The main difficulty is that load-balancing requires the 2 nodes of the cluster to be “in synch” in real time which can be complex to achieve (especially if you consider the limitations of the RCS component). Also, managing CIFS and load-balancing is something we did not have the time to study for the first release.

Hope that you begin to have a good overview of what infra we currently try to setup in our case. In a next post, I will continue to describe the following recommendation:
4 – Estimate the number of concurrent users you plan to have.
5 – Estimate your documents data volume and how it will scale up in the future (this is a key point that will drive your backup strategy).
6 – Study carefully the backup and the restore strategy (DRP). Also define if you want to be able to restore a single file.
7 – Involve your internal infrastructure team (and Alfresco service team) to finalize the design of your infrastructure.


3 ways to find and restore a (single) “lost” document stored in Alfresco

December 2, 2009

This week, someone of my team told me he “lost” one document when working in Alfresco.
He told me he did some cut and copy actions using CIFS (windows explorer) and that he experienced some unexpected system behaviour…finally he was not able to find the document anymore in the Alfresco space…
I looked into the trashcan (managed deleted items) => nothing.
I ran a Lucene search on the full repo => nothing.

So yes, the document was not available from the Alfresco interface. But as the document has not been deleted, I was quite sure that it could in an “orphan” state (i.e no meta-data associated with the binaries on the storage file system). Maybe a “transaction failure” might have caused this situation…

Fortunately, the user gave me the following information about the doc:
– Filename : team_metodology
– Document type (MS .ppt document),
– The document topic (it is about “methodology”),
– The last time he saved it in Alfresco (“2009/12/1” at 10 AM).
– The space path is “Company Home/Space1/Space2/Space3”.

In this case, you have several solutions. Please note that none of them is a perfect approach, but these are the only solutions you have with the current 3.2 version.
They are ordered by complexity order (from most complex to less complex):

———-

1/ Restore a full backup of day D-1. This approach requires you have:
– Full backup of database and content store for D-1 (or any backup which contains the doc).
– An Alfresco server dedicated to restore operation (that means it should have the same disk space available as your current production server).

Obviously, if you have a large repository, this is a very “heavy” operation, just to restore a single file.

———-

2/ Restore only a database backup of day D-1:
– With this approach, you will restore only the DB, but not the ContentStore (because you might not have enough disk space to restore all documents),
– Basically, the approach is to:
   – Find the filesytem path of the doc in the DB (like “alfresco/alf_data/contentstore/2009/12/1/hh/mm/”),
   – Try to find the doc in the ContentStore backup based on the previous path,
– I assume here you have a Database server dedicated to restore operation (with enough disk space).
– Of course, you will not be able to use Alfresco application here => you will have to start the
DB only and do some SQL queries to find the corresponding path of the document.
– I never tested this approach, so I cannot give you the corresponding SQL queries. But basically, you should first use the filename (team_metodology.ppt) to find the corresponding entry in DB, and then try to find its path on filesystem (like “2009/12/1/hh/mm/”).

Once again, this has never been tested, so this is a pure theoretical approach…

———-

3/ Search the doc binary on the Alfresco storage (filesystem):
– In our case, we are using a Linux filesystem as Alfresco storage. So the corresponding .bin object we are looking for is somewhere in the “alfresco/alf_data/contentstore/” directory.
– We know that last time user saved the doc in Alfresco was “2009/12/1” at 10 AM.
– We also know that doc is about “methodology” (so I assume here the string “methodology” can be found in the doc content),
– You should know that document filename and extension are modified by Alfresco when stored on filesystem (e.g “team_metodology.ppt” becomes something similar to “f8001f06-ddda-11de-a614-ad54d765801e.bin”).
– So the approach is to:
   – go in directory “alfresco/alf_data/contentstore/2009/12/1/10”,
   – do a grep on “methodology” (grep -ir “methodology” *): you might have several results:
 
Binary file 3/cc1eaf57-dddc-11de-ba7b-ad54d765801e.bin matches
Binary file 19/f18bad09-dddc-11de-ba7b-ad54d765801e.bin matches
Binary file 10/8b555d79-ddda-11de-a614-ad54d765801e.bin matches
Binary file 11/844c0916-dddd-11de-83bf-ad54d765801e.bin matches
Binary file 5/f8001f06-ddda-11de-a614-ad54d765801e.bin matches

   – Based on the last modification hour, these 2 files are matching:

   Binary file 10/8b555d79-ddda-11de-a614-ad54d765801e.bin matches
   Binary file 11/844c0916-dddd-11de-83bf-ad54d765801e.bin matches

   – Download the files on your computer, and then change the extension to .ppt:

   8b555d79-ddda-11de-a614-ad54d765801e.ppt
   844c0916-dddd-11de-83bf-ad54d765801e.ppt

   – Try to open these files (through windows explorer, not through the “winscp” client as this might not work).

   – One of them should be the doc you are looking for !

This last approach has been tested, so I can “guarantee” it is working.
But obviously, this is a “workaround” solution…Also you really need to know the last time
the document has been updated, otherwhise your “grep” search might take too much time.

———-

I did talk with an Alfresco service guy last week, and he agreed that this could be a very useful evolution to be able to quickly find and restore an “orphan” document.

Our company employees are used to work with “Windows server” facilities, and MS do provide such feature to easily find and restore a single file from the backup.

He told me a basic custom dev could help here: each time a file is created/saved, Alfresco could add an entry in a log file to reference the mapping between a doc path (like “Company Home/Space1/Space2/Space3/team_metodology.ppt”)
and the corresponding path on the filesystem (like “alfresco/alf_data/contentstore/2009/12/1/hh/mm/”). This would
greatly simplify the document search into the ContentStore backup (or within the live ContentStore itself).

Of course a more industrialized solution provided by the editor would be even more appropriate…
I think I will open a JIRA for this evolution…