We had a call with Mike Farman (Alfresco Product Manager) this morning, and here is the summary of our discussion about “basic” changes that can be done on Alfresco DM in order to “quickly” optimize performances:
- Mike Farman did confirm that the Open-Office treatment (plain text transformation for indexing) and the Lucene indexing process are the most intensive in term of CPU.
- So these are the 2 major things we should work on (in our case) if we need improve the scalability of the current plateform (as well as JVM tuning, see below).
- For Open-Office :
– As expected OO has big impact on CPU/RAM (whatever the version used, impact is similar),
– OO is used for:
– File transformation (e.g .ppt to pdf),
– Plain-text transformation (e.g .ppt to plain-text), this is to allow lucene indexing,
– For some specific meta-data extraction: but OO is not mandatory here, and it can be “replaced” by another treatment,
– Technically, it is possible to not use OO at all,
– Reminder : OO is currently disabled in our environment (however some limitation/impact because .ppt file content are not indexed anymore by Lucene, also pdf transformation is not working).
– OO process can be runned on a remote separated server/JVM.
– This is documented here : http://wiki.alfresco.com/wiki/Running_OpenOffice_From_Remote_Machine
– We will check if this is something feasible in our current architecture,
– If OO is runned on a separated server, the benefits for us will then be:
– Lucene indexing of .ppt document (and others ?) can be re-enable,
– File transformation to .pdf can be re-enable.
- For Lucene indexing:
– Lucene has 2 major impacts on Alfresco CPU/RAM:
– The Lucene indexing process is resource consuming (indexing of new documents). This is the biggest treatment.
– Lucene search request (end-user searching for a document).
– Lucene search engine cannot be completly disabled (it is used for doc search but also for some internal technical search operation).
– Lucene search engine cannot be runned on a separated server (neither in version 2.x nor 3.x).
– Lucene indexing of document content can be adapted in order to:
– Either do the job asynchronously (scheduled during the night using cron). Meta-data will still be searchable in real-time.
– Or completely “disabled” (i.e search form removed from web UI) : then we need to implement another third-party search solution. Meta-data will still be searchable in real-time.
– A custom solution (AMP) will be provided by Mike to do the job asynchronously or to disable completely Lucene indexing of doc content.
– Technically: the solution consist in applying an aspect on each document (flag indexing = no/yes, default value = no). To allow indexing the flag should be set to yes programmatically (solution to be studied ?).
- The JVM tuning seems to be also very important : we should check if the following recommendation can/should be applied (http://wiki.alfresco.com/wiki/Repository_Hardware).
It is recommended to do some JVM profiling to see if some tuning parameter can improve performances (also check gc cycle).
- Database tuning should be considered also (no recommendation provided here).
- Other performances consideration: the more rules you have, the bigger CPU consumption will be…
– On our case, we only use a few rules (mainly for meta-data extraction),
– Moving OO on a separated server can reduce the impact on CPU when file to .pdf transformation is used. (this rule is still available on our server, it might or might not be used by end-users…).