Wednesday, November 14, 2018

Externalizing JCR Version Storage with VFSFileSystem

A while ago, I wrote a blog article, Can't we store huge amount of binary data in JCR?. It was about switching Apache Jackrabbit DataStore from DbDataStore to either S3DataStore or VFSDataStore. Depending on your database usage pattern, it will allow you to save huge amount of database just by switching DataStore component configuration in the repository.xml.

In some cases, the version history data in VERSION_BUNDLE could be as big as DATASTORE table. The following is an excerpt from https://www.onehippo.org/library/administration/maintenance/cleaning-up-version-history.html, explaining what's happening when you (de)publish a document, causing revisions in version history:
Each time a document is published, a copy of the current state of the document is stored as a new version. While this feature enables users to restore any previously published version of their document, it comes at the cost of an ever increasing size of the version history storage.
So if your users update and publish documents regularly, the version history data size will increase proportionally as time goes by, which might cause a big database size at some point. Administrators need to monitor it and they might need to remove old revisions just to reduce the database size.

The same story goes here as we have dealt with binary storage issue in database in my previous blog article. Is there a solution for this? Do we really need to care about database size increases for the version history?

Yes, we have a solution in Apache Jackrabbit: VFSFileSystem.

JackrabbitRepository component uses two distinct internal components: Workspace and VersionManager. (I'm using logical names instead of physical class names such as org.apache.jackrabbit.core.RepositoryImpl.WorkspaceInfo here.) See the diagram below:


Whenever a version needs to be made, the node data is copied to VersionManager, which saves the data in its own FileSystem -- DatabaseFileSystem by default if you use RDBMS persistence for Apache Jackrabbit. That's why the database size should increase by default whenever a version is made.

Now if you switch the internal FileSystem of the VersionManager to VFSFileSystem with SFTP or WebDAV backend, then all the version data, the copies from the Workspace, will be stored in an external file system such as SFTP or WebDAV backend instead.

Switching it to VFSFileSystem for VersionManager is straightforward. See the following snippets from repository.xml configuration:

<Repository>


  <!-- SNIP -->


  <Versioning rootPath="${rep.home}/version">

    <FileSystem class="org.apache.jackrabbit.vfs.ext.fs.VFSFileSystem">
      <param name="config" value="${catalina.base}/conf/vfs2-filesystem-sftp.properties" />
    </FileSystem>

    <PersistenceManager
      class="org.apache.jackrabbit.core.persistence.bundle.BundleFsPersistenceManager">
    </PersistenceManager>

    <!-- SNIP -->

  </Versioning>

  <!-- SNIP -->

</Repository>

Just replace FileSystem element and PersistenceManager element inside the Versioning element to use VFSFileSystem which is configured with a properties file specifying SFTP credentials or private key identity file.
Then it will make Apache Jackrabbit Repository to store all the version history data in the backend SFTP file system instead of database.

Please find a working demo project in my GitHub project at https://github.com/woonsanko/hippo-davstore-demo. The demo project shows how to use VFSFile system for an SFTP backend system option for version history data as well as binary DataStore option with either VFS file system or AWS S3 bucket backend. Just follow its README.md.


No comments:

Post a Comment