Wednesday, November 14, 2018

Apache Jackrabbit Database Usage Patterns and Options to Reduce Database size

Recently, I wrote about how to externalize version storage to an SFTP server backend to reduce database size: https://woonsanko.blogspot.com/2018/11/externalizing-jcr-version-storage-with.html. It is kind of similar case to how to keep the binary content in either AWS S3 bucket or virtual file system such as SFTP or WebDAV server as I described before in https://woonsanko.blogspot.com/2016/08/cant-we-store-huge-amount-of-binary.html. The only difference is, in high level, the former is about version history database table, VERSION_BUNDLE, whereas the latter is about the binary table, DATASTORE.

I'd like to explain how those tables make a significant impact on database size by showing database usage patterns from several real CMS systems. Also, I'd like to list the benefits by reducing the database size at last.

Pattern 1: Huge DATASTORE table for a Simple Website



In the chart, it shows more than 95% of database is consumed by DATASTORE table which stores only binary content such as images, PDF files, etc, not document or configuration nodes and properties. The project implements a CMS based website serving huge amount of binaries. But business users do not probably edit and publish documents often. It is also possible that they migrate some binary data such as images and PDF files from external sources to CMS in order to serve those through website easily.

If they switch the Apache Jackrabbit DataStore component from the default DbDataStore to either S3DataStore or VFSDataStore, they can save more than 95% of database.

Pattern 2: Big DATASTORE table with Modest Document/Node Updates



This site shows modest amount of document and node content in DEFAULT_BUNDLE table which contains the node bundle data of the default Jackrabbit workspace. It means that business users update and publish content in modest size. But still more than 90% of database is consumed for binary content only in DATASTORE table.

The same story goes. If they switch the Apache Jackrabbit DataStore component from the default DbDataStore to either S3DataStore or VFSDataStore, they can save more than 90% of database.

Pattern 3: More Document Oriented CMS



In this site, the DEFAULT_BUNDLE table is relatively bigger than other sites, taking more than 50% of database. It means that content document updates and publication is very important to business users with their CMS system. Business users probably need to update and (re)publish content more frequently for their websites.

As the default workspace data needs to be queried and accessed frequently in the delivery web applications, there's nothing to do more with the DEFAULT_BUNDLE table.
However, they still have consumed more than 20% of database only for binary content in DATASTORE table, and they have consumed up to 20% of database for version history in VERSION_BUNDLE table.
Therefore, if they switch both DataStore component and FileSystem component of VersionManager to alternatives -- S3DataStore / VFSDataStore and VFSFileSystem -- then they can save more than 40% of database.

Pattern 4: More Versioning or Periodic Content Ingestion to CMS



In this site, more than 55% of database is consumed for version history in VERSION_BUNDLE table, and up to 30% of database is consumed for binary content in DATASTORE table.
There are two possibilities: (a) business users update and publish document very often so that it results in a lot of version history data, or (b) there is a batch job periodically running to import external content into CMS with publishing the updated document after imports.
In either case, if they switch both DataStore component and FileSystem component of VersionManager to alternatives -- S3DataStore / VFSDataStore and VFSFileSystem -- then they can save more than 85% of database.

Benefits by Reducing Database Size


What are the benefits by reducing the repository database size by the way?
Here's my list:
  1. Transparent JCR API
    • As you're switching only Apache Jackrabbit internal components, it doesn't affect applications. You don't need to write or use a plugin to manage binary content in a different storage by yourself. The existing JCR API still works transparently.
    • Indexing still works transparently. If you upload a PDF file, it will be indexed and searchable. However, if you implement a custom solution, you need to take care of it by yourself.
  2. Almost unlimited storage for binaries
    • If you use either S3 bucket or SFTP gateway for Google Cloud Platform or even SFTP server directly, then you can store practically almost unlimited amount of binaries and version history in modern cloud computing world.
  3. Cheaper storage
    • Amazon S3 or SFTP server is a lot cheaper than database option. For example, Amazon RDS is more expensive than S3 storage for binary content.
  4. Faster backup, import, migration
    • Apache Jackrabbit DataStore component allows you to do hot-backup and restoration from the backup files to the backend system at runtime.
  5. Build new environment quickly from production data.
    • As the database is small enough in most cases, you can build a new environment from from other environment's backups more quickly.
  6. Save backup storage
    • If you do nightly backup, weekly backup, etc. and you have to keep those backup files for some period (e.g, 1 year), then you might need to worry about the backup disk storage sometimes. If the database size is small enough, your concerns will be more relieved by taking advantage of S3 backup capabilities.
  7. Encryption at rest
    • If you have sensitive PDF files for example, you might want to take advantage of Encryption at REST provided by Amazon S3 or Linux file system.


Externalizing JCR Version Storage with VFSFileSystem

A while ago, I wrote a blog article, Can't we store huge amount of binary data in JCR?. It was about switching Apache Jackrabbit DataStore from DbDataStore to either S3DataStore or VFSDataStore. Depending on your database usage pattern, it will allow you to save huge amount of database just by switching DataStore component configuration in the repository.xml.

In some cases, the version history data in VERSION_BUNDLE could be as big as DATASTORE table. The following is an excerpt from https://www.onehippo.org/library/administration/maintenance/cleaning-up-version-history.html, explaining what's happening when you (de)publish a document, causing revisions in version history:
Each time a document is published, a copy of the current state of the document is stored as a new version. While this feature enables users to restore any previously published version of their document, it comes at the cost of an ever increasing size of the version history storage.
So if your users update and publish documents regularly, the version history data size will increase proportionally as time goes by, which might cause a big database size at some point. Administrators need to monitor it and they might need to remove old revisions just to reduce the database size.

The same story goes here as we have dealt with binary storage issue in database in my previous blog article. Is there a solution for this? Do we really need to care about database size increases for the version history?

Yes, we have a solution in Apache Jackrabbit: VFSFileSystem.

JackrabbitRepository component uses two distinct internal components: Workspace and VersionManager. (I'm using logical names instead of physical class names such as org.apache.jackrabbit.core.RepositoryImpl.WorkspaceInfo here.) See the diagram below:


Whenever a version needs to be made, the node data is copied to VersionManager, which saves the data in its own FileSystem -- DatabaseFileSystem by default if you use RDBMS persistence for Apache Jackrabbit. That's why the database size should increase by default whenever a version is made.

Now if you switch the internal FileSystem of the VersionManager to VFSFileSystem with SFTP or WebDAV backend, then all the version data, the copies from the Workspace, will be stored in an external file system such as SFTP or WebDAV backend instead.

Switching it to VFSFileSystem for VersionManager is straightforward. See the following snippets from repository.xml configuration:

<Repository>


  <!-- SNIP -->


  <Versioning rootPath="${rep.home}/version">

    <FileSystem class="org.apache.jackrabbit.vfs.ext.fs.VFSFileSystem">
      <param name="config" value="${catalina.base}/conf/vfs2-filesystem-sftp.properties" />
    </FileSystem>

    <PersistenceManager
      class="org.apache.jackrabbit.core.persistence.bundle.BundleFsPersistenceManager">
    </PersistenceManager>

    <!-- SNIP -->

  </Versioning>

  <!-- SNIP -->

</Repository>

Just replace FileSystem element and PersistenceManager element inside the Versioning element to use VFSFileSystem which is configured with a properties file specifying SFTP credentials or private key identity file.
Then it will make Apache Jackrabbit Repository to store all the version history data in the backend SFTP file system instead of database.

Please find a working demo project in my GitHub project at https://github.com/woonsanko/hippo-davstore-demo. The demo project shows how to use VFSFile system for an SFTP backend system option for version history data as well as binary DataStore option with either VFS file system or AWS S3 bucket backend. Just follow its README.md.