Wednesday, November 14, 2018

Apache Jackrabbit Database Usage Patterns and Options to Reduce Database size

Recently, I wrote about how to externalize version storage to an SFTP server backend to reduce database size: https://woonsanko.blogspot.com/2018/11/externalizing-jcr-version-storage-with.html. It is kind of similar case to how to keep the binary content in either AWS S3 bucket or virtual file system such as SFTP or WebDAV server as I described before in https://woonsanko.blogspot.com/2016/08/cant-we-store-huge-amount-of-binary.html. The only difference is, in high level, the former is about version history database table, VERSION_BUNDLE, whereas the latter is about the binary table, DATASTORE.

I'd like to explain how those tables make a significant impact on database size by showing database usage patterns from several real CMS systems. Also, I'd like to list the benefits by reducing the database size at last.

Pattern 1: Huge DATASTORE table for a Simple Website



In the chart, it shows more than 95% of database is consumed by DATASTORE table which stores only binary content such as images, PDF files, etc, not document or configuration nodes and properties. The project implements a CMS based website serving huge amount of binaries. But business users do not probably edit and publish documents often. It is also possible that they migrate some binary data such as images and PDF files from external sources to CMS in order to serve those through website easily.

If they switch the Apache Jackrabbit DataStore component from the default DbDataStore to either S3DataStore or VFSDataStore, they can save more than 95% of database.

Pattern 2: Big DATASTORE table with Modest Document/Node Updates



This site shows modest amount of document and node content in DEFAULT_BUNDLE table which contains the node bundle data of the default Jackrabbit workspace. It means that business users update and publish content in modest size. But still more than 90% of database is consumed for binary content only in DATASTORE table.

The same story goes. If they switch the Apache Jackrabbit DataStore component from the default DbDataStore to either S3DataStore or VFSDataStore, they can save more than 90% of database.

Pattern 3: More Document Oriented CMS



In this site, the DEFAULT_BUNDLE table is relatively bigger than other sites, taking more than 50% of database. It means that content document updates and publication is very important to business users with their CMS system. Business users probably need to update and (re)publish content more frequently for their websites.

As the default workspace data needs to be queried and accessed frequently in the delivery web applications, there's nothing to do more with the DEFAULT_BUNDLE table.
However, they still have consumed more than 20% of database only for binary content in DATASTORE table, and they have consumed up to 20% of database for version history in VERSION_BUNDLE table.
Therefore, if they switch both DataStore component and FileSystem component of VersionManager to alternatives -- S3DataStore / VFSDataStore and VFSFileSystem -- then they can save more than 40% of database.

Pattern 4: More Versioning or Periodic Content Ingestion to CMS



In this site, more than 55% of database is consumed for version history in VERSION_BUNDLE table, and up to 30% of database is consumed for binary content in DATASTORE table.
There are two possibilities: (a) business users update and publish document very often so that it results in a lot of version history data, or (b) there is a batch job periodically running to import external content into CMS with publishing the updated document after imports.
In either case, if they switch both DataStore component and FileSystem component of VersionManager to alternatives -- S3DataStore / VFSDataStore and VFSFileSystem -- then they can save more than 85% of database.

Benefits by Reducing Database Size


What are the benefits by reducing the repository database size by the way?
Here's my list:
  1. Transparent JCR API
    • As you're switching only Apache Jackrabbit internal components, it doesn't affect applications. You don't need to write or use a plugin to manage binary content in a different storage by yourself. The existing JCR API still works transparently.
    • Indexing still works transparently. If you upload a PDF file, it will be indexed and searchable. However, if you implement a custom solution, you need to take care of it by yourself.
  2. Almost unlimited storage for binaries
    • If you use either S3 bucket or SFTP gateway for Google Cloud Platform or even SFTP server directly, then you can store practically almost unlimited amount of binaries and version history in modern cloud computing world.
  3. Cheaper storage
    • Amazon S3 or SFTP server is a lot cheaper than database option. For example, Amazon RDS is more expensive than S3 storage for binary content.
  4. Faster backup, import, migration
    • Apache Jackrabbit DataStore component allows you to do hot-backup and restoration from the backup files to the backend system at runtime.
  5. Build new environment quickly from production data.
    • As the database is small enough in most cases, you can build a new environment from from other environment's backups more quickly.
  6. Save backup storage
    • If you do nightly backup, weekly backup, etc. and you have to keep those backup files for some period (e.g, 1 year), then you might need to worry about the backup disk storage sometimes. If the database size is small enough, your concerns will be more relieved by taking advantage of S3 backup capabilities.
  7. Encryption at rest
    • If you have sensitive PDF files for example, you might want to take advantage of Encryption at REST provided by Amazon S3 or Linux file system.


Externalizing JCR Version Storage with VFSFileSystem

A while ago, I wrote a blog article, Can't we store huge amount of binary data in JCR?. It was about switching Apache Jackrabbit DataStore from DbDataStore to either S3DataStore or VFSDataStore. Depending on your database usage pattern, it will allow you to save huge amount of database just by switching DataStore component configuration in the repository.xml.

In some cases, the version history data in VERSION_BUNDLE could be as big as DATASTORE table. The following is an excerpt from https://www.onehippo.org/library/administration/maintenance/cleaning-up-version-history.html, explaining what's happening when you (de)publish a document, causing revisions in version history:
Each time a document is published, a copy of the current state of the document is stored as a new version. While this feature enables users to restore any previously published version of their document, it comes at the cost of an ever increasing size of the version history storage.
So if your users update and publish documents regularly, the version history data size will increase proportionally as time goes by, which might cause a big database size at some point. Administrators need to monitor it and they might need to remove old revisions just to reduce the database size.

The same story goes here as we have dealt with binary storage issue in database in my previous blog article. Is there a solution for this? Do we really need to care about database size increases for the version history?

Yes, we have a solution in Apache Jackrabbit: VFSFileSystem.

JackrabbitRepository component uses two distinct internal components: Workspace and VersionManager. (I'm using logical names instead of physical class names such as org.apache.jackrabbit.core.RepositoryImpl.WorkspaceInfo here.) See the diagram below:


Whenever a version needs to be made, the node data is copied to VersionManager, which saves the data in its own FileSystem -- DatabaseFileSystem by default if you use RDBMS persistence for Apache Jackrabbit. That's why the database size should increase by default whenever a version is made.

Now if you switch the internal FileSystem of the VersionManager to VFSFileSystem with SFTP or WebDAV backend, then all the version data, the copies from the Workspace, will be stored in an external file system such as SFTP or WebDAV backend instead.

Switching it to VFSFileSystem for VersionManager is straightforward. See the following snippets from repository.xml configuration:

<Repository>


  <!-- SNIP -->


  <Versioning rootPath="${rep.home}/version">

    <FileSystem class="org.apache.jackrabbit.vfs.ext.fs.VFSFileSystem">
      <param name="config" value="${catalina.base}/conf/vfs2-filesystem-sftp.properties" />
    </FileSystem>

    <PersistenceManager
      class="org.apache.jackrabbit.core.persistence.bundle.BundleFsPersistenceManager">
    </PersistenceManager>

    <!-- SNIP -->

  </Versioning>

  <!-- SNIP -->

</Repository>

Just replace FileSystem element and PersistenceManager element inside the Versioning element to use VFSFileSystem which is configured with a properties file specifying SFTP credentials or private key identity file.
Then it will make Apache Jackrabbit Repository to store all the version history data in the backend SFTP file system instead of database.

Please find a working demo project in my GitHub project at https://github.com/woonsanko/hippo-davstore-demo. The demo project shows how to use VFSFile system for an SFTP backend system option for version history data as well as binary DataStore option with either VFS file system or AWS S3 bucket backend. Just follow its README.md.


Friday, January 12, 2018

Recipe for Migrating Hippo CMS Database from One to Another

Sometimes people want to migrate an existing database of Hippo CMS from one to another. For example, they have been running Hippo CMS on Oracle database, but after a while they started thinking about moving their on-premise system and database to a cloud platform. Sounds like a typical use case and that there must be some solutions already out there, right?

Well, surprisingly many people don't know that Apache Jackrabbit has provided a repository copying (or "backup" or "migration" as they call in the documentation) tool since v1.6, dated first in 2010!

There are some reasons why people don't know about the useful tool:
  • Many people use a vendor specific Apache Jackrabbit repository implementations from a specific project or product, not the Apache Jackrabbit Standalone Server itself. So, even if the Backup and migration feature is well documented in Apache Jackrabbit Standalone Server page, it is hard for them to follow.
  • Each vendor specific implementation with Apache Jackrabbit, such as Hippo CMS  has some tweaks for their own purposes, including extra libraries on top of the default Apache Jackrabbit modules. So, if users don't know which extra libraries to add more by themselves, it can hardly work for them.

That's why I created a 'recipe' project in one of my GitHub repositories:

The recipe introduces a step-by-step guide, with Hippo CMS specific examples. I think it should be helpful to other Apache Jackrabbit derivatives too. Please browse the source.

Last but not least, many thanks to Apache Jackrabbit Standalone Server tool! Cheers!

Tuesday, May 23, 2017

Remoting for Automation via Apache Jackrabbit JCR Webdav Server from Command Lines

Sometimes we need to create, update or even delete data in JCR in an automated way. For example, we need to update some properties on specific configuration nodes just after reseting the database and restarting the server for that specific environment. Or sometimes you need to import some data from XML to a remote JCR just after startup. Obviously you can do these manually through UI. But concerns arise when you need to do these in an automated way through a batch job or script.
I'd like to introduce the Apache Jackrabbit JCR WebDAV Server which provides an advanced remoting feature and how you can take advantage of the feature in an automated way like executing it in command lines.

Apache Jackrabbit JCR WebDAV Server

Apache Jackrabbit JCR WebDAV Server was basically designed to support remote JCR API calls via underlying WebDAV protocol. You can create, read, update or delete data in JCR content repository through JCR WebDAV Server via either a) JCR Client API or b) direct WebDAV requests from the client.




It is really good to be able to use JCR APIs directly from a remote client without having to care of the details of WebDAV/HTTP payloads, which should be really a good topic to cover later, but in this article, I'd like to focus only on the use cases from Command Line Client because it's more related to the "automation" topic of this article.

Command Line Examples through WebDAV/HTTP

I don't want to copy every example again here. Jukka Zitting, a former chairman of Apache Jackrabbit project and incubator PMCs, already explained it with very intuitive examples in one of his great blog articles in the past:
The blog article of Jukka's explains how to create a node, how to read a node, how to update one single-valued property in different types such as date or string, and how to delete a node.

I'd like to just add two more helpful examples below.

Updating Multiple Values Property from Command Lines


If you want to update a multiple-valued property like hipposys:members property in the following example CND, you can't use the example to update single valued property in Jukka's blog article:

[hipposys:group] > nt:base
- hipposys:system (boolean)
- hipposys:members (string) multiple
- hipposys:description (string)
...

To update multiple-valued property, you need to wrap the values in <values xmlns='http://www.day.com/jcr/webdav/1.0'>...</values> element for the data argument. Here's an example curl command to update the property:

  curl --request PUT --header "Content-Type: jcr-value/undefined" \
    --data "<values xmlns='http://www.day.com/jcr/webdav/1.0'><value>editor</value><value>john</value><value>jane</value></values>" \
    --user admin:admin \
    http://localhost:8080/cms/server/default/jcr:root/hippo:configuration/hippo:groups/editor/hipposys:members

Importing System View XML file to JCR from Command Lines


This example is basically just a variation of how to create a node shown in Jukka's blog article by using an external system view XML file instead.
Suppose you have the following system view XML file (e.g, editor.xml):

<?xml version="1.0" encoding="UTF-8"?>
<sv:node xmlns:sv="http://www.jcp.org/jcr/sv/1.0" sv:name="editor">
  <sv:property sv:name="jcr:primaryType" sv:type="Name">
    <sv:value>hipposys:group</sv:value>
  </sv:property>
  <sv:property sv:name="hipposys:members" sv:type="String" sv:multiple="true">
    <sv:value>editor</sv:value>
    <sv:value>john</sv:value>
  </sv:property>
  <sv:property sv:name="hipposys:securityprovider" sv:type="String">
    <sv:value>internal</sv:value>
  </sv:property>
</sv:node>

You can pipe the standard output of the input file, editor.xml, into a curl command by specifying --data argument as @-, meaning the data must be read from the standard input.

  cat editor.xml | curl -v --request MKCOL --data @- --user admin:admin \
    http://localhost:8080/cms/server/default/jcr:root/hippo:configuration/hippo:groups/editor

The command can be rewritten to the following instead just as a different way to specify the standard input:

  curl --request MKCOL --data @- --user admin:admin \
    http://localhost:8080/cms/server/default/jcr:root/hippo:configuration/hippo:groups/editor \
    < editor.xml

Or, you can specify the input file directly by prefixing the file path with '@' like the following example instead:

  curl --request MKCOL --data @editor.xml --user admin:admin \
    http://localhost:8080/cms/server/default/jcr:root/hippo:configuration/hippo:groups/editor

Therefore, if you want to remove an existing /hippo:configuration/hippo:groups/editor node and recreate it from the XML file, you could execute a delete command like the following, followed by one of the create command explained above:

  curl --request DELETE --user admin:admin \
    http://localhost:8080/cms/server/default/jcr:root/hippo:configuration/hippo:groups/editor


I guess you already grasp the whole idea about how we can take advantage of JCR WebDAV Server for automation in command lines. Basically, you can easily create, read, update or delete any content in JCR from command lines!

How to Enable JCR WebDAV Server in Hippo CMS Repository?

By default, JCR WebDAV Server is not enabled in Hippo CMS Repository, but you can easily install and configure it by following this community forge plugin documentation:
Please let me know if you have any further questions.

Thursday, October 20, 2016

Playing with an Apache Jackrabbit DataStore Migration Tool

A while ago, I posted a blog article (Can't we store huge amount of binary data in JCR?) about why Apache Jackrabbit VFSDataStore or S3DataStore is useful and how to use it when storing huge amount of binary data in JCR. But, we already have many running JCR systems with different DataStores (e.g. DbDataStore). So, we need to be able to migrate an existing DataStore to VFSDataStore or S3DataStore. That's what I wanted to do with a migration tool (https://github.com/woonsan/jackrabbit-datastore-migration).
In this article, I'd like to share my experiences in migrating a DbDataStore to VFSDataStore in a real project with the tool.

The Problem


One of my project (based on Hippo CMS) uses DbDataStore which is configured in repository.xml as the default option like the following:

<DataStore class="org.apache.jackrabbit.core.data.db.DbDataStore">
  <param name="url" value="java:comp/env/jdbc/repositoryDS" />
  <param name="driver" value="javax.naming.InitialContext" />
  <param name="databaseType" value="mysql" />
  <param name="minRecordLength" value="1024" />
  <param name="maxConnections" value="5" />
  <param name="copyWhenReading" value="true" />
</DataStore>

Basically, I want to replace the DbDataStore with VFSDataStore backed by SFTP server after data migration in the end:

<DataStore class="org.apache.jackrabbit.vfs.ext.ds.VFSDataStore">
  <param name="config" value="${catalina.base}/conf/vfs2-datastore-sftp.properties" />
  <param name="asyncWritePoolSize" value="10" />
  <param name="secret" value="123456"/>
  <param name="minRecordLength" value="1024"/>
  <param name="recLengthCacheSize" value="10000" />
</DataStore>

And, vfs2-datastore-sftp.properties should look like the following:

# SFTP base folder URL
baseFolderUri = sftp://tester:secret@localhost/vfsds
# when the identity file (your private key file) is used instead of password
#fso.sftp.identities = /home/tester/.ssh/id_rsa

So, we need to migrate all the data managed by DbDatStore to the SFTP location before switching to VFSDataStore.

Data Migration Steps


First of all, we need to download the latest version of the migration tool from https://github.com/woonsan/jackrabbit-datastore-migration/releases.
After uncompressing the downloaded file in a folder, we can build it with `mvn package`, which generates `jackrabbit-datastore-migration-x.x.x.jar` file under the `target` folder.

Second, we need to configure the "source" DataStore and "target" DataStore in a YAML file like the following example (e.g. config/migration-db-to-vfs.yaml):

logging:
    level:
        root: 'WARN'
        com.github.woonsan.jackrabbit.migration.datastore: 'INFO'

batch:
    minWorkers: '10'
    maxWorkers: '10'

source:
    dataStore:
        homeDir: 'target/storage-db'
        className: 'org.apache.jackrabbit.core.data.db.DbDataStore'
        params:
            url: 'jdbc:mysql://localhost:3306/hippodb?autoReconnect=true&characterEncoding=utf8'
            user: 'hippo'
            password: 'hippo'
            driver: 'com.mysql.jdbc.Driver'
            databaseType: 'mysql'
            minRecordLength: '1024'
            maxConnections: '10'
            copyWhenReading: 'true'
            tablePrefix: ''
            schemaObjectPrefix: ''
            schemaCheckEnabled: 'false'

target:
    dataStore:
        homeDir: 'target/storage-vfs'
        className: 'org.apache.jackrabbit.vfs.ext.ds.VFSDataStore'
        params:
            asyncUploadLimit: '0'
            baseFolderUri: 'sftp://tester:secret@localhost/vfsds'
            minRecordLength: '1024'

As you can see, the "source" DataStore is configured with DbDataStore backed by a MySQL database, and the "target" DataStore is configured with VFSDataStore backed by a SFTP location.
Please note that the configuration style for each DataStore is actually equivalent to how it is set in repository.xml if you compare both configurations.
In addition, the YAML configuration has somethings about logging and thread pool worker counts, too, since logging and multi-threaded workers are important in this kind of batch applications.

Now, it's time to execute the migration tool.
Assuming you have JDBC Driver jar file in lib/ directory (e.g. lib/mysql-connector-java-5.1.38.jar), you can execute the tool like the following:

$ java -Dloader.path="lib/" \
       -jar target/jackrabbit-datastore-migration-0.0.1-SNAPSHOT.jar \
       --spring.config.location=config/migration-db-to-vfs.yaml

Or, if you know a specific location where the JDBC driver jar file exists, maybe you can run it like this instead:

java -Dloader.path=/home/tester/.m2/repository/mysql/mysql-connector-java/5.1.38/ \
     -jar target/jackrabbit-datastore-migration-0.0.1-SNAPSHOT.jar \
     --spring.config.location=config/migration-db-to-vfs.yaml

If your configurations are okay and the tool run properly, you will see result logs like the following:

.   ____          _            __ _ _
/\\ / ___'_ __ _ _(_)_ __  __ _ \ \ \ \
( ( )\___ | '_ | '_| | '_ \/ _` | \ \ \ \
\\/  ___)| |_)| | | | | || (_| |  ) ) ) )
'  |____| .__|_| |_|_| |_\__, | / / / /
=========|_|==============|___/=/_/_/_/
:: Spring Boot ::        (v1.4.0.RELEASE)

...
2016-10-17 23:14:44.785  INFO 5071 --- [           main] .w.j.m.d.b.MigrationJobExecutionReporter :
===============================================================================================================
Execution Summary:
---------------------------------------------------------------------------------------------------------------
Total: 22383, Processed: 22383, Read Success: 22383, Read Fail: 0, Write Success: 22383, Write Fail: 0, Duration: 1887607ms
---------------------------------------------------------------------------------------------------------------
Details (in CSV format):
---------------------------------------------------------------------------------------------------------------
SEQ,ID,READ,WRITE,SIZE,ERROR
1,000082f676bf6ed3a39debd6b656287efa6687b6,true,true,54395,
2,00030e05db40739611fcd06d1af26cc7a6afd5b0,true,true,626789,
3,00076b2f5e4e43245928accbcc90fcf738121652,true,true,4097,
...
22382,fff8b902c59f0c310ff53952d86b17d383805355,true,true,258272,
22383,fffc167ef45efdc9b3bea7ed3953fea8ccdb294f,true,true,518903,
===============================================================================================================

2016-10-17 23:14:44.820  INFO 5071 --- [           main] c.g.w.j.migration.datastore.Application  : Started Application in 1892.767 seconds (JVM running for 1893.449)

Spring Boot generates the logging very nicely by default. You can also change the logging configuration. Please see Spring Boot documentation for that.
Anyway, it shows the result, including record sequence number, read/write status, byte size and error information, in CSV format in the end after execution logging lines.

Switching to VFSDataStore and Restart


Once all the binary data is migrated from DbDataStore to VFSDataStore (to SFTP location), we can switch to VFSDataStore by replacing the old <DataStore> element by this in repository.xml:

<DataStore class="org.apache.jackrabbit.vfs.ext.ds.VFSDataStore">
  <param name="config" value="${catalina.base}/conf/vfs2-datastore-sftp.properties" />
  <param name="asyncWritePoolSize" value="10" />
  <param name="secret" value="123456"/>
  <param name="minRecordLength" value="1024"/>
  <param name="recLengthCacheSize" value="10000" />
</DataStore>

Restart the server, and now the binary data will be served from SFTP server through the new VFSDataStore component!



Tuesday, August 30, 2016

Can't we store huge amount of binary data in JCR?

Can't we store huge amount of binary data in JCR? If you as a software architect have ever met a question like this (e.g, a requirement to store huge amount of binary data such as PDF files in JCR), maybe you could have had a moment depicting some candidate solutions. What is technically feasible and what's not? What is most appropriate to fulfill all the different quality attributes (such as scalability, performance, security, etc.) with acceptable trade-offs? Furthermore, what is more cost-effective and what's not?

Surprisingly, many people have tried to avoid JCR storage for binary data if the amount is going to be really huge. Instead of using JCR, in many cases, they have tried to implement a custom (UI) module to store binary data directly to a different storage such as SFTP, S3 or WebDAV through specific backend APIs.



It somewhat makes sense to separate binary data store if the amount is going to be really huge. Otherwise, the size of the database used by JCR can grow too much, which makes it harder and more costly to maintain, backup, restore and deploy as time goes by. Also, if your application requires to serve the binary data in a very scalable way, it will be more difficult with keeping everything in single database than separating the binary data store somewhere else.

But there is a big disadvantage with this custom (UI) module approach. If you store a PDF file through a custom (UI) module, you won't be able to search the content through standard JCR Query API any more because JCR (Jackrabbit) is never involved in storing/indexing/retrieving the binary data. If you could use JCR API to store the data, then Apache Jackrabbit could have indexed your binary node automatically and you could have been able to search the content very easily. Being unable to search PDF documents through standard JCR API could be a big disappointment.

Let's face the initial question again: Can't we store huge amount of binary data in JCR?
Actually... yes, we can. We can store huge amount of binary data through JCR in a standard way if you choose a right Apache Jackrabbit DataStore for a different backend such as SFTP, WebDAV or S3. Apache Jackrabbit was designed in a way to be able to plug in a different DataStore, and has provided various DataStore components for various backends. As of Apache Jackrabbit 2.13.2 (released on August, 29, 2016), it supports even Apache Commons VFS based DataStore component which enables to use SFTP and WebDAV as backend storage. That's what I'm going to talk about here.

DataStore Component in Apache Jackrabbit

Before jumping into the details, let me try to explain what DataStore was designed for in Apache Jackrabbit first. Basically, Apache Jackrabbit DataStore was designed to support large binary store for performance, reducing disk usage. Normally all node and property data is stored through PersistenceManager, but for relatively large binaries such as PDF files are stored through DataStore component separately.



DataStore enables:
  • Fast copy (only the identifier is stored by PersistenceManager, in database for example),
  • No blocking in storing and reading,
  • Immutable objects in DataStore,
  • Hot backup support, and
  • All cluster nodes using the same DataStore.
Please see https://wiki.apache.org/jackrabbit/DataStore for more detail. Especially, please note that a binary data entry in DataStore is immutable. So, a binary data entry cannot be changed after creation. This makes it a lot easier to support caching, hot backup/restore and clustering. Binary data items that are no longer used will be deleted automatically by the Jackrabbit Garbage collector.

Apache Jackrabbit has several DataStore implementations as shown below:


FileDataStore uses a local file system, DbDataStore uses a relational databases, and S3DataStore uses Amazon S3 as backend. Very interestingly, VFSDataStore uses a virtual file system provided by Apache Commons VFS module.

FileDataStore cannot be used if you don't have a stable shared file system between cluster nodes. DbDataStore has been used by Hippo Repository by default because it can work well in a clustered environment unless the binary data increases extremely too much. S3DataStore and VFSDataStore look more interesting because you can store binary data into an external storage. In the following diagrams, binary data is handled by Jackrabbit through standard JCR APIs, so it has a chance to index even binary data such as PDF files. Jackrabbit invokes S3DataStore or VFSDataStore to store or retrieve binary data and the DataStore component invokes its internal Backend component (S3Backend or VFSBackend) to write/read to/from the backend storage.


One important thing to note is that both S3DataStore and VFSDataStore extend CachingDataStore of Apache Jackrabbit. This gives a big performance benefit because a CachingDataStore caches binary data entries in local file system not to communicate with the backend if unnecessary.


As shown in the preceding diagram, when Jackrabbit needs to retrieve a binary data entry, it invokes DataStore (a CachingDataStore such as S3DataStore or VFSDataStore, in this case) with an identifier. CachingDataStore checks if the binary data entry already exists in its LocalCache first. [R1] If not found there, it invokes its Backend (such as S3Backend or VFSBackend) to read the data from the backend storage such as S3, SFTP, WebDAV, etc. [B1] When reading the data entry, it stores the entry into the LocalCache as well and serve the data back to JackrabbitCachingDataStore keeps the LRU cache, LocalCache, up to 64GB by default in a local folder that can be changed in the configuration. Therefore, it should be very performant when a binary data entry is requested multiple times because it is most likely to be served from the local file cache. Serving a binary data from a local cached file is probably much faster than serving data using DbDataStore since DbDataStore doesn't extend CachingDataStore nor have a local file cache concept at all (yet).

Using VFSDataStore in a Hippo CMS Project

To use VFSDataStore, you have the following properties in the root pom.xml:

  <properties>

    <!--***START temporary override of versions*** -->
    <!-- ***END temporary override of versions*** -->
    <com.jcraft.jsch.version>0.1.53</com.jcraft.jsch.version>

    <-- SNIP -->

  </properties>

Apache Jackrabbit VFSDataStore is supported since 2.13.2. You also need to add the following dependencies in cms/pom.xml:

    <!-- Adding jackrabbit-vfs-ext -->
    <dependency>
      <groupId>org.apache.jackrabbit</groupId>
      <artifactId>jackrabbit-vfs-ext</artifactId>
      <version>${jackrabbit.version}</version>
      <scope>runtime</scope>
      <!--
        Exclude jackrabbit-api and jackrabbit-jcr-commons since those were pulled
        in by Hippo Repository modules.
      -->
      <exclusions>
        <exclusion>
          <groupId>org.apache.jackrabbit</groupId>
          <artifactId>jackrabbit-api</artifactId>
        </exclusion>
        <exclusion>
          <groupId>org.apache.jackrabbit</groupId>
          <artifactId>jackrabbit-jcr-commons</artifactId>
        </exclusion>
      </exclusions>
    </dependency>

    <!-- Required to use SFTP VFS2 File System -->
    <dependency>
      <groupId>com.jcraft</groupId>
      <artifactId>jsch</artifactId>
      <version>${com.jcraft.jsch.version}</version>
    </dependency>

And, we need to configure VFSDataStore in conf/repository.xml like the following example:

<Repository>

  <!-- SNIP -->

  <DataStore class="org.apache.jackrabbit.vfs.ext.ds.VFSDataStore">
    <param name="config" value="${catalina.base}/conf/vfs2.properties" />
    <!-- VFSDataStore specific parameters -->
    <param name="asyncWritePoolSize" value="10" />
    <!--
      CachingDataStore specific parameters:
        - secret : key to generate a secure reference to a binary.
    -->
    <param name="secret" value="123456789"/>
    <!--
      Other important CachingDataStore parameters with default values, just for information:
        - path : local cache directory path. ${rep.home}/repository/datastore by default.
        - cacheSize : The number of bytes in the cache. 64GB by default.
        - minRecordLength : The minimum size of an object that should be stored in this data store. 16KB by default.
        - recLengthCacheSize : In-memory cache size to hold DataRecord#getLength() against DataIdentifier. One item for 140 bytes approximately.
    -->
    <param name="minRecordLength" value="1024"/>
    <param name="recLengthCacheSize" value="10000" />
  </DataStore>

  <!-- SNIP -->

</Repository>

The VFS connectivity is configured in ${catalina.base}/conf/vfs2.properties like the following for instance:

baseFolderUri = sftp://tester:secret@localhost/vfsds

So, the VFSDataStore uses SFTP backend storage in this specific example as configured in the properties file to store/read binary data in the end.

If you want to see more detailed information, examples and other backend usages such as WebDAV through VFSDataBackend, please visit my demo project here:

Note: Hippo CMS 10.x and 11.0 pull in modules of Apache Jackrabbit 2.10.x at the moment. However, there has not been any significant changes nor incompatible changes in org.apache.jackrabbit:jackrabbit-data and org.apache.jackrabbit:jackrabbit-vfs-ext between Apache Jackrabbit 2.10.x and Apache Jackrabbit 2.13.x. Therefore, it seems no problem to pull in org.apache.jackrabbit:jackrabbit-vfs-ext:jar:2.13.x dependency in cms/pom.xml like the preceding at the moment. But it should be more ideal to match all the versions of Apache Jackrabbit modules some day soon.
Update: Note that Hippo CMS 12.x pulls in Apache Jackrabbit 14.0+. Therefore, you can simply use ${jackrabbit.version} for the dependencies mentioned in this article.

Configuration for S3DataStore

In case you want to use S3DataStore instead, you need the following dependency:

    <!-- Adding jackrabbit-aws-ext -->
    <dependency>
      <groupId>org.apache.jackrabbit</groupId>
      <artifactId>jackrabbit-aws-ext</artifactId>
      <!-- ${jackrabbit.version} or a specific version like 2.14.0-h2. -->
      <version>${jackrabbit.version}</version>
      <scope>runtime</scope>
      <!--
        Exclude jackrabbit-api and jackrabbit-jcr-commons since those were pulled
        in by Hippo Repository modules.
      -->
      <exclusions>
        <exclusion>
          <groupId>org.apache.jackrabbit</groupId>
          <artifactId>jackrabbit-api</artifactId>
        </exclusion>
        <exclusion>
          <groupId>org.apache.jackrabbit</groupId>
          <artifactId>jackrabbit-jcr-commons</artifactId>
        </exclusion>
      </exclusions>
    </dependency>

    <!-- Consider using the latest AWS Java SDK for latest bug fixes. -->
    <dependency>
      <groupId>com.amazonaws</groupId>
      <artifactId>aws-java-sdk-s3</artifactId>
      <version>1.11.95</version>
    </dependency>

And, we need to configure S3DataStore in conf/repository.xml like the following example (excerpt from https://github.com/apache/jackrabbit/blob/trunk/jackrabbit-aws-ext/src/test/resources/repository_sample.xml):

<Repository>

  <!-- SNIP -->

  <DataStore class="org.apache.jackrabbit.aws.ext.ds.S3DataStore">
    <param name="config" value="${catalina.base}/conf/aws.properties"/>
    <param name="secret" value="123456789"/>
    <param name="minRecordLength " value="16384"/> 
    <param name="cacheSize" value="68719476736"/>
    <param name="cachePurgeTrigFactor" value="0.95d"/>
    <param name="cachePurgeResizeFactor" value="0.85d"/>
    <param name="continueOnAsyncUploadFailure" value="false"/>
    <param name="concurrentUploadsThreads" value="10"/>
    <param name="asyncUploadLimit" value="100"/>
    <param name="uploadRetries" value="3"/>
  </DataStore>

  <!-- SNIP -->

</Repository>

The AWS S3 connectivity is configured in ${catalina.base}/conf/aws.properties in the above example.

Please find an example aws.properties of in the following and adjust the configuration for your environment:

Comparisons with Different DataStores

DbDataStore (the default DataStore used by most Hippo CMS projects) provides a simple clustering capability based on a centralized database, but it could increase the database size and as a result it could increase maintenance/deployment cost and make it relatively harder to use hot backup/restore if the amount of binary data becomes really huge. Also, because DbDataStore doesn't maintain local file cache for the "immutable" binary data entries, it is relatively less performant when serving binary data, in terms of binary data retrieval from JCR. Maybe you can argue that application is responsible for all the cache controls in order not to burden JCR though.

S3DataStore uses Amazon S3 as backend storage, and VFSDataStore uses a virtual file system provided by Apache Commons VFS module. They obviously help reduce the database size, so system administrators could save time and cost in maintenance or new deployments with these DataStores. They are internal plugged-in components as designed by Apache Jackrabbit, so clients can simply use standard JCR APIs to write/read binary data. More importantly, Jackrabbit is able to index the binary data such as PDF files internally to Lucene index, so clients can make standard JCR queries to retrieve data without having to implement custom code depending on specific backend APIs.

One of the notable differences between S3DataStore and VFSDataStore is, the former requires a cloud-based storage (Amazon S3) which might not be allowed in some highly secured environments, whereas the latter allows to use various and cost-effective backend storages including SFTP and WebDAV that can be deployed wherever they want to have. You can take full advantage of cloud based flexible storage with S3DataStore though.

Summary

Apache Jackrabbit VFSDataStore can give a very feasible, cost-effective and secure option in many projects when it is required to host huge amount of binary data in JCR. VFSDataStore enables to use SFTP, WebDAV, etc. as backend storage at a moderate cost, and enables to deploy wherever they want to have. Also, it allows to use standard JCR APIs to read and write binary data, so it should save more development effort and time than implementing a custom (UI) plugin to communicate directly with a specific backend storage.

Other Materials

I have once presented this topic to my colleagues. I'd like to share that with you as well.

Please leave a comment if you have any questions or remarks.

Thursday, May 28, 2015

Hiding Hippo Channel Manager toolbar when unnecessary

WARNING: The solution described in this article is applicable only to Hippo CMS v10.x. As Hippo CMS rewrote many parts of Channel Manager using Angular framework since v11, it is not applicable any more since v11.


In some use cases, content editors don't want to be distracted by the toolbar when editing a page in Hippo Channel Manager. In such use cases, they're okay with using Hippo Channel Manager just as a simple preview tool for the editing content.

So, it is not surprising to hear that they want the toolbar to be hidden in a project unless the current user is really a power user like the 'admin' user.
Yes, that should be easy. I'll look for possible configuration options or ask around on how to hide the toolbar based on the user.
Well, I initially expected that there should be a configuration option somewhere to show the toolbar only to some groups of users. That's why I said so. But, unfortunately, there's no option like that at the moment (at least until 7.9).

Actually someone suggested that I should hack around some CSS classes to hide it, but it would be really hard to set CSS classes properly based on the group memberships of the current user. Also, it sounds really hacky and unmaintainable, which I always try to avoid.

After digging in for a while, the following article took my sights:
After reading that article, it didn't take minutes for me to think about adding an invisible toolbar widget to do some JavaScript tricks to hide the whole toolbar. Right? That should be really an easy and maintainable solution!

I followed the guideline described in the article and was able to implement a solution which hides the whole toolbar unless the user is in the 'admin' group by default. Also, I even added a plugin configuration to be able to set which groups are allowed to see the toolbar.

Here's my plugin source:

// cms/src/main/java/com/example/cms/channelmanager/templatecomposer/ToolbarHidingPlugin.java

package com.example.cms.channelmanager.templatecomposer;

import java.text.MessageFormat;
import java.util.Arrays;
import java.util.Collection;
import java.util.HashSet;
import java.util.Set;

import javax.jcr.NodeIterator;
import javax.jcr.RepositoryException;
import javax.jcr.query.Query;
import javax.jcr.query.QueryResult;

import org.apache.commons.collections.CollectionUtils;
import org.apache.commons.lang.ArrayUtils;
import org.apache.commons.lang.StringUtils;
import org.apache.wicket.Component;
import org.apache.wicket.markup.head.IHeaderResponse;
import org.apache.wicket.markup.head.JavaScriptHeaderItem;
import org.apache.wicket.request.resource.JavaScriptResourceReference;
import org.hippoecm.frontend.plugin.IPluginContext;
import org.hippoecm.frontend.plugin.config.IPluginConfig;
import org.hippoecm.frontend.session.UserSession;
import org.json.JSONException;
import org.json.JSONObject;
import org.onehippo.cms7.channelmanager.templatecomposer.ToolbarPlugin;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import org.wicketstuff.js.ext.util.ExtClass;

/**
 * Invisible Channel Manager Page Editor toolbar widget plugin
 * in order to do some javascript trick like hiding the toolbar
 * based on user's group information.
 * <P>
 * By default, this plugin compares the group names of the current user
 * with the configured {@code groupNamesWithToolbarEnabled} group names.
 * 'admin' group is added to {@code groupNamesWithToolbarEnabled} by default.
 * If there's any common between both, this shows the toolbar.
 * Otherwise, this hides the toolbar.
 * </P>
 * @see http://www.onehippo.org/library/development/add-custom-button-to-the-template-composer-toolbar.html
 */
@ExtClass("Example.ChannelManager.ToolbarHidingPlugin")
public class ToolbarHidingPlugin extends ToolbarPlugin {

    private static Logger log = LoggerFactory.getLogger(ToolbarHidingPlugin.class);

    /**
     * Ext.js plugin JavaScript code.
     */
    private static final JavaScriptResourceReference TOOLBAR_HIDING_PLUGIN_JS =
        new JavaScriptResourceReference(ToolbarHidingPlugin.class, "ToolbarHidingPlugin.js");

    /**
     * JCR query statement to retrieve all the group names of the current user.
     */
    private static final String GROUPS_OF_USER_QUERY =
        "//element(*, hipposys:group)[(@hipposys:members = ''{0}'' or @hipposys:members = ''*'') and @hipposys:securityprovider = ''internal'']";

    /**
     * The names of the groups which the toolbar should be enabled to.
     */
    private Set<String> groupNamesWithToolbarEnabled = new HashSet<String>();

    public ToolbarHidingPlugin(IPluginContext context, IPluginConfig config) {
        super(context, config);

        String param = config.getString("group.names.with.toolbar.enabled", "admin");
        String [] groupNames = StringUtils.split(param, ",");

        if (ArrayUtils.isNotEmpty(groupNames)) {
            groupNamesWithToolbarEnabled.addAll(Arrays.asList(groupNames));
        }
    }

    @Override
    public void renderHead(final Component component, final IHeaderResponse response) {
        super.renderHead(component, response);
        response.render(JavaScriptHeaderItem.forReference(TOOLBAR_HIDING_PLUGIN_JS));
    }

    @Override
    protected JSONObject getProperties() throws JSONException {
        JSONObject properties = super.getProperties();

        if (groupNamesWithToolbarEnabled.contains("*")) {
            properties.put("toolbarEnabled", true);
        } else {
            Set<String> groupNames = getGroupNamesOfCurrentUser();
            Collection intersection = CollectionUtils.intersection(groupNames, groupNamesWithToolbarEnabled);
            properties.put("toolbarEnabled", CollectionUtils.isNotEmpty(intersection));
        }

        return properties;
    }

    private Set<String> getGroupNamesOfCurrentUser() {
        Set<String> groupNames = new HashSet<String>();

        try {
            final String username = UserSession.get().getJcrSession().getUserID();
            String statement = MessageFormat.format(GROUPS_OF_USER_QUERY, username);

            Query q = UserSession.get().getJcrSession().getWorkspace().getQueryManager().createQuery(statement, Query.XPATH);
            QueryResult result = q.execute();
            NodeIterator nodeIt = result.getNodes();
            String groupName;

            while (nodeIt.hasNext()) {
                groupName = nodeIt.nextNode().getName();
                groupNames.add(groupName);
            }
        } catch (RepositoryException e) {
            log.error("Failed to retrieve group names of the current user.", e);
        }

        return groupNames;
    }
}

Basically, the plugin class compares the group membership of the current user with the configured group names to which the toolbar should be enabled. And, it simply sets a flag value to the JSON properties in #getProperties() method. The JSON properties will be passed to the Ext.js class in the end.

Because Hippo Channel Manager components are mostly implemented in Ext.js as well, I need the following Ext.js class. This Ext.js class will read the flag variable passed from the plugin class and hide or show the toolbar HTML element.

// cms/src/main/resources/com/example/cms/channelmanager/templatecomposer/ToolbarHidingPlugin.js

Ext.namespace('Example.ChannelManager');

Example.ChannelManager.ToolbarHidingPlugin = Ext.extend(Ext.Container, {
  constructor: function(config) {

    // hide first and show if the current user has a group membership to which it is allowed.
    $('#pageEditorToolbar').hide();
    if (config.toolbarEnabled) {
      $('#pageEditorToolbar').show();
    }

    // show an empty invisible container widget.
    Example.ChannelManager.ToolbarHidingPlugin.superclass.constructor.call(this, Ext.apply(config, {
      width: 0,
      renderTo: Ext.getBody(),
      border: 0,
    }));
  }
});

I used a simple jQuery trick to hide/show the toolbar (#pageEditorToolbar):

  • $('#pageEditorToolbar').hide();
  • $('#pageEditorToolbar').show();

Now, I need to bootstrap this custom toolbar plugin into repository like the following:

<?xml version="1.0" encoding="UTF-8"?>

<!-- bootstrap/configuration/src/main/resources/configuration/frontend/hippo-channel-manager/templatecomposer-toolbar-hiding.xml -->

<sv:node sv:name="templatecomposer-toolbar-hiding" xmlns:sv="http://www.jcp.org/jcr/sv/1.0">
  <sv:property sv:name="jcr:primaryType" sv:type="Name">
    <sv:value>frontend:plugin</sv:value>
  </sv:property>
  <sv:property sv:name="plugin.class" sv:type="String">
    <sv:value>com.example.cms.channelmanager.templatecomposer.ToolbarHidingPlugin</sv:value>
  </sv:property>
  <sv:property sv:name="position.edit" sv:type="String">
    <sv:value>first</sv:value>
  </sv:property>
  <sv:property sv:name="position.view" sv:type="String">
    <sv:value>after template-composer-toolbar-pages-button</sv:value>
  </sv:property>
</sv:node>

Of course, the bootstrap XML should be added by a hippo:initializeitem in hippoecm-extension.xml like the following:

<!-- bootstrap/configuration/src/main/resources/hippoecm-extension.xml -->

    <!-- SNIP -->

    <sv:node sv:name="example-hippo-configuration-hippo-frontend-cms-hippo-channel-manager-templatecomposer-toolbar-hiding">
        <sv:property sv:name="jcr:primaryType" sv:type="Name">
            <sv:value>hippo:initializeitem</sv:value>
        </sv:property>
        <sv:property sv:name="hippo:sequence" sv:type="Double">
            <sv:value>30000.3</sv:value>
        </sv:property>
        <sv:property sv:name="hippo:contentresource" sv:type="String">
            <sv:value>configuration/frontend/hippo-channel-manager/templatecomposer-toolbar-hiding.xml</sv:value>
        </sv:property>
        <sv:property sv:name="hippo:contentroot" sv:type="String">
            <sv:value>/hippo:configuration/hippo:frontend/cms/hippo-channel-manager</sv:value>
        </sv:property>
        <sv:property sv:name="hippo:reloadonstartup" sv:type="Boolean">
            <sv:value>true</sv:value>
        </sv:property>
    </sv:node>

    <!-- SNIP -->
All right. That's it! Enjoy taming your Hippo!