Thursday, October 20, 2016

Playing with an Apache Jackrabbit DataStore Migration Tool

A while ago, I posted a blog article (Can't we store huge amount of binary data in JCR?) about why Apache Jackrabbit VFSDataStore or S3DataStore is useful and how to use it when storing huge amount of binary data in JCR. But, we already have many running JCR systems with different DataStores (e.g. DbDataStore). So, we need to be able to migrate an existing DataStore to VFSDataStore or S3DataStore. That's what I wanted to do with a migration tool (https://github.com/woonsan/jackrabbit-datastore-migration).
In this article, I'd like to share my experiences in migrating a DbDataStore to VFSDataStore in a real project with the tool.

The Problem


One of my project (based on Hippo CMS) uses DbDataStore which is configured in repository.xml as the default option like the following:

<DataStore class="org.apache.jackrabbit.core.data.db.DbDataStore">
  <param name="url" value="java:comp/env/jdbc/repositoryDS" />
  <param name="driver" value="javax.naming.InitialContext" />
  <param name="databaseType" value="mysql" />
  <param name="minRecordLength" value="1024" />
  <param name="maxConnections" value="5" />
  <param name="copyWhenReading" value="true" />
</DataStore>

Basically, I want to replace the DbDataStore with VFSDataStore backed by SFTP server after data migration in the end:

<DataStore class="org.apache.jackrabbit.vfs.ext.ds.VFSDataStore">
  <param name="config" value="${catalina.base}/conf/vfs2-datastore-sftp.properties" />
  <param name="asyncWritePoolSize" value="10" />
  <param name="secret" value="123456"/>
  <param name="minRecordLength" value="1024"/>
  <param name="recLengthCacheSize" value="10000" />
</DataStore>

And, vfs2-datastore-sftp.properties should look like the following:

# SFTP base folder URL
baseFolderUri = sftp://tester:secret@localhost/vfsds
# when the identity file (your private key file) is used instead of password
#fso.sftp.identities = /home/tester/.ssh/id_rsa

So, we need to migrate all the data managed by DbDatStore to the SFTP location before switching to VFSDataStore.

Data Migration Steps


First of all, we need to download the latest version of the migration tool from https://github.com/woonsan/jackrabbit-datastore-migration/releases.
After uncompressing the downloaded file in a folder, we can build it with `mvn package`, which generates `jackrabbit-datastore-migration-x.x.x.jar` file under the `target` folder.

Second, we need to configure the "source" DataStore and "target" DataStore in a YAML file like the following example (e.g. config/migration-db-to-vfs.yaml):

logging:
    level:
        root: 'WARN'
        com.github.woonsan.jackrabbit.migration.datastore: 'INFO'

batch:
    minWorkers: '10'
    maxWorkers: '10'

source:
    dataStore:
        homeDir: 'target/storage-db'
        className: 'org.apache.jackrabbit.core.data.db.DbDataStore'
        params:
            url: 'jdbc:mysql://localhost:3306/hippodb?autoReconnect=true&characterEncoding=utf8'
            user: 'hippo'
            password: 'hippo'
            driver: 'com.mysql.jdbc.Driver'
            databaseType: 'mysql'
            minRecordLength: '1024'
            maxConnections: '10'
            copyWhenReading: 'true'
            tablePrefix: ''
            schemaObjectPrefix: ''
            schemaCheckEnabled: 'false'

target:
    dataStore:
        homeDir: 'target/storage-vfs'
        className: 'org.apache.jackrabbit.vfs.ext.ds.VFSDataStore'
        params:
            asyncUploadLimit: '0'
            baseFolderUri: 'sftp://tester:secret@localhost/vfsds'
            minRecordLength: '1024'

As you can see, the "source" DataStore is configured with DbDataStore backed by a MySQL database, and the "target" DataStore is configured with VFSDataStore backed by a SFTP location.
Please note that the configuration style for each DataStore is actually equivalent to how it is set in repository.xml if you compare both configurations.
In addition, the YAML configuration has somethings about logging and thread pool worker counts, too, since logging and multi-threaded workers are important in this kind of batch applications.

Now, it's time to execute the migration tool.
Assuming you have JDBC Driver jar file in lib/ directory (e.g. lib/mysql-connector-java-5.1.38.jar), you can execute the tool like the following:

$ java -Dloader.path="lib/" \
       -jar target/jackrabbit-datastore-migration-0.0.1-SNAPSHOT.jar \
       --spring.config.location=config/migration-db-to-vfs.yaml

Or, if you know a specific location where the JDBC driver jar file exists, maybe you can run it like this instead:

java -Dloader.path=/home/tester/.m2/repository/mysql/mysql-connector-java/5.1.38/ \
     -jar target/jackrabbit-datastore-migration-0.0.1-SNAPSHOT.jar \
     --spring.config.location=config/migration-db-to-vfs.yaml

If your configurations are okay and the tool run properly, you will see result logs like the following:

.   ____          _            __ _ _
/\\ / ___'_ __ _ _(_)_ __  __ _ \ \ \ \
( ( )\___ | '_ | '_| | '_ \/ _` | \ \ \ \
\\/  ___)| |_)| | | | | || (_| |  ) ) ) )
'  |____| .__|_| |_|_| |_\__, | / / / /
=========|_|==============|___/=/_/_/_/
:: Spring Boot ::        (v1.4.0.RELEASE)

...
2016-10-17 23:14:44.785  INFO 5071 --- [           main] .w.j.m.d.b.MigrationJobExecutionReporter :
===============================================================================================================
Execution Summary:
---------------------------------------------------------------------------------------------------------------
Total: 22383, Processed: 22383, Read Success: 22383, Read Fail: 0, Write Success: 22383, Write Fail: 0, Duration: 1887607ms
---------------------------------------------------------------------------------------------------------------
Details (in CSV format):
---------------------------------------------------------------------------------------------------------------
SEQ,ID,READ,WRITE,SIZE,ERROR
1,000082f676bf6ed3a39debd6b656287efa6687b6,true,true,54395,
2,00030e05db40739611fcd06d1af26cc7a6afd5b0,true,true,626789,
3,00076b2f5e4e43245928accbcc90fcf738121652,true,true,4097,
...
22382,fff8b902c59f0c310ff53952d86b17d383805355,true,true,258272,
22383,fffc167ef45efdc9b3bea7ed3953fea8ccdb294f,true,true,518903,
===============================================================================================================

2016-10-17 23:14:44.820  INFO 5071 --- [           main] c.g.w.j.migration.datastore.Application  : Started Application in 1892.767 seconds (JVM running for 1893.449)

Spring Boot generates the logging very nicely by default. You can also change the logging configuration. Please see Spring Boot documentation for that.
Anyway, it shows the result, including record sequence number, read/write status, byte size and error information, in CSV format in the end after execution logging lines.

Switching to VFSDataStore and Restart


Once all the binary data is migrated from DbDataStore to VFSDataStore (to SFTP location), we can switch to VFSDataStore by replacing the old <DataStore> element by this in repository.xml:

<DataStore class="org.apache.jackrabbit.vfs.ext.ds.VFSDataStore">
  <param name="config" value="${catalina.base}/conf/vfs2-datastore-sftp.properties" />
  <param name="asyncWritePoolSize" value="10" />
  <param name="secret" value="123456"/>
  <param name="minRecordLength" value="1024"/>
  <param name="recLengthCacheSize" value="10000" />
</DataStore>

Restart the server, and now the binary data will be served from SFTP server through the new VFSDataStore component!