Thursday, October 20, 2016

Playing with an Apache Jackrabbit DataStore Migration Tool

A while ago, I posted a blog article (Can't we store huge amount of binary data in JCR?) about why Apache Jackrabbit VFSDataStore or S3DataStore is useful and how to use it when storing huge amount of binary data in JCR. But, we already have many running JCR systems with different DataStores (e.g. DbDataStore). So, we need to be able to migrate an existing DataStore to VFSDataStore or S3DataStore. That's what I wanted to do with a migration tool (https://github.com/woonsan/jackrabbit-datastore-migration).
In this article, I'd like to share my experiences in migrating a DbDataStore to VFSDataStore in a real project with the tool.

The Problem


One of my project (based on Hippo CMS) uses DbDataStore which is configured in repository.xml as the default option like the following:

<DataStore class="org.apache.jackrabbit.core.data.db.DbDataStore">
  <param name="url" value="java:comp/env/jdbc/repositoryDS" />
  <param name="driver" value="javax.naming.InitialContext" />
  <param name="databaseType" value="mysql" />
  <param name="minRecordLength" value="1024" />
  <param name="maxConnections" value="5" />
  <param name="copyWhenReading" value="true" />
</DataStore>

Basically, I want to replace the DbDataStore with VFSDataStore backed by SFTP server after data migration in the end:

<DataStore class="org.apache.jackrabbit.vfs.ext.ds.VFSDataStore">
  <param name="config" value="${catalina.base}/conf/vfs2-datastore-sftp.properties" />
  <param name="asyncWritePoolSize" value="10" />
  <param name="secret" value="123456"/>
  <param name="minRecordLength" value="1024"/>
  <param name="recLengthCacheSize" value="10000" />
</DataStore>

And, vfs2-datastore-sftp.properties should look like the following:

# SFTP base folder URL
baseFolderUri = sftp://tester:secret@localhost/vfsds
# when the identity file (your private key file) is used instead of password
#fso.sftp.identities = /home/tester/.ssh/id_rsa

So, we need to migrate all the data managed by DbDatStore to the SFTP location before switching to VFSDataStore.

Data Migration Steps


First of all, we need to download the latest version of the migration tool from https://github.com/woonsan/jackrabbit-datastore-migration/releases.
After uncompressing the downloaded file in a folder, we can build it with `mvn package`, which generates `jackrabbit-datastore-migration-x.x.x.jar` file under the `target` folder.

Second, we need to configure the "source" DataStore and "target" DataStore in a YAML file like the following example (e.g. config/migration-db-to-vfs.yaml):

logging:
    level:
        root: 'WARN'
        com.github.woonsan.jackrabbit.migration.datastore: 'INFO'

batch:
    minWorkers: '10'
    maxWorkers: '10'

source:
    dataStore:
        homeDir: 'target/storage-db'
        className: 'org.apache.jackrabbit.core.data.db.DbDataStore'
        params:
            url: 'jdbc:mysql://localhost:3306/hippodb?autoReconnect=true&characterEncoding=utf8'
            user: 'hippo'
            password: 'hippo'
            driver: 'com.mysql.jdbc.Driver'
            databaseType: 'mysql'
            minRecordLength: '1024'
            maxConnections: '10'
            copyWhenReading: 'true'
            tablePrefix: ''
            schemaObjectPrefix: ''
            schemaCheckEnabled: 'false'

target:
    dataStore:
        homeDir: 'target/storage-vfs'
        className: 'org.apache.jackrabbit.vfs.ext.ds.VFSDataStore'
        params:
            asyncUploadLimit: '0'
            baseFolderUri: 'sftp://tester:secret@localhost/vfsds'
            minRecordLength: '1024'

As you can see, the "source" DataStore is configured with DbDataStore backed by a MySQL database, and the "target" DataStore is configured with VFSDataStore backed by a SFTP location.
Please note that the configuration style for each DataStore is actually equivalent to how it is set in repository.xml if you compare both configurations.
In addition, the YAML configuration has somethings about logging and thread pool worker counts, too, since logging and multi-threaded workers are important in this kind of batch applications.

Now, it's time to execute the migration tool.
Assuming you have JDBC Driver jar file in lib/ directory (e.g. lib/mysql-connector-java-5.1.38.jar), you can execute the tool like the following:

$ java -Dloader.path="lib/" \
       -jar target/jackrabbit-datastore-migration-0.0.1-SNAPSHOT.jar \
       --spring.config.location=config/migration-db-to-vfs.yaml

Or, if you know a specific location where the JDBC driver jar file exists, maybe you can run it like this instead:

java -Dloader.path=/home/tester/.m2/repository/mysql/mysql-connector-java/5.1.38/ \
     -jar target/jackrabbit-datastore-migration-0.0.1-SNAPSHOT.jar \
     --spring.config.location=config/migration-db-to-vfs.yaml

If your configurations are okay and the tool run properly, you will see result logs like the following:

.   ____          _            __ _ _
/\\ / ___'_ __ _ _(_)_ __  __ _ \ \ \ \
( ( )\___ | '_ | '_| | '_ \/ _` | \ \ \ \
\\/  ___)| |_)| | | | | || (_| |  ) ) ) )
'  |____| .__|_| |_|_| |_\__, | / / / /
=========|_|==============|___/=/_/_/_/
:: Spring Boot ::        (v1.4.0.RELEASE)

...
2016-10-17 23:14:44.785  INFO 5071 --- [           main] .w.j.m.d.b.MigrationJobExecutionReporter :
===============================================================================================================
Execution Summary:
---------------------------------------------------------------------------------------------------------------
Total: 22383, Processed: 22383, Read Success: 22383, Read Fail: 0, Write Success: 22383, Write Fail: 0, Duration: 1887607ms
---------------------------------------------------------------------------------------------------------------
Details (in CSV format):
---------------------------------------------------------------------------------------------------------------
SEQ,ID,READ,WRITE,SIZE,ERROR
1,000082f676bf6ed3a39debd6b656287efa6687b6,true,true,54395,
2,00030e05db40739611fcd06d1af26cc7a6afd5b0,true,true,626789,
3,00076b2f5e4e43245928accbcc90fcf738121652,true,true,4097,
...
22382,fff8b902c59f0c310ff53952d86b17d383805355,true,true,258272,
22383,fffc167ef45efdc9b3bea7ed3953fea8ccdb294f,true,true,518903,
===============================================================================================================

2016-10-17 23:14:44.820  INFO 5071 --- [           main] c.g.w.j.migration.datastore.Application  : Started Application in 1892.767 seconds (JVM running for 1893.449)

Spring Boot generates the logging very nicely by default. You can also change the logging configuration. Please see Spring Boot documentation for that.
Anyway, it shows the result, including record sequence number, read/write status, byte size and error information, in CSV format in the end after execution logging lines.

Switching to VFSDataStore and Restart


Once all the binary data is migrated from DbDataStore to VFSDataStore (to SFTP location), we can switch to VFSDataStore by replacing the old <DataStore> element by this in repository.xml:

<DataStore class="org.apache.jackrabbit.vfs.ext.ds.VFSDataStore">
  <param name="config" value="${catalina.base}/conf/vfs2-datastore-sftp.properties" />
  <param name="asyncWritePoolSize" value="10" />
  <param name="secret" value="123456"/>
  <param name="minRecordLength" value="1024"/>
  <param name="recLengthCacheSize" value="10000" />
</DataStore>

Restart the server, and now the binary data will be served from SFTP server through the new VFSDataStore component!



7 comments:

  1. Hi Woonsan, is the migration tool also going to work for migrating data from FileDataStore to DBDataStore?

    ReplyDelete
    Replies
    1. It should work. Basically the tool creates two DataStore (interface) instances - one as the source and the other as the target - based on the configuration, and copy data from the source to the target. So if you configure the FileDataStore for either, it should work.

      Delete
  2. Hi,
    im currently using your migration tool, which is really great was pretty easy to use. But I think it does not delete the migrated records. To me it seems like there is no easy way of doing this, since the Datastore-API does not allow it. And since I can't map the records IDs to UUIDs in the JCR (are they equal?), I see no way of deleting the migrated records, though i generally should be doable, since the JCR console allows deletion of nodes and properties. So if I can't delete the large records from the old DataStore, I have a great amount of useless data in it. Any thoughts on how I could be able to. do this? Would be really cool, because in most cases such a migration to a new Datastore only takes place if the old DataStore grows relatively large. And when I cannot reduce that size, the migration is halfway to uselessness :(

    Thanks,
    Jan

    ReplyDelete
    Replies
    1. Hi Jan,

      Not 100% sure if I understood the problem correctly, but the existing Jackrabbit DataStore Garbage Collector API [1] might be helpful. Basically, what the DataStore GC API does is removing all binary items from the DataStore, which are not referenced by the PersistenceManager, i.e., JCR node properties any more.
      So, in case that the JCR binary properties (so stored in DataStore exceeding the threshold) are removed after migrating to somewhere else, you can use the DataStore GC API to clean up.
      As far as I know, most CMS vendors using Jackrabbit 2 do not enable the DataStore GC APIs in any way (yet), which I wonder why they do not expose it in a management UI.

      Regards,

      Woonsan

      [1] https://jackrabbit.apache.org/archive/wiki/JCR/DataStore_115513387.html#DataStore-DataStoreGarbageCollection

      Delete
    2. Regarding "can't map the records IDs to UUIDs in the JCR (are they equal?)", they are not equal; a JCR item (jcr binary property) may store a datarecord ID, which you can find examples in either Database or FileSystem in p8 [2].
      So, suppose you migrated a DbDataStore to FileDataStore and you're not using the DbDataStore by the repository configuration any more. Then you can just delete the DataStore table from the database as it's not used any more by the PersistenceManager.

      [2] https://www.slideshare.net/woonsan/hidden-gems-in-apache-jackrabbit-and-bloomreach-forge

      Delete
    3. Thank you :) I will definitely have a look at it. For the time being I managed to free at least a big part of the disk space of the DB by simply deleting the "DATASTORE" table in the source DB. This even though I did change the minimum size for the target data store. We'll have another test run soon and then I'll try the garbage collector.

      Delete
    4. Hi,
      unfortunatelly I was unable to run the GC. In contrast to the migration tool, it needs a full repostitory.xml, but the one from brXM does not work directly. So I guess it can be done, but for me it took long, so I still stick to just deleting the DATASTORE table in the DB. Seems to work fine. To be extra sure, I did not change the minimal file size from the old datastore to the new one.

      Delete