Woonsan on Open Source Software

Saturday, May 25, 2019

Apache Tomcat 9 Translation

1. Translation Processing Time

By those who haven't learned and used English a lot, messages can be understood easier and faster if those are in their own language. Apache Tomcat project has the following message in English as an example (ref: https://github.com/apache/tomcat/blob/master/java/org/apache/catalina/core/LocalStrings.properties#L20):

applicationContext.addListener.iae.sclNotAllowed=Once the first ServletContextListener has been called, no more ServletContextListeners may be added.

The message means, as you can figure out in the translation process in your brain in the end, that if any ServletContextListener has ever been invoked then you cannot add a new ServletContextListener any more. This can be translated into Korean like this (ref: https://github.com/apache/tomcat/blob/master/java/org/apache/catalina/core/LocalStrings_ko.properties#L20):

applicationContext.addListener.iae.sclNotAllowed=첫번째 ServletContextListener가 호출되고 나면, 더 이상 ServletContextListener들을 추가할 수 없습니다.

Basically English messages require me to take translation processing time in my brain as the cognitive system of mine has been developed enough already with my mother tongue, Korean. It probably takes less time for some people, but It takes more time for me. I have to spend more time to get the meaning than native or native-like English speakers do. And it does not happen just once or twice, but it keeps occurring again and again, accumulating it into stack of gaps of several hours, days or months. Message translations in software projects may help avoid or reduce the gaps.
Of course, the translation should be correct. Long time ago, the books published by the "ㅅ" publisher company in South Korea, were very difficult to understand even if those were translated into Korean. Perhaps someone had enough English skills but had never experienced in software development or had never asked proficient engineers to review the translations before publication. I don't think it happened only in IT field. Whether they were about Economics or Statistics, some books (translated) in Korean were harder to understand. Somethings were out of context, with terminologies that were never used in real practices, with weird combinations of Chinese characters to make up new words, or with unnatural passive voices from too strict literal translations. So, some people used to try to read the original books in English instead, or some others had to rush in head first, including myself.
One thing clear to me is that once those are translated into correct words, it saves a lot of time in translation process that many people have to spend in otherwise. The more popular software, the more values of correct translation to people.

2. Apache Tomcat Translation with Korean examples

Since Apache Tomcat 9.0.15, almost every English message has been translated into Korean. If you set the default language of the JVM to Korean (`CATALINA_OPTS="-user.country=KR -Duser.language=ko") like the following example, you can see all the internal information, warning or error messages in Korean. I ran Apache Tomcat simply with `bin/catalina.sh run` below.

$ export CATALINA_OPTS="-Duser.country=KR -Duser.language=ko"

$ bin/catalina.sh run

Using CATALINA_BASE: /Users/tester/tomcat

Using CATALINA_HOME: /Users/tester/tomcat

Using CATALINA_TMPDIR: /Users/tester/tomcat/temp

...

24-Apr-2019 23:51:08.477 정보 [main] org.apache.catalina.startup.VersionLoggerListener.log 서버 버전 이름: Apache Tomcat/9.0.18-dev24-Apr-2019 23:51:08.481 정보 [main] org.apache.catalina.startup.VersionLoggerListener.log Server 빌드 시각: Apr 20 2019 19:48:52 UTC

24-Apr-2019 23:51:08.481 정보 [main] org.apache.catalina.startup.VersionLoggerListener.log Server 버전 번호: 9.0.18.0

24-Apr-2019 23:51:08.481 정보 [main] org.apache.catalina.startup.VersionLoggerListener.log 운영체제 이름: Mac OS X

24-Apr-2019 23:51:08.481 정보 [main] org.apache.catalina.startup.VersionLoggerListener.log 운영체제 버전: 10.14.4

24-Apr-2019 23:51:08.481 정보 [main] org.apache.catalina.startup.VersionLoggerListener.log 아키텍처: x86_64

24-Apr-2019 23:51:08.481 정보 [main] org.apache.catalina.startup.VersionLoggerListener.log 자바 홈: /Library/Java/JavaVirtualMachines/jdk1.8.0_144.jdk/Contents/Home/jre

24-Apr-2019 23:51:08.481 정보 [main] org.apache.catalina.startup.VersionLoggerListener.log JVM 버전: 1.8.0_144-b01

24-Apr-2019 23:51:08.481 정보 [main] org.apache.catalina.startup.VersionLoggerListener.log JVM 벤더: Oracle Corporation

24-Apr-2019 23:51:08.481 정보 [main] org.apache.catalina.startup.VersionLoggerListener.log CATALINA_BASE: /Users/tester/tomcat

24-Apr-2019 23:51:08.481 정보 [main] org.apache.catalina.startup.VersionLoggerListener.log CATALINA_HOME: /Users/tester/tomcat

...

24-Apr-2019 23:51:08.488 정보 [main] org.apache.catalina.core.AprLifecycleListener.lifecycleEvent 프로덕션 환경들에서 최적의 성능을 제공하는, APR 기반 Apache Tomcat Native 라이브러리가, 다음 java.library.path에서 발견되지 않습니다: [/Users/tester/Library/Java/Extensions:/Library/Java/Extensions:/Network/Library/Java/Extensions:/System/Library/Java/Extensions:/usr/lib/java:.]24-Apr-2019 23:51:08.749 정보 [main] org.apache.coyote.AbstractProtocol.init 프로토콜 핸들러 ["http-nio-8080"]을(를) 초기화합니다.

24-Apr-2019 23:51:08.775 정보 [main] org.apache.coyote.AbstractProtocol.init 프로토콜 핸들러 ["ajp-nio-8009"]을(를) 초기화합니다.

24-Apr-2019 23:51:08.778 정보 [main] org.apache.catalina.startup.Catalina.load [526] 밀리초 내에 서버가 초기화되었습니다.

24-Apr-2019 23:51:08.806 정보 [main] org.apache.catalina.core.StandardService.startInternal 서비스 [Catalina]을(를) 시작합니다.

24-Apr-2019 23:51:08.807 정보 [main] org.apache.catalina.core.StandardEngine.startInternal 서버 엔진을 시작합니다: [Apache Tomcat/9.0.18-dev]

24-Apr-2019 23:51:08.814 정보 [main] org.apache.catalina.startup.HostConfig.deployDirectory 웹 애플리케이션 디렉토리 [/Users/tester/tomcat/webapps/docs]을(를) 배치합니다.

24-Apr-2019 23:51:09.052 정보 [main] org.apache.catalina.startup.HostConfig.deployDirectory 웹 애플리케이션 디렉토리 [/Users/tester/tomcat/webapps/docs]에 대한 배치가 [237] 밀리초에 완료되었습니다.

24-Apr-2019 23:51:09.055 정보 [main] org.apache.catalina.startup.HostConfig.deployDirectory 웹 애플리케이션 디렉토리 [/Users/tester/tomcat/webapps/manager]을(를) 배치합니다.

24-Apr-2019 23:51:09.117 정보 [main] org.apache.catalina.startup.HostConfig.deployDirectory 웹 애플리케이션 디렉토리 [/Users/tester/tomcat/webapps/manager]에 대한 배치가 [62] 밀리초에 완료되었습니다.

24-Apr-2019 23:51:09.117 정보 [main] org.apache.catalina.startup.HostConfig.deployDirectory 웹 애플리케이션 디렉토리 [/Users/tester/tomcat/webapps/examples]을(를) 배치합니다.

24-Apr-2019 23:51:09.911 정보 [main] org.apache.catalina.startup.HostConfig.deployDirectory 웹 애플리케이션 디렉토리 [/Users/tester/tomcat/webapps/examples]에 대한 배치가 [793] 밀리초에 완료되었습니다.

24-Apr-2019 23:51:09.911 정보 [main] org.apache.catalina.startup.HostConfig.deployDirectory 웹 애플리케이션 디렉토리 [/Users/tester/tomcat/webapps/ROOT]을(를) 배치합니다.

24-Apr-2019 23:51:09.969 정보 [main] org.apache.catalina.startup.HostConfig.deployDirectory 웹 애플리케이션 디렉토리 [/Users/tester/tomcat/webapps/ROOT]에 대한 배치가 [57] 밀리초에 완료되었습니다.

24-Apr-2019 23:51:09.969 정보 [main] org.apache.catalina.startup.HostConfig.deployDirectory 웹 애플리케이션 디렉토리 [/Users/tester/tomcat/webapps/host-manager]을(를) 배치합니다.

24-Apr-2019 23:51:10.019 정보 [main] org.apache.catalina.startup.HostConfig.deployDirectory 웹 애플리케이션 디렉토리 [/Users/tester/tomcat/webapps/host-manager]에 대한 배치가 [50] 밀리초에 완료되었습니다.

24-Apr-2019 23:51:10.022 정보 [main] org.apache.coyote.AbstractProtocol.start 프로토콜 핸들러 ["http-nio-8080"]을(를) 시작합니다.

24-Apr-2019 23:51:10.029 정보 [main] org.apache.coyote.AbstractProtocol.start 프로토콜 핸들러 ["ajp-nio-8009"]을(를) 시작합니다.
24-Apr-2019 23:51:10.031 정보 [main] org.apache.catalina.startup.Catalina.start 서버가 [1,252] 밀리초 내에 시작되었습니다.

Almost every message is now served in Korean: "... 밀리초 내에 서버가 초기화되었습니다" (meaning "the server was initialized in ... ms"), "웹 애플리케이션 디렉토리" (meaning "Web Application Directory"), "배치가 ... 완료되었습니다" (meaning "Deployment ... completed"), etc.
A screenshot below was taken on the servlet example page for the HelloWorldExample in the default example web application. You can visit http://localhost:8080/, click on "Examples" menu on the top, click on the "Servlet Examples" link and finally click on the "Hello World" example link.

The HelloWorld servlet example (/examples/servlets/servlet/HelloWorldExample)

The Request Info servlet example ("RequestInfoExample") is served in Korean, too:

The RequestInfo servlet example (/examples/servlets/servlet/RequestInfoExample)

When stopping the Apache Tomcat by entering Control-C in the command line console, messages about the stopping process are served in Korean, too:

^C
25-Apr-2019 00:08:47.580 정보 [Thread-5] org.apache.coyote.AbstractProtocol.pause 프로토콜 핸들러 ["http-nio-8080"]을(를) 일시 정지 중

25-Apr-2019 00:08:47.589 정보 [Thread-5] org.apache.coyote.AbstractProtocol.pause 프로토콜 핸들러 ["ajp-nio-8009"]을(를) 일시 정지 중

25-Apr-2019 00:08:47.595 정보 [Thread-5] org.apache.catalina.core.StandardService.stopInternal 서비스 [Catalina]을(를) 중지시킵니다.

25-Apr-2019 00:08:47.615 정보 [Thread-5] org.apache.coyote.AbstractProtocol.stop 프로토콜 핸들러 ["http-nio-8080"]을(를) 중지시킵니다.

25-Apr-2019 00:08:47.618 정보 [Thread-5] org.apache.coyote.AbstractProtocol.stop 프로토콜 핸들러 ["ajp-nio-8009"]을(를) 중지시킵니다.

25-Apr-2019 00:08:47.619 정보 [Thread-5] org.apache.coyote.AbstractProtocol.destroy 프로토콜 핸들러 ["http-nio-8080"]을(를) 소멸시킵니다.

25-Apr-2019 00:08:47.620 정보 [Thread-5] org.apache.coyote.AbstractProtocol.destroy 프로토콜 핸들러 ["ajp-nio-8009"]을(를) 소멸시킵니다.
$

3. How Was It Started?

As you may know, The Apache Software Foundataion (https://apache.org) has helped and nurtured great open source software projects and communities based on voluntary contributions. People get involved in the community through mailing lists of the project in which they found their interests. They ask questions or try to give answers to help other people; those who are interested in testing, development or documentation also discuss how to improve the software and process in the mailing lists and report bugs through the bug tracking systems. The community invite people as committers if someone has made quite amount of contributions in various forms such as bug reporting, helping others through mailing lists, providing patches, helping documentation, and so on. The committers makes changes in the source. Furthermore, committers who has shared with the vision of the community may become members of the Project Management Committee (PMC) and participate in decision making process for the project on behalf of the community. This governance model is known as The Apache Way. See https://www.apache.org/theapacheway/index.html for more detail.
Anyway, the Apache Tomcat Translation initiative was started based on the community culture with voluntary contribution from individuals. On Nov. 12, 2018, Mark Thomas, the long time Apache Tomcat committer and PMC member, contributing a lot to the Apache Software Foundation too, posted the following message in the user mailing list (ref: https://lists.apache.org/thread.html/d53034694855fcc346e660fb688ddb7886574e0168d6eca70e4ece37@%3Cusers.tomcat.apache.org%3E). Long story short, to solve the fundamental problem that many people have met such as it being very hard to find which resource files to patch unless you're an expert of Apache Tomcat project, the PMC of Apache Tomcat initiated a POEditor project (see the screenshot below) to encourage more people to participate in the collective translation contributions, hoping to ship the contributed resources in Apache Tomcat 9 releases.

From: Mark Thomas

Subject: Translation help wanted

Date: 2018/11/12 11:49:51

List: users@tomcat.apache.org

All,

Apache Tomcat includes some translations for error messages and parts of

the user interface - primarily the Manager web application. We would

like to improve the coverage and quality of these translations.

Accordingly, the Tomcat project has been set up on POEditor, a web-based

service for managing the translation of resource files.

The aim is that anyone who wants to contribute to the translations (it

could be anything from fixing a typo in an existing translation to

adding support for a new language) can create an account and contribute.

If you would like to contribute in this way then the

The Tomcat project can be found here:

https://poeditor.com/join/project/NUTIjDWzrl

Anyone should be able to join up as a contributor. If you are

interested, please sign up and start contributing.

Note: All contributions will be taken as being made under the terms of

the Apache License version 2.

I'm aiming to export the translations on a regular basis to the Tomcat

source code. How regularly will depend on the rate of new/updated

translations but as a minimum, I'm aiming to get any updates into the

next Tomcat 9 release.

If you have any difficulties or questions, please ask here.

Thanks,

Mark

---------------------------------------------------------------------

To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org

For additional commands, e-mail: users-help@tomcat.apache.org

If you have a look at the message thread, unbelievably many people reacted with positive voluntary willingness. And it didn't take many days to prove to us how great the outcome could be achieved by the community. Here is again Mark Thomas's message just after 9 days. (ref: https://lists.apache.org/thread.html/3dfab1b732e4223bd846617086b788ee41d35228b684f4714404f2a3@%3Cusers.tomcat.apache.org%3E)

From: Mark Thomas

To: Tomcat Users List

Subject: Translations update

Date: 2018/11/21 09:58:15

List: users@tomcat.apache.org

Hi all,

I wanted to let you know about the amazing progress that is being made

on the Tomcat translations at

https://poeditor.com/join/project/NUTIjDWzrl

In the short time since this effort has started the community has

achieved the following:

- French has increased from 18% to 64% coverage

- Simplified Chinese has been added and has already reached 32% coverage

- Korean has been added and has reached 10% coverage

- German has increased from 2% to 7% coverage

- Brazilian Portuguese has been added and has reached 4% coverage

- Spanish has increased from 42% to 44% coverage

as well as a smaller number of additions and corrections to another 6

languages.

A big thank you to everyone who has contributed.

There is still lots to do so if you would like to help out please join

us at:

https://poeditor.com/join/project/NUTIjDWzrl

Thanks,

Mark

---------------------------------------------------------------------

To unsubscribe, e-mail: users-unsubscribe@tomcat.apache.org

For additional commands, e-mail: users-help@tomcat.apache.org

In less than 10 days, the portion of translated messages increased from 18% to 64% for French, 42% to 44% for Spanish, 2% to 7% for German. Even better, new languages were added, which had never been in the project before: 32% for Chinese; 10% for Korean; 4% for Brazillian Portugese.
As of today, April 28, 2019, while I'm writing this article, more than 99% of messages were translated into Korean, and more than 140 volunteers made more than 3,044 contributions in 17 different languages! And the collective work continues. See the POEditor project homepage for detail: https://poeditor.com/join/project/NUTIjDWzrl.

4. And It Continues

Have you ever seen weird translations in software messages in your IT career? Even if the English messages were translated into your language, haven't you ever seen that some messages give an awkward feeling to you sometimes?
Sharing the concerns, the Apache Tomcat community suggests that we should try to fix this problem together. The suggestion comes not just as an abstract principle, but very concrete and practical solution: the POEditor project (https://poeditor.com/join/project/NUTIjDWzrl). It's never difficult; it's really easy to edit. If you don't understand why the message is used there on which context, you can also ask questions through commenting in the POEditor project. You can share ideas together. Committers may give answers to your questions, or you can discourse with other translators, too.

If you want to join in the common experience helping each other in the community, feel free to join the Apache Tomcat POEditor project (https://poeditor.com/join/project/NUTIjDWzrl). Choose the language you want to translate into.
Also, to join the users' or developers' mailing lists to ask questions or discuss on anything, see https://tomcat.apache.org/lists.html.

In Today's World where everything is connected to each other digitally, people, located far from each other geographically with time and language differences, start contributing to the open source projects in which they find their interests. It is like people having been collaborating to build shared reservoirs and planting trees on the dams to protect them in commons for thousands of years. People, who have already experienced, now try to figure out how to encourage other people to participate in more easily. Such easy tools as POEditor help people get involved in easier and better. They know that it becomes easier together and they can achieve more benefits in the community together.

Wednesday, November 14, 2018

Apache Jackrabbit Database Usage Patterns and Options to Reduce Database size

Recently, I wrote about how to externalize version storage to an SFTP server backend to reduce database size: https://woonsanko.blogspot.com/2018/11/externalizing-jcr-version-storage-with.html. It is kind of similar case to how to keep the binary content in either AWS S3 bucket or virtual file system such as SFTP or WebDAV server as I described before in https://woonsanko.blogspot.com/2016/08/cant-we-store-huge-amount-of-binary.html. The only difference is, in high level, the former is about version history database table, VERSION_BUNDLE, whereas the latter is about the binary table, DATASTORE.

I'd like to explain how those tables make a significant impact on database size by showing database usage patterns from several real CMS systems. Also, I'd like to list the benefits by reducing the database size at last.

Pattern 1: Huge DATASTORE table for a Simple Website

In the chart, it shows more than 95% of database is consumed by DATASTORE table which stores only binary content such as images, PDF files, etc, not document or configuration nodes and properties. The project implements a CMS based website serving huge amount of binaries. But business users do not probably edit and publish documents often. It is also possible that they migrate some binary data such as images and PDF files from external sources to CMS in order to serve those through website easily.

If they switch the Apache Jackrabbit DataStore component from the default DbDataStore to either S3DataStore or VFSDataStore, they can save more than 95% of database.

Pattern 2: Big DATASTORE table with Modest Document/Node Updates

This site shows modest amount of document and node content in DEFAULT_BUNDLE table which contains the node bundle data of the default Jackrabbit workspace. It means that business users update and publish content in modest size. But still more than 90% of database is consumed for binary content only in DATASTORE table.

The same story goes. If they switch the Apache Jackrabbit DataStore component from the default DbDataStore to either S3DataStore or VFSDataStore, they can save more than 90% of database.

Pattern 3: More Document Oriented CMS

In this site, the DEFAULT_BUNDLE table is relatively bigger than other sites, taking more than 50% of database. It means that content document updates and publication is very important to business users with their CMS system. Business users probably need to update and (re)publish content more frequently for their websites.

As the default workspace data needs to be queried and accessed frequently in the delivery web applications, there's nothing to do more with the DEFAULT_BUNDLE table.
However, they still have consumed more than 20% of database only for binary content in DATASTORE table, and they have consumed up to 20% of database for version history in VERSION_BUNDLE table.
Therefore, if they switch both DataStore component and FileSystem component of VersionManager to alternatives -- S3DataStore / VFSDataStore and VFSFileSystem -- then they can save more than 40% of database.

Pattern 4: More Versioning or Periodic Content Ingestion to CMS

In this site, more than 55% of database is consumed for version history in VERSION_BUNDLE table, and up to 30% of database is consumed for binary content in DATASTORE table.
There are two possibilities: (a) business users update and publish document very often so that it results in a lot of version history data, or (b) there is a batch job periodically running to import external content into CMS with publishing the updated document after imports.
In either case, if they switch both DataStore component and FileSystem component of VersionManager to alternatives -- S3DataStore / VFSDataStore and VFSFileSystem -- then they can save more than 85% of database.

Benefits by Reducing Database Size

What are the benefits by reducing the repository database size by the way?
Here's my list:

Transparent JCR API

As you're switching only Apache Jackrabbit internal components, it doesn't affect applications. You don't need to write or use a plugin to manage binary content in a different storage by yourself. The existing JCR API still works transparently.
Indexing still works transparently. If you upload a PDF file, it will be indexed and searchable. However, if you implement a custom solution, you need to take care of it by yourself.

Almost unlimited storage for binaries

If you use either S3 bucket or SFTP gateway for Google Cloud Platform or even SFTP server directly, then you can store practically almost unlimited amount of binaries and version history in modern cloud computing world.

Cheaper storage

Amazon S3 or SFTP server is a lot cheaper than database option. For example, Amazon RDS is more expensive than S3 storage for binary content.

Faster backup, import, migration

Apache Jackrabbit DataStore component allows you to do hot-backup and restoration from the backup files to the backend system at runtime.

Build new environment quickly from production data.

As the database is small enough in most cases, you can build a new environment from from other environment's backups more quickly.

Save backup storage

If you do nightly backup, weekly backup, etc. and you have to keep those backup files for some period (e.g, 1 year), then you might need to worry about the backup disk storage sometimes. If the database size is small enough, your concerns will be more relieved by taking advantage of S3 backup capabilities.

Encryption at rest

If you have sensitive PDF files for example, you might want to take advantage of Encryption at REST provided by Amazon S3 or Linux file system.

Externalizing JCR Version Storage with VFSFileSystem

A while ago, I wrote a blog article, Can't we store huge amount of binary data in JCR?. It was about switching Apache Jackrabbit DataStore from DbDataStore to either S3DataStore or VFSDataStore. Depending on your database usage pattern, it will allow you to save huge amount of database just by switching DataStore component configuration in the repository.xml.

In some cases, the version history data in VERSION_BUNDLE could be as big as DATASTORE table. The following is an excerpt from https://www.onehippo.org/library/administration/maintenance/cleaning-up-version-history.html, explaining what's happening when you (de)publish a document, causing revisions in version history:

Each time a document is published, a copy of the current state of the document is stored as a new version. While this feature enables users to restore any previously published version of their document, it comes at the cost of an ever increasing size of the version history storage.

So if your users update and publish documents regularly, the version history data size will increase proportionally as time goes by, which might cause a big database size at some point. Administrators need to monitor it and they might need to remove old revisions just to reduce the database size.

The same story goes here as we have dealt with binary storage issue in database in my previous blog article. Is there a solution for this? Do we really need to care about database size increases for the version history?

Yes, we have a solution in Apache Jackrabbit: VFSFileSystem.

JackrabbitRepository component uses two distinct internal components: Workspace and VersionManager. (I'm using logical names instead of physical class names such as org.apache.jackrabbit.core.RepositoryImpl.WorkspaceInfo here.) See the diagram below:

Whenever a version needs to be made, the node data is copied to VersionManager, which saves the data in its own FileSystem -- DatabaseFileSystem by default if you use RDBMS persistence for Apache Jackrabbit. That's why the database size should increase by default whenever a version is made.

Now if you switch the internal FileSystem of the VersionManager to VFSFileSystem with SFTP or WebDAV backend, then all the version data, the copies from the Workspace, will be stored in an external file system such as SFTP or WebDAV backend instead.

Switching it to VFSFileSystem for VersionManager is straightforward. See the following snippets from repository.xml configuration:

<Repository>


  <!-- SNIP -->


  <Versioning rootPath="${rep.home}/version">

    <FileSystem class="org.apache.jackrabbit.vfs.ext.fs.VFSFileSystem">
      <param name="config" value="${catalina.base}/conf/vfs2-filesystem-sftp.properties" />
    </FileSystem>

    <PersistenceManager
      class="org.apache.jackrabbit.core.persistence.bundle.BundleFsPersistenceManager">
    </PersistenceManager>

    <!-- SNIP -->

  </Versioning>

  <!-- SNIP -->

</Repository>

Just replace FileSystem element and PersistenceManager element inside the Versioning element to use VFSFileSystem which is configured with a properties file specifying SFTP credentials or private key identity file.
Then it will make Apache Jackrabbit Repository to store all the version history data in the backend SFTP file system instead of database.

Please find a working demo project in my GitHub project at https://github.com/woonsanko/hippo-davstore-demo. The demo project shows how to use VFSFile system for an SFTP backend system option for version history data as well as binary DataStore option with either VFS file system or AWS S3 bucket backend. Just follow its README.md.

Friday, January 12, 2018

Recipe for Migrating Hippo CMS Database from One to Another

Sometimes people want to migrate an existing database of Hippo CMS from one to another. For example, they have been running Hippo CMS on Oracle database, but after a while they started thinking about moving their on-premise system and database to a cloud platform. Sounds like a typical use case and that there must be some solutions already out there, right?

Well, surprisingly many people don't know that Apache Jackrabbit has provided a repository copying (or "backup" or "migration" as they call in the documentation) tool since v1.6, dated first in 2010!

There are some reasons why people don't know about the useful tool:

Many people use a vendor specific Apache Jackrabbit repository implementations from a specific project or product, not the Apache Jackrabbit Standalone Server itself. So, even if the Backup and migration feature is well documented in Apache Jackrabbit Standalone Server page, it is hard for them to follow.
Each vendor specific implementation with Apache Jackrabbit, such as Hippo CMS has some tweaks for their own purposes, including extra libraries on top of the default Apache Jackrabbit modules. So, if users don't know which extra libraries to add more by themselves, it can hardly work for them.

That's why I created a 'recipe' project in one of my GitHub repositories:

https://github.com/woonsanko/recipe-for-hippo-db-migration

The recipe introduces a step-by-step guide, with Hippo CMS specific examples. I think it should be helpful to other Apache Jackrabbit derivatives too. Please browse the source.

Last but not least, many thanks to Apache Jackrabbit Standalone Server tool! Cheers!

Tuesday, May 23, 2017

Remoting for Automation via Apache Jackrabbit JCR Webdav Server from Command Lines

Sometimes we need to create, update or even delete data in JCR in an automated way. For example, we need to update some properties on specific configuration nodes just after reseting the database and restarting the server for that specific environment. Or sometimes you need to import some data from XML to a remote JCR just after startup. Obviously you can do these manually through UI. But concerns arise when you need to do these in an automated way through a batch job or script.
I'd like to introduce the Apache Jackrabbit JCR WebDAV Server which provides an advanced remoting feature and how you can take advantage of the feature in an automated way like executing it in command lines.

Apache Jackrabbit JCR WebDAV Server

Apache Jackrabbit JCR WebDAV Server was basically designed to support remote JCR API calls via underlying WebDAV protocol. You can create, read, update or delete data in JCR content repository through JCR WebDAV Server via either a) JCR Client API or b) direct WebDAV requests from the client.

It is really good to be able to use JCR APIs directly from a remote client without having to care of the details of WebDAV/HTTP payloads, which should be really a good topic to cover later, but in this article, I'd like to focus only on the use cases from Command Line Client because it's more related to the "automation" topic of this article.

Command Line Examples through WebDAV/HTTP

I don't want to copy every example again here. Jukka Zitting, a former chairman of Apache Jackrabbit project and incubator PMCs, already explained it with very intuitive examples in one of his great blog articles in the past:

https://jukkaz.wordpress.com/2009/11/24/jackrabbit-over-http/

The blog article of Jukka's explains how to create a node, how to read a node, how to update one single-valued property in different types such as date or string, and how to delete a node.

I'd like to just add two more helpful examples below.

Updating Multiple Values Property from Command Lines

If you want to update a multiple-valued property like hipposys:members property in the following example CND, you can't use the example to update single valued property in Jukka's blog article:

[hipposys:group] > nt:base
- hipposys:system (boolean)
- hipposys:members (string) multiple
- hipposys:description (string)
...

To update multiple-valued property, you need to wrap the values in <values xmlns='http://www.day.com/jcr/webdav/1.0'>...</values> element for the data argument. Here's an example curl command to update the property:

  curl --request PUT --header "Content-Type: jcr-value/undefined" \
    --data "<values xmlns='http://www.day.com/jcr/webdav/1.0'><value>editor</value><value>john</value><value>jane</value></values>" \
    --user admin:admin \
    http://localhost:8080/cms/server/default/jcr:root/hippo:configuration/hippo:groups/editor/hipposys:members

Importing System View XML file to JCR from Command Lines

This example is basically just a variation of how to create a node shown in Jukka's blog article by using an external system view XML file instead.
Suppose you have the following system view XML file (e.g, editor.xml):

<?xml version="1.0" encoding="UTF-8"?>
<sv:node xmlns:sv="http://www.jcp.org/jcr/sv/1.0" sv:name="editor">
  <sv:property sv:name="jcr:primaryType" sv:type="Name">
    <sv:value>hipposys:group</sv:value>
  </sv:property>
  <sv:property sv:name="hipposys:members" sv:type="String" sv:multiple="true">
    <sv:value>editor</sv:value>
    <sv:value>john</sv:value>
  </sv:property>
  <sv:property sv:name="hipposys:securityprovider" sv:type="String">
    <sv:value>internal</sv:value>
  </sv:property>
</sv:node>

You can pipe the standard output of the input file, editor.xml, into a curl command by specifying --data argument as @-, meaning the data must be read from the standard input.

  cat editor.xml | curl -v --request MKCOL --data @- --user admin:admin \
    http://localhost:8080/cms/server/default/jcr:root/hippo:configuration/hippo:groups/editor

The command can be rewritten to the following instead just as a different way to specify the standard input:

  curl --request MKCOL --data @- --user admin:admin \
    http://localhost:8080/cms/server/default/jcr:root/hippo:configuration/hippo:groups/editor \
    < editor.xml

Or, you can specify the input file directly by prefixing the file path with '@' like the following example instead:

  curl --request MKCOL --data @editor.xml --user admin:admin \
    http://localhost:8080/cms/server/default/jcr:root/hippo:configuration/hippo:groups/editor

Therefore, if you want to remove an existing /hippo:configuration/hippo:groups/editor node and recreate it from the XML file, you could execute a delete command like the following, followed by one of the create command explained above:

  curl --request DELETE --user admin:admin \
    http://localhost:8080/cms/server/default/jcr:root/hippo:configuration/hippo:groups/editor

I guess you already grasp the whole idea about how we can take advantage of JCR WebDAV Server for automation in command lines. Basically, you can easily create, read, update or delete any content in JCR from command lines!

How to Enable JCR WebDAV Server in Hippo CMS Repository?

By default, JCR WebDAV Server is not enabled in Hippo CMS Repository, but you can easily install and configure it by following this community forge plugin documentation:

https://onehippo-forge.github.io/hippo-jcr-over-webdav/

Please let me know if you have any further questions.

Thursday, October 20, 2016

Playing with an Apache Jackrabbit DataStore Migration Tool

A while ago, I posted a blog article (Can't we store huge amount of binary data in JCR?) about why Apache Jackrabbit VFSDataStore or S3DataStore is useful and how to use it when storing huge amount of binary data in JCR. But, we already have many running JCR systems with different DataStores (e.g. DbDataStore). So, we need to be able to migrate an existing DataStore to VFSDataStore or S3DataStore. That's what I wanted to do with a migration tool (https://github.com/woonsan/jackrabbit-datastore-migration).
In this article, I'd like to share my experiences in migrating a DbDataStore to VFSDataStore in a real project with the tool.

The Problem

One of my project (based on Hippo CMS) uses DbDataStore which is configured in repository.xml as the default option like the following:

<DataStore class="org.apache.jackrabbit.core.data.db.DbDataStore">
  <param name="url" value="java:comp/env/jdbc/repositoryDS" />
  <param name="driver" value="javax.naming.InitialContext" />
  <param name="databaseType" value="mysql" />
  <param name="minRecordLength" value="1024" />
  <param name="maxConnections" value="5" />
  <param name="copyWhenReading" value="true" />
</DataStore>

Basically, I want to replace the DbDataStore with VFSDataStore backed by SFTP server after data migration in the end:

<DataStore class="org.apache.jackrabbit.vfs.ext.ds.VFSDataStore">
  <param name="config" value="${catalina.base}/conf/vfs2-datastore-sftp.properties" />
  <param name="asyncWritePoolSize" value="10" />
  <param name="secret" value="123456"/>
  <param name="minRecordLength" value="1024"/>
  <param name="recLengthCacheSize" value="10000" />
</DataStore>

And, vfs2-datastore-sftp.properties should look like the following:

# SFTP base folder URL
baseFolderUri = sftp://tester:secret@localhost/vfsds
# when the identity file (your private key file) is used instead of password
#fso.sftp.identities = /home/tester/.ssh/id_rsa

So, we need to migrate all the data managed by DbDatStore to the SFTP location before switching to VFSDataStore.

Data Migration Steps

First of all, we need to download the latest version of the migration tool from https://github.com/woonsan/jackrabbit-datastore-migration/releases.

After uncompressing the downloaded file in a folder, we can build it with `mvn package`, which generates `jackrabbit-datastore-migration-x.x.x.jar` file under the `target` folder.

Second, we need to configure the "source" DataStore and "target" DataStore in a YAML file like the following example (e.g. config/migration-db-to-vfs.yaml):

logging:
    level:
        root: 'WARN'
        com.github.woonsan.jackrabbit.migration.datastore: 'INFO'

batch:
    minWorkers: '10'
    maxWorkers: '10'

source:
    dataStore:
        homeDir: 'target/storage-db'
        className: 'org.apache.jackrabbit.core.data.db.DbDataStore'
        params:
            url: 'jdbc:mysql://localhost:3306/hippodb?autoReconnect=true&characterEncoding=utf8'
            user: 'hippo'
            password: 'hippo'
            driver: 'com.mysql.jdbc.Driver'
            databaseType: 'mysql'
            minRecordLength: '1024'
            maxConnections: '10'
            copyWhenReading: 'true'
            tablePrefix: ''
            schemaObjectPrefix: ''
            schemaCheckEnabled: 'false'

target:
    dataStore:
        homeDir: 'target/storage-vfs'
        className: 'org.apache.jackrabbit.vfs.ext.ds.VFSDataStore'
        params:
            asyncUploadLimit: '0'
            baseFolderUri: 'sftp://tester:secret@localhost/vfsds'
            minRecordLength: '1024'

As you can see, the "source" DataStore is configured with DbDataStore backed by a MySQL database, and the "target" DataStore is configured with VFSDataStore backed by a SFTP location.

Please note that the configuration style for each DataStore is actually equivalent to how it is set in repository.xml if you compare both configurations.

In addition, the YAML configuration has somethings about logging and thread pool worker counts, too, since logging and multi-threaded workers are important in this kind of batch applications.

Now, it's time to execute the migration tool.

Assuming you have JDBC Driver jar file in lib/ directory (e.g. lib/mysql-connector-java-5.1.38.jar), you can execute the tool like the following:

$ java -Dloader.path="lib/" \
       -jar target/jackrabbit-datastore-migration-0.0.1-SNAPSHOT.jar \
       --spring.config.location=config/migration-db-to-vfs.yaml

Or, if you know a specific location where the JDBC driver jar file exists, maybe you can run it like this instead:

java -Dloader.path=/home/tester/.m2/repository/mysql/mysql-connector-java/5.1.38/ \
     -jar target/jackrabbit-datastore-migration-0.0.1-SNAPSHOT.jar \
     --spring.config.location=config/migration-db-to-vfs.yaml

If your configurations are okay and the tool run properly, you will see result logs like the following:

.   ____          _            __ _ _
/\\ / ___'_ __ _ _(_)_ __  __ _ \ \ \ \
( ( )\___ | '_ | '_| | '_ \/ _` | \ \ \ \
\\/  ___)| |_)| | | | | || (_| |  ) ) ) )
'  |____| .__|_| |_|_| |_\__, | / / / /
=========|_|==============|___/=/_/_/_/
:: Spring Boot ::        (v1.4.0.RELEASE)

...
2016-10-17 23:14:44.785  INFO 5071 --- [           main] .w.j.m.d.b.MigrationJobExecutionReporter :
===============================================================================================================
Execution Summary:
---------------------------------------------------------------------------------------------------------------
Total: 22383, Processed: 22383, Read Success: 22383, Read Fail: 0, Write Success: 22383, Write Fail: 0, Duration: 1887607ms
---------------------------------------------------------------------------------------------------------------
Details (in CSV format):
---------------------------------------------------------------------------------------------------------------
SEQ,ID,READ,WRITE,SIZE,ERROR
1,000082f676bf6ed3a39debd6b656287efa6687b6,true,true,54395,
2,00030e05db40739611fcd06d1af26cc7a6afd5b0,true,true,626789,
3,00076b2f5e4e43245928accbcc90fcf738121652,true,true,4097,
...
22382,fff8b902c59f0c310ff53952d86b17d383805355,true,true,258272,
22383,fffc167ef45efdc9b3bea7ed3953fea8ccdb294f,true,true,518903,
===============================================================================================================

2016-10-17 23:14:44.820  INFO 5071 --- [           main] c.g.w.j.migration.datastore.Application  : Started Application in 1892.767 seconds (JVM running for 1893.449)

Spring Boot generates the logging very nicely by default. You can also change the logging configuration. Please see Spring Boot documentation for that.
Anyway, it shows the result, including record sequence number, read/write status, byte size and error information, in CSV format in the end after execution logging lines.

Switching to VFSDataStore and Restart

Once all the binary data is migrated from DbDataStore to VFSDataStore (to SFTP location), we can switch to VFSDataStore by replacing the old <DataStore> element by this in repository.xml:

<DataStore class="org.apache.jackrabbit.vfs.ext.ds.VFSDataStore">
  <param name="config" value="${catalina.base}/conf/vfs2-datastore-sftp.properties" />
  <param name="asyncWritePoolSize" value="10" />
  <param name="secret" value="123456"/>
  <param name="minRecordLength" value="1024"/>
  <param name="recLengthCacheSize" value="10000" />
</DataStore>

Restart the server, and now the binary data will be served from SFTP server through the new VFSDataStore component!