Introducing Migrate Source JCR

October 30, 2019
By Joseph Purcell,
Lead Architect

Migrating content from one technology stack to another can be an arduous and error-prone process. With enterprise sites, the challenges are magnified by the amount and complexity of content and that is certainly true if looking at a migration from Adobe Experience Manager (AEM) to Drupal.

How to Plan a Content Management System Migration
By Joseph Purcell,
Lead Architect
· October 29, 2019

AEM and Drupal both leverage open-source technologies for data storage. With our deep expertise in both and our involvement in open-source communities, we knew there had to be a reliable, repeatable solution. There was none.

Now there is! We’ve created Migrate Source JCR, a Drupal module that allows one to extract AEM data, transform, and load it directly into Drupal. This tool is useful in migrating from AEM to Drupal; let’s look at how it works in more detail.

Why Use the Migrate Source JCR Module

When migrating content from AEM to Drupal, you may choose to manually re-author some of the content or even drop some pages to consolidate. But, for the areas of content that are similar and plentiful, for example, a blog section, you likely want to programmatically migrate the content. It’s this activity that the Migrate Source JCR module provides significant benefit.

First, let’s look at what options currently exist for migrating content into Drupal. There are three tools:

  • CSV Importer
  • Feeds
  • Migrate API

I know that a comparison of each merits its own conversation. For the sake of brevity, let’s assume the conclusion that we want to use is Migrate API because it is the most robust option when it comes to large and complex data sets.

Yet, regardless of the tool used, there is a major flaw: they all require an intermediary format. None of those options support reading directly from AEM, and therefore you would need to extract AEM data into a format like CSV, XML, or JSON, and then migrate that into Drupal. However, if you’ve done enough migrations, you know how risky this is.

Welcome to Migrate Source JCR! This lets you read data directly from AEM’s data storage, Java Content Repository (JCR). And that’s how the module gets its name: it allows Migrate API to read data from a JCR source. On the surface, it’s not obvious how that works, so let’s take a deeper look.

What is the Migrate Source JCR Module

The Migrate Source JCR module is just 150 lines of code! So to understand what it is, we really want to look at the two functionalities it ties together: Drupal’s Migrate API and the Jackalope library.

Drupal’s Migrate API

Let’s start with an explanation of Migrate API, which will help you understand what the Migrate Source JCR module actually is. Migrate API is extract-transform-load (ETL) infrastructure within Drupal. Each step of the ETL process is easily extended by using plugins, and there are many plugins already built-in: see the documentation for a list of supported source plugins, process plugins, and destination plugins.

The drupal.org documentation explains this with a simple diagram:

diagram of the migrate etl process

Source: https://www.drupal.org/docs/8/api/migrate-api/migrate-api-overview#s-mi…

In addition to the highly extendible nature of Migrate API, there are four other features important to note:

  • Native Drupal support: Migrate API has destination plugins for all entity types in Drupal.
  • Composable migrations: If you have hierarchical or relational data, you can specify that a certain field in a migration is actually a reference to the result of another migration.
  • Continuous migrations: Every migration has a unique key defined which allows updates to existing data on subsequent migrations. For large datasets, a highwater mark can be used to ensure you’re only migrating what has changed.
  • Rollbacks: If there are any issues during a migration, it can be rolled back to its pre-migration state.

The first and second points are critical.

First, this means the “Load” step is already built for you. You don’t have to write a single line of code to take the transformed data, create an entity, and save it to Drupal’s storage. This effort is already handled by the destination plugins Migrate API provides.

Second, this means you can model your AEM data in a similar way in Drupal without causing more difficulty in the migration process. Certainly, there’s still a lot of effort in ensuring the related data is keyed on the correct fields and verifying there are no issues, but at least the logic of making the relationship is out of the box.

All of that was just explaining Migrate API, but it allows us to explain the Migrate Source JCR module very simply: it is a source plugin that handles the extract step of the ETL process within Migrate API. Specifically, it brings AEM data into runtime so that it can be transformed and then loaded into Drupal. How does it bring AEM data into runtime? That’s where Jackalope comes into play.

Jackalope

In 2009, Liip introduced Jackalope to the PHP world. Their documentation succinctly explains what this library is:

Jackalope is an open source implementation of the PHPCR API, which is a PHP adaption of the Java Content Repository (JCR) standard, an open API specification defined in JSR-170/283.

While there is a longer history with PHPCR and the many people who helped make Jackalope possible, the part to focus on is that Jackalope is an adaptation of the JCR standard. AEM uses Jackrabbit for its data storage which also is an adaptation of JCR. Consequently, we can use Jackalope to query AEM’s storage for data. If JCR were not a standard that both implemented, there would be no way to directly extract the data with PHP.

Now, there is a very important detail here: we only need to read from AEM. Remember we mentioned that Jackalope and AEM use adaptations of the JCR standard? That means there's not complete feature parity and certain operations may result in an error. Because we are only reading data and the API for queries and data types are well defined, we would not expect an issue during migration.

In conclusion, we can say the Migrate Source JCR module is a source plugin that uses the Jackalope library to read from JCR storage during the extract step of the ETL process within Migrate API.

How to Use the Migrate Source JCR Module

Using the module requires expertise in both AEM and Drupal’s Migrate API. The module includes documentation that explains how to migrate from a JCR source and that will give you all you need to get started. But, when migrating from AEM, you’ll need to know more than just the basics. So, here we’re going to dig into understanding how AEM structures its data and how to reference it in your migration.

Understanding the AEM Database

Drupal uses a relational database management system (RDMS) for its database, such as Microsoft SQL Server or MySQL. Data is stored in sets of tables similar to how you would think of tabs in a spreadsheet. Each row of a table is a record and the structure of that record is represented by the columns. The record’s properties are the values stored in each column, i.e. the “cell” if you are still picturing a table in a spreadsheet.

AEM uses JCR which is quite different; it’s an object database. Data is stored in nodes that are part of a tree. Each node in a branch of the tree is a record and the structure of that record is not always guaranteed. The record’s properties are the values stored in each property on the node, but which properties are allowed varies by the type of node. Some nodes may even be an “unstructured” type which would be similar to what you would expect with the “NoSQL” class of databases.

These different structures mean they will store data about a blog post in different ways. When Drupal stores a blog post it will normalize data across many tables. The blog’s title may be stored in one table while the body content is stored in another. If you were to explore the database for a blog post, you would need to look across all these tables for the data.

AEM, on the other hand, will not normalize the data. Instead, the data is stored on the node itself and on its sub-nodes. In many ways, this makes exploring data far simpler. If you were to look for a blog post, all you need to do is find the node for the blog post and inspect that node and its sub-nodes.

It may surprise you to know that they both support SQL queries. The difference, of course, is when you have a record returned from a SQL query in AEM, it’s a node with properties and sub-nodes. Keeping that in mind, let’s go exploring!

Querying the AEM Database

AEM comes with a browser interface called CRXDE Lite which provides a way for us to navigate and query the JCR database. Depending on your AEM setup, you can access it at “example.com:4502/crx/de/index.jsp”. Go there and under Tools → Query, you’ll find an interface that lets you write queries to explore the JCR data. Supported queries include XPath, SQL, and SQL2.

What query shall we write? Let’s explore AEM’s “WeRetail” sample data, specifically let’s look at the blog pages. There’s some AEM knowledge that will help us know where to look first. Content is generally found under “/content/{sitename}” where “{sitename}” is the name of the site you’re exploring content for and assets stored in AEM’s Digital Asset Manager (DAM) are found under “/content/dam.” Also, the JCR path in AEM closely mirrors the URL path. So, knowing that the blog URL is http://weretail.com/community/blog.html we can browse in CRX to “/content/we-retail/us/en/community/blog” to find its content.

We know the path in the tree where the blog content lives, but we can’t just query for all nodes in that path. Remember it’s a tree, so we need to search that path for a specific type of node so that we avoid retrieving all its sub-nodes as well. And here is where more AEM knowledge is useful. The “sling:resourceType” property of any given node can help classify that node as a particular type. Looking at the properties of these blog nodes we can see they all have a type “weretail/components/structure/page”.

Putting all of that information together, we can write the following query:

SELECT * FROM [nt:unstructured] AS node WHERE ISDESCENDANTNODE(node, "/content/we-retail/us/en/community/blog/") AND [sling:resourceType] = "weretail/components/structure/page"

Which, we can plug into CRX and run it to verify it returns just the blog nodes we want to migrate:

screen shot of AEM to Drupal Query

Perfect, this shows us the three blog nodes that we want. We can see there is a 1:1 relationship between these nodes and the Drupal node entity we will be creating. In our migration configuration file, we would specify the query like so:

source:
  plugin: jcr
  host: "http://localhost:8080/server"
  query: 'SELECT * FROM [nt:unstructured] AS node WHERE ISDESCENDANTNODE(node, "/content/we-retail/us/en/community/blog/") AND [sling:resourceType] = "weretail/components/structure/page"'
  type: "JCR-SQL2"
  user: "admin"
  pass: "admin"
  workspace: "default"

Now, let’s look at one of those blog nodes to see its properties in CRX:

screen shot of AEM to Drupal Properties

You can see that the property “jcr:title” contains the title of the post, “Blog Post 1.” We’ll take note of this for use in writing the migration.

Now, where is the content on the page? For that, we’ll need to explore sub-nodes. Clicking the plus sign to expand the menu on the left side lets us see there is a sub-node called “text” and on that, we can see there are some properties:

screen shot AEM to Drupal Properties Text

Here we can see a property called “text” contains the markup on the page. So, starting with the node that our query matched, “jcr:content”, the sub-path to this node is “root/responsivegrid/text” and the property we want on it is called “text”.

With this information, you would start to build a field mapping from AEM to Drupal like so:

Bundle: Blog
Query:
SELECT * FROM [nt:unstructured] AS node WHERE ISDESCENDANTNODE(node, "/content/we-retail/us/en/community/blog/") AND [sling:resourceType] = "weretail/components/structure/page"
AEM Drupal
Subpath:
Property: jcr:Title
title
Subpath: root/responsivegrid/text
Property: text
body

In our migration configuration, the source field definitions would look like this:

fields:
    -
      name: title
      subpath: ''
      property: 'jcr:title'
    -
      name: body
      subpath: root/responsivegrid/text
      property: text

This is a simple example showing a field-to-field mapping from nodes to a single entity in Drupal. As you build out the mapping, you’ll start to find cases where you need a field that is a relationship to another entity or is a sub-entity, such as a paragraph (see “Using Drupal Paragraphs with a Component-Based Approach”). In those cases, you may start to use composable migrations (i.e. migration lookups) which are more complex and merits its own write-up. But, the principles are the same and you now have the fundamentals to write the queries and specify the sub-paths and properties in order to build out those mappings.

Final Thoughts

You should now have a working concept of what the Migrate Source JCR module is, how to use it, and why it’s useful. Migrating from AEM, or any JCR storage, to Drupal is complex and since this is tricky, it may be necessary to collaborate. Check out our module documentation or our guide, How to Plan a Content Management System Migration as a reference.
 
Additionally, you can connect with us in the Drupal community. Since this project is open source, you can submit questions or improvements and bug fixes in the Drupal issue queue.