Flexible and Scalable Content Management using Apache Chemistry OpenCmis and Couchbase
This Blog was originally posted by Cecile Le Pape here
Is it possible to build a Content Management system flexible and scalable ? Flexible so I can choose independently my file storage and my metadata storage. Each of these parts should be scalable. I mean consistently and horizontally scalable. The idea is to build a CMS where I can add nodes when the number of documents is too large or the number of customers grows so that the response time to retrieve a document is constant. And of course, the solution should follow standard CMS specifications for compatibility with existing clients.
Architecture
Yes it is. Let’s have a look at the architecture below :
At the bottom you can see 2 different scalable components:
- Couchbase cluster : documents and folder metadata are stored as simple json documents with uniquer identifiers. The json also contains the tree structure (parent id and children). Couchbase Server is well suited for this since it provides a build-in cache for high and consistent performances of key/value queries and is able to documents in Json format. Couchbase stores document up to 20MB and can query them using views or N1QL (SQL like language to query Json document).
- Distributed File System or Distributed Blob store : files are stored in a single container, using their identifier to retrieve them. No need for a hierarchical folder storage. AWS S3 or OpenStack Swift are examples of that kind of storage. Local File System is available for testing purpose. The blob store can store large files as binary content.
At the application layer, the server implements the CMIS specification using Apache Chemistry OpenCmis framework. The server is a web application with a custom repository containing a metadata service to interact with Couchbase cluster for metadata storage and a storage service interacting with a distributed blob store for file content storage.
At the client side, you can use both AtomPub, SOAP or JSON to interact with the application. Apache Chemistry OpenCMIS provides several clients to test (browser, workbench).
To make your application layer scalable, you can simply setup multiple application servers and add a load balancer on top of them because each server is RESTful. Each request is sent to a load balancer which chooses which application server will respond to it, as show here :
Data modelling
CMIS specifications model includes documents, folder, item and relationship objects. Item and relationship are optional. For now, let’s consider documents and folders. Let’s assume that each document belongs to a single folder and that a folder is composed of subfolders and documents.
Each object has a unique identifier (for instance a generated UUID). The root folder is a special document with a special identifier (for instance ‘@root@‘). Each folder knows its path, its parentId, its children (folders and documents), together with its name, its last modification date, etc.
Each file is a json document looking similar to a folder object except that it doesn’t have children and it contains informations about the content stream (length, name, mime type).
For instance, suppose the root folder contains a subfolder folderA. Suppose folderA contains a document doc1.pdf Here is a sample of what the json documents can look like :
The repository is used by the CMIS framework to serve client queries, as a part of its RESTful architecture. There is several methods associated to each client query.
Method getObject retrieves a folder or a document and fill the objectInfos :
public ObjectData getObject(CallContext context, String objectId,
String versionServicesId, String filter,
Boolean includeAllowableActions, Boolean includeAcl,
ObjectInfoHandler objectInfos) {
boolean userReadOnly = checkUser(context, false);
// get the file or folder
CmisObject data = this.cbService.getCmisObject(objectId);
// set defaults if values not set
boolean iaa = CouchbaseUtils.getBooleanParameter(includeAllowableActions, false);
boolean iacl = CouchbaseUtils.getBooleanParameter(includeAcl, false);
// gather properties
return compileObjectData(context, data, filterCollection, iaa, iacl,userReadOnly, objectInfos);
}
- gets the object identifier objectId from the request
- asks CouchbaseService cbService for the corresponding JsonDocument mapped into a CmisObject
- fills the ObjectInfoHandler response and sends it back to the client
Couchbase service
The metadata service is very simple : basically, it can perform CRUD operations on Couchbase. It also map Cmis object to JsonDocument and vice-versa.
To connect to Couchbase create an instance of CouchabseService:
public class CouchbaseService {
private Cluster cluster = null;
private Bucket bucket = null;
public CouchbaseService() {
cluster = CouchbaseCluster.create();
bucket = cluster.openBucket(BUCKET);
// creation of root node if not exist yet
createRootFolderIfNotExists();
}
}
To disconnect Couchbase call the close method :
public void close() {
if (cluster != null) cluster.disconnect();
}
The CouchbaseService implements CRUD operations and mapping between Couchbase Json documents and Cmis objects. For instance let’s take a look at getCmisObject method retrieving folder or document metadata based on its CMIS type.
public CmisObject getCmisObject(String objectId) {
CmisObject data = new CmisObject(objectId);
JsonDocument jsondoc = this.bucket.get(objectId);
if (jsondoc == null) return null;
JsonObject doc = jsondoc.content();
java.util.Set<String> names = doc.getNames();
for (String propId : names) {
if (PropertyIds.NAME.equals(propId)) {
data.setName(doc.getString(propId));
data.setFileName(doc.getString(propId));
} else if (PropertyIds.OBJECT_TYPE_ID.equals(propId)) {
data.setType(doc.getString(propId));
} else if (PropertyIds.CREATED_BY.equals(propId)) {
data.setCreatedBy(doc.getString(propId));
} else if (PropertyIds.LAST_MODIFIED_BY.equals(propId)) {
data.setLastModifiedBy(doc.getString(propId));
} else if (PropertyIds.CONTENT_STREAM_MIME_TYPE.equals(propId)) {
data.setContentType(doc.getString(propId));
} else if (PropertyIds.CREATION_DATE.equals(propId)) {
Long time = doc.getLong(propId);
GregorianCalendar cal = new GregorianCalendar();
cal.setTimeInMillis(Long.valueOf(time));
data.setCreationDate(cal);
} else if (PropertyIds.LAST_MODIFICATION_DATE.equals(propId)) {
Long time = doc.getLong(propId);
GregorianCalendar cal = new GregorianCalendar();
cal.setTimeInMillis(Long.valueOf(time));
data.setLastModificationDate(cal);
} else if (PropertyIds.PARENT_ID.equals(propId)) {
data.setParentId(doc.getString(propId));
} else if (PropertyIds.PATH.equals(propId)) {
data.setPath(doc.getString(propId));
} else if (CHILDREN.equals(propId)) {
JsonArray jsa = doc.getArray(CHILDREN);
int count = jsa.size();
for (int i = 0; i < count; i++) {
data.addChildren((String) jsa.get(i));
}
}
return data;
}
}
- retrieves the json document from couchbase using bucket.get(objectId)
- for each property in the Json document, checks if it is a CMIS property based on PropertyIds CMIS constants, converts the value if needed (for instance dates are stored in long and converted to GregorianCalendar) and fills the new CMIS object with the property value.
Storage service
There is 2 current implementation for storage : local (using local file system) and remote (using AWS S3 storage). Each class implements the StorageService interface :
public interface StorageService {
public String getStorageId();
public void writeContent(String dataId, ContentStream contentStream)
throws StorageException;
public boolean deleteContent(String dataId);
public ContentStream getContent( String dataId, BigInteger offset,
BigInteger length, String filename) throws StorageException;
public boolean exists(String dataId);
}
You can see that the storage is unaware of the folders’ structure. It stores binary content identified by an unique id.
Where can I find the code ?
The code implementing these CMIS server on top of Couchase is available at Github here :
https://github.com/cecilelepape/cmis-couchbase/tree/master/chemistry-opencmis-server-couchbase
Specials Thanks to David Maier and David Ostrovsky that help me with the architecture and the S3 storage.