The ResourceSync PHP Library

resync_logoOver the past year, thanks to funding from the Jisc, I’ve been involved with the NISO / OAI ResourceSync initiative.  The aim of ResourceSync is to provide mechanisms for large-scale synchronisations of web resources.  There are lots of use cases for this, and many reasons why it is an interesting problem.  For some background reading, I’d suggest:

The specification itself can be read at http://www.openarchives.org/rs, and a quick read will highlight very quickly that the specification is based on sitemaps (http://www.sitemaps.org/) which is no surprise, given that they were developed for the easy and efficient listing of web resources for search engine crawlers to harvest – which in itself is a specialised form of resource synchronisation.

As with anything new, the proof is always in the pudding, which in this context means that reference implementations are required in order to both test that a standard can be implemented and fulfill the original use cases it was designed to do, but also to smooth off any rough edges that only appear once you use it in anger.

My role therefore has been to develop a PHP ResourceSync client library.  The role of a client library is to allow other software systems to easily interact with a technology – in this case, web servers that support ResourceSync.  The client library therefore provides the facility to connect to a web server and synchronise the contents, and then to stay up to date by loading lists of resources that have been created, updated, or deleted.

The PHP library can be downloaded from: https://github.com/stuartlewis/resync-php

The rest of this blog post will step through the different parts of ResourceSync, and shows how they can be access by the PHP client library:

The first step is to discover whether a site supports ResourceSync.  The mechanism to do this is by using the well-known URI specification (see: RFC5785).  Put simply, if a server supports ResourceSync, it places a file at http://www.example.com/.well-known/resourcesync which then points to where the capability list exists.

The first function of the PHP ResourceSync library is therefore to support this discovery:

include('ResyncDiscover.php');
$resyncdiscover = new ResyncDiscover('http://example.com/');
$capabilitylists = $resyncdiscover->getCapabilities();
echo ' - There were ' . count($capabilitylists) .
     ' capability lists found:' . "\n";
foreach ($capabilitylists as $capabilties) {
    echo ' - ' . $capabilties . "\n";
}

Zero, one, or more capability list URIs are returned.  If none are returned, then the site doesn’t support ResourceSync.  If one is returned, the next step is to examine the capability list to see which parts of the ResourceSync protocol are supported:

include('ResyncCapabilities.php');
$resynccapabilities = new ResyncCapabilities('http://example.com/capabilitylist.xml');
$capabilities = $resynccapabilities->getCapabilities();
echo 'Capabilities' . "\n";
foreach($capabilities as $capability => $type) {
    echo ' - ' . $capability . ' (capability type: ' . $type . ')' . "\n";
}

The output of this is that the specific ResourceSync capabilities supported by that server will be returned.  Typically a resourcelist and a changelist will be shown.

The next step is often to perform a baseline sync (complete download of all resources).  Again, the PHP library supports this:

include 'ResyncResourcelist.php';
$resourcelist = new ResyncResourcelist('http://example.com/resourcelist.xml');
$resourcelist->enableDebug(); // Show progress
$resourcelist->baseline('/resync');

It is possible to ask the library how many files it has downloaded, and how large they were:

echo $resourcelist->getDownloadedFileCount() . ' files downloaded, and ' .
     $resourcelist->getSkippedFileCount() . ' files skipped' . "\n";
echo $resourcelist->getDownloadSize() . 'Kb downloaded in ' .
     $resourcelist->getDownloadDuration() . ' seconds (' .
     ($resourcelist->getDownloadSize() /
      $resourcelist->getDownloadDuration()) . ' Kb/s)' . "\n";

It is possible to also restrict the files to be downloaded to those from a certain date.  This can be useful if you only want to synchronise recently created files:

$from = new DateTime("2013-05-18 00:00:00.000000");
$resourcelist->baseline('/resync', $from);

Once a baseline sync has taken place, all of the files exposed via the ResourceSync interface will now exist on the local computer.  The next step is to routinely keep this set of resources up to date.  To do this, depending on the frequency at which the server produces change lists, these should be processed to download new or updated files, and to delete old files:

include 'ResyncChangelist.php';
$changelist = new ResyncChangelist('http://example.com/changelist.xml');
$changelist->enableDebug(); // Show progress
$changelist->process('/resync');

Again, there are options to see what files have been processed:

echo ' - ' . $changelist->getCreatedCount() . ' files created' . "\n";
echo ' - ' . $changelist->getUpdatedCount() . ' files updated' . "\n";
echo ' - ' . $changelist-getDeletedCount() . ' files deleted' . "\n";
echo $changelist->getDownloadedFileCount() . ' files downloaded, and ' .
     $changelist->getSkippedFileCount() . ' files skipped' . "\n";
echo $changelist->getDownloadSize() . 'Kb downloaded in ' .
     $changelist->getDownloadDuration() . ' seconds (' .
     ($changelist->getDownloadSize() /
      $changelist->getDownloadDuration()) . ' Kb/s)' . "\n";

Also again, it is possible to only see changes since a particular date.  This can be used to keep note of when the sync was last attempted, meaning only changes made since then are processed:

$from = new DateTime("2013-05-18 00:00:00.000000");
$changelist->process('/resync', $from);

The PHP library allows in a few steps, each consisting of a few lines, for the contents of a ResourceSync enabled server to be kept in sync with a local copy.

A further two blog posts will be published in this series.  The next will show how to interact with the library so that more complex actions can be performed when resources are created, updated, or deleted.  The final blog post will show this in action, with an application of the PHP ResourceSync library making use of the resources it processes.

2 thoughts on “The ResourceSync PHP Library

  1. Pingback: Resourcesync: Making things happen with callbacks | Stuart Lewis' Blog

  2. Pingback: ResourceSync and SWORD | Stuart Lewis' Blog

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>