ResourceSync and SWORD

resync_logoThis is the third post is a short series of blog posts about ResourceSync.  Thanks to Jisc funding, a small number of us from the UK have been involved in the NISO / OAI ResourceSync Initiative.  This has involved attending several meetings of the Technical Committee to help design the standard, working on documenting some of the different ResourceSync use cases, and working on some trial implementations.  As mentioned in the previous blog posts, I’ve been creating a PHP API library that makes it easy to interact with ResourceSync-enabled services.

In order to really test the library, it is good to think of a real end-to-end use case and implement it.  The use case I chose to do this was to mirror one repository to another, and to then keep it up to date.  This first involves a baseline sync to gather all the content, followed by an incremental sync of changes made each day.

ResourceSync provides the mechanism by which to gather the resources from the remote repository.  However another function is then required to take those resources and put them into the destination repository.  The obvious choice for this is SWORD v2.

ResourceSync is designed to list all files (or changed files) on a server.  These are then transferred using good old HTTP, but to get them into another repository requires a deposit protocol – in this case, SWORD.  In other words, ResourceSync is used to harvest the resources onto my computer, and SWORD is then used to deposit them into a destination repository.

The challenge here is linking resources together.  An ‘item’ in a repository is typically made up of a metadata resource, along with one or more associated file resources.  Because these are separate resources, they are listed independently in the ResourceSync resource lists.  However they contain attributes that link them together: ‘describes’ and ‘describedBy’.  The metadata ‘describes’ the file, and the file is ‘describedBy’ the metadata.  A good example of this is given in the CottageLabs description of how the OAI-PMH use case can be implemented using ResourceSync:

<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
        xmlns:rs="http://www.openarchives.org/rs/terms/">

    <rs:ln rel="resourcesync" href="http://example.com/capabilitylist.xml"/>
    <rs:md capability="resourcelist" modified="2013-01-03T09:00:00Z"/>

    <url>
        <loc>http://example.com/metadata-resource</loc>
        <lastmod>2013-01-02T17:00:00Z</lastmod>
        <rs:ln rel="describes" href="http://example.com/bitstream1"/>
        <rs:ln rel="describedBy" href="http://purl.org/dc/terms/"/>
        <rs:ln rel="collection" href="http://example.com/collection1"/>
        <rs:md hash="md5:1584abdf8ebdc9802ac0c6a7402c03b6"
             length="8876"
             type="application/xml"/>
    </url>

    <url>
        <loc>http://example.com/bitstream1</loc>
        <lastmod>2013-01-02T17:00:00Z</lastmod>
        <rs:ln rel="describedBy" href="http://example.com/metadata-resource"/>
        <rs:ln rel="describedBy" href="http://example.com/other-metadata"/>
        <rs:ln rel="collection" href="http://example.com/collection1"/>
        <rs:md hash="md5:1e0d5cb8ef6ba40c99b14c0237be735e"
             length="14599"
             type="application/pdf"/>
    </url>
</urlset>

So here’s the recipe (and here’s the code) for syncing a resource list such as this, and then depositing it into a remote repository using SWORD.  Both use PHP libraries, which makes the code quite short.

The recipe

include_once('../../ResyncResourcelist.php');
$resourcelist = new ResyncResourcelist('http://93.93.131.168:8080/rs/resourcelist.xml');
$resourcelist->registerCallback(function($file, $resyncurl) {
  // Work out if this is a metadata object or a file
  global $metadataitems, $objectitems;
  $type = 'metadata';
  $namespaces = $resyncurl->getXML()->getNameSpaces(true);
  if (!isset($namespaces['sm'])) $sac_ns['sm'] = 'http://www.sitemaps.org/schemas/sitemap/0.9';
    $lns = $resyncurl->getXML()->children($namespaces['rs'])->ln;
    $key = '';
    $owner = '';
    foreach($lns as $ln) {
      if (($ln->attributes()->rel == 'describedby') && ($ln->attributes()->href != 'http://purl.org/dc/terms/')) {
      $type = 'object';
      $key = $resyncurl->getLoc();
      $owner = $ln->attributes()->href;
    }
  }

  echo ' - New file saved: ' .$file . "\n";
  echo '  - Type: ' . $type . "\n";

  if ($type == 'metadata') {
    $metadataitems[] = $resyncurl;
  } else {
    $objectitems[(string)$key] = $resyncurl;
    $resyncurl->setOwner($owner);
  }
});

This piece of code is performing a baseline sync, and is using the callback registration option mentioned in the last blog.  The callback is just doing one thing: sorting the metadata objects into one list, and the file objects into another.  These will then be processed later.

Next, each metadata item is processed in order to deposit that metadata object into the destination repository using SWORD v2:

foreach ($metadataitems as $item) {
  echo " - Item " . ++$counter . ' of ' . count($metadataitems) . "\n";
  echo "  - Metadata file: " . $item->getFileOnDisk() . "\n";
  $namespaces = $xml->getNameSpaces(true);
  if (!isset($namespaces['dc'])) $sac_ns['dc'] = 'http://purl.org/dc/terms/';
  if (!isset($namespaces['dcterms'])) $sac_ns['dc'] = 'http://purl.org/dc/elements/1.1/';
  $dc = $xml->children($namespaces['dc']);
  $dcterms = $xml->children($namespaces['dcterms']);
  $title = $dc->title[0];
  $contributor = $dc->contributor[0];
  $id = $dc->identifier[0];
  $date = $dcterms->issued[0];
  echo '   - Location: ' . $item->getLoc() . "\n";
  echo '   - Author: ' . $contributor . "\n";
  echo '   - Title: ' . $title . "\n";
  echo '   - Identifier: ' . $id . "\n";
  echo '   - Date: ' . $date . "\n";

  // Create the atom entry
  $test_dirin = 'atom_multipart';
  $atom = new PackagerAtomTwoStep($resync_test_savedir, $sword_deposit_temp, '', '');
  $atom->setTitle($title);
  $atom->addMetadata('creator', $contributor);
  $atom->setIdentifier($id);
  $atom->setUpdated($date);
  $atom->create();

  // Deposit the metadata record
  $atomfilename = $resync_test_savedir . '/' . $sword_deposit_temp . '/atom';
  echo '  - About to deposit metadata: ' . $atomfilename . "\n";
  $deposit = $sword->depositAtomEntry($sac_deposit_location,
                                      $sac_deposit_username,
                                      $sac_deposit_password,
                                      '',
                                      $atomfilename,
                                      true);

This option being used here is to first create an atom entry that contains the metadata, and depositing that.  The SWORD v2 ‘in-progress’ flag is being set to TRUE, which indicates that further activity will take place to the record.

The code then needs to look through the list of file resources, and find any that are ‘describedBy’ the metadata record in question.  Any that are, are deposited to the same record using SWORD v2:

// Find related files for this metadata record
foreach($objectitems as $object) {
  if ((string)$object->getOwner() == (string)$item->getLoc()) {
  $finfo = finfo_open(FILEINFO_MIME_TYPE);
  $mime = finfo_file($finfo, $object->getFileOnDisk());
  echo '    - Related object: ' . $object->getLoc() . "\n";
  echo '     - File: ' . $object->getFileOnDisk() . ' (' . $mime . ")\n";

  // Deposit file
  $deposit = $sword->addExtraFileToMediaResource($edit_media,
                                                 $sac_deposit_username,
                                                 $sac_deposit_password,
                                                 '',
                                                 $object->getFileOnDisk(),
                                                 $mime);
  }
}

Using the SWORD v2 API library is very easy: once you have the file and its MIME type, it is a single line of code to add that file to the record in the destination repository.

Once all the related files have been added, the final step is to set the ‘in-progress’ flag to FALSE to indicate that the object is complete, and that it can be formally archived into the repository.  This is a simple as:

// Complete the deposit
$deposit = $sword->completeIncompleteDeposit($edit_iri,
                                             $sac_deposit_username,
                                             $sac_deposit_password,
                                             '');

The end to end process has now taken place – the items have been harvested using ResourceSync, and then deposited back using SWORD v2.

Limitations

The default DSpace implementation of the SWORD v2 protocol allows items to deposited, updated, and deleted.  It does this by keeping items in the workflow, and when the ‘In-progress’ flag is set to false, the deposit is completed by moving it out of the workflow and into the main archive.  Once the item is moved into the main archive, it can no longer be edited using SWORD.

This is a sensible approach for most situations.  Once an item has been formally ingested, it is under the control of the archive manager, and the original depositor should probably not have the rights to make further changes.

However in the case of performing a synchronisation with ResrcoueSync, the master copy of the data is in a remote repository, and that should therefore be allowed to overwrite data that is formally archived in the repository.  This is an implementation option though, and if an alternative WorkflowManager was written, this could be changed.

[Update: 20th June 2013.  I have now edited the default WorkflowManager, to make one that permits updates to items that are in workflow or in the archive.  This overcomes this limitation.  I hope to add this as a configurable option to a future release of DSpace.]

Conclusion

ResourceSync and SWORD are two complementary interoperability protocols. ResourceSync can be used to harvest all content from one site, and SWORD used to deposit that content into another.

ResourceSync can differentiate between new, updated, and deleted content.  SWORD v2 also allows these interactions, so can be used to reflect those changes as they happen.

2 thoughts on “ResourceSync and SWORD

  1. Pingback: The ResourceSync PHP Library | Stuart Lewis' Blog

  2. Pingback: Resourcesync: Making things happen with callbacks | Stuart Lewis' Blog

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>