Tag Archives: repositories

ResourceSync and SWORD

resync_logoThis is the third post is a short series of blog posts about ResourceSync.  Thanks to Jisc funding, a small number of us from the UK have been involved in the NISO / OAI ResourceSync Initiative.  This has involved attending several meetings of the Technical Committee to help design the standard, working on documenting some of the different ResourceSync use cases, and working on some trial implementations.  As mentioned in the previous blog posts, I’ve been creating a PHP API library that makes it easy to interact with ResourceSync-enabled services.

In order to really test the library, it is good to think of a real end-to-end use case and implement it.  The use case I chose to do this was to mirror one repository to another, and to then keep it up to date.  This first involves a baseline sync to gather all the content, followed by an incremental sync of changes made each day.

ResourceSync provides the mechanism by which to gather the resources from the remote repository.  However another function is then required to take those resources and put them into the destination repository.  The obvious choice for this is SWORD v2.

ResourceSync is designed to list all files (or changed files) on a server.  These are then transferred using good old HTTP, but to get them into another repository requires a deposit protocol – in this case, SWORD.  In other words, ResourceSync is used to harvest the resources onto my computer, and SWORD is then used to deposit them into a destination repository.

The challenge here is linking resources together.  An ‘item’ in a repository is typically made up of a metadata resource, along with one or more associated file resources.  Because these are separate resources, they are listed independently in the ResourceSync resource lists.  However they contain attributes that link them together: ‘describes’ and ‘describedBy’.  The metadata ‘describes’ the file, and the file is ‘describedBy’ the metadata.  A good example of this is given in the CottageLabs description of how the OAI-PMH use case can be implemented using ResourceSync:

<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
        xmlns:rs="http://www.openarchives.org/rs/terms/">

    <rs:ln rel="resourcesync" href="http://example.com/capabilitylist.xml"/>
    <rs:md capability="resourcelist" modified="2013-01-03T09:00:00Z"/>

    <url>
        <loc>http://example.com/metadata-resource</loc>
        <lastmod>2013-01-02T17:00:00Z</lastmod>
        <rs:ln rel="describes" href="http://example.com/bitstream1"/>
        <rs:ln rel="describedBy" href="http://purl.org/dc/terms/"/>
        <rs:ln rel="collection" href="http://example.com/collection1"/>
        <rs:md hash="md5:1584abdf8ebdc9802ac0c6a7402c03b6"
             length="8876"
             type="application/xml"/>
    </url>

    <url>
        <loc>http://example.com/bitstream1</loc>
        <lastmod>2013-01-02T17:00:00Z</lastmod>
        <rs:ln rel="describedBy" href="http://example.com/metadata-resource"/>
        <rs:ln rel="describedBy" href="http://example.com/other-metadata"/>
        <rs:ln rel="collection" href="http://example.com/collection1"/>
        <rs:md hash="md5:1e0d5cb8ef6ba40c99b14c0237be735e"
             length="14599"
             type="application/pdf"/>
    </url>
</urlset>

So here’s the recipe (and here’s the code) for syncing a resource list such as this, and then depositing it into a remote repository using SWORD.  Both use PHP libraries, which makes the code quite short.

The recipe

include_once('../../ResyncResourcelist.php');
$resourcelist = new ResyncResourcelist('http://93.93.131.168:8080/rs/resourcelist.xml');
$resourcelist->registerCallback(function($file, $resyncurl) {
  // Work out if this is a metadata object or a file
  global $metadataitems, $objectitems;
  $type = 'metadata';
  $namespaces = $resyncurl->getXML()->getNameSpaces(true);
  if (!isset($namespaces['sm'])) $sac_ns['sm'] = 'http://www.sitemaps.org/schemas/sitemap/0.9';
    $lns = $resyncurl->getXML()->children($namespaces['rs'])->ln;
    $key = '';
    $owner = '';
    foreach($lns as $ln) {
      if (($ln->attributes()->rel == 'describedby') && ($ln->attributes()->href != 'http://purl.org/dc/terms/')) {
      $type = 'object';
      $key = $resyncurl->getLoc();
      $owner = $ln->attributes()->href;
    }
  }

  echo ' - New file saved: ' .$file . "\n";
  echo '  - Type: ' . $type . "\n";

  if ($type == 'metadata') {
    $metadataitems[] = $resyncurl;
  } else {
    $objectitems[(string)$key] = $resyncurl;
    $resyncurl->setOwner($owner);
  }
});

This piece of code is performing a baseline sync, and is using the callback registration option mentioned in the last blog.  The callback is just doing one thing: sorting the metadata objects into one list, and the file objects into another.  These will then be processed later.

Next, each metadata item is processed in order to deposit that metadata object into the destination repository using SWORD v2:

foreach ($metadataitems as $item) {
  echo " - Item " . ++$counter . ' of ' . count($metadataitems) . "\n";
  echo "  - Metadata file: " . $item->getFileOnDisk() . "\n";
  $namespaces = $xml->getNameSpaces(true);
  if (!isset($namespaces['dc'])) $sac_ns['dc'] = 'http://purl.org/dc/terms/';
  if (!isset($namespaces['dcterms'])) $sac_ns['dc'] = 'http://purl.org/dc/elements/1.1/';
  $dc = $xml->children($namespaces['dc']);
  $dcterms = $xml->children($namespaces['dcterms']);
  $title = $dc->title[0];
  $contributor = $dc->contributor[0];
  $id = $dc->identifier[0];
  $date = $dcterms->issued[0];
  echo '   - Location: ' . $item->getLoc() . "\n";
  echo '   - Author: ' . $contributor . "\n";
  echo '   - Title: ' . $title . "\n";
  echo '   - Identifier: ' . $id . "\n";
  echo '   - Date: ' . $date . "\n";

  // Create the atom entry
  $test_dirin = 'atom_multipart';
  $atom = new PackagerAtomTwoStep($resync_test_savedir, $sword_deposit_temp, '', '');
  $atom->setTitle($title);
  $atom->addMetadata('creator', $contributor);
  $atom->setIdentifier($id);
  $atom->setUpdated($date);
  $atom->create();

  // Deposit the metadata record
  $atomfilename = $resync_test_savedir . '/' . $sword_deposit_temp . '/atom';
  echo '  - About to deposit metadata: ' . $atomfilename . "\n";
  $deposit = $sword->depositAtomEntry($sac_deposit_location,
                                      $sac_deposit_username,
                                      $sac_deposit_password,
                                      '',
                                      $atomfilename,
                                      true);

This option being used here is to first create an atom entry that contains the metadata, and depositing that.  The SWORD v2 ‘in-progress’ flag is being set to TRUE, which indicates that further activity will take place to the record.

The code then needs to look through the list of file resources, and find any that are ‘describedBy’ the metadata record in question.  Any that are, are deposited to the same record using SWORD v2:

// Find related files for this metadata record
foreach($objectitems as $object) {
  if ((string)$object->getOwner() == (string)$item->getLoc()) {
  $finfo = finfo_open(FILEINFO_MIME_TYPE);
  $mime = finfo_file($finfo, $object->getFileOnDisk());
  echo '    - Related object: ' . $object->getLoc() . "\n";
  echo '     - File: ' . $object->getFileOnDisk() . ' (' . $mime . ")\n";

  // Deposit file
  $deposit = $sword->addExtraFileToMediaResource($edit_media,
                                                 $sac_deposit_username,
                                                 $sac_deposit_password,
                                                 '',
                                                 $object->getFileOnDisk(),
                                                 $mime);
  }
}

Using the SWORD v2 API library is very easy: once you have the file and its MIME type, it is a single line of code to add that file to the record in the destination repository.

Once all the related files have been added, the final step is to set the ‘in-progress’ flag to FALSE to indicate that the object is complete, and that it can be formally archived into the repository.  This is a simple as:

// Complete the deposit
$deposit = $sword->completeIncompleteDeposit($edit_iri,
                                             $sac_deposit_username,
                                             $sac_deposit_password,
                                             '');

The end to end process has now taken place – the items have been harvested using ResourceSync, and then deposited back using SWORD v2.

Limitations

The default DSpace implementation of the SWORD v2 protocol allows items to deposited, updated, and deleted.  It does this by keeping items in the workflow, and when the ‘In-progress’ flag is set to false, the deposit is completed by moving it out of the workflow and into the main archive.  Once the item is moved into the main archive, it can no longer be edited using SWORD.

This is a sensible approach for most situations.  Once an item has been formally ingested, it is under the control of the archive manager, and the original depositor should probably not have the rights to make further changes.

However in the case of performing a synchronisation with ResrcoueSync, the master copy of the data is in a remote repository, and that should therefore be allowed to overwrite data that is formally archived in the repository.  This is an implementation option though, and if an alternative WorkflowManager was written, this could be changed.

[Update: 20th June 2013.  I have now edited the default WorkflowManager, to make one that permits updates to items that are in workflow or in the archive.  This overcomes this limitation.  I hope to add this as a configurable option to a future release of DSpace.]

Conclusion

ResourceSync and SWORD are two complementary interoperability protocols. ResourceSync can be used to harvest all content from one site, and SWORD used to deposit that content into another.

ResourceSync can differentiate between new, updated, and deleted content.  SWORD v2 also allows these interactions, so can be used to reflect those changes as they happen.

Resourcesync: Making things happen with callbacks

resync_logoIn a previous blog post I introduced the ResourceSync PHP API library.  This is a code library written PHP that makes it easy to interact with web sites that support the new ResourceSync standard.  The default behavior for the code when scynchronising with a server either during a baseline sync (complete sync) or a incremental sync (of only changed files since the last baseline sync) is to simply download the files and store them on disk in the same directories as they exist on the server.

However, unless you want to just store the files for backup purposes, the chances are that you’ll want to process them in some way.  There are two ways to do this, either perform the synchronisation, and then process the files, or process them as they are downloaded.

From the last post, you’ll know that by using the ResourceSync PHP library, performing a sync can be as simple as:

include 'ResyncResourcelist.php';
$resourcelist = new ResyncResourcelist('http://example.com/resourcelist.xml');
$resourcelist->baseline('/resync');

This will process the resourcelist file by file, and download them to the /resync/ directory.

In order to process these, you need to register a ‘callback’ function with the library.  Each time an item is synchronised, the code in the callback function will be executed.

The following code snippet shows a very simple example of a callback.  This example displays the filename of the resource that has been downloaded, and prints the XML that described the file in the ResourceSync resourcelist.  The XML can be useful as it provides contextual information about the file, such as its size, checksum, last modified date, and links to related items.  Of course some of these will have already been checked by the library (such as last modified date when using the date range option, and the checksum to make sure the file has been retried successfully).

$resourcelist->registerCallback(function($file, $resyncurl) {
    echo '  - Callback given value of ' .$file . "\n";
    echo '   - XML:' . "\n" . $resyncurl->getXML()->asXML() . "\n";
});

When performing a baseline sync using the ResyncResourcelist class it is only possible to register a single callback.  This is called whenever any file is downloaded.

However the ResyncChangelist class allows three different callbacks to be registered, depending on the action: CREATED, UPDATED, or DELETED.

$changelist->registerCreateCallback(function($file, $resyncurl) {
    echo '  - CREATE Callback given value of ' .$file . "\n";
    echo '   - XML:' . "\n" . $resyncurl->getXML()->asXML() . "\n";
});

$changelist->registerUpdateCallback(function($file, $resyncurl) {
    echo '  - UPDATE Callback given value of ' .$file . "\n";
    echo '   - XML:' . "\n" . $resyncurl->getXML()->asXML() . "\n";
});

$changelist->registerDeleteCallback(function($file, $resyncurl) {
    echo '  - DELETE Callback given value of ' .$file . "\n";
    echo '   - XML:' . "\n" . $resyncurl->getXML()->asXML() . "\n";
});

Depending on the purpose of your code, it is likely that you would want to handle these three types of events in different ways, hence the three callback options.

In the next blog post, I’ll show an example of this code in action, as it uses the callback to look at each resource’s XML to discover whether it is a metadata file or a related resource.  It then uses this information to deposit the item into a repository using SWORD.

The ResourceSync PHP Library

resync_logoOver the past year, thanks to funding from the Jisc, I’ve been involved with the NISO / OAI ResourceSync initiative.  The aim of ResourceSync is to provide mechanisms for large-scale synchronisations of web resources.  There are lots of use cases for this, and many reasons why it is an interesting problem.  For some background reading, I’d suggest:

The specification itself can be read at http://www.openarchives.org/rs, and a quick read will highlight very quickly that the specification is based on sitemaps (http://www.sitemaps.org/) which is no surprise, given that they were developed for the easy and efficient listing of web resources for search engine crawlers to harvest – which in itself is a specialised form of resource synchronisation.

As with anything new, the proof is always in the pudding, which in this context means that reference implementations are required in order to both test that a standard can be implemented and fulfill the original use cases it was designed to do, but also to smooth off any rough edges that only appear once you use it in anger.

My role therefore has been to develop a PHP ResourceSync client library.  The role of a client library is to allow other software systems to easily interact with a technology – in this case, web servers that support ResourceSync.  The client library therefore provides the facility to connect to a web server and synchronise the contents, and then to stay up to date by loading lists of resources that have been created, updated, or deleted.

The PHP library can be downloaded from: https://github.com/stuartlewis/resync-php

The rest of this blog post will step through the different parts of ResourceSync, and shows how they can be access by the PHP client library:

The first step is to discover whether a site supports ResourceSync.  The mechanism to do this is by using the well-known URI specification (see: RFC5785).  Put simply, if a server supports ResourceSync, it places a file at http://www.example.com/.well-known/resourcesync which then points to where the capability list exists.

The first function of the PHP ResourceSync library is therefore to support this discovery:

include('ResyncDiscover.php');
$resyncdiscover = new ResyncDiscover('http://example.com/');
$capabilitylists = $resyncdiscover->getCapabilities();
echo ' - There were ' . count($capabilitylists) .
     ' capability lists found:' . "\n";
foreach ($capabilitylists as $capabilties) {
    echo ' - ' . $capabilties . "\n";
}

Zero, one, or more capability list URIs are returned.  If none are returned, then the site doesn’t support ResourceSync.  If one is returned, the next step is to examine the capability list to see which parts of the ResourceSync protocol are supported:

include('ResyncCapabilities.php');
$resynccapabilities = new ResyncCapabilities('http://example.com/capabilitylist.xml');
$capabilities = $resynccapabilities->getCapabilities();
echo 'Capabilities' . "\n";
foreach($capabilities as $capability => $type) {
    echo ' - ' . $capability . ' (capability type: ' . $type . ')' . "\n";
}

The output of this is that the specific ResourceSync capabilities supported by that server will be returned.  Typically a resourcelist and a changelist will be shown.

The next step is often to perform a baseline sync (complete download of all resources).  Again, the PHP library supports this:

include 'ResyncResourcelist.php';
$resourcelist = new ResyncResourcelist('http://example.com/resourcelist.xml');
$resourcelist->enableDebug(); // Show progress
$resourcelist->baseline('/resync');

It is possible to ask the library how many files it has downloaded, and how large they were:

echo $resourcelist->getDownloadedFileCount() . ' files downloaded, and ' .
     $resourcelist->getSkippedFileCount() . ' files skipped' . "\n";
echo $resourcelist->getDownloadSize() . 'Kb downloaded in ' .
     $resourcelist->getDownloadDuration() . ' seconds (' .
     ($resourcelist->getDownloadSize() /
      $resourcelist->getDownloadDuration()) . ' Kb/s)' . "\n";

It is possible to also restrict the files to be downloaded to those from a certain date.  This can be useful if you only want to synchronise recently created files:

$from = new DateTime("2013-05-18 00:00:00.000000");
$resourcelist->baseline('/resync', $from);

Once a baseline sync has taken place, all of the files exposed via the ResourceSync interface will now exist on the local computer.  The next step is to routinely keep this set of resources up to date.  To do this, depending on the frequency at which the server produces change lists, these should be processed to download new or updated files, and to delete old files:

include 'ResyncChangelist.php';
$changelist = new ResyncChangelist('http://example.com/changelist.xml');
$changelist->enableDebug(); // Show progress
$changelist->process('/resync');

Again, there are options to see what files have been processed:

echo ' - ' . $changelist->getCreatedCount() . ' files created' . "\n";
echo ' - ' . $changelist->getUpdatedCount() . ' files updated' . "\n";
echo ' - ' . $changelist-getDeletedCount() . ' files deleted' . "\n";
echo $changelist->getDownloadedFileCount() . ' files downloaded, and ' .
     $changelist->getSkippedFileCount() . ' files skipped' . "\n";
echo $changelist->getDownloadSize() . 'Kb downloaded in ' .
     $changelist->getDownloadDuration() . ' seconds (' .
     ($changelist->getDownloadSize() /
      $changelist->getDownloadDuration()) . ' Kb/s)' . "\n";

Also again, it is possible to only see changes since a particular date.  This can be used to keep note of when the sync was last attempted, meaning only changes made since then are processed:

$from = new DateTime("2013-05-18 00:00:00.000000");
$changelist->process('/resync', $from);

The PHP library allows in a few steps, each consisting of a few lines, for the contents of a ResourceSync enabled server to be kept in sync with a local copy.

A further two blog posts will be published in this series.  The next will show how to interact with the library so that more complex actions can be performed when resources are created, updated, or deleted.  The final blog post will show this in action, with an application of the PHP ResourceSync library making use of the resources it processes.

Facebook advertising Open Access “Are you a researcher?”

2012 has been a busy year in the world of Open Access.  From a UK funding point of view, the big news has included the Finch Report and the RCUK’s reaction this in its new Policy on Access to Research Outputs.  To cut a very long story short, the RCUK is now providing £17+ million to UK institutions (pro-rata to the size of grants) to help fund Gold Open Access: that is, payments of Article Processing Charges (APCs) in order to make journal papers free at the point of use, from the publishers’ website, with a Creative Commons By Attribution (CC-BY) licence, at the time of first publication.  There are many on-going debates about how to proportion this money, exactly what is covers, and how best to administrate and report the spending.

An unsurprising reaction to this has been from the open access hybrid publishers.  Pure Open Access Publishers (BioMed Central, PLoS etc) already run their business model this way.  Traditional publishers have had to introduce hybrid approaches to allow Gold APCs to paid to make papers available which would have normally been funded by subscriptions.  The latest changes for hybrid publishers have been to take into account the requirement for the CC-BY licence.  An example is the Nature Publishing Group who have introduced differential pricing based on the Creative Commons licence selected, for example to make up for the shortfall of income from reprints.  Another example is Wiley and their new Open Access schemes.

However the point of this blog post was my surprise at logging into Facebook this morning…

springerfacebook

In case you missed it, here is one advert in particular that I’ve not seen before…

areyouaresearcher

Clicking on this takes you to Springer’s web page on Open Access:

springer

Springer is advertising on Facebook to let authors know about their journals and open access publishing options, and most importantly, that there is money from RCUK to back it up (for RCUK-funded outputs).

I don’t want to pass judgement on this, I don’t really have an opinion on it, however it is an interesting development!  Those of us who work closely with these Open Access initiatives and the RCUK block grants need to be aware of the messages that are being put out there.  This is a new message in a new medium!

A prize will be offered for the first (genuine!!!) enquiry received about Open Access and the RCUK funding from an author who ‘saw it on Facebook’!  It will be interesting to see how well this message propagates and is understood.

Oh, the admin and the coder should be friends!

This is a tongue-in-cheek blog post a few days before the Open Repositories 2012 conference that is being held here in Edinburgh. I’ll give a bit of background first, a disclaimer, a video, then the main content of this post.

First the background: I have a slight love/hate relationship with the repository community and the Open Repositories conference related to how it makes a strong distinction between ‘Repository Managers’ and ‘Developers’.  Its nice that we do this as it allows for innovate conference strands such as the ‘developers challenge‘ where developers can come and show-off their wares. However I also hate this segregation and the labeling of delegates into these categories.  Personally I see myself as straddling the two, and I feel that we should be looking for our shared interests (developing open repository services) rather than highlighting differences between our roles.

However, I won’t rant or get on my soap-box, but instead I’ll butcher a song from the Rodgers and Hammerstein musical ‘Oklahoma!‘. One of the most famous songs is about how the farmers and the cowboys don’t get along and look for all the differences between themselves, rather than trying to work together to make the most of being settlers in a new territory. (See any similarities?!)

The disclaimer – the song makes the assumption that farmers and cowmen are all male, and that the females stay at home cooking, with the daughters waiting to get married.  In my re-working of the lyrics I’ve been equally sexist and made the repository managers female and the developers male.  This is not representative of my views or of reality, but it fits for the song!  So please don’t hold this against me!  This is just a light-hearted piece!

The song also fits with the OR2012 conference as it talks about the admins and coders (‘repository managers’ and ‘developers’ are too long for the song!) dancing together.  The conference dinner will be ending with a ceilidh, where hopefully there will be much dancing and fun!  If you’ve never seen Oklahoma you can watch a performance of this song below:

So sing along (preferably in your head if you work in a shared office!)

The admin and the coder should be friends.
Oh the admin and the coder should be friends.
One of them likes to bulk upload, the other likes to cut some code,
But that’s no reason why they can’t be friends!

Repository folks should stick together,
Repository folks should all be pals.
Admins dance with the coders’ daughters,
Coders dance with the admins’ gals.
(repeat)

I’d like to say a word for the admin,
She come out west when repos were in beta,
She came out west and built a lot of services,
And uploads PDFs with metadata!

The admin is a good and thrifty citizen,
no matter what the coder says or thinks.
You sometimes see ‘em drinkin’ in the tea room.
And always wants download stats when she rings.

But the admin and the coder should be friends.
Oh, the admin and the coder should be friends.
The coder writes a script with ease, the admin holds the OA keys,
But that’s no reason why they can’t be friends.

Repository folks should stick together,
Repository folks should all be pals.
Admins dance with the coders’ daughters,
Coders dance with the admins’ gals.

I’d like to say a word for the coder,
the road he treads is difficult and stoney.
He codes for days on end with just a keyboard for a friend.
I sure do find he’s often tired and moany!

The coder should be sociable with the admin.
If he drops in looking like he needs bath water,
Don’t treat him like a louse make him welcome in your house.
But be sure that you lock up your wife and daughters!

I’d like to teach you all a little saying.
And learn the words by heart the way you should.
I don’t say I’m no better than anybody else,
But I’ll be damned if I ain’t just as good!

Repository folks should stick together,
Repository folks should all be pals.
Admins dance with the coders’ daughters,
Coders dance with the admins’ gals.

Suggestions for better lyrics are most welcome!

[If you want to see the original lyrics, you can view them at: http://www.stlyrics.com/lyrics/oklahoma/thefarmerandthecowman.htm]

Is the Repository Developer a dying breed?

Is the Repository Developer a dying breed, and should we care?

Cast your mind back, perhaps seven or eight years.  It was the heyday of repository development.  Projects such as DSpace and EPrints were taking off, and institutions around the world were watching the area closely and with excitement to see where this glorious new world would take us.

But back in those days repositories were similar to the early motor car – you needed a lot of money, several years, and your own mechanic/driver (developer) to make it work.  Luckily, back then these resources were often available, money was perhaps a little easier to come by, and there were many funding opportunities from the likes of JISC to help out too.

As the repository developer worked with the repository software for a few years, they became intimately related with the software – they knew how it worked, how it was structured, what it could and couldn’t do, how to structure data within the repository, and often became key players in the development of the open source platforms by taking on roles such as DSpace Committership.  Life was good, and I was lucky to be part of this, riding on the waves of e-theses, JISC projects (Repository Bridge, ROAD, RSP, Deposit Plait, SWORD) and the start of local institutional open access advocacy movements.

However… life moves on.  The early repository developers have taken different career paths, and now find themselves in different situations.

  • Some have left the domain when the project funded projects slowed down, repositories could be implemented without a dedicated developer, and new areas of interest arose.
  • Some have progressed in their careers, and chosen to take a non-technical route up the tree.
  • Some have taken a commercial route, choosing to take their skills into the commercial sector and providing development services back to repository-using institutions.
  • Others have specialised as repository developers, but often find their emphasis has to be on compliance or marketing issues such as statistics, research assessment, or branding, rather than continuing to develop and apply core repository functionality.

It is rare to find a role these days where a developer can specialise in repositories, spending the majority of their time in that area.

I believe there is still a need for repository developers, as they bring many benefits such as:

  • An understanding of the technology that helps them to know when repository technology can and should be applied, and when it should not.  Often repositories do not appear to be suitable choices for some of our system requirements as we’re used to confining them to electronic theses and journal papers, but they have great potential in new areas.
  • They know the underlying technology and data structures used by repositories, and how these can be mapped onto new domains.  This can save institutions time and money, as they can re-use their existing repository infrastructure and expertise, rather than in investing in others.
  • Equally and opposite, they know the weaknesses of repositories, and where current or future functionality will not be suitable.
  • They provide technical credibility.  Often repositories are run by libraries, but in environments where there are IT departments who may hold varying views on the technical development competence of the library.
  • They make the running of the current repository/ies more smooth, and can help manipulate the data they contain (import / export / update / delete) in ways that are not supported by the native interfaces.
  • Repositories are starting to become integration targets of enterprise systems such as CRIS systems.  Having a repository develop around can make these integrations easier.

I think we’re seeing a downward spiral in the availability of repository developers.  From time to time you see job adverts seeking experienced repository developers, and unfortunately they seem to be becoming a rare breed – a breed which I think we should protect, recognise, foster, and grow.  I’m lucky to have worked in several large institutions lately where there have been small, effective, embedded, and valued repository development teams.  However these types of teams are starting to become fewer and harder to find.

Opportunities for repository developers to network, learn, and share, are decreasing.  There are exciting events such as Open Repositories with their Developers Challenge, and the Repository Fringe, but these are only annual events, and do not provide the opportunity for repository developers to show their skills and the potential of repository technologies to anyone outside of the repository community.

How will this affect the repository community?  I think that there will be increasing problems in the repository world if repository developers become a very rare breed:

  • The large open source platforms that we have come to rely upon (installations of platforms such as DSpace, EPrints, and Fedora number in the thousands) will find it harder to continue to develop and keep pace with current requirements.  The large amount of development effort that has gone into these systems over the past decade could be wasted, and we’ll fail to see some of the benefits that only come to fruition after this length of maturity.
  • There will be fewer exemplars of good practice of repository use to inspire and drive forward the innovative use of repositories.
  • Those who administer repositories will lose their local allies who are able to provide the tools and integrations to make repositories a local success.
  • The potential for repositories to be involved in new hot topics such as Research Data Management, the resurgence of interest in open access publishing, or the need for better digital preservation may be missed.
  • It will be even harder to recruit experienced and passionate repository developers, and without well-established teams of these, new developers thrust into the arena will find it harder to grow their skills and knowledge.

What should we, and can we do about it?  It think that we need to value the role of the repository developer, and continue to recognise that despite there being less requirement for a repository developer in order to run a repository, the absence of development skills may inhibit a good repository service from becoming a great repository service.  As we value multi-talented systems librarians, we should value the repository developer as a multi-skilled employee that allows us to correctly apply and integrate repository technologies.

Looking around at commercial companies that offer repository development services (for example Cottage Labs and atmire) we see the sort of innovative thinking that has so much potential in this area, and when I talk to staff involved with these companies there seems no shortage of people wanting their skills.  And this is good, and shows that there is a demand.  But equally I feel we need to keep growing these skills within institutions, and not let the local Repository Developer become a dying breed.

Repositories are in their teenage years, we nurtured them through birth, messy childhoods, promising early years, and now we’re starting to get a glimpse of how they can become powerful embedded tools.  But without the continued availability of skilled parents to shepherd their development, they may never reach their full adulthood potential.

[This blog post was written on the way to a DevCSI event for managers of developers, where we shall be looking at how we can show the positive impact of having local development teams within universities, and from my perspective and passion, their particular value to libraries.]