Tag Archives: dspace

ResourceSync and SWORD

resync_logoThis is the third post is a short series of blog posts about ResourceSync.  Thanks to Jisc funding, a small number of us from the UK have been involved in the NISO / OAI ResourceSync Initiative.  This has involved attending several meetings of the Technical Committee to help design the standard, working on documenting some of the different ResourceSync use cases, and working on some trial implementations.  As mentioned in the previous blog posts, I’ve been creating a PHP API library that makes it easy to interact with ResourceSync-enabled services.

In order to really test the library, it is good to think of a real end-to-end use case and implement it.  The use case I chose to do this was to mirror one repository to another, and to then keep it up to date.  This first involves a baseline sync to gather all the content, followed by an incremental sync of changes made each day.

ResourceSync provides the mechanism by which to gather the resources from the remote repository.  However another function is then required to take those resources and put them into the destination repository.  The obvious choice for this is SWORD v2.

ResourceSync is designed to list all files (or changed files) on a server.  These are then transferred using good old HTTP, but to get them into another repository requires a deposit protocol – in this case, SWORD.  In other words, ResourceSync is used to harvest the resources onto my computer, and SWORD is then used to deposit them into a destination repository.

The challenge here is linking resources together.  An ‘item’ in a repository is typically made up of a metadata resource, along with one or more associated file resources.  Because these are separate resources, they are listed independently in the ResourceSync resource lists.  However they contain attributes that link them together: ‘describes’ and ‘describedBy’.  The metadata ‘describes’ the file, and the file is ‘describedBy’ the metadata.  A good example of this is given in the CottageLabs description of how the OAI-PMH use case can be implemented using ResourceSync:

[xml]
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:rs="http://www.openarchives.org/rs/terms/">

<rs:ln rel="resourcesync" href="http://example.com/capabilitylist.xml"/>
<rs:md capability="resourcelist" modified="2013-01-03T09:00:00Z"/>

<url>
<loc>http://example.com/metadata-resource</loc>
<lastmod>2013-01-02T17:00:00Z</lastmod>
<rs:ln rel="describes" href="http://example.com/bitstream1"/>
<rs:ln rel="describedBy" href="http://purl.org/dc/terms/"/>
<rs:ln rel="collection" href="http://example.com/collection1"/>
<rs:md hash="md5:1584abdf8ebdc9802ac0c6a7402c03b6"
length="8876"
type="application/xml"/>
</url>

<url>
<loc>http://example.com/bitstream1</loc>
<lastmod>2013-01-02T17:00:00Z</lastmod>
<rs:ln rel="describedBy" href="http://example.com/metadata-resource"/>
<rs:ln rel="describedBy" href="http://example.com/other-metadata"/>
<rs:ln rel="collection" href="http://example.com/collection1"/>
<rs:md hash="md5:1e0d5cb8ef6ba40c99b14c0237be735e"
length="14599"
type="application/pdf"/>
</url>
</urlset>
[/xml]

So here’s the recipe (and here’s the code) for syncing a resource list such as this, and then depositing it into a remote repository using SWORD.  Both use PHP libraries, which makes the code quite short.

The recipe

[php]
include_once(‘../../ResyncResourcelist.php’);
$resourcelist = new ResyncResourcelist(‘http://93.93.131.168:8080/rs/resourcelist.xml’);
$resourcelist->registerCallback(function($file, $resyncurl) {
// Work out if this is a metadata object or a file
global $metadataitems, $objectitems;
$type = ‘metadata’;
$namespaces = $resyncurl->getXML()->getNameSpaces(true);
if (!isset($namespaces[‘sm’])) $sac_ns[‘sm’] = ‘http://www.sitemaps.org/schemas/sitemap/0.9’;
$lns = $resyncurl->getXML()->children($namespaces[‘rs’])->ln;
$key = ”;
$owner = ”;
foreach($lns as $ln) {
if (($ln->attributes()->rel == ‘describedby’) && ($ln->attributes()->href != ‘http://purl.org/dc/terms/’)) {
$type = ‘object’;
$key = $resyncurl->getLoc();
$owner = $ln->attributes()->href;
}
}

echo ‘ – New file saved: ‘ .$file . "\n";
echo ‘  – Type: ‘ . $type . "\n";

if ($type == ‘metadata’) {
$metadataitems[] = $resyncurl;
} else {
$objectitems[(string)$key] = $resyncurl;
$resyncurl->setOwner($owner);
}
});
[/php]

This piece of code is performing a baseline sync, and is using the callback registration option mentioned in the last blog.  The callback is just doing one thing: sorting the metadata objects into one list, and the file objects into another.  These will then be processed later.

Next, each metadata item is processed in order to deposit that metadata object into the destination repository using SWORD v2:

[php]
foreach ($metadataitems as $item) {
echo " – Item " . ++$counter . ‘ of ‘ . count($metadataitems) . "\n";
echo "  – Metadata file: " . $item->getFileOnDisk() . "\n";
$namespaces = $xml->getNameSpaces(true);
if (!isset($namespaces[‘dc’])) $sac_ns[‘dc’] = ‘http://purl.org/dc/terms/’;
if (!isset($namespaces[‘dcterms’])) $sac_ns[‘dc’] = ‘http://purl.org/dc/elements/1.1/’;
$dc = $xml->children($namespaces[‘dc’]);
$dcterms = $xml->children($namespaces[‘dcterms’]);
$title = $dc->title[0];
$contributor = $dc->contributor[0];
$id = $dc->identifier[0];
$date = $dcterms->issued[0];
echo ‘   – Location: ‘ . $item->getLoc() . "\n";
echo ‘   – Author: ‘ . $contributor . "\n";
echo ‘   – Title: ‘ . $title . "\n";
echo ‘   – Identifier: ‘ . $id . "\n";
echo ‘   – Date: ‘ . $date . "\n";

// Create the atom entry
$test_dirin = ‘atom_multipart’;
$atom = new PackagerAtomTwoStep($resync_test_savedir, $sword_deposit_temp, ”, ”);
$atom->setTitle($title);
$atom->addMetadata(‘creator’, $contributor);
$atom->setIdentifier($id);
$atom->setUpdated($date);
$atom->create();

// Deposit the metadata record
$atomfilename = $resync_test_savedir . ‘/’ . $sword_deposit_temp . ‘/atom’;
echo ‘  – About to deposit metadata: ‘ . $atomfilename . "\n";
$deposit = $sword->depositAtomEntry($sac_deposit_location,
$sac_deposit_username,
$sac_deposit_password,
”,
$atomfilename,
true);
[/php]

This option being used here is to first create an atom entry that contains the metadata, and depositing that.  The SWORD v2 ‘in-progress’ flag is being set to TRUE, which indicates that further activity will take place to the record.

The code then needs to look through the list of file resources, and find any that are ‘describedBy’ the metadata record in question.  Any that are, are deposited to the same record using SWORD v2:

[php]
// Find related files for this metadata record
foreach($objectitems as $object) {
if ((string)$object->getOwner() == (string)$item->getLoc()) {
$finfo = finfo_open(FILEINFO_MIME_TYPE);
$mime = finfo_file($finfo, $object->getFileOnDisk());
echo ‘    – Related object: ‘ . $object->getLoc() . "\n";
echo ‘     – File: ‘ . $object->getFileOnDisk() . ‘ (‘ . $mime . ")\n";

// Deposit file
$deposit = $sword->addExtraFileToMediaResource($edit_media,
$sac_deposit_username,
$sac_deposit_password,
”,
$object->getFileOnDisk(),
$mime);
}
}
[/php]

Using the SWORD v2 API library is very easy: once you have the file and its MIME type, it is a single line of code to add that file to the record in the destination repository.

Once all the related files have been added, the final step is to set the ‘in-progress’ flag to FALSE to indicate that the object is complete, and that it can be formally archived into the repository.  This is a simple as:

[php]
// Complete the deposit
$deposit = $sword->completeIncompleteDeposit($edit_iri,
$sac_deposit_username,
$sac_deposit_password,
”);
[/php]

The end to end process has now taken place – the items have been harvested using ResourceSync, and then deposited back using SWORD v2.

Limitations

The default DSpace implementation of the SWORD v2 protocol allows items to deposited, updated, and deleted.  It does this by keeping items in the workflow, and when the ‘In-progress’ flag is set to false, the deposit is completed by moving it out of the workflow and into the main archive.  Once the item is moved into the main archive, it can no longer be edited using SWORD.

This is a sensible approach for most situations.  Once an item has been formally ingested, it is under the control of the archive manager, and the original depositor should probably not have the rights to make further changes.

However in the case of performing a synchronisation with ResrcoueSync, the master copy of the data is in a remote repository, and that should therefore be allowed to overwrite data that is formally archived in the repository.  This is an implementation option though, and if an alternative WorkflowManager was written, this could be changed.

[Update: 20th June 2013.  I have now edited the default WorkflowManager, to make one that permits updates to items that are in workflow or in the archive.  This overcomes this limitation.  I hope to add this as a configurable option to a future release of DSpace.]

Conclusion

ResourceSync and SWORD are two complementary interoperability protocols. ResourceSync can be used to harvest all content from one site, and SWORD used to deposit that content into another.

ResourceSync can differentiate between new, updated, and deleted content.  SWORD v2 also allows these interactions, so can be used to reflect those changes as they happen.как разместить объявление в контакте

The collection is dead! Long live the collection!

“The collection is dead! Long live the collection!”  That summarises my current thoughts and feelings about the collection hierarchy structure in DSpace.

When first installed, DSpace shows its need for a community and collection hierarchy, because without at least one community, containing at least one collection, it is impossible to even submit an item.  Therefore, from day one, DSpace repository managers get used to creating collections, giving them names, and creating a hierarchy.  Having created a first community with its first collection, it seems silly to have just a single community and collection.  So the repository managers creates a larger hierarchy – often mimicking the organisational structure of their institution.  Often this hierarchy extends down to departments, research groups, even individuals.

And the repository manager feels good.  There is structure which will give people a sense of belonging, and more importantly, ownership of ‘their collection’.  This will help them gather content for their repository by helping the individual or research group feel that it is their space.

Very often of course it means that a repository has a lot of structure with empty collections.  Also it leads to other problems – what about theses?  We want them to appear in their department’s collection, but also to appear in our central library thesis collection for harvesting by a service such as EThOS.  Or what a about an article that has been co-authored by people across different departments?  Or departments that move around in the organisation.  This collection structure is starting to feel a bit inflexible and restrictive!  What started of as a useful tool that made us feel ‘organised’, now feels the opposite.

Luckily, help is now at hand!  Since version 1.7 DSpace has included the ‘discovery’ module.  This is nothing ground-breaking as such, just a faceted search feature using solr contributed to DSpace by atmire.  The real beauty and power of faceted search comes with their ability to make ‘virtual collections’.  You want a collection of theses published by the faculty of science: sure – just link to the type:thesis + faculty:science facet search result page.  You want a collection for Prof. S Smith?  No problem, link to the author:s. smith faceted search result page.  Want a collection of the ‘recent’ publications of a department?  Just link to the department:foobar + year:2011.

Facets give us the ability to create ‘views’ over the data, based on properties (metadata) of items.  Maybe this is why Google and other search engines are more popular than http://www.dmoz.org/?  We like to have our own collections defined instantly by a search, not be forced to traverse a hierarchy dictated by others.  Of course this does rely on quality (consistent / present / correct) metadata to ensure that items all appear in their virtual collections.  To conclude – sometimes it feels like “The collection is dead!”.  We have better ways to create structure over the repository rather than through it.

But wait!  I cry “Long live the collection!”.

Whilst in our main ‘institutional repository’ at The University of Auckland Library (http://researchspace.auckland.ac.nz/) we have been rationalising the number of collections we have and removing most of organisational structure, we have been making use of DSpace collections in another very useful way…

We’re halfway through a project known internally as the SuperIndex.  Not (just!) because it is ‘super’, but because we are creating a super-index of many of our disparate bibliographic and digital special collections.  We have many databases of collections all over the place, and the SuperIndex project aims to bring them all together into a single system.  This will make management of the collections more consistent, while reducing the number of systems to maintain.  DSpace is our chosen central management system.

This is where the collection structure is becoming very useful. Each collection of items really is a collection.  An item in a database about fisheries in New Zealand will not (or is unlikely to!) appear in any of our other special collections.  This collection structure makes it easier to manage each collection separately.  We can run curation tasks on a collection, or control who has rights to edit a collection.  The management of this repository is much wider than the institutional repository that is just administered by one team.  We will have staff in many areas editing the items.  It also allows us to create individual websites for each collection, each with their own URL structure and branding – the end user does not know that the item they are viewing is actually managed in a DSpace somewhere, and that the DSpace contains thousands of other items in different collections.

So the ‘collection’ is starting to become less useful in the standard institutional repository of research outputs (which is, in a way, a single collection) but is having a new life for us in managing what could be seen as more traditional ‘collections’ in a single DSpace repository.

I’d be interested to hear what you think are the strengths and weaknesses, or reasons for and against forced collection hierarchies in DSpace.

“The collection is dead! Long live the collection!”vaxter-vk.ru

Life used to be so simple / unconfigurable

I had a spare hour this afternoon, so I thought I’d take a quick look at how dspace.cfg has changed over time. Anyone who has had the pleasure of looking after a DSpace server will know about dspace.cfg. It is the main configuration file for DSpace where many of the configurable options reside. These vary from core settings such as the name of your database server or mail server, to minor tweaks that make differences to your repository that nobody would ever notice!

DSpace 1.7 has just been released, and as it stands, dspace.cfg is a whopping 2268 lines long, containing over 200 required or preset configuration options, and a further 250+ optional parameters.  That’s quite a configuration file!

(It’s not so bad though – to get a new system up and running requires less than 5 settings to be edited to match your local environment.  The rest are there for a rainy day.)

But… how has it grown over time?

  • Version 1.0: 31 required / preset, 2 optional
  • Version 1.1: 30 required / preset, 5 optional (-1, +3)
  • Version 1.2: 34 required / preset, 9 optional (+4, +4)
  • Version 1.3: 55 required / preset, 32 optional (+21, +23)
  • Version 1.4: 101 required / preset, 58 optional (+46, +26)
  • Version 1.5 : 157 required / preset, 104 optional (+56, +46)
  • Version 1.6: 195 required / preset, 227 optional (+38, +123)
  • Version 1.7: 215 required / preset, 257 optional (+20, +30)

Or if you want to see it as a chart:

(N.B.: The gaps in between each release do not always reflect the amount of time or code changes in that version, but as a subversion repository as a whole.  The darker area are the number of required / preset configuration options, and the lighter shaded area the optional settings.  The number of optional settings is a rough calculation, looking in the configuration file for any line that starts with a ‘#’ (a comment) and contains and equals sign.)

That’s obviously quite a change – from humble beginnings.  I think everyone agrees that something needs to be done to help the system administrator / repository manager navigate and understand the plethora of configuration options.  However there are many different options / preferences / views about how this is best tackled: multiple configuration files, configuration stored in the database, configuration managed via spring services, DSpace installers, etc. One will have to win…сайт

The SWORD course videos now online

I recently blogged about ‘The SWORD Course’, as the slides had been put onto slideshare.  Now, thanks to UKOLN’s Adrian Stevenson, the videos are now available too:

  1. An Introduction to SWORD: Gives an overview of SWORD, the rationale behind its creation, and details of the first three funded SWORD projects
  2. SWORD Use Cases: Provides an introduction to use cases, and examines some of the use cases that SWORD can be used for
  3. How SWORD Works: A high level overview of the SWORD protocol, lightly touching on a few technical details in order to explain how it works
  4. SWORD Clients: The reasons for needing SWORD clients are shown, followed by a tour of some of the current SWORD clients
  5. Create Your Own SWORD Client: An overview of the EasyDeposit SWORD client creation toolkit, including the chance to try it out

The complete set of videos can be found at http://vimeo.com/channels/swordappangry racer

The SWORD course slides now online

As part of the JISC-funded SWORD 3 project, I created ‘The SWORD Course’ and presented it during a two hour workshop at the recent Open Repositories 2010 conference in Madrid. The aim of the course was to empower repository managers and repository developers who knew what SWORD was, but who are not currently using it, to be able to go back to their institutions and start using it.
The course, entitled ‘Adding SWORD To Your Repository Armoury’ is made up of 5 modules:
  1. An Introduction to SWORD: Gives an overview of SWORD, the rationale behind its creation, and details of the first three funded SWORD projects
  2. SWORD Use Cases: Provides an introduction to use cases, and examines some of the use cases that SWORD can be used for
  3. How SWORD Works: A high level overview of the SWORD protocol, lightly touching on a few technical details in order to explain how it works
  4. SWORD Clients: The reasons for needing SWORD clients are shown, followed by a tour of some of the current SWORD clients
  5. Create Your Own SWORD Client: An overview of the EasyDeposit SWORD client creation toolkit, including the chance to try it out

The slides from each presentation have now been uploaded to Slideshare with a Creative Commons Attribution NonCommercial Sharealike licence. The workshop was video recorded too, and hopefully this will be posted online some soon too.

vzlomat-vse.ru

DSpace 1.6 released!

As you will have hopefully read in the publicity that has gone out – DSpace 1.6 is finally released! It has been 9 or 10 months in the making, has involved and benefited from a lot of input from dozens of developers, users, and testers, and contains some fantastic new features. It has been a privilege to co-ordinate the release, and I hope it proves a useful upgrade to current users, and a useful product to new users.

But progress waits for no one – the community is already thinking about what comes next! We’ve seen changes in the way development has taken place for 1.6 (see: ‘On DSpace Development‘) and we want to continue to innovate both in terms of the way development is steered, and how it takes place. We’ve got some ‘special meetings’ planned to look at this: how to manage releases and how to manage the release cycle. Exciting times. If you’ve got any interest in DSpace, keep an eye out for these meetings and join in. Associated with this is the call for more committers. If you’re interested in developing DSpace, read the call.

[A small errata from the DSpace newsletter: It states that the ability for item exports to take place via the user interfaces has been added in 1.6. This facility has been there since 1.5.1 and was written by Scott Philips. What has been added in 1.6 is the ability (making use of some of Scott’s code) for the command line version of the batch importer and exporter to handle zip files of multiple items. This means you can export multiple items to just one zip file, transfer the single file to a new server, and re-import it.]free online game car