Monthly Archives: July 2009

Email your repository

What modern information handling system do we probably interact with most each day? For the majority of us, it is probably our email. We send and recive dozens of emails each day. So how about enabling repository deposit via email?

It has certainly been talked about from time to time, and a plugin to the Thunderbird email client has even been written that allows you to deposit attachments into repositories using SWORD (no mention is made of metadata). This plugin should work with any SWORD enabled repository, but only works with one email client. Wouldn’t it be great if there was a more general solution that worked with all repositories, and all email clients?

Now there is! The latest version of the SWORD PHP library (version 0.8) contains an example script showing SWORD and the PHP library in use. To make it work, just fill out a configuration file with your email address, password and IMAP mailbox details, and your repository login, password and deposit URL. After the configuration file has been filled in, all you need to do is run the script ‘imap-mail.php’ on the command line. The script will connect to your mailbox and look at each unread message. It will package each one up and deposit it into the repository.

How does it work?

It uses the standard PHP IMAP library to connect to your inbox. For each unread message it finds, it extracts the name of the sender of the email, the email subject, and the body of the message. It uses these for metadata:

  • From name -> Author
  • Subject -> Title
  • Message body -> Abstract

Along with the metadata, the script adds each email attachment to the deposited item. An example of an email deposited this way can be seen at: http://dspace.swordapp.org/jspui/handle/123456789/318

If you want to try it out, but don’t want to set it all up, for a limited time I’ll leave it running against the deposit@swordapp.org mailbox. Send an email to deposit@swordapp.org and when I periodically run the script, your email will be deposited into the test DSpace/SWORD repository at http://dspace.swordapp.org/. Please bear in mind that the repository is open access to the world, so anyone can see what your email and optional attachments contain! (Please consider working hours / timezones etc when working out when I am likely to next run the script! You will receive a confirmation email when your deposit has taken place.)

A few further thoughts:

  • What is the use of this script? I can think of a few. If you want to allow faculty members to deposit full texts easily, and you are happy to update the record with better metadata, this script could help you out. Alternatively if you have a system that for example generates weekly reports, and you want to archive these, you may find it easier to get the system to email the reports to the SWORD script than to develop a SWORD deposit interface to the system.
  • n for the price of 1: As far as I know, none of the major SWORD enabled repository platforms has an email deposit facility. This script shows one aspect of the real power of interoperability and SWORD’s part in creating an interoperable environment. Rather than developing an email deposit facility for just one repository, I have developed one for any SWORD compliant repository.
  • Extensions to the script: The following idea came up in a conversation with Kim Shepherd, the LCoNZ DSpace programmer who suggested that the ‘local part’ of an email address could be used to set further options. Email addresses can have extra tags applied following a ‘+’ or ‘-’ (for example username-XYZ@example.com). So if for example you wanted your users to be able to choose which collection in the repository they wanted to deposit an item into, they could change the deposit email to something like deposit+datasets@example.com or deposit-chemistrylearningobjects@example.com. These emails would all end up in the same mailbox, but the script could process them differently. Or of course the parameters could be used to set other options.

If I get time, my next extension to the PHP SWORD library will be a basic web client (similar to http://client.swordapp.org/ except written in PHP, and will create packages from files for you). If you have any other suggestions, please leave a comment!

Direct from MS Word to DSpace via SWORD

As a member of the SWORD project, it has been a great seeing Microsoft’s External Research group integrate SWORD into Word 2007, their Zentity repository, and their online journal hosting system. There is a good overview of this work in a presentation given by Pablo Fernicola at the Open Repositories 2009 conference entitled ‘Connecting Authors and Repositories Through SWORD‘.

This blog post is about the functionality I have added to DSpace to allow it to accept deposits from within Microsoft Word using SWORD.

If you are unaware of the authoring add-in, then before reading the rest of this blog, take a look at Pablo’s YouTube video ‘Integrating with repositories and journal submissions’ at http://www.youtube.com/watch?v=2_M2gfUyVzU. The video explains the authoring add-in, so I’ll not duplicate that information in this blog post. The rest of this post explains how I extended DSpace to work with the add-in…

In order for DSpace to be able to ingest a package, it needs an ingester that understands the format and knows how to unpack it and extract the metadata and file(s). In the case of .docx files created by Microsoft Word, it needs to know how to extract the metadata from within the file, and to archive the file as-is. This is a pretty easy task as a .docx file is actually just a zip file (try renaming it from .docx to .zip and then take a peek inside!). So I wrote an ingester than unzips the file, extracts the NLM metadata that the add-in inserted in the file, and then creates a new DSpace item with that metadata. Finally it adds the complete .docx file as a bitstream for people to download.

Some of the metadata such as the authors identities are held in the .docx file is held in the customXml/item*.xml files, and other parts such as the article title and abstract are held in the actual document contents in word/document.xml. The ingester extracts these values for use in the new DSpace item.

<w:t>Add an S to Microsoft Word and you get SWORD</w:t>
<my:name.>
<my:name.content-type.datatypeattribute.attribute.></my:name.content-type.datatypeattribute.attribute.>
<my:name.name-style.datatypeattribute.attribute.></my:name.name-style.datatypeattribute.attribute.>
<my:surname.>Lewis</my:surname.>
<my:given-names.>Stuart</my:given-names.>
</my:name.>

I then configured the DSpace ingesters to use the docx ingester when it encountered .docx files:

plugin.named.org.dspace.content.packager.PackageIngester = \
org.dspace.content.packager.PDFPackager  = Adobe PDF, PDF, \
org.dspace.content.packager.DSpaceMETSIngester = METS, \
org.dspace.content.packager.DSpaceDocxIngester = DOCX

I then configured the SWORD package to expose the fact that it supported .docx files in its SWORD service document:

sword.accept-packaging.Docx.identifier = application/vnd.openxmlformats-officedocument.wordprocessingml.document
sword.accept-packaging.Docx.q = 1.0

Finally the DSpace SWORD interface needed to know which packager to use for .docx files based on their MIME type:

plugin.named.org.dspace.sword.SWORDIngester = \
org.dspace.sword.SWORDMETSIngester = http://purl.org/net/sword-types/METSDSpaceSIP \
org.dspace.sword.SimpleFileIngester = SimpleFileIngester \
org.dspace.sword.DocxIngester = application/vnd.openxmlformats-officedocument.wordprocessingml.document

All that is needed to use this is a copy of the authoring add-in (http://research.microsoft.com/en-us/projects/authoring/), and a suitable formatted template for the repository that you wish to deposit the document into (dspace-swordapp-org.docx). The template is preconfigured to deposit directly into the DSpace SWORD demo repository which I have upgraded with the new code to accept .docx deposits. Feel free to create an account in that repository, install the add-in, load the template, and try out a deposit!

This complete end to end process allows you to create Word templates, and to mark them up with required and optional fields. It also allows you to embed details of the SWORD deposit repository URL (so the users do not need to know what it is) within the template for easy deposit. This could be used for example for a journal editor to provide a template and a deposit location for new paper submissions all-in-one. And this use case could be extended: for example if a faculty member wants all their students to submit an assignment with a template, they could do so and use the repository as the end point rather than a traditional VLE. And unlike a VLE, the repository will probably provide search and indexing facilities across the deposited documents. I’m sure as this tool gets used more, there will be a lot of new ideas for how it can be used.

Comments welcome! :)