Daily Archives: July 4, 2009

Direct from MS Word to DSpace via SWORD

As a member of the SWORD project, it has been a great seeing Microsoft’s External Research group integrate SWORD into Word 2007, their Zentity repository, and their online journal hosting system. There is a good overview of this work in a presentation given by Pablo Fernicola at the Open Repositories 2009 conference entitled ‘Connecting Authors and Repositories Through SWORD‘.

This blog post is about the functionality I have added to DSpace to allow it to accept deposits from within Microsoft Word using SWORD.

If you are unaware of the authoring add-in, then before reading the rest of this blog, take a look at Pablo’s YouTube video ‘Integrating with repositories and journal submissions’ at http://www.youtube.com/watch?v=2_M2gfUyVzU. The video explains the authoring add-in, so I’ll not duplicate that information in this blog post. The rest of this post explains how I extended DSpace to work with the add-in…

In order for DSpace to be able to ingest a package, it needs an ingester that understands the format and knows how to unpack it and extract the metadata and file(s). In the case of .docx files created by Microsoft Word, it needs to know how to extract the metadata from within the file, and to archive the file as-is. This is a pretty easy task as a .docx file is actually just a zip file (try renaming it from .docx to .zip and then take a peek inside!). So I wrote an ingester than unzips the file, extracts the NLM metadata that the add-in inserted in the file, and then creates a new DSpace item with that metadata. Finally it adds the complete .docx file as a bitstream for people to download.

Some of the metadata such as the authors identities are held in the .docx file is held in the customXml/item*.xml files, and other parts such as the article title and abstract are held in the actual document contents in word/document.xml. The ingester extracts these values for use in the new DSpace item.

<w:t>Add an S to Microsoft Word and you get SWORD</w:t>
<my:name.>
<my:name.content-type.datatypeattribute.attribute.></my:name.content-type.datatypeattribute.attribute.>
<my:name.name-style.datatypeattribute.attribute.></my:name.name-style.datatypeattribute.attribute.>
<my:surname.>Lewis</my:surname.>
<my:given-names.>Stuart</my:given-names.>
</my:name.>

I then configured the DSpace ingesters to use the docx ingester when it encountered .docx files:

plugin.named.org.dspace.content.packager.PackageIngester = \
org.dspace.content.packager.PDFPackager  = Adobe PDF, PDF, \
org.dspace.content.packager.DSpaceMETSIngester = METS, \
org.dspace.content.packager.DSpaceDocxIngester = DOCX

I then configured the SWORD package to expose the fact that it supported .docx files in its SWORD service document:

sword.accept-packaging.Docx.identifier = application/vnd.openxmlformats-officedocument.wordprocessingml.document
sword.accept-packaging.Docx.q = 1.0

Finally the DSpace SWORD interface needed to know which packager to use for .docx files based on their MIME type:

plugin.named.org.dspace.sword.SWORDIngester = \
org.dspace.sword.SWORDMETSIngester = http://purl.org/net/sword-types/METSDSpaceSIP \
org.dspace.sword.SimpleFileIngester = SimpleFileIngester \
org.dspace.sword.DocxIngester = application/vnd.openxmlformats-officedocument.wordprocessingml.document

All that is needed to use this is a copy of the authoring add-in (http://research.microsoft.com/en-us/projects/authoring/), and a suitable formatted template for the repository that you wish to deposit the document into (dspace-swordapp-org.docx). The template is preconfigured to deposit directly into the DSpace SWORD demo repository which I have upgraded with the new code to accept .docx deposits. Feel free to create an account in that repository, install the add-in, load the template, and try out a deposit!

This complete end to end process allows you to create Word templates, and to mark them up with required and optional fields. It also allows you to embed details of the SWORD deposit repository URL (so the users do not need to know what it is) within the template for easy deposit. This could be used for example for a journal editor to provide a template and a deposit location for new paper submissions all-in-one. And this use case could be extended: for example if a faculty member wants all their students to submit an assignment with a template, they could do so and use the repository as the end point rather than a traditional VLE. And unlike a VLE, the repository will probably provide search and indexing facilities across the deposited documents. I’m sure as this tool gets used more, there will be a lot of new ideas for how it can be used.

Comments welcome! :)