Direct from MS Word to DSpace via SWORD

As a member of the SWORD project, it has been a great seeing Microsoft’s External Research group integrate SWORD into Word 2007, their Zentity repository, and their online journal hosting system. There is a good overview of this work in a presentation given by Pablo Fernicola at the Open Repositories 2009 conference entitled ‘Connecting Authors and Repositories Through SWORD‘.

This blog post is about the functionality I have added to DSpace to allow it to accept deposits from within Microsoft Word using SWORD.

If you are unaware of the authoring add-in, then before reading the rest of this blog, take a look at Pablo’s YouTube video ‘Integrating with repositories and journal submissions’ at http://www.youtube.com/watch?v=2_M2gfUyVzU. The video explains the authoring add-in, so I’ll not duplicate that information in this blog post. The rest of this post explains how I extended DSpace to work with the add-in…

In order for DSpace to be able to ingest a package, it needs an ingester that understands the format and knows how to unpack it and extract the metadata and file(s). In the case of .docx files created by Microsoft Word, it needs to know how to extract the metadata from within the file, and to archive the file as-is. This is a pretty easy task as a .docx file is actually just a zip file (try renaming it from .docx to .zip and then take a peek inside!). So I wrote an ingester than unzips the file, extracts the NLM metadata that the add-in inserted in the file, and then creates a new DSpace item with that metadata. Finally it adds the complete .docx file as a bitstream for people to download.

Some of the metadata such as the authors identities are held in the .docx file is held in the customXml/item*.xml files, and other parts such as the article title and abstract are held in the actual document contents in word/document.xml. The ingester extracts these values for use in the new DSpace item.

<w:t>Add an S to Microsoft Word and you get SWORD</w:t>
<my:name.>
<my:name.content-type.datatypeattribute.attribute.></my:name.content-type.datatypeattribute.attribute.>
<my:name.name-style.datatypeattribute.attribute.></my:name.name-style.datatypeattribute.attribute.>
<my:surname.>Lewis</my:surname.>
<my:given-names.>Stuart</my:given-names.>
</my:name.>

I then configured the DSpace ingesters to use the docx ingester when it encountered .docx files:

plugin.named.org.dspace.content.packager.PackageIngester = \
org.dspace.content.packager.PDFPackager  = Adobe PDF, PDF, \
org.dspace.content.packager.DSpaceMETSIngester = METS, \
org.dspace.content.packager.DSpaceDocxIngester = DOCX

I then configured the SWORD package to expose the fact that it supported .docx files in its SWORD service document:

sword.accept-packaging.Docx.identifier = application/vnd.openxmlformats-officedocument.wordprocessingml.document
sword.accept-packaging.Docx.q = 1.0

Finally the DSpace SWORD interface needed to know which packager to use for .docx files based on their MIME type:

plugin.named.org.dspace.sword.SWORDIngester = \
org.dspace.sword.SWORDMETSIngester = http://purl.org/net/sword-types/METSDSpaceSIP \
org.dspace.sword.SimpleFileIngester = SimpleFileIngester \
org.dspace.sword.DocxIngester = application/vnd.openxmlformats-officedocument.wordprocessingml.document

All that is needed to use this is a copy of the authoring add-in (http://research.microsoft.com/en-us/projects/authoring/), and a suitable formatted template for the repository that you wish to deposit the document into (dspace-swordapp-org.docx). The template is preconfigured to deposit directly into the DSpace SWORD demo repository which I have upgraded with the new code to accept .docx deposits. Feel free to create an account in that repository, install the add-in, load the template, and try out a deposit!

This complete end to end process allows you to create Word templates, and to mark them up with required and optional fields. It also allows you to embed details of the SWORD deposit repository URL (so the users do not need to know what it is) within the template for easy deposit. This could be used for example for a journal editor to provide a template and a deposit location for new paper submissions all-in-one. And this use case could be extended: for example if a faculty member wants all their students to submit an assignment with a template, they could do so and use the repository as the end point rather than a traditional VLE. And unlike a VLE, the repository will probably provide search and indexing facilities across the deposited documents. I’m sure as this tool gets used more, there will be a lot of new ideas for how it can be used.

Comments welcome! :)

24 thoughts on “Direct from MS Word to DSpace via SWORD

  1. Pingback: Noticias Edición Digital » Blog Archive » Word + SWORD + Ingester = Word to DSpace Deposit

  2. Bharat

    I want to configure SWORD apps with my DSpace 1.5.1 instance how does it possible?
    Pl. guide me.

  3. stuart Post author

    Within [dspace]/webapps/ there should be a ‘sword’ directory. Publish this in Tomcat like you do with jspui/xmlui and oai directories. If you have specific questions about how to use SWORD, try sending them to the dspace-tech email list or to the swordapp-tech email list as there are many SWORD experts on those lists. If you are able to upgrade to DSpace 1.5.2 then you will be able to use the latest version of SWORD.

  4. Steve

    Very cool! Can you point me in the right direction to change the Dspace URL in the template?

  5. stuart Post author

    Hi Steve.

    Follow this method for editing the .docx files, as it fixes an issue that can occur when unzipping and rezipping the file: http://www.bizsupportonline.net/blog/2008/12/office-open-xml-file-cannot-opened-problems-contents/

    Once you are inside the .docx, look for the customXml directory, and find the infoX.xml (where X is a number, which varies from time to time) which contains a line similar to:

    <article ms:DepositURL=”http://localhost:8080/sword/deposit/123456789/2″ ms:JournalName=”test” ms:SignupURL=”http://localhost:8080/jspui/” ms:PasswordRequired=”True” ms:PreferredFormat=”docx” ms:SupportedFormats=”docx” ms:Category=”" ms:SubCategory=”">

    Edit as appropriate, and re-save.

    Hope that helps,

    Stuart

  6. jonathan

    Hello
    I want to configure with the Sword interface on our DSpace 1.4.2.
    I integrated the sword class, changed my word file the way you suggest.
    When I upload my paper, it is correctly deposited into DSpace.
    DSpace answers the following to Word:

    <?xml version="1.0" encoding="UTF-8"?>
    <atom:entry xmlns:atom="http://www.w3.org/2005/Atom&quot; xmlns:sword="http://purl.org/net/sword/"&gt;
    <atom:id>http://hdl.handle.net/2013/177</atom:id&gt;
    <atom:author>
    <atom:name>jonathan</atom:name>
    </atom:author>
    <atom:content type="application/vnd.openxmlformats-officedocument.wordprocessingml.document" src="https://localhost:8443/dspace/bitstream/2013/177/2/dspace-swordapp-org.docx"/&gt;
    <atom:generator uri="http://www.dspace.org/ns/sword/1.3.1&quot; version="1.3"/>
    <atom:link href="https://localhost:8443/dspace/bitstream/2013/177/1/dspace-swordapp-org.docx&quot; rel="part" type="application/vnd.openxmlformats-officedocument.wordprocessingml.document"/>
    <atom:link href="http://hdl.handle.net/2013/177&quot; rel="alternate" type="text/html"/>
    <atom:published>2010-05-27T08:19:16Z</atom:published>
    <atom:rights type="text"/>
    <atom:summary type="text">This paper examines the new features in Microsoft Word to upload using SWORD.</atom:summary>
    <atom:title type="text">Add an S to Microsoft Word and you get SWORD</atom:title>
    <atom:updated>2010-05-27T08:19:16Z</atom:updated>
    <atom:category>ULBREFONLY ULBPUB</atom:category>
    <sword:treatment>The file has been stored ‘as deposited’, with the metadata extracted from the file</sword:treatment>
    <sword:verboseDescription/>
    <sword:noOp>false</sword:noOp>
    <sword:packaging/>
    </atom:entry>

    However MS-Word always returns the following message:

    Unable to Upload document.
    Details:
    The value can not be null.
    Parameter name: uriString

    I have been looking into this problem for the past days, but can’t find why MS-Word answers this. Maybe it is linked to the fact that I want to run this on the older 1.4.2 version of DSpace? Can you help me out please?

    Thank you in advance,
    Jonathan

  7. Stuart Post author

    Hi Jonathan,

    The problem could be related to the version of DSpace you are using. The latest version of SWORD (version 1.3) was introduced in DSpace 1.5.2. If you have a chance to try a DSpace 1.5.2 / 1.6.0 or 1.6.1 version, then it might work.

    Thanks,

    Stuart

  8. Hrvoje

    Dear Stuart,

    I am using latest dspace 1.6.2 and can’t get repository to accept docx through Word.
    I have followed your instructions closely several times.
    Here is log of my error:
    2010-10-25 00:38:15,218 INFO org.dspace.sword.SWORDService @ [2010-10-25 00:38:15.218] Authenticated user: hrvoje@zsem.hr;
    2010-10-25 00:38:15,218 INFO org.dspace.sword.SWORDService @ [2010-10-25 00:38:15.218] Initialising depositor for an Item in a Collection;
    2010-10-25 00:38:15,218 ERROR org.dspace.sword.CollectionDepositor @ Unacceptable content type detected: application/vnd.openxmlformats-officedocument.wordprocessingml.document for collection 2
    2010-10-25 00:38:15,218 ERROR org.purl.sword.server.DepositServlet @ org.purl.sword.base.SWORDErrorException: Unacceptable content type in deposit request: application/vnd.openxmlformats-officedocument.wordprocessingml.document

    I have made a fresh bulid after that but no luck. I can successfully submit through Word to your demo repository but when try depositing in my, I get 415 – unsupported media type.
    I can successfully deposit docx over Facebook (http://apps.facebook.com/swordapp/deposit/start/) ,
    here is log:
    2010-10-25 00:36:17,828 INFO org.dspace.search.DSIndexer @ Wrote Item: 123456789/8 to Index
    2010-10-25 00:36:17,828 INFO org.purl.sword.base.DepositResponse @

    http://hdl.handle.net/123456789/8

    hrvoje@zsem.hr

    2010-10-25
    http://193.198.217.97:8080/xmlui/bitstream/123456789/8/2/license.txt
    gfgfd
    vx
    2010-10-24T22:36:17Z
    The package has been deposited into DSpace. Each file has been unpacked and provided with a unique identifier. The metadata in the manifest has been extracted and attached to the DSpace item, which has been provided with an identifier leading to an HTML splash page.

    false
    SWORDAPP PHP library (version 0.9) http://php.swordapp.org/
    http://purl.org/net/sword-types/METSDSpaceSIP

    I can only asume that whole thing doesnt work on this version of Dspace.
    Here is SWORD part of my dspace.cfg:

    #————–SWORD SPECIFIC CONFIGURATIONS——————–#
    #—————————————————————#
    # These configs are only used by the SWORD interface #
    #—————————————————————#

    # tell the SWORD METS implementation which package ingester to use
    # to install deposited content. This should refer to one of the
    # classes configured for:
    #
    # plugin.named.org.dspace.content.packager.PackageIngester
    #
    # The value of sword.mets-ingester.package-ingester tells the
    # system which named plugin for this interface should be used
    # to ingest SWORD METS packages
    #
    # The default is METS
    #
    # sword.mets-ingester.package-ingester = METS

    # Define the metadata type EPDCX (EPrints DC XML)
    # to be handled by the SWORD crosswalk configuration
    #
    mets.submission.crosswalk.EPDCX = SWORD

    # define the stylesheet which will be used by the self-named
    # XSLTIngestionCrosswalk class when asked to load the SWORD
    # configuration (as specified above). This will use the
    # specified stylesheet to crosswalk the incoming SWAP metadata
    # to the DIM format for ingestion
    #
    crosswalk.submission.SWORD.stylesheet = crosswalks/sword-swap-ingest.xsl

    # The base URL of the SWORD deposit. This is the URL from
    # which DSpace will construct the deposit location urls for
    # collections.
    #
    # The default is {dspace.url}/sword/deposit
    #
    # In the event that you are not deploying DSpace as the ROOT
    # application in the servlet container, this will generate
    # incorrect URLs, and you should override the functionality
    # by specifying in full as below:
    #
    # sword.deposit.url = http://www.myu.ac.uk/sword/deposit

    # The base URL of the SWORD service document. This is the
    # URL from which DSpace will construct the service document
    # location urls for the site, and for individual collections
    #
    # The default is {dspace.url}/sword/servicedocument
    #
    # In the event that you are not deploying DSpace as the ROOT
    # application in the servlet container, this will generate
    # incorrect URLs, and you should override the functionality
    # by specifying in full as below:
    #
    # sword.servicedocument.url = http://www.myu.ac.uk/sword/servicedocument

    # The base URL of the SWORD media links. This is the URL
    # which DSpace will use to construct the media link urls
    # for items which are deposited via sword
    #
    # The default is {dspace.url}/sword/media-link
    #
    # In the event that you are not deploying DSpace as the ROOT
    # application in the servlet container, this will generate
    # incorrect URLs, and you should override the functionality
    # by specifying in full as below:
    #
    # sword.media-link.url = http://www.myu.ac.uk/sword/media-link

    # The URL which identifies the sword software which provides
    # the sword interface. This is the URL which DSpace will use
    # to fill out the atom:generator element of its atom documents.
    #
    # The default is:
    #
    # http://www.dspace.org/ns/sword/1.3.1
    #
    # If you have modified your sword software, you should change
    # this URI to identify your own version. If you are using the
    # standard dspace-sword module you will not, in general, need
    # to change this setting
    #
    # sword.generator.url = http://www.dspace.org/ns/sword/1.3.1

    # The metadata field in which to store the updated date for
    # items deposited via SWORD.
    #
    sword.updated.field = dc.date.updated

    # The metadata field in which to store the value of the slug
    # header if it is supplied
    #
    sword.slug.field = dc.identifier.slug

    # The accept packaging properties, along with their associated
    # quality values where appropriate.
    #
    # Global settings; these will be used on all DSpace collections
    #
    sword.accept-packaging.METSDSpaceSIP.identifier = http://purl.org/net/sword-types/METSDSpaceSIP
    sword.accept-packaging.METSDSpaceSIP.q = 1.0

    sword.accept-packaging.Docx.identifier = application/vnd.openxmlformats-officedocument.wordprocessingml.document
    sword.accept-packaging.Docx.q = 1.0

    # A comma separated list of MIME types that SWORD will accept
    sword.accepts = application/zip

    # Collection Specific settings: these will be used on the collections
    # with the given handles
    #
    # sword.accept-packaging.[handle].METSDSpaceSIP.identifier = http://purl.org/net/sword-types/METSDSpaceSIP
    # sword.accept-packaging.[handle].METSDSpaceSIP.q = 1.0

    # Should the server offer up items in collections as sword deposit
    # targets. This will be effected by placing a URI in the collection
    # description which will list all the allowed items for the depositing
    # user in that collection on request
    #
    # NOTE: this will require an implementation of deposit onto items, which
    # will not be forthcoming for a short while
    #
    sword.expose-items = false

    # Should the server offer as the default the list of all Communities
    # to a Service Document request. If false, the server will offer
    # the list of all collections, which is the default and recommended
    # behaviour at this stage.
    #
    # NOTE: a service document for Communities will not offer any viable
    # deposit targets, and the client will need to request the list of
    # Collections in the target before deposit can continue
    #
    sword.expose-communities = false

    # The maximum upload size of a package through the sword interface,
    # in bytes
    #
    # This will be the combined size of all the files, the metadata and
    # any manifest data. It is NOT the same as the maximum size set
    # for an individual file upload through the user interface. If not
    # set, or set to 0, the sword service will default to no limit.
    #
    sword.max-upload-size = 0

    # Should DSpace store a copy of the original sword deposit package?
    #
    # NOTE: this will cause the deposit process to run slightly slower,
    # and will accelerate the rate at which the repository consumes disk
    # space. BUT, it will also mean that the deposited packages are
    # recoverable in their original form. It is strongly recommended,
    # therefore, to leave this option turned on
    #
    # When set to “true”, this requires that the configuration option
    # “upload.temp.dir” above is set to a valid location
    #
    sword.keep-original-package = true

    # The bundle name that SWORD should store incoming packages under if
    # sword.keep-original-package is set to true. The default is “SWORD”
    # if not value is set
    #
    # sword.bundle.name = SWORD

    # Should the server identify the sword version in deposit response?
    #
    # It is recommended to leave this enabled.
    #
    sword.identify-version = true

    # Should we support mediated deposit via sword? Enabled, this will
    # allow users to deposit content packages on behalf of other users.
    #
    # See the SWORD specification for a detailed explanation of deposit
    # On-Behalf-Of another user
    #
    sword.on-behalf-of.enable = true

    # Configure the plugins to process incoming packages. The form of this
    # configuration is as per the Plugin Manager’s Named Plugin documentation:
    #
    # plugin.named.[interface] = [implementation] = [package format identifier] \
    #
    # Package ingesters should implement the SWORDIngester interface, and
    # will be loaded when a package of the format specified above in:
    #
    # sword.accept-packaging.[package format].identifier = [package format identifier]
    #
    # is received.
    #
    # In the event that this is a simple file deposit, with no package
    # format, then the class named by “SimpleFileIngester” will be loaded
    # and executed where appropriate. This case will only occur when a single
    # file is being deposited into an existing DSpace Item
    #
    plugin.named.org.dspace.sword.SWORDIngester = \
    org.dspace.sword.SWORDMETSIngester = http://purl.org/net/sword-types/METSDSpaceSIP \
    org.dspace.sword.SimpleFileIngester = SimpleFileIngester \
    org.dspace.sword.DocxIngester = application/vnd.openxmlformats-officedocument.wordprocessingml.document

    Any help would be highly appreciated!
    Cheers!
    hrvoje

  9. Stuart Post author

    Hi hrvoje,

    Have you got the following section in dspace.cfg?

    plugin.named.org.dspace.content.packager.PackageIngester = \
    org.dspace.content.packager.PDFPackager = Adobe PDF, PDF, \
    org.dspace.content.packager.DSpaceMETSIngester = METS, \
    org.dspace.content.packager.DSpaceDocxIngester = DOCX

    (The first three lines probably exist already, you’ll need to add the final line, and the ‘,\’ to the end of the METS line)

    Thanks,

    Stuart

  10. Hrvoje

    Hi Stuart,
    Yes:
    # Packager Plugins:
    plugin.named.org.dspace.content.packager.PackageDisseminator = \
    org.dspace.content.packager.DSpaceMETSDisseminator = METS

    plugin.named.org.dspace.content.packager.PackageIngester = \
    org.dspace.content.packager.PDFPackager = Adobe PDF, PDF, \
    org.dspace.content.packager.DSpaceMETSIngester = METS, \
    org.dspace.content.packager.DSpaceDocxIngester = DOCX
    I have also followed DS-316 and DS-325 closely and added and deleted all “+” and “-“.

    I only missed to add:
    sword.accepts = application/zip, application/vnd.openxmlformats-officedocument.wordprocessingml.document
    instead:
    sword.accepts = application/zip

    Now I add it and now it works! :):
    http://193.198.217.97:8080/jspui/handle/123456789/9?mode=full&submit_simple=Show+full+item+record

    Actually in both cases (on sword demo repository and my repository) Word throws on the end of depositing:
    “Unexpected end of file while parsing Name has occurred. Line 1, position 8.”
    But document end up properly deposited anyway. I have changed docx according to directions so I didn’t mess with it. Even your original word template ends up with same error.
    But I guess that is Word add-on problem.

    Thank you for your help!
    hrvoje

  11. Stuart Post author

    Hi Kavari,

    It uses the value of dspace.url from dspace.cfg. However, dspace.url often makes use of the value of dspace.baseURL. So make sure you use these settings:

    # DSpace base host URL. Include port number etc.
    dspace.baseUrl = http://mydspace.com

    # DSpace base URL. Include port number etc., but NOT trailing slash
    # Change to xmlui if you wish to use the xmlui as the default, or remove
    # “/jspui” and set webapp of your choice as the “ROOT” webapp in
    # the servlet engine.
    dspace.url = ${dspace.baseUrl}/jspui

    Thanks,

    Stuart

  12. Pavel Mika

    Hello,

    i have been trying to implement deposit to dspace repository with sword interface,
    i keep getting this error:

    415 Unsupported Media type
    when i look in dspace.log, i see this:

    org.dspace.sword.CollectionDepositor @ Unacceptable content type detected: null for collection 1

    i have tried supply multiple types of content-type values but it doesnt change, it always returns null, also its strange that the sword debugs that content type is null, instead of provided value (like text/richtext or the long handler for docx)
    i have dspace 1.6.2 , uploading through jspui works without a problem.
    Also its possible to deposit document into item entry trough sword as additional file, but not in the collection itself as it gives the error.
    I have added multiple header values like Content-MD5 without any change, just if i put wrong md5 sum, it will give checksum error reponse.
    its seems the interface doesnt parse httpwebrequest properly, any idea where can be the problem?

  13. Pavel Mika

    you were right, i have packed the files in the proper format, and its possible to deposit those packages.
    i have still problem with the mets.xml file.

    if i use epdcx mets format , i am not able assign the keywords in the document (under dspace dc.subject), the epdcx schema doesnt support it.

    if i use mods format, i have the keywords available (listed as subjects) but from some reason i am not able to assign type (under dspace its dc.type) in any way. i have tried different parameters, but they dont display in dspace, or dspace gives internal server error 500 – wrong argument

    you can see mets.xml i use below:
    ———————————-

    author

    AUTHOR

    text
    technical reports – internal server error 500 in dspace

    PUBLISHER

    ABSTRACT

    en_US

    KEYWORDS

    TITLE

    thesis – ignored in dspace

  14. Pavel Mika

    small update, the EPCDX has subject element in the schema (dc.subject used for dspace keywords) , but its not parsed inside dspace from some reason.

    in the MODS meta, i have tried several combinations of .mods:genre. text ./mods:genre.
    .mods:typeOfResource. software, multimedia. /mods:typeOfResource.

    with different type values , with dct or usecollection attributes without any change, it gets parsed without error, but the dspace type element is still empty.

    i have tried to implement JSON meta file, but its not so documented and i didnt find proper settings in the dspace.cfg or http header to sword.

    is there any other meta format i can use with sword? with some small example it would be the best.

    thank you
    with regards
    Pavel

  15. Stuart Post author

    Take a look at [dspace]/config/crosswalks/sword-swap-ingest.xsl

    This is the crosswalk file that converts the ingested metadata from mets.xml into DSpace metadata elements. You should be able to edit this to ingest extra metadata.

  16. Pavel Mika

    Thank you Stuart, its exactly the thing i was looking for,
    i wasnt aware such script is available.
    i have added missing element genre->type to the mods.submission.xsl file and it works correctly now.
    cheers
    Pavel

  17. Claire Knowles

    Hi Stuart, does the SWORD submission work when using the template and demo repository and the 2.0 version of the MS Word authoring tool? As I’ve not been able to make a deposit.

    Thanks

    Claire

  18. Claire Knowles

    Hi Stuart, I got it working and can now deposit from Word to DSpace. I’m going to be working on the Deposit MO project so wanted to get this up and running. Claire

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>