Tag Archives: roadproject

If SWORD is the answer, what is the question?

I’ve just had a new collaborative paper published: ‘If SWORD is the answer, what is the question?’ (DOI: 10.1108/00330330910998057). It covers the most recent iteration of the SWORD repository deposit standard, looks briefly at some issues around the present lack of adoption of SWORD, and most usefully presents seven use cases of SWORD written by their developers:

Lewis, S., Hayes, L., Newton-Wade, V., Corfield, A., Davis, R., Donohue, T., Wilson, S., If SWORD is the answer, what is the question?: Use of the Simple Web-service Offering Repository Deposit protocol, Program: electronic library and information systems, 2009,  Vol 43, Issue 4, pp: 407 – 418, 10.1108/00330330910998057, Emerald Group Publishing Limited

Of course a copy is available open access in our repository: http://hdl.handle.net/2292/5315

Abstract:

Purpose – The purpose of this paper is to describe the repository deposit protocol, Simple Web-service Offering Repository Deposit (SWORD), its development iteration, and some of its potential use cases. In addition, seven case studies of institutional use of SWORD are provided.

Design/methodology/approach – The paper describes the recent development cycle of the SWORD standard, with issues being identified and overcome with a subsequent version. Use cases and case studies of the new standard in action are included to demonstrate the wide range of practical uses of the SWORD standard.

Findings – SWORD has many potential use cases and has quickly become the de facto standard for depositing items into repositories. By making use of a widely-supported interoperable standard, tools can be created that start to overcome some of the problems of gathering content for deposit into institutional repositories. They can do this by changing the submission process from a “one-size-fits-all” solution, as provided by the repository’s own user interface, to customised solutions for different users.

Originality/value – Many of the case studies described in this paper are new and unpublished, and describe methods of creating novel interoperable tools for depositing items into repositories. The description of SWORD version 1.3 and its development give an insight into the processes involved with the development of a new standard.

The seven case studies include a thesis submission system, a SWORD plugin for moodle, an automated laboratory data repository deposit tool, a desktop deposit tool, the BibApp repository integration module, a custom deposit tool for a technical report series, and the Facebook SWORD deposit tool.pass-cracker

DSpace at a third of a million items

As part of the JISC-funded ROAD (Robot-generated Open Access Data) project we are load testing DSpace EPrints and Fedora to see how they cope with holding large numbers of items. For a bit of background, see an earlier blog post: ‘About to load test DEF repositories

The project programmer Antony Corfield has created a SWORD deposit tool for this purpose. It is a configurable tool that allow you to define a set of SWORD packages to deposit, how fast you want them to be deposited, how many you want to deposit at a time etc. We decided to deposit using SWORD so that we can deposit a common SIP into each repository without too much extra work.

Early tests with this software depositing into DSpace on our server (8 processors, 16GB RAM) suggested that the optimal rate of deposit is 4 concurrent deposits. (This may make be due to having 8 processors, and each deposit optimally requires two processors – one for the database and one for the web application, but further tests would be required to confirm this).

The tool was left running over Christmas depositing 9Mb (approx) SWORD packages, each containing the results of an experiment. It is now almost up to 1/3rd of a million deposits. Whilst we would have liked to keep running the experiment to take the deposits to maybe 1 million, it would take more time than we have. Below is a graph of the time it took to deposit each item:

dspace-banding

Some observations from this chart:

  • As expected, the more items that were in the repository, the longer an average deposit took to complete.
  • On average deposits into an empty repository took about one and a half seconds
  • On average deposits into a repository with three hundred thousand items took about seven seconds
  • If this linear looking relationship between number of deposits and speed of deposit were to continue at the same rate, an average deposit into a repository containing one million items would take about 19 to 20 seconds.
  • Extrapolate this to work out throughput per day, and that is about 10MB deposited every 20 seconds, 30MB per minute, or 43GB of data per day.
  • The ROAD project proposal suggested we wanted to deposit about 2Gb of data per day, which is therefore easily possible.
  • If we extrapolate this further, then DSpace could theoretically hold 4 to 5 million items, and still accept 2GB of data per day deposited via SWORD.

Some notes on how the experiment was undertaken:

  • The deposits took place in one go, with no break.
  • Therefore no maintenance took place during the test which might have done normally, and which might be expected to improve performance (e.g. vacuuming the database).
  • The network between the machine performing the deposits and the machine running the repository was a dedicated 1Gbit/s (crossover cable).

What is most interesting in the chart is the ‘banding’ that seems to be taking place. The time taken to perform a deposit seems to fall into 12 or so ‘bands’. For example at three hundred thousand deposits, deposits either take 2.5 OR 3.5 OR 4.5 OR 5.5 OR … seconds to deposit, but NOT 3 OR 4 OR 5 OR… seconds to deposit. Why is this? One hypothesis might be there when a deposit starts, the other three deposits that are also taking place can be in one of a number of ‘states’. Depending upon the combination of states that the other deposits are in, it affect the time it takes for the deposit. For example if the other deposits are all in a disk intensive state, then initiating another deposit where the SIP is being uploaded (requiring even more disk activity) would be slow, leading to a slow deposit time. However if the other three happen to be in a processor intensive state, the deposit may be faster.

But… with three other deposits going on at once, if each of those had two states that they could be in, you might reasonably expect there to be only eight bands (23). Or if there were three states, you might reasonably expect there to be 27 states (33). But the graph suggests there are more than 8 states, but a lot less than 27.

Of course it may be an artifact of the way the deposit tool is working that causes this. Hopefully it isn’t, but until we run the same test depositing into EPrints and Fedora, we’ll not know.

It is useful to note the overhead that SWORD puts on the deposit process. Those of you who have run normal imports into repositories such as DSpace will know that they zip along quite fast, probably several times (if not more) faster than the deposit by SWORD. The reasons for this are:

  • With batch import, the files are already on the repository server. With SWORD, they need to be transferred onto the server using a HTTP POST.
  • With batch import the files are ready to be imported into the asset store. With SWORD the files are zipped up in a package. First the package needs to be read in order for its MD5 checksum to be computed and compared to the checksum sent by the server to ensure the package has been transmitted without and errors. Then the package has to be unzipped.
  • With batch import, the metadata is held in an XML file which is parsed and read. With SWORD, the metadata is also held in an XML file, but is first transformed using an XML stylesheet which takes a bit more time.
  • With batch import the results are displayed on the command line. With SWORD, and Atompub response has to be sent back over the network and then processed by the client.

(Over the next few months we’ll run similar tests with EPrints and Fedora, and depending on time can try other tests such as performing the same tests but on a server which is under user-load (we’ll run load testing software that simulates user load such as viewing and downloading items, performing searches, accessing the OAI-PMH interface etc) to see how that compares. Before we simulate user load, we’ll habve to consider what the expected load on a data repository might be. My initial guess is that it wold be perhaps lower than a general institutional repository, but when a user does download data, they will probably want to download many hundreds of experiments and several gigabytes of data. So the general use would be lower, but each use would have a higher impact on the repository.)site

About to load test DEF repositories

One of the core aims of the ROAD project is to load test DSpace, EPrints and Fedora repositories to see how they scale when it comes to using them as repositories to archive large amounts of data (in the form of experimental results and metadata). According to ROAR, the largest repositories (housing open access materials) based on these platforms are 191,51059,715 and 85,982 respectively (as of 18th July 208). We want to push them further and see how they fare.

DSpace for instance has in the past suffered from ongoing bad publicity and its own honesty relating to some issues in early versions where they suffered from some instability and slowness under load (user load and content load). One of the downsides of the web (well, of some of it’s users really) is that old reports stay archived on the web, and are read and believed with no consideration of changes that may have taken place in the interim. Many or most of these issues have now been sorted for the sort of scale that used to cause problems (100,000 items to 1/4 million items) and we need to re-evaluate the platform to see where it now breaks. Indeed the following report set out to test DSpace with 1 million items, and found no particular issues:

Testing the Scalability of a DSpace-based Archive, Dharitri Misra, James Seamans, George R. Thoma, National Library of Medicine, Bethesda, Maryland, USA

I’ve not looked very hard, but there was nothing obvious on the first page of Google results about EPrints scalability, but for Fedora I found this useful page: http://fedora.fiz-karlsruhe.de/docs/Wiki.jsp?page=Main

Our new load testing hardware has arrived. We have a standard spec server to perform the testing, and a beefy little number on which to run the repositories:

  • Two quad-core XEON processors
  • 16GB RAM
  • 6TB raw SATA disk (yes its slow, but cheap!)

We’ve not yet decided what tests we’ll run (get in contact if you have any suggestions!), but we have decided we’ll be using SWORD to perform the test deposits it allows us to throw identical packages at all three repositories which provides us with a level playing field.

We’ve done some initial work which showed some of the repositories fell down as soon as we tried to deposit more than a couple of items concurrently using SWORD, and others fell down at 50 concurrent deposits, but these are small implementation issues which have now been fixed, so full testing can start taking place.

More details will be blogged once we start getting some useful comparative data, however seeing as the report cited above took about 10 days to deposit 1 million items, it may be some weeks before we’re able to report data from meaningful tests on each platform.

These results will inform the next stage of the ROAD project which is to choose one of the repositories upon which to build a repository for the Robot Scientist, so the stakes are high!сайт для начинающих копирайтеров