DSpace at a third of a million items

As part of the JISC-funded ROAD (Robot-generated Open Access Data) project we are load testing DSpace EPrints and Fedora to see how they cope with holding large numbers of items. For a bit of background, see an earlier blog post: ‘About to load test DEF repositories

The project programmer Antony Corfield has created a SWORD deposit tool for this purpose. It is a configurable tool that allow you to define a set of SWORD packages to deposit, how fast you want them to be deposited, how many you want to deposit at a time etc. We decided to deposit using SWORD so that we can deposit a common SIP into each repository without too much extra work.

Early tests with this software depositing into DSpace on our server (8 processors, 16GB RAM) suggested that the optimal rate of deposit is 4 concurrent deposits. (This may make be due to having 8 processors, and each deposit optimally requires two processors – one for the database and one for the web application, but further tests would be required to confirm this).

The tool was left running over Christmas depositing 9Mb (approx) SWORD packages, each containing the results of an experiment. It is now almost up to 1/3rd of a million deposits. Whilst we would have liked to keep running the experiment to take the deposits to maybe 1 million, it would take more time than we have. Below is a graph of the time it took to deposit each item:

dspace-banding

Some observations from this chart:

Some notes on how the experiment was undertaken:

What is most interesting in the chart is the ‘banding’ that seems to be taking place. The time taken to perform a deposit seems to fall into 12 or so ‘bands’. For example at three hundred thousand deposits, deposits either take 2.5 OR 3.5 OR 4.5 OR 5.5 OR … seconds to deposit, but NOT 3 OR 4 OR 5 OR… seconds to deposit. Why is this? One hypothesis might be there when a deposit starts, the other three deposits that are also taking place can be in one of a number of ’states’. Depending upon the combination of states that the other deposits are in, it affect the time it takes for the deposit. For example if the other deposits are all in a disk intensive state, then initiating another deposit where the SIP is being uploaded (requiring even more disk activity) would be slow, leading to a slow deposit time. However if the other three happen to be in a processor intensive state, the deposit may be faster.

But… with three other deposits going on at once, if each of those had two states that they could be in, you might reasonably expect there to be only eight bands (23). Or if there were three states, you might reasonably expect there to be 27 states (33). But the graph suggests there are more than 8 states, but a lot less than 27.

Of course it may be an artifact of the way the deposit tool is working that causes this. Hopefully it isn’t, but until we run the same test depositing into EPrints and Fedora, we’ll not know.

It is useful to note the overhead that SWORD puts on the deposit process. Those of you who have run normal imports into repositories such as DSpace will know that they zip along quite fast, probably several times (if not more) faster than the deposit by SWORD. The reasons for this are:

(Over the next few months we’ll run similar tests with EPrints and Fedora, and depending on time can try other tests such as performing the same tests but on a server which is under user-load (we’ll run load testing software that simulates user load such as viewing and downloading items, performing searches, accessing the OAI-PMH interface etc) to see how that compares. Before we simulate user load, we’ll habve to consider what the expected load on a data repository might be. My initial guess is that it wold be perhaps lower than a general institutional repository, but when a user does download data, they will probably want to download many hundreds of experiments and several gigabytes of data. So the general use would be lower, but each use would have a higher impact on the repository.)

Bookmark and Share
Posted on January 19, 2009 at 10:32 am by Stuart · Permalink
In: Uncategorized · Tagged with: , ,

20 Responses

Subscribe to comments via RSS

  1. Written by Stevan Harnad
    on January 19, 2009 at 7:12 pm
    Permalink

    Bravo for benchmarking the load capacity of Institutional Repositories (IRs) (and I look forward to seeing your results for EPrints).

    But why the concern with how much one can cram into *one* IR, and how fast and how long? Why not just branch to another IR, and another, as each gets topped up? CalTech has at least 26 EPrints IRs! The whole point of the OAI-PMH was to make multiple distributed IRs interoperable, as if they were one big virtual (harvested) repository.

  2. Written by Paul Hartr
    on January 20, 2009 at 3:44 pm
    Permalink

    Hi Stuart,
    Very interesting research and glad to see you took the time to try this. We use Intralibrary and the forthcoming version 3 supports SWORD, I must try something similar.

  3. Written by Les Carr
    on January 21, 2009 at 9:38 am
    Permalink

    Can you compile DSpace with profiling information to get some accurate data about what activities are taking place? For example, how much of the ingest process is spent on communicating with the database, communicating with the storage server, creating thumbnails, indexing the fulltext, making a checksum etc.

  4. Written by stuart
    on January 21, 2009 at 2:48 pm
    Permalink

    Hi Les: Yes. There are different logging ‘levels’ that can be used. A normal DSpace instance would use the INFO log level which just logs high level information, errors and warnings. If we set this to DEBUG, we get very detailed logging information, right down to the individual SQL statements being executed. This log file would need analysing to derive the actual profiling information though.

  5. Written by stuart
    on January 21, 2009 at 2:57 pm
    Permalink

    Hi Stevan: Thanks for your comments. My take on this is that i) It is good to load test repositories. Whilst we could use multiple repositories, we need to make sure we are getting the best use out of our current software and hardware, and that we are not being stifled by bad software. Unless we run load tests, we’ll never know if that is the case.

    ii) Yes, OAI-PMH can be used to pull together multiple repositories into a common search system, but somewhere (such as a common search system) all the data needs to be held in one place, so we might as well see if the repositories can handle this themselves without the complexity and cost of having to run multiple repositories and a search service.

  6. Written by Jim Downing
    on January 27, 2009 at 1:15 pm
    Permalink

    Hi Stuart, good work!

    Can the banding be explained by GC activity?

    Did you try dropping all the indices before depositing, then rebuilding them afterwards? A commonly employed trick if the game is bulk ingest…

  7. Written by stuart
    on January 29, 2009 at 9:37 am
    Permalink

    Hi Jim. Thanks for your comment.

    I suppose garbage collection is likely to play a role here, but to what extent it contributes towards the banding I don’t know.

    We didn’t drop the indexes as we felt that would have been cheating a bit. Our aim was to simulate normal deposit over time, rather than just seeing how quickly we can get the deposits into the repository.

  8. Written by Debashree Pati
    on February 10, 2009 at 5:16 pm
    Permalink

    Hi Stuart,

    What do you mean by batch import? Does SWORD provide for bulk posting similar to the itemImport of DSpace. From the SWORD source code and sword client options, all I see is method to post a single item at a time.

    Thanks,
    Deb

  9. Written by stuart
    on February 12, 2009 at 6:32 am
    Permalink

    Hi Deb,

    Where the post mentions batch import, I was comparing the performance of using SWORD, to using DSpace’s (or EPrint’s, or Fedora’s) command line batch import tool.

    What the project has been using is a tool we’ve written to mimic that sort of bath behaviour, not only in depositing multiple items serially (one at a time) but also in parallel (many at a time).

    Thanks,

    Stuart

  10. Written by Kai
    on February 18, 2009 at 9:27 am
    Permalink

    Hi Stuart,

    we’ve also seen the banding with Fedora batch ingests. I believe these bands emerge mostly due to the hard disk(s) being under heavy load and having to operate in a predominantly non-sequential way. It got significantly better (less bands with narrower gaps) once we began distributing different parts of the system, most notably the database, to separate machines or at least separate disks.

    Thanks,
    Kai

  11. Written by Luca
    on November 13, 2009 at 5:36 am
    Permalink

    Hi Stuart,

    great post. Are you going to test retrival time also? Are there performance issues with Lucene search capabilities?

    Thanks,
    Luca.

  12. Written by Stuart
    on November 13, 2009 at 2:20 pm
    Permalink

    Hi Luca,

    The project has now finished so there won’t be any further tests run by the project.

    I would image that other aspects of the system would suffer from performance issues (for example the db-backed browse system) before the Lucene search index suffers from problems. (but this is just a guess).

    Thanks,

    Stuart

  13. Written by Abhi
    on November 18, 2009 at 7:43 pm
    Permalink

    Thanks a lot for such a useful information. i need ur expertise. could u pls tell me how to build or configure the sword….

  14. Written by Stuart
    on November 19, 2009 at 7:18 am
    Permalink

    SWORD is usually now part of the latest repository downloads. For example SWORD with DSpace is contained in the download and doesn’t need any configuration to get it to work.

  15. Written by Abhi
    on November 20, 2009 at 10:21 pm
    Permalink

    Stuart, thanks once again for prompt reply. i downloaded the sword-client-1.1 & sword-webapp-1.1. so what next ? is there any build process like Dspace ,if yes pls let me know or just i have to placed in webapps folder of tomcat ??

    pls guide me accordingly.

  16. Written by Abhi
    on November 20, 2009 at 10:22 pm
    Permalink

    one more thing Stuart, can we have any step by step installation procedure ???

  17. Written by Stuart
    on November 21, 2009 at 7:01 am
    Permalink

    As long as you are running DSpace 1.5 or above, then it comes as part of the distribution. When you compile DSpace, you’ll find a ’sword’ directory is created in [dspace]/webapps/ along with ‘jspui’, ‘xmlui’, ‘oai’ etc. You just need to deploy that webapp to [tomcat]/webapps/ like you do the others.

    The service document can be found at http://dspace.your-server.com/sword/servicedocument and the deposit URL is http://dspace.your-server.com/sword/deposit/your-handle/collection-id (e.g. 12345676879/45)

  18. Written by Abhi
    on November 23, 2009 at 6:23 pm
    Permalink

    Stuart, i have sword(was with Dspace). after that i downloaded sword-client-1.1 & sword-webapp-1.1. what next ?? i want to make my own deposite as well as service document.
    could u pls tell me what the steps after deplyoing sword in [tomcat]/webapps/

  19. Written by Abhi
    on November 25, 2009 at 5:40 pm
    Permalink

    also there is no index file available in thar folder. i think it required build, if yes then what step i need to follow ???

  20. Written by Mark Diggory
    on December 11, 2009 at 8:53 am
    Permalink

    Stuart,

    Even though your tests have finished in this area, I would make the following comment. Disabling the Browse and Search Event Consumers during the ingest process may lead to some very interesting results for comparison of if it is those systems where the most time is being spent on larger loads. Once those are disabled, calls to update the Browse and Search indexes would not happen (calling index-init afterward would rebuild the full index.

    (Smells like a feature for a future release)

    Cheers,
    Mark

Subscribe to comments via RSS

Leave a Reply