As part of the JISC-funded ROAD (Robot-generated Open Access Data) project we are load testing DSpace EPrints and Fedora to see how they cope with holding large numbers of items. For a bit of background, see an earlier blog post: ‘About to load test DEF repositories‘
The project programmer Antony Corfield has created a SWORD deposit tool for this purpose. It is a configurable tool that allow you to define a set of SWORD packages to deposit, how fast you want them to be deposited, how many you want to deposit at a time etc. We decided to deposit using SWORD so that we can deposit a common SIP into each repository without too much extra work.
Early tests with this software depositing into DSpace on our server (8 processors, 16GB RAM) suggested that the optimal rate of deposit is 4 concurrent deposits. (This may make be due to having 8 processors, and each deposit optimally requires two processors – one for the database and one for the web application, but further tests would be required to confirm this).
The tool was left running over Christmas depositing 9Mb (approx) SWORD packages, each containing the results of an experiment. It is now almost up to 1/3rd of a million deposits. Whilst we would have liked to keep running the experiment to take the deposits to maybe 1 million, it would take more time than we have. Below is a graph of the time it took to deposit each item:
Some observations from this chart:
- As expected, the more items that were in the repository, the longer an average deposit took to complete.
- On average deposits into an empty repository took about one and a half seconds
- On average deposits into a repository with three hundred thousand items took about seven seconds
- If this linear looking relationship between number of deposits and speed of deposit were to continue at the same rate, an average deposit into a repository containing one million items would take about 19 to 20 seconds.
- Extrapolate this to work out throughput per day, and that is about 10MB deposited every 20 seconds, 30MB per minute, or 43GB of data per day.
- The ROAD project proposal suggested we wanted to deposit about 2Gb of data per day, which is therefore easily possible.
- If we extrapolate this further, then DSpace could theoretically hold 4 to 5 million items, and still accept 2GB of data per day deposited via SWORD.
Some notes on how the experiment was undertaken:
- The deposits took place in one go, with no break.
- Therefore no maintenance took place during the test which might have done normally, and which might be expected to improve performance (e.g. vacuuming the database).
- The network between the machine performing the deposits and the machine running the repository was a dedicated 1Gbit/s (crossover cable).
What is most interesting in the chart is the ‘banding’ that seems to be taking place. The time taken to perform a deposit seems to fall into 12 or so ‘bands’. For example at three hundred thousand deposits, deposits either take 2.5 OR 3.5 OR 4.5 OR 5.5 OR … seconds to deposit, but NOT 3 OR 4 OR 5 OR… seconds to deposit. Why is this? One hypothesis might be there when a deposit starts, the other three deposits that are also taking place can be in one of a number of ‘states’. Depending upon the combination of states that the other deposits are in, it affect the time it takes for the deposit. For example if the other deposits are all in a disk intensive state, then initiating another deposit where the SIP is being uploaded (requiring even more disk activity) would be slow, leading to a slow deposit time. However if the other three happen to be in a processor intensive state, the deposit may be faster.
But… with three other deposits going on at once, if each of those had two states that they could be in, you might reasonably expect there to be only eight bands (23). Or if there were three states, you might reasonably expect there to be 27 states (33). But the graph suggests there are more than 8 states, but a lot less than 27.
Of course it may be an artifact of the way the deposit tool is working that causes this. Hopefully it isn’t, but until we run the same test depositing into EPrints and Fedora, we’ll not know.
It is useful to note the overhead that SWORD puts on the deposit process. Those of you who have run normal imports into repositories such as DSpace will know that they zip along quite fast, probably several times (if not more) faster than the deposit by SWORD. The reasons for this are:
- With batch import, the files are already on the repository server. With SWORD, they need to be transferred onto the server using a HTTP POST.
- With batch import the files are ready to be imported into the asset store. With SWORD the files are zipped up in a package. First the package needs to be read in order for its MD5 checksum to be computed and compared to the checksum sent by the server to ensure the package has been transmitted without and errors. Then the package has to be unzipped.
- With batch import, the metadata is held in an XML file which is parsed and read. With SWORD, the metadata is also held in an XML file, but is first transformed using an XML stylesheet which takes a bit more time.
- With batch import the results are displayed on the command line. With SWORD, and Atompub response has to be sent back over the network and then processed by the client.
(Over the next few months we’ll run similar tests with EPrints and Fedora, and depending on time can try other tests such as performing the same tests but on a server which is under user-load (we’ll run load testing software that simulates user load such as viewing and downloading items, performing searches, accessing the OAI-PMH interface etc) to see how that compares. Before we simulate user load, we’ll habve to consider what the expected load on a data repository might be. My initial guess is that it wold be perhaps lower than a general institutional repository, but when a user does download data, they will probably want to download many hundreds of experiments and several gigabytes of data. So the general use would be lower, but each use would have a higher impact on the repository.)