Posts Tagged ‘dspace’

Launch of ‘The DSpace Course’

Wednesday, August 27th, 2008

This afternoon we (the JISC-funded Repositories Support Project) formally launched ‘The DSpace Course‘ - a creative commons licensed course for new DSpace repository administrators and developers.

There are currently 20 modules, and a Live CD that can be used for the training. The course is designed to be taught by a trainer, and used in a mix-and-match way so that courses can be designed around the attendees and their desired outcomes. Each module has a set of slides and a student workbook.

We’d be glad to receive any feedback on the course in order to improve it!

The press release says:

Today the JISC-funded Repositories Support Project (http://rsp.ac.uk/) have formally launched a modular training course for DSpace - “The DSpace Course“. The course materials have been published with a Creative Commons licence in order to facilitate their re-use.

The course is suitable for DSpace administrators and developers, with the choice of modules being dependent on the people taking the course. The course tutor can mix-and-match the modules to create a custom course. Each module comes with a set of PowerPoint slides, and an associated student workbook. The course has been successfully taught in the UK and Italy.

There are 20 modules in the course, with more modules due to be added soon. The modules include:

 - An Introduction to DSpace

 - How to Get Help

 - Repository Structure

 - Identifiers

 - DSpace Configuration

 - User management and authentication options

 - Metadata Input Customisation

 - Look and Feel Customisation

 - Language Customisation

 - Item Submission Workflows

 - Import and Export

 - Configuring LDAP

 - Upgrading from 1.4. to 1.5

In addition to the course materials the RSP has released a DSpace ‘Live CD’.

The CD allows any PC to be used as training machine with a copy of DSpace pre-installed, along with all of the files required to perform a new installation. 

The CD is inserted into a computer upon boot, and will load a live version of the DSpace software without installation to the hard drive. Upon completion of the training course, remove the CD and the normal operating system will be loaded upon restart of the PC. 

The course materials can be downloaded from:

 - http://hdl.handle.net/2160/615

The Live CD can be downloaded from:

 - http://hdl.handle.net/2160/563

The course has been written by Stuart Lewis (DSpace committer, developer and trainer), Chris Yates (DSpace developer, support provider and trainer) and has benefited from input by Claudia Jürgen (DSpace committer, developer and trainer).

For help and support, please direct all enquiries related to the course to support@rsp.ac.uk.

In addition, the support team may be able to put you in touch with suitable trainers who could teach the course in your area.

Test LDAP service upgraded - now with branches

Monday, August 18th, 2008

A few weeks ago I made a test LDAP service available (read the blog post) in order to allow people without an LDAP service to test their LDAP-related DSpace patches, or to help people configuring their DSpace LDAP settings by showing them an example with the correct configuration settings.

I’ve been working recently to upgrade the LDAP support in DSpace to allow it to support sub-tree searching. At present it can only authenticate users within a single OU, but many institutions separate their users across a large tree of OUs.

So, I have now released a patch that does this, which will either be included in the upcoming DSpace 1.5.1, or will have to wait for 1.5.2 or 1.6 etc.

In order for me to test this I have had to include more users in my test LDAP service which you are welcome to use too! The patch allows you to specify the DN and password of a user who has full read and search rights overs the LDAP tree in order to identify the DN of the user who is trying to log-in. If you have anonymous access enabled on your server you could comment out the user’s details. The patch then uses that DN and the password provided by the user to re-bind to the LDAP server to make sure their credentials are correct. If you want to make use of this service, here are the settings you’ll need:

  • ldap.provider_url = ldap://ldap.testathon.net:389/
  • ldap.id_field = cn
  • ldap.object_context = OU=users,DC=testathon,DC=net
  • ldap.search_context = OU=users,DC=testathon,DC=net
  • ldap.email_field = mail
  • ldap.surname_field = sn
  • ldap.givenname_field = givenName
  • ldap.phone_field = telephoneNumber
  • ldap.search_scope = 2
  • ldap.search.user = CN=stuart,OU=users,DC=testathon,DC=net
  • ldap.search.password = stuart

There are now nine users, structured as shown below:

As before, all passwords are the same as usernames. 

I hope this is a useful service. Comments welcome!

About to load test DEF repositories

Friday, July 18th, 2008

One of the core aims of the ROAD project is to load test DSpace, EPrints and Fedora repositories to see how they scale when it comes to using them as repositories to archive large amounts of data (in the form of experimental results and metadata). According to ROAR, the largest repositories (housing open access materials) based on these platforms are 191,51059,715 and 85,982 respectively (as of 18th July 208). We want to push them further and see how they fare.

DSpace for instance has in the past suffered from ongoing bad publicity and its own honesty relating to some issues in early versions where they suffered from some instability and slowness under load (user load and content load). One of the downsides of the web (well, of some of it’s users really) is that old reports stay archived on the web, and are read and believed with no consideration of changes that may have taken place in the interim. Many or most of these issues have now been sorted for the sort of scale that used to cause problems (100,000 items to 1/4 million items) and we need to re-evaluate the platform to see where it now breaks. Indeed the following report set out to test DSpace with 1 million items, and found no particular issues:

Testing the Scalability of a DSpace-based Archive, Dharitri Misra, James Seamans, George R. Thoma, National Library of Medicine, Bethesda, Maryland, USA

I’ve not looked very hard, but there was nothing obvious on the first page of Google results about EPrints scalability, but for Fedora I found this useful page: http://fedora.fiz-karlsruhe.de/docs/Wiki.jsp?page=Main

Our new load testing hardware has arrived. We have a standard spec server to perform the testing, and a beefy little number on which to run the repositories:

  • Two quad-core XEON processors
  • 16GB RAM
  • 6TB raw SATA disk (yes its slow, but cheap!)

We’ve not yet decided what tests we’ll run (get in contact if you have any suggestions!), but we have decided we’ll be using SWORD to perform the test deposits it allows us to throw identical packages at all three repositories which provides us with a level playing field.

We’ve done some initial work which showed some of the repositories fell down as soon as we tried to deposit more than a couple of items concurrently using SWORD, and others fell down at 50 concurrent deposits, but these are small implementation issues which have now been fixed, so full testing can start taking place.

More details will be blogged once we start getting some useful comparative data, however seeing as the report cited above took about 10 days to deposit 1 million items, it may be some weeks before we’re able to report data from meaningful tests on each platform.

These results will inform the next stage of the ROAD project which is to choose one of the repositories upon which to build a repository for the Robot Scientist, so the stakes are high!

Test LDAP service

Monday, July 7th, 2008

One of the first integration tasks undertaken on a new repository installation is to plug it in to the local authentication system. More often than not this is LDAP. It allows users to use their usual local username and password in the repository rather than having to remember another password. LDAP services can be provided by a Microsoft Active Directory (run by most institutions who have Microsoft desktop systems) or dedicated LDAP (e.g. OpenLDAP) service.

One thing I’ve noticed with the DSpace testathons is that often LDAP does not get tested because many of the developers do not have access to an LDAP system - for example in DSpace 1.5 LDAP authentication does not work with Manakin or SWORD. (I have fixed both in the upcoming 1.5.1 though :) ). With this in mind, and because I have to teach a DSpace technical course in 4 days time where we’ll be covering LDAP configuration, I’ve created an open LDAP server which can be used for testing and training.

Details:

  • ldap.provider_url = ldap://ldap.testathon.net:389/
  • ldap.id_field = cn
  • ldap.object_context = OU=users,DC=testathon,DC=net
  • ldap.search_context = OU=users,DC=testathon,DC=net
  • ldap.email_field = mail
  • ldap.surname_field = sn
  • ldap.givenname_field = givenName
  • ldap.phone_field = telephoneNumber

Users and their passwords are:

  • stuart / stuart
  • john / john
  • carol / carol

Each user has a full name (Stuart Lewis / John Smith / Carol Jones), a telephone number and email address so should be fully functional.

If you make use of this server, please drop me a line or leave a comment so I know. Otherwise it might get turned off…!

Preserving reactions to Lord Of The Rings

Friday, June 27th, 2008

‘Preserving reactions to Lord Of The Rings’ is a funny blog posting title, but I’ll explain…

Back in 2003 to 2004, our department of Theatre, Film and Television Studies undertook the biggest audience response survey to a film ever. They collected just short of 25,000 responses to the films from speakers of 14 different languages. The project is now finished, published, and they’re hoping to move on to even bigger projects of the same type. So the work is ready to archive in our repository, and its my job to archive the data in such as way as to enable and ensure preservation.

Now, I’m no preservation expert, so the following details what I did to archive the data which was given to us in the form of a Microsoft Access database, and a word document explaining the structure of the database and the codings it used:

  • The database: Well, nothing wrong as such with archiving an Access database - it can easily be used by people today. So that gets archived. But what about a long-term copy for archival and preservation purposes? Access has a nice handy ‘Export to xml’ feature. That looks good! It even gives the option to ensure the file is correctly encoded in UTF-8 to preserve the audience responses in different character sets. (As an aside, the xml file is about 40MB big, so I found in order to get an xml editor to open the file in order to validate it and check the encoding I had to upgrade the RAM on my VIsta workstation from 2Gb to 4GB!).
  • The guidance notes: These came in Microsoft Word format, nice and easy, so that gets archived. A PDFa copy is then created using Microsoft Word’s ‘Export to PDF’ option, and that is archived too.
  • The repository: All this is stored in a DSpace-powered repository, has daily file checksum checks being run to detect bit-rot, backed up nightly to disk and tape, with off-site copies of the tapes stored.
Now to a non-preservation expert, this all sounds too easy. Have I been naive and missed any thing out? (Wouldn’t surprise me! :) )

Tracking repository searches from the inside

Friday, June 6th, 2008

One of the many great features of Google Analytics is that it can shown the search terms that visitors to your site have used in search engines. This is a great tool for finding out what brings users to your repository.

Seven months ago Google launched a new feature in Google Analytics that also allows you to track the search terms used by visitors within your repository. Its very easy to set up, all you need to do is enable the feature and set the query parameter used by your repository. Follow these rules from the help pages:

  1. Log in to your Google Analytics account.
  2. Click ‘Edit’ under Website Profiles for the profile you would like to enable Site Search for.
  3. Click ‘Edit’ from the ‘Main Website Profile Information’ section of the Profile Settings page.
  4. Select the ‘Do Track Site Search’ radio button in the Site Search section of the Edit Profile Information page.
  5. Enter your ‘Query Parameter’ in the field provided. Please enter only the word or words that designate an internal query parameter such as “term,search,query”. Sometimes the word is just a letter, such as “s” or “q”. You may provide up to five parameters, separated by a comma.
  6. Select whether or not you want Google Analytics to strip out the query parameter from your URL. Please note that this will only strip out the parameters you provided, and not any other parameters in the same URL. This has the same functionality as excluding URL Query Parameters in your Main Profile - if you strip the query parameters from your Site Search Profile, you don’t have to exclude them again from your Main Profile.

Google Analytics Site SearchFor DSpace you need to set the query parameter to query and with EPrints set it to simple.

To view the results, follow the links shown in the image (Content -> SIte Search) and explore the results. 

Here is some interesting statistics from our repository as an example of the extra stats it can provide:

  • 89% of visits did not make use of a a site search, whilst the remaining 11% did.
  • 39% of search users left the system having performed the search without going any further (e.g. looking at one of the items found by the search)
  • 22% of searchers resulted in search refinements being undertaken by the searcher
  • 50% of searches were performed from the repository homepage, the remaining from item, collection and community pages.
  • Following a search, the average visitor stayed on the site for a further 1 minute and 30 seconds.
  • 8% of searches were performed without the visitor having entered a search term.

Lessons from teaching DSpace

Tuesday, June 3rd, 2008

Yesterday I spent the day with a colleague delivering a training day aimed at new or potential DSpace administrators as part of my role working with the Repositories Support Project (known as the RSP).

We had a fun, interesting and busy day talking about DSpace, but a few hiccups along the way.

With each event we run, we learn new things about planning and delivering events. Whilst we’ve never had a bad event, there have been issues from time to time. With this event, the main issue was the hardware provided to us in the training suite at New Horizons in Birmingham. The staff were great, and the facilities good, the food was excellent, but the PCs were, ummm, a little on the old side! We were teaching DSpace, which is a pice of server software, so requires quite a bit of ‘infrastructure’ in terms of software requisities. Each trainee had their own PC with a copy of DSpace installed so that they had their own copy to mess about with, configure, and populate. To make this a little easier, we used the Ubuntu Linux distribution.

So when we combined Ubuntu (not a lightweight distro) with Postgres, Java, Tomcat, and Cocoon, lets just say that the poor 7 year old Pentium 866’s with 256 MB of RAM couldn’t quite cope. In fact, they couldn’t cope at all! Our other problem was that we’d configured the machines to launch Firefox as soon as they booted, so that the users were presented with DSpace straight away. Firefox had two tabs opened automatically upon startup, one for the JSP interface, and one for Manakin the XML interface. This meant that Tomcat then had to startup these web applications, at which point the whole machine came unusable and started swapping like crazy.

Our solution was to run round each machine and delete the more resource intensive Manakin user interface, and teach the course using the JSP interface instead. Even then, the machines were slow, so we had a lot of the trainees using one of our test servers back in the office instead.

So what is the lesson to be learnt from this?

Make sure you agree (in writing) the spec of the machines that you’ll be provided with at a training suite.

It sounds obvious, and we were probably just naive to assume that a PC training company would have machines that weren’t quite so old. But we live and learn, and have now negotiated a better room for our next course. Now to get the agreement in writing….

Shibboleth, SWORD, and DSpace 1.5

Tuesday, May 27th, 2008

It was nice to see the announcement recently from the MAMS (Meta Access Management System) project at Macquarie University in Australia that they have implemented Shibboleth authentication for DSpace 1.5. It makes use of the stackable authentication system, and is therefore very nicely integrated with the DSpace architecture.

I’ve been playing with Shibboleth a bit recently, trying to get Blackboard on Windows to work with it. Apparently it works on Unix, but no one knows about Windows. To cut a long story short, I got it almost working. Shibooleth will get our local identity provider to perform the authentication, but unfortunately the IIS Shibboleth ISAPI filter sets the username in the HTTP_REMOTE_USER header, rather than the REMOTE_USER header. Blackboard isn’t configurable enough to look in either. So an enhancement request has been submitted!

Anyway, I was thinking it would be nice to convert our DSpace instance to use Shibboleth authentication rather than than LDAP. The great power of Shibboleth will be when all our systems use it, and we only have to log in to the IdP once. But I hit a snag in my thoughts…

I’ve just told a professor in our university that he could use our SWORD interface to remotely deposit some data that he is creating on a geographically remote server. This means he can periodically automatically archive the data in order to abide by the terms and conditions of the AHRC grant that is funding his work. How would SWORD work with Shibboleth? SWORD works with HTTP basic, and it would be hard to delegate this in the background to Shibboleth. So does that mean I can’t use Shibboleth?

But then I remembered the modular structure of DSpace. Each module (such as the user interface, the OAI-PMH interface, the SWORD interface) is deployed to the application server separately.  Each module can therefore have its own configuration. Normally they would share a single configuration, but I could use different configurations for each one. The normal user interface can use Shibboleth for authentication, and the SWORD interface can keep using LDAP,or what might be even better would be to use the local in-built password system in DSpace so that the professor doesn’t have to embed his university username and password in a script. 

So to conclude - we can use Shibboleth with DSpace whilst still having a working SWORD interface. Nice!

Repository bounce rates

Monday, May 26th, 2008

Bounce rate imageI’ve often wondered about what people do when they visit a repository, and whether what they are doing while visiting the repository could be considered ‘good’ in terms of the usefulness and general aims of the repository. Let me explain… I’m a big fan of Google Analytics, and one of the things it lets you see is what people do once they get to your repository. For each page it can show where they came from, how long each user stayed there, and whether they ‘bounced’ straight off to another web site afterwards (that is, Google Analytics on your repository did not encounter another view from that user in their browsing session), or whether they stayed within your repository to hopefully view more items.

The help file for Google Analytics describes the bounce rate as:

Bounce Rate: Bounce rate is the percentage of single-page visits (i.e. visits in which the person left your site from the entrance page). Bounce rate is a measure of visit quality and a high bounce rate generally indicates that site entrance (landing) pages aren’t relevant to your visitors. You can minimize Bounce Rates by tailoring landing pages to each keyword and ad that you run. Landing pages should provide the information and services that were promised in the ad copy.

If you consider an e-commerce website such as Amazon, then this description, and the aim of reducing the bounce rate must hold true. If your visitor searched for an item in a search engine, came to your website, viewed the item, and then ‘bounced’ away, you have lost the sale and the visitor took their business elsewhere. That is ‘bad’.

However, what is the purpose of a repository?

If you take the view that a repository (of the open access persuasion) is there to provide access to resources, then a bounce may not be so bad after all. Image the following scenario:

“I’m a researcher in the field of building robotic sailing boats. I’ve read an article that cites a paper by the title of ‘An Autonomous Sailing Robot for Ocean Observation’. So I duly perform a search using Google Scholar and it see a paper by that title is the top result. I visit the link and find myself in a repository which holds that paper. I download the paper, and go on my way, happy to have found what I wanted.”

Within Google Analytics we would see several different aspects of this visit:

  1. We’d see the visit to the metadata jump-off page.
  2. We’d see that the visitor came from Google Scholar.
  3. We’d see the search term that was used by the user within Google Scholar
  4. We’d see that the visitor stayed on the metadata jump-off page for say 20 seconds.
  5. Then… nothing. In other words, it wold be registered as a bounce.

So in traditional analytics terms this looks like a bad visit. However, was it? Clearly not. The visitor got what they wanted, and the repository has done its job. Why did Google Analytics not register the fact that the visitor read the PDF version of the paper though?

Unlike website log file analysis software (e.g. AWStats) Google Analytics can’t see every single interaction between the user and the web server. It can only see pages which include a small bit of Javascript that send the details of the visit to Google. So in the case of the repository, the metadata jump-off page contains the code so Google Analytics knows about the visit, but the PDF cannot contain the code. Google Analytics therefore doesn’t know about the successful download of the PDF. Maybe one day Google will address this issue in some way? It would be great if they could.

The repository has served it purpose, and the visitor got what they were after, but is it also the job of the repository to hold the user and to attract them to other related items in the repository? There are many ways this could be done, a subject for another day, but these will no doubt include elements of Web 2.0, social networking and item suggestion. This issue does though highlight one of the origins and ongoing features of Google Analytics - that of supporting e-commerce sites, particularly those that make use of its AdWords scheme.

But for me, for now, I think I’m reasonably happy with a bounce!