Monthly Archives: July 2008

Would iTunes make a good journal article repository?

I’ve just read a few blog and news posts (none from the repository-worlde) about a paper published in The American Journal of Roentgenology (I had to look it up too! “Radiology employing x-rays”). The paper is entitled ‘An Easy and Effective Approach to Manage Radiologic Portable Document Format (PDF) Files Using iTunes‘ and talks about how radiologists in Shanghai’s Renji Hospital and Shanghai Jiaotong University School of Medicine have used iTunes to store and manage PDF files containing radiological documents.

+ =  ?

The abstract says:

OBJECTIVE: The objective of this article is to explain an easy and effective approach for managing radiologic files in portable document format (PDF) using iTunes.

CONCLUSION: PDF files are widely used as a standard file format for electronic publications as well as for medical online documents. Unfortunately, there is a lack of powerful software to manage numerous PDF documents. In this article, we explain how to use the hidden function of iTunes (Apple Computer) to manage PDF documents as easily as managing music files.

Apparently (I’ve not actually read it as it costs $10 :( ) it talks about using the in-built features of iTunes such as tagging items with keywords, and rating items with stars. Since many of us use iTunes everyday to manage our music and iPods, could this make a useful software tool for personal repositories?

(There is a useful page about using PDFs in iTunes at http://lifehacker.com/software/pdf/geek-to-live–organize-your-pdf-library-with-itunes-240447.php)

Google Analytics is not a statistics package!

As everyone knows I’m a big fan of using Google Analytics with repositories in order to see what is happening with your repository with respect to visitors – what they are looking at / which links they are following / where they are coming from / how many people are visiting the site etc.

However from time to time I come across views regarding some of the data that is not captured by Google Analytics. Such data includes users who do not allow javascript / cookies, and visitors who click directly on ‘files’ (e.g. PDF files). In this second case, the data isn’t tracked because there is no web page shown from which to run the Google Analytics tracking code. In an attempt to help collect some of this information I have used a script by Patrick H. Lauke which triggers when a user clicks to download a file from a metadata jump-off page. It registers the click with Google Analytics and the download is recorded. But as I said, it doesn’t direct hits to the file that did not first go via the repository.

Is this a problem? Personally I don’t think so:

  • At least some of the data is now being recorded, which is better than none. It might not be numercially accurate, but hopefully it is still representative of user behaviour.
  • Remember that Google Analytics is an analytics package, not a statistics package. It does not claim to record every click, but is more intended to help with analysing and improving the user experience (e.g. “Do I get more file downloads if I place the list of files above the metadata or below it” or “Do users that land on a browse page download more files than those that arrive directly on an item page”).
  • If you want raw download figures, use a proper statistics system that works from web server logs (e.g. IRStats or a common web stats system such as AWStats). Most likely you’ll want to use both.

Microsoft SWORD announcements

It has been very encouraging to see two announcements within the last 24 hours from Microsoft regarding SWORD:

The first came in the form of an email from Lee Dirks to the American Scientist Open Access Forum. In the email Lee says:

Microsoft Research and arXiv.org have been working closely on the adoption/utilization of the SWORD protocol and arXiv.org already has a preliminary service up and running.   I would encourage you to follow-up directly with Simeon and Paul for additional detail, but I think the bulk of the work is largely done.

In an important related point, I would like to add that we (Microsoft Research) is actively working on the ability to push from Word 2007 directly to repositories that can consume the SWORD protocol.

And in the second, a blog posting by Microsoft’s Pablo Fernicola gives details of the alpha preview of the Microsoft eJournal Service  (a hosted eJournal management system). In it he writes:

The service is open to all file formats, article submissions can be of any format, as configured as part of the site settings.  On the archival side, the service supports depositing into any Information Repository that uses the SWORD protocol (for example, the ArXiv repository and EPrints based IRs).

These are both encouraging developments for repository interoperability.

About to load test DEF repositories

One of the core aims of the ROAD project is to load test DSpace, EPrints and Fedora repositories to see how they scale when it comes to using them as repositories to archive large amounts of data (in the form of experimental results and metadata). According to ROAR, the largest repositories (housing open access materials) based on these platforms are 191,51059,715 and 85,982 respectively (as of 18th July 208). We want to push them further and see how they fare.

DSpace for instance has in the past suffered from ongoing bad publicity and its own honesty relating to some issues in early versions where they suffered from some instability and slowness under load (user load and content load). One of the downsides of the web (well, of some of it’s users really) is that old reports stay archived on the web, and are read and believed with no consideration of changes that may have taken place in the interim. Many or most of these issues have now been sorted for the sort of scale that used to cause problems (100,000 items to 1/4 million items) and we need to re-evaluate the platform to see where it now breaks. Indeed the following report set out to test DSpace with 1 million items, and found no particular issues:

Testing the Scalability of a DSpace-based Archive, Dharitri Misra, James Seamans, George R. Thoma, National Library of Medicine, Bethesda, Maryland, USA

I’ve not looked very hard, but there was nothing obvious on the first page of Google results about EPrints scalability, but for Fedora I found this useful page: http://fedora.fiz-karlsruhe.de/docs/Wiki.jsp?page=Main

Our new load testing hardware has arrived. We have a standard spec server to perform the testing, and a beefy little number on which to run the repositories:

  • Two quad-core XEON processors
  • 16GB RAM
  • 6TB raw SATA disk (yes its slow, but cheap!)

We’ve not yet decided what tests we’ll run (get in contact if you have any suggestions!), but we have decided we’ll be using SWORD to perform the test deposits it allows us to throw identical packages at all three repositories which provides us with a level playing field.

We’ve done some initial work which showed some of the repositories fell down as soon as we tried to deposit more than a couple of items concurrently using SWORD, and others fell down at 50 concurrent deposits, but these are small implementation issues which have now been fixed, so full testing can start taking place.

More details will be blogged once we start getting some useful comparative data, however seeing as the report cited above took about 10 days to deposit 1 million items, it may be some weeks before we’re able to report data from meaningful tests on each platform.

These results will inform the next stage of the ROAD project which is to choose one of the repositories upon which to build a repository for the Robot Scientist, so the stakes are high!

Test LDAP service

One of the first integration tasks undertaken on a new repository installation is to plug it in to the local authentication system. More often than not this is LDAP. It allows users to use their usual local username and password in the repository rather than having to remember another password. LDAP services can be provided by a Microsoft Active Directory (run by most institutions who have Microsoft desktop systems) or dedicated LDAP (e.g. OpenLDAP) service.

One thing I’ve noticed with the DSpace testathons is that often LDAP does not get tested because many of the developers do not have access to an LDAP system – for example in DSpace 1.5 LDAP authentication does not work with Manakin or SWORD. (I have fixed both in the upcoming 1.5.1 though :) ). With this in mind, and because I have to teach a DSpace technical course in 4 days time where we’ll be covering LDAP configuration, I’ve created an open LDAP server which can be used for testing and training.

Details:

  • ldap.provider_url = ldap://ldap.testathon.net:389/
  • ldap.id_field = cn
  • ldap.object_context = OU=users,DC=testathon,DC=net
  • ldap.search_context = OU=users,DC=testathon,DC=net
  • ldap.email_field = mail
  • ldap.surname_field = sn
  • ldap.givenname_field = givenName
  • ldap.phone_field = telephoneNumber

Users and their passwords are:

  • stuart / stuart
  • john / john
  • carol / carol

Each user has a full name (Stuart Lewis / John Smith / Carol Jones), a telephone number and email address so should be fully functional.

If you make use of this server, please drop me a line or leave a comment so I know. Otherwise it might get turned off…!