Monthly Archives: February 2010

On DSpace development

Over the past 9 months I’ve had the privilege to hold the position of ‘DSpace Release Co-coordinator’. This has meant that I’ve been able to not only work with a group of dedicated and talented repository developers and to act as liaison with the user community, but to also watch the development process happen from close quarters.

With each release of DSpace a release co-coordinator is elected to manage the process of releasing the software. In most (all?) cases however there has only been one volunteer, meaning a vote has not been used. Each release coordinator brings with them their own unique style of working, collaborating, and decision-making. These inputs mean that the development process has changed slowly over time from the beginnings when it was a funded project hosted by MIT and HP Research Labs, to the truly community driven open source project that it is today.

My contribution to the change has come through the form of a survey at the start of the 1.6 development process to collect the top three new features that the community wanted in the release. It was my desire that these three features should be completed before 1.6 was released. I am pleased to say that this has happened. Prior to 1.6 the way in which the feature set for the next release was decided was usually just based on the effort available within the community, and the interests of those with the effort available. Whilst this has inevitably continued to happen (and to great effect – 1.6 will include many excellent features not in the top 3 list) it did give us a focus for our developments.

This blog post outlines some of the ways that the DSpace development process has changed over the past year, and some ways in which I think it should continue to change.

On the past year:

Almost a year ago we started holding weekly development meetings using IRC (Internet Relay Chat). These meetings were spearheaded by our technical director Bradley McLean and have resulted in the ability for us to have much more co-ordination between developers. Before the weekly meetings any committer could commit any code they chose to, and there was little discussion about the general direction that any release would take. To some extent this freedom continues, however it has become much more of the norm for developers to discuss their plans, and to get the approval of their fellow developers before doing any work.

We have started to use a new issue tracking system called JIRA. JIRA allows us more flexibility than our previous trackers that were provided by SourceForge. The biggest change however has not come through the direct use of JIRA as much of its functionality is identical to SourceForge, but more through the interaction between the weekly development meetings and JIRA. The first 15 minutes of each development meeting is typically devoted to reviewing all new issues (bugs, suggestions, patches) and deciding what to do with them. Sometimes this involves us closing the issue immediately (e.g. if it is out of scope for DSpace), asking for further input from the contributor, or assigning a developer to work with the contributor to resolve the issue (fix the bug / apply the patch / diagnose the problem). One of the problems we used to suffer from with the SourceForge tracker (a problem with our lack of processes rather than with the software), and for which we used to get bad publicity, was the average time for which issues stayed on the tracker. The average was usually over two years! It should be stressed that didn’t mean contributions took that long to be assessed, just that no one ever cleared out old issues, which meant that the average was somewhat skewed.

The final change during the past year has been the addition of some more committers. The most notable of the new committers has been Jeffrey Trimble. His addition to the group is notable as he is the first non-developer to join the group. He was invited to join as he has been an active member of the community for many years, and has been working with us as a ‘documentation gardener’ to ensure that our documentation gets the attention it needs.

I hope that the result of these changes will be that the community finds DSpace 1.6 to be a better piece of software, more in line with their needs, and with better documentation. In addition contributors should have received better and quicker feedback on their contributions.

But what about the future? Where next? I’m sure no one thinks that we have the perfect development community and processes. These are my thoughts on what we could change, although many people and discussions have influenced them heavily, so I can’t claim most of them to be my original thoughts. Some of these thoughts have been expressed elsewhere and by others so won’t be new, others perhaps will be new to you.

On the release coordinator role:

I’m quite often asked, “What does the release coordinator do”? This is perhaps hard to define as each release coordinator has their own style, but I’ll explain what it has meant to me. Primarily it has involved coordination. This isn’t a technical job as it is more akin to project management. Knowing what needs doing, where through the processes of doing these things we are, knowing who is doing what, finding volunteers to undertake tasks, and reporting progress to the community.  It has also cost time performing the role, and, although this was purely a personal choice, has cost me money by paying for the server to be run allowing users to test the release.

Being a developer myself has no doubt helped in this process, but I don’t think the role needs to be held by a developer. The developers of DSpace have often been criticised unfairly for not listening to users of the software when deciding what features to develop. Perhaps by having a non-developer in this role could help as their actions are less likely to be perceived this way? I know from experience at recent DSpace conferences, and after requests for feedback, that very often repository managers do not respond to requests for input. If this were due to any perceived barriers between the development and user communities, perhaps this would help break them down?

Traditionally the release coordinator has been chosen once the previous release has been made. It would be more effective if they were chosen three or four months earlier. This would give the benefit of them learning the ropes from the current release coordinator, but more importantly the current and next release coordinators could work together to help decide what is in and out of scope for the current and next versions. This means that even before a release is made, we know where we’re heading with the next version, which may influence some decisions we make today.

On the decision making process:

The introduction of the weekly development meetings has helped the decision making process in a dramatic way. Where developers once worked in isolation, they now work together much more closely. However the day-to-day decisions are still made by developers and the release coordinator. We need to find a way to involve the wider community in these decisions. For example it was a personal decision of mine that we should wait until all three of the requested new features are completed before we release 1.6. This has undoubtedly delayed the release of 1.6. If the community had of decided that two of these features would have been sufficient to make a new release and wait for the third feature in the subsequent release, then the software could perhaps have been released three months earlier.

How do we get this extra input? It would be impractical to involve the whole community in all decisions, so we need a representative sample of the entire community to for a team who can make these decisions. The team needs to include developers, users, and Duraspace staff. A team of 8 to 12 should ensure enough breadth of experience whilst remaining small enough to be effective. Duraspace would decide the Duraspace members, and elections could be held for the two categories of other members (developers and users).

On committers:

First off I want to express the privilege I feel of being a DSpace committer and being trusted to help steer the development of the most widely used repository platform. I say this first because I’m about to launch into a mini rant!

Committers often get criticised for a lot of different things. We get criticised for not listening to users enough (we listen, although often they don’t talk!), we get criticised for following our own agendas (most of the time we follow the agenda of our employers), we get criticised for developing the software too slowly (most of us have day jobs, and DSpace development is only a small part of our roles, and at any given time there is usually only a third of the committers active due to other work pressures). One of the email lists that the developers use keeps us up to date with when code changes are made in our code repository. I know all the committers, and what time zones they operate in, and without exception a good percentage of these code changes occur outside of each committers working day. We work hard to give as much of our time as we can to DSpace, often at the expense of our own time. Just ask my wife how often I’m working on DSpace code either before or after work, or at weekends and holidays! Committers try their best, are a very friendly bunch, and don’t deserve the criticism they sometimes get. Rant over – time for some more productive thoughts!

It is useful to explain how the committers group has evolved over time. The first committers were members of the original HP / MIT project team who initially developed DSpace. The next group of developers who became committers were typically members of funded projects to create some of the first installations of DSpace. They developed some of the early features added to the application. Later still, the new committers were usually asked to join the group because they spent a large amount of their time working with DSpace. These days we do not have so many developers who devote so much of their time to DSpace. This is probably because it is no longer such a big job to install and configure an institutional repository. So how does this affect the committers group?

As I mentioned in my rant, it means that at any time, there is probably only a third of the group ‘active’ and able to give development time and effort, and even then it is likely that most will only be able to give a very small amount of time (not nearly enough to develop large new features). We need to adjust the way the committers group is composed to account for that.

Something else I’d like to note is perhaps the perceived ‘separateness’ of the committers group. Because the committers group isn’t open like the rest of the community, and because there is no way to ‘become’ a committer (other than contribute over time, and wait to be invited) there is probably a perceived barrier. Some developers may think there is no chance of them becoming a committer so will not get involved at all. This is a loss to the community and something we need to address.

My thoughts are that if a community decision-making team existed, then the committers group could change its direction. At the moment being a ‘committer’ is conflating the original meaning that is the rights to add code to DSpace in our code repository, with the role of decision-making. If this decision-making role is given to the newly formed group, then being a ‘committer’ goes back to the traditional meaning. We can then open this group up to anyone who asks for (and needs) commit rights.

Of course allowing anyone to commit code to the code repository comes with the potential for trouble, and this would have to be managed with processes. For example developers who wanted to be granted commit rights would need to have contributed three patches, then for the first six months would have to get the express permission of the decision making group before applying patches, and then they would have finally earned their wings and be granted the full freedom the current committers have.

On the release schedule:

Releases have traditionally happened when everything was ready, and have not followed any prescribed timelines. This has ensured the software evolves at its own pace, but has many negative points such as users not being able to predict when they’ll need technical effort in the future for upgrades, and has slowed down the release of some features that could have been released earlier.

Our development practice has so far been to make a large release every year or two, and to make two or three minor release between them. In a traditional software development model minor releases are only used for small changes or bug fixes. Because our releases are so spread out, minor DSpace releases tend to include much more.

I’d like to see us move to more regular releases, and to keep to only those minor fixes and features for minor releases. If we were to do that, then we should start work on the development of 1.7 as soon as 1.6 is released, whereas previously we would have gone on to 1.6.1. No doubt we’ll need a 1.6.1, but in coding terms this should be developed on a branch of our code repository, not in trunk.

On other roles

So far I’ve concentrated on the technical development of DSpace. However there are many other roles that could be filled to improve the community and software further. Whilst we’ve been lucky to have Jeff working the documentation for 1.6, if we had a small team of documentation specialists working on it, the results would be wonderful documentation complete with screenshots, howtos, usage tips etc. The same goes for other areas such as help screens in DSpace, translations, publicity, training materials, screen casts etc. We need to find good ways of encouraging more contributions from the community, playing to people’s strengths and interests. If all 700+ institutions could donate a small amount of effort, must think what we could do!

These are my thoughts. I’m sure yours are different in some or all aspects. I’m open to any comments, and no doubt my views will continue to change as this subject is discussed further and we get more peoples’ input. I’d love to know what you think!site

DSpace 1.6 – What will be in it for me?

Soon after the release of DSpace 1.5.2 in April 2009 I wrote a blog article ‘DSpace 1.5.2: What’s in it for me?’. The final release of DSpace 1.6 is due shortly, and as the release co-ordinator I thought it might be good to write a similar blog post outlining the key changes and new features that will make it into DSpace 1.6. Soon after 1.5.2 was released we issued a survey asking the DSpace community what three new features they would like to see in DSpace. We shortlisted the responses and there were three clear winners for features that people were asking for. We therefore decided to base the release of DSpace 1.6 around those features. Once those features had been developed and tested, we’d release 1.6.

Those three features were:

Better statistics: The current statistical reporting capabilities of DSpace, whilst sufficient at the time when they were developed, have now become a bit long in the tooth. They are limited to basic reports of metrics such as how many items are in a repository, how many times each item has downloaded (with no filtering out of automatic search engine spiders which often account for over two thirds of the hits), or how many times different search terms have been used.

When we analysed the requirements that users wanted, the biggest requirement was item-level statistics. This feature has now been developed (by @mire) and works in an innovative way that we’d not thought of before they developed it. Rather than storing item views in a log file, or in a database table, they store the item view data in a solr index. What does that mean? Basically they are stored in a search engine index that can be queried very fast and efficiently and in powerful ways.

Out-the-box simple statistical views are available for each item, collection, and community in both the JSPUI and the XMLUI. Information is given about item views, bitstream downloads, and user metadata such as the location the users of the repository came from. The reports are quite basic, but fulfil the requirements we were given. In future versions there will no doubt be work undertaken to make the reports look better and provide more information. The solr index holds a lot of statistical information, we just need to find the best way of displaying it. Along with the new statistics feature comes a script to convert your old dspace.log files into the new format. This means that you can import statistics from old log files, back as far as you have kept them for.

Embargo feature: The lack of embargo functionality in DSpace has been a problem for a long time as universities in particular often need this to either manage open access journal articles that may be under a 6, 9 or 12 month embargo, or for theses that cannot be made public for a certain period. However, when we listened to further input about the requirements, it became obvious that lots of people require subtly different methods of embargoes.

The embargo feature written by Richard Rodgers and Larry Stone (MIT and Harvard respectively) takes this into account. The embargo feature has been written as a framework rather than a fixed implementation. This means that it is possible to write your own embargo rules (in Java classes). Out-the-box is included a simple implementation that should fulfil the needs of many users by allowing an embargo lift date to be set during the submission of the item. The bitstreams (but not item metadata) are locked from public view until that date has passed.

Batch metadata editing: The third of the big three features requested was for a facility to enable batch metadata editing. The users who requested this fell into two camps, and had two different requirements. One was for the ability to edit a lot of metadata easily and in bulk, whilst the other was to perform global changes across the repository (e.g. update all records with the author name ‘Stewart Lewis’ to ‘Stuart Lewis’).

Because the former of these could be used to achieve the later, we chose to implement it. I developed this feature at the University of Auckland where we are already using it regularly, and the XMLUI interface was developed by Kim Shepherd at the Library Consortium of New Zealand. The batch metadata editing tool is based around the assumption that there are better tools that DSpace for editing large amounts of metadata, so rather than trying to make DSpace provide these features, let’s enable the import and export of large amounts of metadata into these tools. This is achieved through the use of CSV (comma separated values) files. CSV files can be opened by most spreadsheet packages such as Microsoft Excel or OpenOffice. These tools have features such as global find and replace, spelling checkers, copy and past etc which all help with the editing of the metadata.

Metadata can be exported for whole collections, whole communities, search results, browse results, or for the whole repository. Once changes have been made, the file is uploaded back into DSpace which detects the changes and displays them to the user. If the user confirms that the changes are correct, then they will be made. The batch metadata editing feature can also be used to enable the creation of new metadata-only records.

Our intention was to ship DSpace 1.6 once these features were completed. However, whilst waiting for this, the DSpace community worked its magic once again, and came up with loads of new features for us to include. This list isn’t exhaustive, but contains some of the other key features that we’ve been able to include:

  • Authority control: A new authority control framework has been included which allows authority sources to be developed for metadata input. For example you may wish to link up author names with a local or national identity database, or link up publications to their ISSNs. In addition to the raw functionality, AJAX lookups are enabled to allow autocomplete functionality to show users matches to the data as they are typing (Larry Stone / Andrea Bollini).
  • Delegated administration: For a long time DSpace has suggested via some options in the user interface that it supports devolved administration of parts of the repository to different users. In some ways this was true, but it was very limited and didn’t include basic options such as delegating the ability to delete items to other users. This has now been included and is fully configurable (Andrea Bollini / Tim Donohue).
  • OpenSearch: An open xml search results system (Richard Rodgers).
  • OAI-PMH harvesting support: This isn’t the ability for DSpace to expose its items via OAI-PMH (which it has done since version 1), but instead is a facility that allows DSpace to harvest other repositories and import their data into DSpace. This could be useful if you want to mirror all or parts of another repository. (Alexey Maslov).
  • Batch imports and exports: These can now make use of zip files instead of directory hierarchies (Stuart Lewis).
  • Command launcher: A new command launcher has been written to replace all of the old DSpace command line scripts. This means that one script can be used to perform all command line functions, and works on all platforms as in the past we’ve not shipped scripts for Windows, only Unix (Stuart Lewis).

In addition to these, there have been literally dozens of other new features, improvements to current features, and bug fixes. We think and think that you’ll be happy! When you start using these features, remember to say a “Thank you” to the two-dozen developers who have worked to bring you these new tools. Also say “Thank you” to the other dozens of users who have provided input to the development of these features, who have tested it, and provided feedback. DSpace really couldn’t exist with the community around it.

Your biggest question is now probably “When will it be released?”. Later this week we hope to release a final ‘release candidate’ which can be used for some last-minute testing. Assuming this all goes well and no show-stopping bugs are found, we plan to release it during the first week of March. All this is tentative, but we’ll keep you

EasyDeposit – SWORD deposit tool creator

The development of the SWORD (Simple Web-service Offering Repository Deposit) protocol has enabled repositories to start accepting deposits from remote systems and interfaces. If you’re unsure of the basics of SWORD, read one of the following:

However, to date there has not been a great deal of use of SWORD. One of the reasons is a lack of SWORD clients that can deposit items into repositories. Demonstration clients were created by the SWORD project, and a PHP SWORD library was created by the SWORD2 project, but no client that can easily be set up by web developers or repository administrators to be used by depositors has been created.

A bit of background:

Last year as part of my job at the University of Auckland Library, I had to create a SWORD deposit client to allow PhD candidates to submit an electronic copy of their thesis. We wanted to use SWORD to do this as it means the PhD students do not have to create a repository account, and learn how to submit in the repository. The SWORD client was written in PHP and made use of the SWORD PHP library. The client was made up of a very small number of pages: login, enter title of thesis, upload file, select embargo and licencing options, verify, submit.

I then had to create a second similar deposit interface to allow a department to archive a technical report series. This deposit interface was similar, but didn’t have the embargo option, asked for more metadata, and returned the URL of the deposited item in a format that could be inserted into their own web publishing system.

Developing and maintaining two similar but not identical systems seemed to be wasteful, therefore I decided to create a generic SWORD deposit interface toolkit that allowed new deposit systems to be easily created. EasyDeposit was born!

What is EasyDeposit?

EasyDeposit is a toolkit for easily creating SWORD deposit web interfaces using PHP. To start using EasyDeposit, follow the installation instructions.

How does EasyDeposit work?

EasyDeposit allows you to create customised SWORD deposit interfaces by configuring a set of ‘steps’. A typical flow of steps may be: login, select a repository, enter some metadata, upload a file, verify the information is correct, perform the deposit, send a confirmation email. Alternatively a deposit flow may just require a file to be uploaded and a title entered. A configuration file is used to list the steps you require.

EasyDeposit makes use of the CodeIgniter MVC PHP framework. This means each ‘step’ is made up of two files: a ‘controller’ which looks after the validation and processing of any data entered, and a ‘view’ which controls the web page that a user sees. This separation of concerns makes it easy for web programmers to edit the controllers, and web designers to tinker with the look and feel of the interface in the views.

What ‘steps’ come with EasyDeposit?

EasyDeposit comes with 14 different steps, including:

  • ldaplogin: Allows login to take place against an LDAP directory
  • nologin: Allows preset login inforamtino to be provided if you don’t wish users to have to login, then forwards the user on to the next step
  • depositcredentials: Sets credentials to be used for the deposit if you wish to use a generic set of credentials, then forwards the user on to the next step
  • selectrepository: Allows a user to select between multiple repositories
  • servidedocument: Displays a service document to the user to allow them to decide which collection to deposit into
  • title: Requires the user to enter a title for the item they are depositing
  • metadata: Requires the user to enter metadata for the item they are depositing
  • uploadfile: Allows the user to upload files to deposit
  • verify: Allow the user to verify their submission before the deposit
  • deposit: Performs the deposit, then forwards the user on to the next step
  • email: Sends an email confirmation of the deposit, then forwards the user on to the next step
  • thankyou: Displays a confirmation of the deposit to the user

Extra steps can be easily added just by adding a controller and a view for each new step.

Is EasyDeposit open source?

Yes! It is published with a modified BSD licence.

How do I use EasyDeposit?

Follow the installation instructions! If you have any questions, please leave comments on this blog entry, to get in touch with me rpg mobile