Tag Archives: analytics

Surfacing Google Analytics stats in DSpace

In the recent survey asking the DSpace community for their top 3 feature requests for DSpace 1.6, the number one most requested feature was statistics. As you’ll know from previous posts, I’m a big fan of Google Analytics.

For the uninitiated, you insert a small bit of JavaScript in your web pages, and Google provide a very rich and powerful analytics service for viewing your site statistics.

Recently Google announced the launch of an analytics API that allows you to remotely query and download the statistics its holds about your site.

I like playing with APIs, so throught I’d write a solution that downloads item splashscreen view statistics from Google Analytics and displays them on the item page:


The solution is quite simple. It requires the additon on one Java class into DSpace. This class should be run daily to download the statistics. The same class is used by the user interface to display the statistics. If you want to implement this solution, follow the instructions below:

  • Create a new directory (java package) at [dspace-src]/dspace-api/src/main/java/org/dspace/app/googleanalytics
  • Download the code shown at the bottom of this post, and save it as GoogleAnalyticsHitCounter.java in the directory that you just created.
  • Edit [dspace-src]/dspace-api/pom.xml to add in the dependencies on the Google API libraries:




  • Then download and save gdata-src.java-1.32.1.zip and extract and save (somewhere handy) the jar files: gdata-core-1.0.jar, gdata-analytics-1.0.jar, google-collect-1.0.jar (in zip file as google-collect-1.0-rc1.jar)
  • Inatall each of these by running the following Maven commands, adjusting paths as appropriate:
    • mvn install:install-file -DgroupId=com.google.gdata -DartifactId=gdata-core -Dversion=1.0 -Dfile=gdata-core-1.0.jar -Dpackaging=jar
    • mvn install:install-file -DgroupId=com.google.gdata -DartifactId=gdata-analytics -Dversion=1.0 -Dfile=gdata-analytics-1.0.jar -Dpackaging=jar
    • mvn install:install-file -DgroupId=com.google.collect -DartifactId=google-collect -Dversion=1.0 -Dfile=google-collect-1.0.jar -Dpackaging=jar
  • Next, edit [dspace-src]/dspace-jspui/dspace-jspui-webapp/src/main/webapp/display-item.jsp, and somewhere in the code (choose where you want it), add the following code:

// See if we can display a counter
String path = "/handle/" + item.getHandle();
String count = GoogleAnalyticsHitCounter.getPageCount(path);
if ((count != null) && (!"".equals(count)))
<table align="center" class="miscTable">
<td class="oddRowEvenCol" align="center">
This item has been viewed <strong><%= count %></strong> times

  • If you don’t deploy your user interface as the ROOT webapp, then you’ll have to add the context in the line: String path = “/handle/” + item.getHandle();
  • Now build and deploy DSpace as you would normally (mvn package; ant update; etc…)
  • Edit dspace.cfg and add in the following entries:
    • googleanalytics.username = your-google-analytics@email.address.com
    • googleanalytics.password = your-google-analytics-password
    • googleanalytics.siteid = 123456789
    • googleanalytics.filename = analyticscounts.properties
    • googleanalytics.startdate = 2007-07-17
  • Adjust the email address and password as appropriate.
  • Log in to Google Analytics and find out the first date that you have statistics for. Set this in the start date entry, in the form of yyyy-mm-dd
  • View the dashboard of your Google Anlytics, and look at the URL. Part of it will include ‘id=nnnnnnn‘. Copy the id number and enter it in the dspace.cfg siteid entry.
  • Download and compile your statistics by running (from [dspace]/bin/)
    • dsrun org.dspace.app.googleanalytics.GoogleAnalyticsHitCounter
  • If everything worked as it should, you should now have a file [dspace]/analyticscounts.properties If you look in this file, you find entires in the form of ‘/handle/xxxx/yyyy=55’.
  • Now start tomcat, view an item, and if the handle appears in the downloaded stats, you should see the item count!

As with the DSpace video player solution I wrote about earlier this week, the code is not perfect, and needs to be improved a bit to make it solid, but is a good start if you wanted to use this type of solution. Enjoy!

package org.dspace.app.googleanalytics;

import java.io.IOException;
import java.io.FileOutputStream;
import java.io.FileReader;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.Properties;
import java.util.Calendar;
import java.util.Date;
import java.text.SimpleDateFormat;

import com.google.gdata.client.analytics.AnalyticsService;
import com.google.gdata.data.analytics.DataEntry;
import com.google.gdata.data.analytics.DataFeed;
import com.google.gdata.data.analytics.Metric;
import com.google.gdata.util.AuthenticationException;
import com.google.gdata.util.ServiceException;
import org.dspace.core.ConfigurationManager;
import org.apache.log4j.Logger;

public class GoogleAnalyticsHitCounter {

/** log4j category */
private static Logger log = Logger.getLogger(GoogleAnalyticsHitCounter.class);

/** Hit counter */
private static Properties counts;

/** When the counter last loaded? */
private static Date lastloaded;

/** The filename of the counter file */
private static String filename;

* Initalise the system
public static void init()
// Load the properties file
Calendar yesterday = Calendar.getInstance();
yesterday.add(Calendar.DATE, -1);
lastloaded = yesterday.getTime();
filename = ConfigurationManager.getProperty("dspace.dir") +
System.getProperty("file.separator") +
counts = new Properties();

* Get the count for a particular page (e.g. /handle/123/456
* @param page The page path
* @return The count. Empty String if unknown
public static String getPageCount(String page)
// Check we’re initialised
if (lastloaded == null)

// Reload the hits

// Get the value
if (page == null)
page = "";
String count = counts.getProperty(page);

// Return the value
if (count != null)
return count;
return "";

* (Re)load the counter. It is reloaded every hour.
private static void loadCounter()
// Do we need to load it?
Calendar hourago = Calendar.getInstance();
hourago.add(Calendar.HOUR, -1);
if (lastloaded.before(hourago.getTime()))
counts.load(new FileReader(filename));
lastloaded = Calendar.getInstance().getTime();
catch (Exception e)
log.warn("Unable to load google hit counter from " + filename);

* Command line method to collect the statistics from Google Analytics.
* @param args No arguments used
public static void main(String args[])
// Set up the variables
String username = ConfigurationManager.getProperty("googleanalytics.username");
String password = ConfigurationManager.getProperty("googleanalytics.password");
String siteid = ConfigurationManager.getProperty("googleanalytics.siteid");
String startdate = ConfigurationManager.getProperty("googleanalytics.startdate");
String handle = ConfigurationManager.getProperty("handle.prefix");
String root = ConfigurationManager.getProperty("dspace.url");
String filename = ConfigurationManager.getProperty("dspace.dir") +
System.getProperty("file.separator") +

// Get the local path
String path = "";
URL localURL = new URL(root);
path = localURL.getPath();
if (path.endsWith("/"))
path = path.substring(0, path.length() – 1);
catch (MalformedURLException e)
System.err.println("Invalid dspace.url URL (" + root + ")");

AnalyticsService as = new AnalyticsService("gaExportAPI_acctSample_v1.0");
String baseUrl = "https://www.google.com/analytics/feeds/";

// Login to Google
try {
as.setUserCredentials(username, password);
} catch (AuthenticationException e) {
System.err.println("Authentication failed : " + e.getMessage());

// The results
Properties counts = new Properties();

// Keep requesting pages of results from Google until a blank page is found
// pages of 1,000 results at a time
URL queryUrl;
int i = 1;
boolean found = true;
int total = 0;

// Get stats up until yesterday
Calendar yesterday = Calendar.getInstance();
yesterday.add(Calendar.DATE, -1);
SimpleDateFormat format = new SimpleDateFormat("yyyy-MM-dd");
String enddate = format.format(yesterday.getTime());

while (found)
found = false;
try {
String q = baseUrl +
"data?start-index=" + i +
"&ids=ga:" + siteid +
"&start-date=" + startdate +
"&end-date=" + enddate +
"&metrics=ga:pageviews" +
"&dimensions=ga:pagePath" +
"&filters=ga:pagePath%3D~" + path + "/handle/" + handle + "/[0-9]%2B$";
queryUrl = new URL(q);
} catch (MalformedURLException e) {
System.err.println("Malformed URL: " + baseUrl);

// Send our request to the Analytics API and wait for the results to come back
DataFeed dataFeed;
try {
dataFeed = as.getFeed(queryUrl, DataFeed.class);
} catch (IOException e) {
System.err.println("Network error trying to retrieve feed: " + e.getMessage());
} catch (ServiceException e) {
System.err.println("Analytics API responded with an error message: " + e.getMessage());

for (DataEntry entry : dataFeed.getEntries()) {
String id = entry.getId().substring(70);
id = id.substring(0, id.indexOf(‘&’));
for (Metric metric : entry.getMetrics()) {
counts.put(id, metric.getValue());
total = total + Integer.parseInt(metric.getValue());
found = true;

i = i + 1000;

// Save the properties file
counts.put("total", "" + total);
counts.store(new FileOutputStream(filename), null);
System.out.println("Saved " + total + " total hits in " + filename);
catch (IOException e)
System.err.println("Error saving results to file: " + filename);

Update your Google Analytics Tracking Code

Just in case you missed the announcement from Google Analytics, they have just updated their tracking code snippet. The snippet is a couple of bits of JavaScript that first download the relevant copy of the analytics code (depending whether your site is SSL protected or not – so they don’t see security warnings from their browser) and then registers the visit with Google.

The change is simply to add some error handling to the JavaScript, so that if something goes wrong users will not see a warning method.

The new snippet is shown below, with the additional code in bold:

<script type="text/javascript">
    var gaJsHost = (("https:" == document.location.protocol) ?
                     "https://ssl." : "http://www.");
    document.write(unescape("%3Cscript src='" + gaJsHost +

<script type="text/javascript">
    try {
        var pageTracker = _gat._getTracker("UA-50020-1");
    } catch(err) {}

(I’ve just applied the change to the DSpace code repository, ready for the upcoming 1.5.2 release)topodin.com

Google Analytics is not a statistics package!

As everyone knows I’m a big fan of using Google Analytics with repositories in order to see what is happening with your repository with respect to visitors – what they are looking at / which links they are following / where they are coming from / how many people are visiting the site etc.

However from time to time I come across views regarding some of the data that is not captured by Google Analytics. Such data includes users who do not allow javascript / cookies, and visitors who click directly on ‘files’ (e.g. PDF files). In this second case, the data isn’t tracked because there is no web page shown from which to run the Google Analytics tracking code. In an attempt to help collect some of this information I have used a script by Patrick H. Lauke which triggers when a user clicks to download a file from a metadata jump-off page. It registers the click with Google Analytics and the download is recorded. But as I said, it doesn’t direct hits to the file that did not first go via the repository.

Is this a problem? Personally I don’t think so:

  • At least some of the data is now being recorded, which is better than none. It might not be numercially accurate, but hopefully it is still representative of user behaviour.
  • Remember that Google Analytics is an analytics package, not a statistics package. It does not claim to record every click, but is more intended to help with analysing and improving the user experience (e.g. “Do I get more file downloads if I place the list of files above the metadata or below it” or “Do users that land on a browse page download more files than those that arrive directly on an item page”).
  • If you want raw download figures, use a proper statistics system that works from web server logs (e.g. IRStats or a common web stats system such as AWStats). Most likely you’ll want to use both.

сайт визитка на заказ

Tracking repository searches from the inside

One of the many great features of Google Analytics is that it can shown the search terms that visitors to your site have used in search engines. This is a great tool for finding out what brings users to your repository.

Seven months ago Google launched a new feature in Google Analytics that also allows you to track the search terms used by visitors within your repository. Its very easy to set up, all you need to do is enable the feature and set the query parameter used by your repository. Follow these rules from the help pages:

  1. Log in to your Google Analytics account.
  2. Click ‘Edit’ under Website Profiles for the profile you would like to enable Site Search for.
  3. Click ‘Edit’ from the ‘Main Website Profile Information’ section of the Profile Settings page.
  4. Select the ‘Do Track Site Search’ radio button in the Site Search section of the Edit Profile Information page.
  5. Enter your ‘Query Parameter’ in the field provided. Please enter only the word or words that designate an internal query parameter such as “term,search,query”. Sometimes the word is just a letter, such as “s” or “q”. You may provide up to five parameters, separated by a comma.
  6. Select whether or not you want Google Analytics to strip out the query parameter from your URL. Please note that this will only strip out the parameters you provided, and not any other parameters in the same URL. This has the same functionality as excluding URL Query Parameters in your Main Profile – if you strip the query parameters from your Site Search Profile, you don’t have to exclude them again from your Main Profile.

Google Analytics Site SearchFor DSpace you need to set the query parameter to query and with EPrints set it to simple.

To view the results, follow the links shown in the image (Content -> SIte Search) and explore the results. 

Here is some interesting statistics from our repository as an example of the extra stats it can provide:

  • 89% of visits did not make use of a a site search, whilst the remaining 11% did.
  • 39% of search users left the system having performed the search without going any further (e.g. looking at one of the items found by the search)
  • 22% of searchers resulted in search refinements being undertaken by the searcher
  • 50% of searches were performed from the repository homepage, the remaining from item, collection and community pages.
  • Following a search, the average visitor stayed on the site for a further 1 minute and 30 seconds.
  • 8% of searches were performed without the visitor having entered a search term.


Repository bounce rates

Bounce rate imageI’ve often wondered about what people do when they visit a repository, and whether what they are doing while visiting the repository could be considered ‘good’ in terms of the usefulness and general aims of the repository. Let me explain… I’m a big fan of Google Analytics, and one of the things it lets you see is what people do once they get to your repository. For each page it can show where they came from, how long each user stayed there, and whether they ‘bounced’ straight off to another web site afterwards (that is, Google Analytics on your repository did not encounter another view from that user in their browsing session), or whether they stayed within your repository to hopefully view more items.

The help file for Google Analytics describes the bounce rate as:

Bounce Rate: Bounce rate is the percentage of single-page visits (i.e. visits in which the person left your site from the entrance page). Bounce rate is a measure of visit quality and a high bounce rate generally indicates that site entrance (landing) pages aren’t relevant to your visitors. You can minimize Bounce Rates by tailoring landing pages to each keyword and ad that you run. Landing pages should provide the information and services that were promised in the ad copy.

If you consider an e-commerce website such as Amazon, then this description, and the aim of reducing the bounce rate must hold true. If your visitor searched for an item in a search engine, came to your website, viewed the item, and then ‘bounced’ away, you have lost the sale and the visitor took their business elsewhere. That is ‘bad’.

However, what is the purpose of a repository?

If you take the view that a repository (of the open access persuasion) is there to provide access to resources, then a bounce may not be so bad after all. Image the following scenario:

“I’m a researcher in the field of building robotic sailing boats. I’ve read an article that cites a paper by the title of ‘An Autonomous Sailing Robot for Ocean Observation’. So I duly perform a search using Google Scholar and it see a paper by that title is the top result. I visit the link and find myself in a repository which holds that paper. I download the paper, and go on my way, happy to have found what I wanted.”

Within Google Analytics we would see several different aspects of this visit:

  1. We’d see the visit to the metadata jump-off page.
  2. We’d see that the visitor came from Google Scholar.
  3. We’d see the search term that was used by the user within Google Scholar
  4. We’d see that the visitor stayed on the metadata jump-off page for say 20 seconds.
  5. Then… nothing. In other words, it wold be registered as a bounce.

So in traditional analytics terms this looks like a bad visit. However, was it? Clearly not. The visitor got what they wanted, and the repository has done its job. Why did Google Analytics not register the fact that the visitor read the PDF version of the paper though?

Unlike website log file analysis software (e.g. AWStats) Google Analytics can’t see every single interaction between the user and the web server. It can only see pages which include a small bit of Javascript that send the details of the visit to Google. So in the case of the repository, the metadata jump-off page contains the code so Google Analytics knows about the visit, but the PDF cannot contain the code. Google Analytics therefore doesn’t know about the successful download of the PDF. Maybe one day Google will address this issue in some way? It would be great if they could.

The repository has served it purpose, and the visitor got what they were after, but is it also the job of the repository to hold the user and to attract them to other related items in the repository? There are many ways this could be done, a subject for another day, but these will no doubt include elements of Web 2.0, social networking and item suggestion. This issue does though highlight one of the origins and ongoing features of Google Analytics – that of supporting e-commerce sites, particularly those that make use of its AdWords scheme.

But for me, for now, I think I’m reasonably happy with a bounce!mobile games rpg online