Google Analytics and Privacy

Collecting web usage data through services like Google Analytics is a top priority for any library. But what about user privacy?

Most libraries (and websites for that matter) lean on Google Analytics to measure website usage and learn about how people access their online content. It’s a great tool. You can learn about where people are coming from (the geolocation of their IP addresses anyway), what devices, browsers and operating systems they are using. You can learn about how big their screen is. You can identify your top pages and much much more.

Google Analytics is really indispensable for any organization with an online presence.

But then there’s the privacy issue.

Is Google Analytics a Privacy Concern?

The question is often asked, what personal information is Google Analytics actually collecting? And then, how does this data collection jive with our organization’s privacy policies.

It turns out, as a user of Google Analytics, you’ve already agreed to publish a privacy document on your site outlining the why and what of your analytics program. So if you haven’t done so, you probably should if only for the sake of transparency.

Personally Identifiable Data

Fact is, if someone really wanted to learn about a particular person, it’s not entirely outside the realm of possibility that they could glean a limited set of personal attributes from the generally anonymized data Google Analytics collects. IP addresses can be loosely linked to people. If you wanted to, you could set up filters in Google Analytics that look at a single IP.

Of course, on the Google side, any user that is logged into their Gmail, YouTube or other Google account, is already being tracked and identified by Google. This is a broadly underappreciated fact. And it’s a critical one when it comes to how approach the question of dealing with the privacy issue.

In both the case of what your organization collects with Google Analytics and what all those web trackers, including Google’s trackers, collect, the onus falls entirely on the user.

The Internet is Public

Over the years, the Internet has become a public space and users of the Web should understand it as such. Everything you do, is recorded and seen. Companies like Google, Facebook, Mircosoft, Yahoo! and many, many others are all in the data mining business. Carriers and Internet Service Providers are also in this game. They deploy technologies in websites that identify you and then sell what your interests, shopping habits, web searches and other activities are to companies interested in selling to you. They’ve made billions on selling your data.

Ever done a search on Google and then seen ads all over the Web trying to sell you that thing you searched last week? That’s the tracking at work.

Only You Can Prevent Data Fires

The good news is that with little effort, individuals can stop most (but not all) of the data collection. Browsers like Chrome and Firefox have plugins like Ghostery, Avast and many others that will block trackers.

Google Analytics can be stopped cold by these plugins. But it won’t solve all the problems. Users also need to set up their browsers to delete cookies websites save to their browsers. And moving off of accounts provided from data mining companies “for free” like Facebook accounts, Gmail and can also help.

But you’ll never be completely anonymous. Super cookies are a thing and are very difficult to stop without breaking websites. And some trackers are required in order to load content. So sometimes you need to pay with your data to play.

Policies for Privacy Conscious Libraries

All of this means that libraries wishing to be transparent and honest about their data collection, need to also contextualize the information in the broader data mining debate.

First and foremost, we need to educate our users on what it means to go online. We need to let them know its their responsibility alone to control their own data. And we need to provide instructions on doing so.

Unfortunately, this isn’t an opt-in model. That’s too bad. It actually would be great if the world worked that way. But don’t expect the moneyed interests involved in data mining to allow the US Congress to pass anything that cuts into their bottom line. This ain’t Germany, after all.

There are ways with a little javascript to create a temporary opt-in/opt-out feature to your site. This will toggle tags added by Google Tag Manager on and off with a single click. But let’s be honest. Most people will ignore it. And if they do opt-out, it will be very easy for them to overlook everytime without a much more robust opt-in/opt-out functionality baked in to your site. But for most sites and users, this is asking alot. Meanwhile, it diverts attention from the real solution: users concerned about privacy need to protect themselves and not take a given websites word for it.

We actually do our users a service by going with the opt-out model. This underlines the larger privacy problems on the Wild Wild Web, which our sites are a part of.

Best Practices for Google Analytics Profiles and Filters

Recently, I heard from a colleague at another institution regarding how to best configure their Google Analytics profiles, especially in regards to filters.

A Google Analytics profile is a view into one of your web properties with specific sets of filters applied to it. For example, one profile I keep of the university library website which I manage filters out everyone except those users coming through an IP range that is used by our wireless users. This profile, then, let’s me see how our wireless users differ from our lab computer users. It’s particularly important, in fact, because it turns out that the browser and OS choices our wireless users make are very different from what our campus IT provides on the lab PC images (Apple vs. Microsoft, Google Chrome vs. IE and Firefox).

The most important best practice for profiles is to always have multiple profiles. You should always have one profile that has absolutely no filters so that if your data ever looks weird you can see right away if it’s your filters causing problems. So, we have one profile without any filters and then multiple profiles that have various other filters applied. For example, for my main analytics report profile, I have the following filters applied:

  • Filter Name: Case sensitive; Filter Type: Lowercase – Remove
  • Filter Name: LibraryIPfilterout (our IP filter); Filter Type: Exclude – Remove (this filter excludes library staff machines from our data)

Here’s how you create a filter:

  1. Go to Google Analytics, and select the Admin tab on the right of the orange bar
  2. On the Account Administration screen, you’ll see a list of accounts (one should be your library site’s account) – select it
  3. On the account’s page, you’ll see your properties (which should include your main website’s URL). You’ll also see a Filters tab – select that
  4. You should now see a list of all filters applied to that web site account. You will also see a “+ New Filter” button. Click that.
  5. You will then need to fill in the parameters of your filter and then Save it.
  6. Once created, it can be applied to any profile. Simply go back to Step 3 and select the URL you want to create your profile for.
  7. Select “+ New Profile” and create this profile
  8. Then locate the Filters tab and apply any filters that you want, including the one you created.

Library Who Dunnit

The joke lately at my library is that our website just doesn’t matter anymore. Today, I saw this sentiment come alive in our analytics data.

As you can see from the graph below, we have a typical academic usage pattern, with activity building up during a quarter, peaking just before finals and then crashing as students rush off on their inter-quarterly wanderings.

What was interesting in this last quarter’s data was that this activity failed to build to its usual tempo. In fact, it appeared to build and then flatten out.

The culprit? I think this is our hosted research help portal, LibGuides, which we just soft-launched in the middle of the quarter…about the very time student research needs start to draw them to the library. Pre-LibGuides, we would see alot of traffic to our locally-hosted Subject Guides, which would show up in our analytics data. But LibGuides, which is replacing these Subject Guides, is off the radar as far as Google Analytics is concerned (for now), so this may be why we saw such low numbers.

Anyway, with WorldCat Local configuration almost complete and LibGuides readying for a hard launch in Fall, the library web site will likely see even further erosion in usage. And this is a very good thing, because it means we’ve gotten all the other stuff out of the way, improved the signal to noise ratio and just simplified the act of accessing our resources (which are also all off-site and out in the cloud).

That’s what they call usability folks!

Better Living Through Good Data, Part 2

Crazy Egg confetti view with cookie variable filtersA few months ago, I blogged about a cookie-based filter I deployed to screen out librarian machines from my library’s Google Analytics and Crazy Egg data. Well, after running this experiment for awhile, it looks like our IP data was actually good enough after all.

The original problem that sparked this experiment was that the computers in our library were all on dynamic IPs and thus, we could not reliably screen those machines out in order to get a pure glimpse into what non-librarians thought was important. A few options were considered, such as putting all library machines on static IPs. The university IT department didn’t like that idea, so I had to go back to the drawing table…or baking table, if you will.

The solution was to deploy a browser cookie on each machine used by staff, which Google and Crazy Egg would then use to identify the librarians visiting various library pages. The added bonus was that we could actually see which web elements and pages were mostly used by staff and which ones were mostly used by others.

Problem is, deploying the cookie across the staff computers was a real hassle: each computer might have multiple users, like the Reference computers, which required going around chasing down librarians (not an easy task!) and having them log into each computer, launch each browser and change their default homepage to our cookie page (which sets the cookie and then refreshed to the library homepage). In some areas, there might be ten or more people using a single computer.

Needless to say, the pay off for all this work had to be really good.

And as it turned out, it out, the IP range filters were not all that different from the cookie-filtered data.

In Google Analytics, we set up multiple views, one with no filters, one with a cookie filter and one with an IP range filter. We let the data come in for two months and then checked the results to see what the discrepancies were.

What we found was actually surprising. The IP filters were doing a pretty good job…in fact, they were doing too good of a job. It appears that the IP range covers the librarians, but also many of the public machines in our computer labs and elsewhere. This isn’t too bad, since users working at home are very likely not librarians…so, we essentially have our librarians out of our data using the IP ranges.

And statistically, the IP ranges were only a few percentage points off the cookie filters, so the differences were fairly benign.

For example, on Saturday March 5, 2011, we had a total of 6,751 people come to our library homepage. Of those, 49 people were librarians according to our pretty much perfect cookie-filtered data. The IP filters identified 131 users as librarians because it counted some computers used by students in the labs. Doing a little math: we found an average 1.5% difference between the two filters. Not siginificant enough to worry about…especially given how onerous deploying and monitoring the cookie was.

SharePoint Migration

At long last, my university IT Department is gearing up to initiate campus-wide migration of all websites to SharePoint. This includes the library’s website.

Say what? SharePoint for a public-facing library website?

I’ll admit, I was fairly incredulous about this concept and continue to wonder how it will play out. This is a Microsoft product after all. But the experimental nature of the idea intrigues me, and I’m actually looking forward to it. Alot of this excitement, of course, also arises from our current CMS conundrum. You see, we’re on Serena Collage, perhaps one of the clunkiest content management systems still in operation…In fact, that’s the problem, it’s not really supported anymore. Fact is, it often doesn’t work at all.

So, I’m excited about getting out of Collage. And, I’m coming around to SharePoint as well…kind of.

Much of my remaining hesitation stems from not knowing how this will all work. Case in point, my team currently has access to the server, where we can do a fair amount of development, despite Collage’s problems. It remains unclear exactly how much access to the server we will have in the future to do the kinds of web service development that we’re accustomed to.

But, let’s focus on what is known…and it’s much to be happy about. SharePoint has a pretty good governance structure to it, which will allow our library to assign sections of the site to curators responsible for keeping information up to date, while allowing me to keep on top of what they’re doing. There are other features that I’m extremely happy about:

  • multiple instance of blogs
  • on-the-fly database creation
  • intranet capabilities

Another cool aspect of this migration is that, as far as I can tell, no other library has used SharePoint as its public site’s CMS. Think journal articles and presentations here. I like that…

So, as you can imagine, there is quite a bit of activity at the library right now. We’ve got until the summer before the migration begins. In the meantime, I’m working on cleaning up our site as much as possible so that we get rid of the clutter and worst usability issues. My sense is that once we begin working with IT, there won’t be much bandwidth for working out a whole new architecture. I’ll have to come back to that later on my own. But I am using some Google Analytics and Crazy Egg data to help re-architect some areas of the site, particularly the unwieldy Special Collections areas ahead of the migration…as sort of site weeding if you will.

As such, my plan post migration is to do immediate user testing, followed up by iteration, more testing and then iteration. We’ll keep that up until we get where we need to be.

Along the way, I plan on keeping a few channels open to our stakeholders, essentially following the model established by the North Carolina State University’s Library redesign project…namely by creating a Web Advisory Committee composed of librarians from across our organization, focused librarian interview sessions and also through a public blog aimed at the university stakeholders. But I’ll also add commentary here when appropriate.

Wish us luck!

Better Living Through Good Data

Coming onboard at my library a few months ago, I immediately went to Google Analytics to get the details on exactly what this online beast I had inherited was all about. As I later surmised, the data I was getting from Google was likely not all that reliable.

Google Analytics measures invaluable data points like the number of users coming to the site, where they came from, with what computer arrangements they came and what exactly they were doing and for how long.

To get myself a better view of these users, I started off right away by creating some funnels to track likely navigation scenarios, for example, tracking how many people that landed on the home page followed the quickest path to the library hours page. And if they went another route, how did they get there.

To some, this may sound boring, but its crucial knowledge when your aim is to improve usability and functionality on a site.

Crazy Egg confetti view with cookie variable filters

Not long after setting up my funnel reports, I was at the LITA conference, a forum populated by like-minded web librarians, where Tabatha Farney reviewed (and made the case for) click analytics. This kind of data, fills in some areas Google doesn’t do all that well, namely to visualize user clicks on your website through confetti views (right) or heat maps.

Talk about an ah-ha-moment! My first order of business when I got back to the library was to get me some click analytics. I went with Crazy Egg. Then, sitting back, I let the magic happen and was soon able to review Google data and click analytics  without holding a single user interview!

All this sounds wonderful, except that I soon learned that our librarian staff computers were using dynamic IPs…egad! This meant that the nice little filters in our Google Analytics reports were likely not filtering out 100% of our librarians. How many librarians were getting through is a question that remains unknown, but I hope to know soon.

I thought long and hard on this muddying of my otherwise pristine view into my users and then it occurred to me that I might be able to deploy a browser cookie on our staff computers that could be used to filter out the librarians. As it turned out, I could.

Doing a little research, I found that others had had similar problems with dynamic IPs and the solution was already worked out, and to my ego’s satisfaction, the method used was the very cookie-based solution I had come up with (pat on the back).

  1. post a new page on your server with a little javascript to plant the cookie on any browser that visits that page
  2. put some more javascript on my site to check for the cookie and talk to Google Analytics
  3. inside Google Analytics, add a custom filter to disregard any users with the cookie

Of course, all good things go to waste until you get a student worker. And so, this solution sat idle for a couple months while I managed the other gazillion projects that were falling like hail over my desk. All of this changed, when I got the green light to hire a computer science student to help out. So the cookie filter project was back on.

To help test the reliability of this method, we created a couple instances of Google Analytics, one with no filters, one with the old IP filter and one with the new cookie filter.

On the Crazy Egg side, things looked a little less plug-and-play. But then, as if the Cosmic Cactus wanted to signal that it really does care (if only a little prickly), I got a Crazy Egg email announcing a new custom variable feature. This worked quite nicely, although, the variable feature cannot be utilized with some Crazy Egg reports, like my favorite, the Heat Map. Still, with a little tinkering, we were able to deploy another cookie filter to work with Crazy Egg.

Proof’s in the pudding, and as pudding requires a little time to set up in the cooler, we’re eager to see how our data looks once we get this running. Tomorrow, I’m announcing the cookie solution to the library staff and hope to start getting all staff computers cookied so we can let the filters do their work.