spiderweb

Wikipedia and Web Content Analysis with Google Sheets

I’ve led a couple of content analysis and thematic analysis projects where the group used Google Sheets to collaborate, gather data, etc. (e.g. this analysis of how anonymity-seekers contribute to Wikipedia [non-paywalled preprint]). Here are some tips & tricks that I learned along the way.

In this set of tips, I’m working with a scenario where I have a list of search terms, URLs or Wikipedia pages or the like, and I want to split them among a set of people and then ask them to look at them in various ways.

  1. Make a ‘template’ tab. Lock it and version it. You’ll end up making changes to the format over time, and you don’t want to have to walk back through the revision history of the document to sort out any changes that happen.
  2. There’s a build-in random shuffle function you can use to randomize the sample among your reviewers (unless you’ve already done some randomization).
  3. Limit inputs if the project is quantitative or mixed-methods. You can either do this by creating drop-down menus inside each sheets cell or, if the project has quite a few fields, by fronting it with a Google Form.
  4. Do as much URL magic as possible. Ideally, set it up so that your team can click the url in the cell and instantly see the page they are supposed to be analyzing. This can be done with a bit of cell formula magic. Solutions will vary by dataset, but in general you’ll notice that URLs for various sites follow a pattern. Sometimes this is documented as an API, but you don’t need to know how to write code in order to use it. If you can figure out the pattern, you can automatically generate URLs using your data and the ‘CONCATENATE’ function. Not only is this faster for the team, it helps eliminate copy-paste errors. Control-D (“fill down”) lets you populate a whole column with clickable URLs that are customized with your data.
    1. Example 1: Google searches. If you run a search, you might notice that the URL bar of the results page will change from https://www.google.com to https://www.google.com/search?q=(your search term here)&(a whole bunch of junk). If you experiment you’ll see that the ‘&(a whole bunch of junk)’ part can be left out. If you have a list of search terms, you can use CONCATENATE to write a formula that will plug in the search terms, one on each line, in your data gathering spreadsheet. If the first entry in column F is ‘thing’, then =CONCATENATE(“https://www.google.com/search?q=”, F2) will give you the URL https://www.google.com/search?q=thing. Yay!
    2. Example 2: Wikipedia pages. If you have a list of revision IDs (old versions of a page) that you want to look at, again in column F, then =CONCATENATE(“https://en.wikipedia.org/w/index.php?oldid=”, F2) will again give you a magic URL that takes someone right to the page.
    3. Example 3: Wayback machine/Internet Archive. Here’s where the power of magic URL making can really help, because you can plug in multiple values into a concatenate, looking at the archive of a given page on a given date. And if you search for a page with a datestamp that doesn’t match what they have, the site will return the closest date without going past what you asked for. If you wanted to look at the front page of a list of websites in column A on a list of dates in column B, then =CONCATENATE(‘https://web.archive.org/web/’, B2, ‘/’, A2) would give you URLs like https://web.archive.org/web/20210122015640/https://www.cnn.com/ — dates are formatted YYYYMMDDHHMMSS — which would give you the front page of CNN on Jan 22nd, 2021 at 1:56:40.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s