Advanced OpenRefine Tutorial

by admin
in Tutorials
— 1 Oct, 2018

This tutorial will extend the earlier Introduction to OpenRefine. It will lead you through more sophisticated operations in the exploration of OpenRefine.

This tutorial uses a sample dataset available at http://digitalnomad.ie/ACIrishBrews2018.tsv

Note that this tutorial assumes that you have OpenRefine installed on your workstation and are familiar with the skills taught in the earlier tutorial.

Tutorial 2 (approximately 20 minutes)

Split One Column into Many Columns

You will often have data columns that contain data that we need to separate out to operate on these data individually. In the dataset we are working with, the Location column contains both a Town and a County. To gain analytical flexibility we may want to separate these out.

1. Choose the drop-down menu from the Location column and choose Edit Column/Separate into multiple columns.

2. In the dialog box, indicate that the separator is ‘, ‘ (comma and space) and also limit the number of columns created to 2.

3. Click the OK button at the foot of the screen to split the column.
4. Now that we have two new columns, we need to rename them to provide more context. Rename the first new column ‘Town’ by choosing Edit Column/Rename this column and entering ‘Town’ into the dialog box.

5. Do the same with the second new column, choosing to rename it to County.

6. This now gives us the ability to work with new forms of data. For example, I can very quickly find out the number of specific beers/county in Ireland, by simply choosing facet on County. Give that a try now.

What County produces the greatest variety of beers – according to this dataset?

Does your result look like this? (Hint: did you choose to sort by count or is the facet sorted by name?)

Changing a Column Case

As we have seen in the earlier tutorial, sometimes spelling errors occur during the data entry process. Often when we deal with the vagaries of human data entry, another common challenge is the variety of cases of words in the various data. Although humans might equate ‘Apple’ with ‘apple’ to a computer these are different terms and might as well be ‘oranges.’
OpenRefine automates the process of cleaning your data to ensure that like is compared with like. In our dataset, the short brewers’ description contains a variety of these challenges.

In this case, we are working with a particular column, but are interested in the individual cells, some of which need to be edited and others not.

Choose Edit cells/Common transforms/Collapse consecutive whitespace on the column MfGDescShort. This will ensure that any occurrences of multiple spaces will be removed and condensed into a single space and normalise our descriptions. Note that OpenRefine tells us that it found this issue in 17 cells out of the 1372 records and made the alternations selectively. If you are curious it also shares some cryptic code with us:
This is OpenRefine GREL (regular expression language) which is very powerful and something that you may wish to explore as you become more familiar with OpenRefine.
To ensure that our capitalisation is uniform in the type column, choose Edit cells/Common transforms/To titlecase. This will ensure that all words in those cells are uniform. As you can see from the summary, in this case, OpenRefine found 36 examples that needed attention.
There are a wide variety of pre-scripted and useful transformations. Note that you can immediately undo any operation you perform by clicking undo on the status window as above. Additionally, you can always step back in the actions you perform as we discussed earlier by choosing undo/redo on the left side panel. So you are always free to explore and simply see what happens.

Linking to External Data Sources (Advanced)

OpenRefine is a very extensible tool. Many users develop their own extensions and routines which are regularly added to the OpenRefine repository. While we are not going to extend OpenRefine during this tutorial, we want to look at one of these extensions.

OpenRefine was developed very early as a means to populate Wikidata (a crowdsourced repository of data that is freely accessible to the world). By connecting your dataset to existing data in Wikidata, it is sometimes possible to augment and expand your dataset with objects that are shared in Wikidata.

The Wikidata operations are accessed via the drop-down menu at the top right of your workspace. Any extensions that you add are made available in this space. You may note above that I have added an extension to produce RDF data and it is shown beside the Wikidata menu.

The process of matching cells in your dataset with data in Wikidata is called reconciliation. To connect to Wikidata, you must register and have confirmed account. You log into Wikidata via the drop-down menu under Wikidata. Once you are logged in you can choose to begin reconciliation with any of the columns in your dataset. In this case, I chose BAA Ontology 1 as it is a very generalised descriptor for beers throughout the world.

2. OpenRefine has quickly identified that I may be talking about Beer and has chosen the beer-style Wikidata source. I then choose Start Reconciling, and OpenRefine begins to discover whether it can automatch with my data to supply additional context.

3. Although this process is advanced and starts to get complicated – it is apparent right away that OpenRefine was able to locate a series of matches for the terms in our data and applies the best guess to help normalise our data to that contained in Wikidata.

Simply to tease you, with this reconciliation made, data contained about various beer types in Wikidata can be pulled from Wikidata, and once you establish authority, you can also augment data in Wikidata.

For more information about working with Wikidata and extending OpenRefine through access to additional public datasets, please look to:

Wikidata and OpenRefine

Discover a Dataset on your own

You now have tried a variety of techniques to use OpenRefine to both discover and modify a large dataset. Let’s see how you might put them to use.

There are two additional datasets available:

http://digitalnomad.ie/louisiana-elected-officialsA.csv – try facetting to describe and understand context.
http://digitalnomad.ie/universityData.csv – try using cluster editing to clean up variations.

SaveSave

Tutorial 2 (approximately 20 minutes)

Split One Column into Many Columns

Changing a Column Case

Linking to External Data Sources (Advanced)

Discover a Dataset on your own

Leave a Reply Cancel reply