Flat File to Thesaurus: Improving Terminologies at the National Gallery

A paper presented at the Collections Trust Conference, New Walk Museum, Leicester, 12 September 2019.


What do you do when you’ve inherited a flat list of keywords, but need a structured vocabulary to drive public information retrieval? In this pragmatic case study, Rupert Shepherd looks at the steps involved in moving the National Gallery from a digitised copy of the index to a printed catalogue, to a set of structured, hierarchical vocabularies used to control the terminology in its collections management system. He explains the practical benefits, including improved discoverability on the Gallery’s website, and also considers what could have been done better.


My subject today is what to do when you’ve inherited a set of keywords – and find that you’re not very happy with them. I’ll talk about the steps we’ve taken when faced with this problem at the National Gallery.

First, a warning: the National Gallery is not typical of most museums. We have hardly any objects (2,373 in the main collection), and they are all very similar. Our problem is not how to categorise a great breadth of objects, but how to describe a very few in great detail.

The challenge

But our collection isn’t very well indexed just now. Historically, we’ve used our collections management system (Gallery SystemsTMS) to manage our collection, rather than catalogue it. This is all changing, as we’re in the middle of projects to provide new digital descriptions of all our paintings, and to rebuild the collection pages on our website.

What we do have in TMS is the index to the second edition of our Complete Illustrated Catalogue, published in 2000; or rather, the index to the CD-ROM version of the catalogue. These are the keywords that we imported into TMS’s thesaurus module.

Our key problem is that this is just a list of terms, all jumbled up. If you only look at our entry for Titian’s Bacchus and Ariadne you’ll see that we have:

  • Former owners – the Aldobrandini and Este collections
  • Things that you can see in the painting – animals, snakes, Bacchanalia, cheetahs
  • People represented in the painting – the mythological characters Bacchus and Ariadne
  • Literary sources for the painting – the Ars Amatoria, and Catullus
  • Types of subject matter – coastal scenes and landscape backgrounds
  • A place-and-time – Italy 1500-25
  • And other artists who are referred to in the catalogue text – Bellini and Dossi

But this doesn’t reflect the organisation of our collections management system which is (of course) based upon Spectrum, and has separate sections for

  • People and organisations (divided between previous owners and people related to the painting)
  • Places
  • Dates
  • Materials and techniques
  • Events, and
  • Keywords

If we later want to add further contextual data to our terms – for example, peoples’ dates, or places’ coordinates – then we need the terms to be stored in the part of TMS designed to describe that particular kind of entity.

There’s also the problem of data retrieval. How can a user find the thing that interests them, if they don’t know what the museum has called it, or even what it is called? One solution is to use a hierarchical structure which can let people browse or filter from broad headings to increasingly specific terms. But our list was effectively a flat file. Here are the first 22 terms in our keyword list, arranged in their hierarchical order.

'A Jew Merchant'
└Murder of
└horses of
Acts of Charity
Acts of Pilate
Adam and Eve
└Expulsion from Paradise
Aelst, Willem van (tr)
Aertson, Pieter (tr)

Step 1 – identify broad types of terms

So what to do? We first took a copy of all the terms in the keyword list, and their links to objects, and put it in a new area of the thesaurus module. That gave us our raw material.

We then mapped these terms to an initial set of top-level headings, which reflect the way that TMS is organised:

  • Events
  • Keywords
  • Materials and techniques
  • People /organisations
    • Former owners
    • Subjects
    • Others
  • Places
    • Of production
    • Represented
  • Timespans

Step 2 – move keywords that aren’t keywords

The next step was to put things where they belong in TMS. First we had to identify which terms fell into which categories, and then by hand move them into holding areas in the thesaurus module. We also had to split our ‘place-and-time’ keywords into separate places and timespans:

Original ‘places-and-times’:

├Bergamo 1500-25
├Bologna 1350-1400
├Bologna 1450-1500
├Bologna 1500-25
├Bologna 1525-50
├Bologna 1575-1600
├Bologna 1600-25
├Bologna 1625-50
├Bologna 1650-75
├Bologna 1700-25
├Borgo Sansepolcro 1400-50
├Borgo Sansepolcro 1450-1500
├Brescia 1525-50
├Brescia 1700-25
├Brescia and Bergamo 1500-25
├Brescia and Bergamo 1525-50

├Bergamo 1500-25
├Bologna 1350-1400
├Bologna 1450-1500
├Bologna 1500-25
├Bologna 1525-50
├Bologna 1575-1600
├Bologna 1600-25
├Bologna 1625-50
├Bologna 1650-75
├Bologna 1700-25
├Borgo Sansepolcro 1400-50
├Borgo Sansepolcro 1450-1500
├Brescia 1525-50
├Brescia 1700-25
├Brescia and Bergamo 1500-25
├Brescia and Bergamo 1525-50

Tidied timespans:

Tidied places:
├Ascoli Piceno
├Bassano del Grappa
├Borgo Sansepolcro

Then we had to move everything around within TMS. This sometimes simply meant reconfiguring how the thesaurus data connected to the main database; but it also meant moving data around between tables in the underlying database. Here, I confess, we’ve so far only done the comparatively easy stuff; but it still took some time and trouble with SQL scripts. But we should end up with:

  • Events in the Events module
  • Keywords in their own hierarchy in the Thesaurus module
  • Materials and techniques in their own section of the Thesaurus module, correctly connected to the Objects module
  • People and organisations in the Constituents module
  • Places in their own section of the Thesaurus module, correctly connected to the Objects module; and
  • Timespans in their own hierarchy in the Thesaurus module

We’ll put Events and People-and-Organisations to one side now, as they have their own data-entry requirements and standards. Looking at the terms we have left in our thesaurus module, we now have:

  • Keywords
  • Materials
  • Places
  • Timespans

Step 3 – prioritise

So which of these to tackle first? We used two criteria to decide: which sets of terms would be easiest to tidy, and which would be the most useful for people trying to find paintings using the filter mechanisms we are building into our new website? In our earlier digital offerings – the Micro Gallery, launched in 1991, and ArtStart, launched in 2005 – users could search by time and place, a combination we’d lost on our more recent websites.

A See Also screen from the National Gallery's Micro Gallery collection information kiosk, showing buttons linking to entries for: 'Velázquez'; 'Madrid 1625-1650'; 'Nudes' and 'Ancient Myth and History'; and 'Ground', 'Erotic art', 'Nude' and 'Venus' - as well as a 'Put Away' button.. Photo: © The National Gallery London.
A See Also screen from the National Gallery’s Micro Gallery collection information kiosk. Photo: © The National Gallery London. Courtesy of Cogapp.
The Browse By Date screen from the National Gallery's ArtStart collection information kiosk, showing van Gogh's 'A Wheatfield, with Cypresses' (NG3861) and the date '1889'. Photo: © The National Gallery London, above a timeline with a slider and 'Earlier' and 'Later' buttons.
The Browse By Date screen from the National Gallery’s ArtStart collection information kiosk. Photo: © The National Gallery London. Courtesy of Annetta Berry.

We therefore decided that we would tackle keywords in the following order:

  1. Timespans
  2. Places made
  3. Materials
  4. Keywords

It was a fairly quick job to set up a proper hierarchy for timespans, and it’s been easy enough to update the timespan entries for individual objects: our data in TMS was good enough to let us automate the remapping of this old data to the current dates for the objects. (I had hoped to be able to show you a timespans filter using this data on our current website, but – as is the way of these things – it’s not been built yet.)

Places are more complicated: the old CIC data is all we have in TMS. We have allocated all our paintings to an artistic school, but that is not the same as a place of production: the obvious example would be the work of Poussin, which belongs to the French school, but was mostly produced in Rome.

Poussin, The Adoration of the Golden Calf
Nicolas Poussin (1594-1665), The Adoration of the Golden Calf, 1633-4, oil on canvas, 153.4 x 211.8 cm; London, The National Gallery, NG5597, bought with a contribution from The Art Fund, 1945. Photo © The National Gallery, London Creative Commons License This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

We’ve had to ask the curators to check the data where there are gaps or inconsistencies – but this has been put on hold, because of other pressures on them.

We’re also reviewing all our materials data. This is more complex. In TMS, we have only used free text to describe the materials our paintings are made of. ‘Oil on canvas’ is easy enough to deal with; but there are more arcane media like glue sizes to describe, and we try and identify the individual species of wood that our panels are made of – never mind the different varieties of board, which are proving quite complicated to pin down. We’ve convened a small working group, drawn from our Curatorial, Conservation, and Scientific departments to try and standardise our vocabularies and the structures of our entries. Once we’ve finished, we will be able to attach our medium entries in TMS to the thesaurus, and export the data to our website where users will, for the first time, be able to browse for pictures painted in particular media, and on specific supports.

Step 4 – identify subject headings

But what about our subject headings? These actually described many different aspects of our paintings; we ended up with the following list:

  • Functions – how the painting was used: altarpieces, cartoons, furniture
  • Genres – the painting’s artistic genre: allegories, landscapes, portraits
  • Periods and styles – the painting’s artistic style: renaissance, baroque, impressionist
  • Physical forms – the painting’s physical shape: diptychs, polyptychs, tondi
  • Subject matter – what the painting depicts …

So far, we’ve allocated all our keywords to one of these headings. All of these, I think, will make useful filter categories.

Step 5 – review and reorganise

It’s at this point that things became a bit more complicated. It transpired that we already had further sources of keywords:

The first is the glossary on our website, another flat file, with many of the same problems as the CIC keywords. It, too, has evolved ad hoc, with no clear standards. Many of the terms are the same as (or mean the same as) existing keywords – but others, such as long lists of scientific techniques or technical art historical terms, are not necessarily terms that we would want to attach directly to paintings as keywords. But, because we want to generate more of our website dynamically, we needed to incorporate them into TMS and our new workflow for editing and publishing object-related texts and data.

We also already had a hierarchically-arranged, browsable list of keywords at the Gallery:

  • Abstract concepts/qualities
  • Architecture and Building
  • Costume and Dress
  • History and Events
  • Leisure and Pastimes
  • Literature and Fiction
  • Nature
  • Objects
  • People
  • Places
  • Religion and Belief
  • Society
  • Symbols and Personifications
  • Work and Occupations

It had been developed to index our digital asset management system (DAMS) for non-collection images. If I were drawing up my initial headings for subjects now, I would certainly have referred to this, although I do have some reservations about how useful it would be for the general public, rather than subject specialists.

In any case, the bulk of our keywords are going to need an ongoing, iterative process of review and reorganisation to ensure that they are consistently and sensibly organised.

Step 6 – add further keywords

Once that’s done (no small task), we will need to make sure that all our paintings are properly indexed using our newly-arranged keywords – this will be a project in its own right. We might be able to exploit some additional resources: in particular, further keywords that have already been allocated to our paintings by our commercial wing, the National Gallery Company. We will need to map those subject headings to our own. This is complicated by the way that one painting is usually tagged with multiple synonyms of the same term:

Probably by Jean-Baptiste Perronneau, A Girl with a Kitten
Probably by Jean-Baptiste Perronneau (1715/16 – 1783), A Girl with a Kitten, 1743?, pastel on paper, 59.1 x 49.8 cm; London, The National Gallery, NG3588, presented by Sir Joseph Duveen, 1921. Photo © The National Gallery, London Creative Commons License This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

NGC keywords (the synonyms in italics) for NG3588: 1 Person, Animal, Animals, Blond, Blonde, Blue, bow, Cat, Child, Clothes, Clothing, Color, Colors, Colour, Colours, Costumes, Cute, Feline, Female, Girl, Half-length, Happy, Hold, Holding, Holds, Innocence, Innocent, Kid, kitten, kittens, One person, People, Person, Pet, Pets, Portrait, Portraits, Portraiture, Pretty, rosy cheeks, Smile, Smiles, Smiling, Vertical, Wearing, Young

Step 7 – map to external vocabularies

And reconciliation brings me to the question of other vocabularies. We want to deliver our collection as Linked Data; and to do this, we need to be able to link the terms that we are using, to those in other authorities and terminology lists, so that other people can understand what we mean by them. We’re already doing this by hand for new entries for people and organisations, which we link to the Getty’s ULAN, Wikidata or, if they’re not in either of those, to VIAF. I plan to experiment with OpenRefine to see if it can help us reconcile our current terms with the Getty’s AAT (the Getty have just launched an OpenRefine reconciliation service) and with Wikidata. For places, I’ll see if we can reconcile against Geonames, the Getty TGN, and, again, Wikidata.


But that’s all in the future. Looking back, what would I have done differently? Well, the key thing would have been to look at our existing subject headings in other systems – the DAMS and online glossary – and incorporate them into our work much sooner. This would have influenced our initial choice of subject headings, and saved us from a significant amount of future shuffling around.

But I’d like to end by summarising the steps which we followed, or are planning, as we tidy our flat files into thesauruses at the National Gallery:

  1. Identify broad types of terms
  2. Move keywords that aren’t keywords into suitable places in your collections management systems
  3. Prioritise the next steps
  4. Identify subject headings
  5. Review and reorganise your keywords, and keep on reviewing them (this is the stage we’ve reached)
  6. Incorporate further keywords and link keywords to objects if necessary
  7. Map your terms to external vocabularies