At the same time as announcing 200 Paintings for 200 Years, the National Gallery has expanded the online information it provides about its paintings to include bibliographies, exhibition histories, and provenances. This post is about the first two; I’ll write more about provenances soon. Like the 200 online catalogue entries, these are projects that my team has been working on for some time – building on work done at the Gallery years ago.
What have we published?
Bibliographies
Over the last few months we have uploaded 21,542 bibliographic references relating to 1,164 of the Gallery’s paintings into our TMS collections management system. This entailed adding records for 4,739 new articles and 5,955 new monographs, which between them are linked to 3,528 new author records.
You can find these references listed on each painting’s web page: scroll down to below the image, to the ‘Key facts’ section, and click on ‘Bibliography’.
My standard example, Titian’s Bacchus and Ariadne (NG35), is as good a painting as any to try this. There you’ll see that the references run from Lomazzo’s Trattato dell’arte della pittura, Milan 1584 to A. Brookes, ‘Richard Symonds’s Account of His Visit to Rome in 1649-1651’, The Walpole Society, LXIX, 2007, pp. 1-183. As the note ‘About this record’ says:
Bibliographies may not be complete; more comprehensive information is available in the National Gallery Library.
There’s a reason for this. What we’ve put online are the bibliographies assembled during the Gallery’s Dossier Project, which began around 1993. The project aimed to rehouse and assemble lists of the correspondence, offprints and photographs in the Gallery’s curatorial dossiers (what other museums would call ‘object’ or ‘history’ files); and add core information about every painting – bibliographic references, exhibition histories, references in the Gallery’s archive, and related works.
The Dossier Project ran until 2004, by when half the collection had been documented to this higher standard. (The other half simply had the dossier contents rehoused.) At this point, the project was placed in a ‘maintenance’ mode: assistants were employed to add new information to the dossiers as it came to the Gallery’s attention, when new acquisitions were made, or as cataloguing projects were completed and curators’ working notes rehoused, but no further active research was done to augment the existing bibliographies, etc. Finally, as austerity hit in the early 2010s, the assistant post was frozen in 2011 and abolished in 2013.
In parallel, the Gallery’s Library was also noting references to Gallery paintings as it catalogued newly-acquired books and catalogues. These are included in the Library’s online catalogue: try searching for NG35 (Bacchus and Ariadne‘s accession number). Looking at the detailed record for a publication – for example, this one for Alessandro Ballarin (ed.), Il camerino delle pitture di Alfonso I, 6 vols, Cittadella 2002-7 – you can see the information recorded under ‘NG Inventory Number’ (MARC field 690 $a $b, for any librarians reading this).
Like the TMS-derived bibliographies which we have published, the library catalogue references have been imported into our CIIM middleware and linked to the relevant paintings. However, as these simply record every book acquired by the Library that mentions the painting in question, without any assessment of the publication’s significance, it was decided not to include these in the paintings’ web pages. And this work, too was abandoned due to lack of resources, in about 2013. Since then, there has been no systematic recording of publications relating to Gallery paintings.
Exhibitions
We have also put online 3,789 links between 1,295 objects and 1,116 exhibitions. There may be more than one link between a given object and its exhibition, because we link the object to each venue at which it appeared – we don’t always lend to every leg of a multi-venue show. Have a look at Artemisia Gentileschi’s Self Portrait as Saint Catherine of Alexandria (NG6671) for some unexpected venues – again, scroll down to below the image, to the ‘Key facts’ section, and click on ‘Exhibition history’.
These records detail all the exhibitions in which Gallery paintings have appeared between 2009 and the present day. The reason that the coverage differs from the bibliographies is that the information has been derived from a different source: the data that we use to manage Gallery exhibitions and loans out to other shows, and that we hold in our collections management system, TMS. We started using TMS to manage loans and exhibitions in 2009, so this is the point from which we have reliable data in the system.
But as I mentioned above, the dossiers also include exhibition histories for our objects. We are currently working on those lists, so that we can add them to the records in TMS and publish them online in due course.
How did we do it?
Dossier data
The dossier lists were first compiled as WordPerfect documents (remember that the project started in 1993). They were soon converted to MS Word; but each painting’s list exists as a separate word-processor document, not as structured data in a spreadsheet or database – and we need structured data if we are to reuse it as easily and widely as possible, for example on our website, where we can embed it in our painting pages rather than have it hidden in a file which has to be downloaded and opened.
The documents were also compiled by different people, each of whom had their own way of implementing the layout conventions adopted by the project (listed by date, with date followed by a hanging indent and the citation itself; additional data such as shelf-marks, or whether the Gallery has a copy or offprint, indented below the full reference). For example, some people used Word’s paragraph formatting to add the indent; others added a hard return and indented the second subsequent line using paragraph settings; others added a tab at every line break to do the same.
Equally, Word is remarkably profligate when applying formatting tags in the underlying document (which is stored as XML), applying multiple tags of the same kind in a span of text which should be enclosed within a single pair: for example we find “Robert, K., Traité pratique de la peinture à l’huile (paysage), Paris 1878″ encoded as Robert, K., <i>Trait</i><i>é</i><i> pratique de la peinture </i><i>à</i><i> l’huile (paysage)</i>, Paris 1878
, rather than Robert, K., <i>Traité pratique de la peinture à l’huile (paysage)</i>, Paris 1878
.
All of this makes it surprisingly difficult to extract structured data from the Word files. We used Python scripts to parse the data into discrete units of information, which were output as CSV files.
Because the same publication could be referred to many times, for different paintings, there were often minor differences in the way the same publication was cited in different Word files: for example, using an author’s initials or full names; the order in which multiple authors appeared; or precisely what punctuation was used to separate titles from sub-titles. We therefore uploaded the CSV tables into OpenRefine and used its clustering algorithms to reduce multiple forms of a citation to a single one wherever possible.
Once this was complete, we output the files into MS Excel, and used these spreadsheets as source files for uploading the data into TMS (which runs on SQL Server) using SQL statements.
We have used the same techniques to process the dossier data for exhibitions, and are currently uploading the data into TMS; but there are many more discrepancies in how individual exhibitions and loans have been recorded, and we will need to do a significant amount of manual data tidying before the data is correct and ready for publication.
TMS data
The exhibitions data held in TMS was more straightforward. Because it has been assembled to help us manage our collections, rather then for online publication, it contained some inconsistencies that didn’t affect Gallery operations, but would be apparent when delivered online.
We therefore created a set of SQL queries to identify as many of these inconsistencies as possible, and then went through the exhibition records by hand, tidying them and augmenting the data where necessary.
Final pipeline
Once we have tidy data in TMS, it is ingested into our CIIM middleware, collated with the object records, and made available to the Gallery’s Umbraco web content management system.
Acknowledgements
As outlined above, putting the bibliographies and exhibition histories for Gallery paintings online has been the culmination of decades of work, to which numerous people have contributed.
The original data was assembled by the team working on the Dossier Project, and subsequent Dossier Assistants: Jacqui McComish, Susanna Avery-Quash, Sarah Herring, Angeliki Lymberopoulou, Barbara Pezzini, Flavia Dietrich-England, Joanne Anderson, Matthew Storey, Mercedes Cerón, Tania Adams and Thomas de Wesselow. Dossier data was initially wrangled by Kate Byrne, and further tidied by Tania Adams and Cynthia So.
TMS exhibition data was originally created by members of the Exhibitions and Registrars’ Departments as part of their day-to-day work; it was tidied by Allison Sharpe, assisted by Katie Holyoak; and advised by Hannah Woodley in the Library; Richard Dark, Alice Calloway, Naomi Lewis, Grace Hailstone, Sam Dorman and Louise Marlborough in the Registrars’ Department, and Sunnifa Hope in Exhibitions.
The CIIM was configured to accept the new data by James Huish at Knowledge Integration, in liaison with Jude Dicken at the Gallery. The entries have been incorporated into our new website collections pages: the initial interface designs were made by Numiko, in collaboration with the Gallery’s internal UX/UI team. Over the last year, Caroline Kha and Antonio Sauro worked with Lucinda Blaser, Nejra Hadzimejlic and Jim Gettrup to iterate the designs for integration; Nejra and Jim also built and implemented the final designs and the data integration.