THE FORGOTTEN IMPORTANCE METADATA TRACKING IN DIGITIZATION AND PORTAL USE OF CULTURAL HERITAGE.

By Larissa Tijsterman

22/12/2017

Introduction

Digitization of cultural heritage is still a recent development. Metadata digitization, however, has its roots as early as the 1960s when MARC (Machine Readable Cataloguing) revolutionised cataloguing practices in the library field. Computers have made standardization and sharing of metadata within the library, archive and museum fields a lot easier but also a lot more complex. The library field was already familiar with a long tradition of sharing records for many other institutions the necessity of sharing their records across the field has not been a common practice (Skinner 53-4). Since the last century many efforts have gone into the standardization of metadata across the field resulting in various different schemes and models, depending not only on the type of collection or object different descriptions would be given but also on the purpose such as retrieval, records and descriptive. But not only the different efforts of standardization but also the introduction of automated/digital management systems that require yet another format have let many institutions to have not just collections of heritage object but collections of metadata each in a different schema. Users have always relied on the metadata records for their research as it helped them identify whether or not the object or items holds the information they seek. Although searching the records can sometimes feel like a daunting task as users have to come into the institution for their quest they can always rely on the help and guidance of experts archivist, librarians or curators to find that right collection to move them forward.

When institutions started to use the internet to display their collections, portals and platforms provided the opportunity for users to engage with cultural heritage from the comfort of their home. For the first time collaboration across the cultural heritage field on a large scale was made possible by integrating all these online collections on one website. All these different metadata schemas and the diversity in formats had to come together and the messiness of the data was made painfully clear. As users are coming less to the institutions and rely more and more on the online availability of records user studies have been crucial to evaluate the quality of the online collections in portals and platforms and to discover what is expected and how people are engaging with cultural heritage online. Each portal and each project attempting to unify the chaos promises to be the ultimate solution yet a formal evaluation methods lack and when funding runs out many projects are abandoned leaving even more different formats. User studies have always been very good at pointing out that metadata should be improved yet they do not offer institutions any guidance on how to identify which collections are most at risk of being unusable due to bad data or recommendations on how to improve data in a way that both suits institutions and users.

This paper seeks out to describe how users are using online collections on platforms and portals, what they are expecting to better understand what user’s need are. Followed an institution’s perspective on what is being done to improve data quality and a look at crowdsourcing as a possible cost-effective solution for data creation and improvement. Finally, this paper proposes a framework for to evaluate the data quality of a collection while comparing the situation with other institutions for possible data sharing opportunities.

Users and Online Collections: The Situation

As researchers are coming less into heritage institutions to search their collections the face to face help that the institutions offer dwindles to the background. Instead, ‘[p]eople expect to find archives and special collections on the open Web using the same techniques they use to find other things, and they expect comprehensive results. The invisibility of archives, manuscripts and special collections may well have more to do with the metadata we create than with the interfaces we build.’ (Schaffner 86). Metadata, the data about the data, has been crucial for management of information whether they are stored on punch cards, written on paper, organized in excel sheets or contained in a management system they have been the go-to source in order to discover where or what items are held in a collection. Yet their creation is often driven by different philosophies and approaches which per field and per institutions even per person can differ widely. Digitization of metadata records has exposed the discrepancies in the records as well as fired up the debates about what the focus should be when creating metadata. Back in 1988 Dowler already advocated the need to adjust practices to suit not the theories but the users of the archive (74-80). He describes in his paper the different approaches between librarians and archivist when they view themselves as gatekeepers of their collections. Although both set out to make their collections more accessible their ideas about how this should be done are quite different: librarians seek to make the users self-sufficient in their search whereas archivist like to guide ‘a grateful scholar by the hand through the uncharted forest of the record to precisely the right material’ (Dowler 82). That Dowler advocates for more user-focused studies at the end of the eighties seem logical for us now as users reach institutions through different channels which are not always easy to identify what their purpose is.

Having still all users coming physically through the door observing their searching behaviour and aiding them in their quests is relatively straightforward. Real-life presence of someone knowledgeable from the institutions self is approachable for all sorts of questions from researchers. They also have plenty of opportunities to talk to the researchers, ask them more about the research and their needs as well providing more guidance towards collections and items that the researchers self wouldn't initially have thought of. This counts for librarians as well even though they initially take a more hands-off approach they are still physically there for any inquiries. Both archivist and librarians work as gatekeepers of information, their in-depth knowledge of the collections and their decisions on how collections are formed and described influences researchers hugely yet at the same time if anything is unclear those from the institutions aiding researchers are the first to know.

Dowler's plea, unbeknown to himself, comes from the introduction of the electronic recorders systems that made its way into the reading rooms and started to replace the more traditional card catalogues and microfiches. Even though he states that archivist has little knowledge on who their users are even before any introduction of electronic records management systems the ease of finding out who the users are and how they are searching the archives is relatively easy as described above.

Our approach to searching for information has drastically changed since the introduction of computers and the internet which confirmed Dowler's view that more attention should be given to the actual users. The field has listened, the go-to tool to evaluate both systems and collections is through user studies and over the years they have provided valuable insights on who the users are, how they search, what their expectations are and what needs to improve. Unfortunately, they do often lack the how. Published research on how to improve is often linked to funded projects that either develop tools, host platforms or portals and commercial companies who offer solutions for a price, they all come with a mild or non-existing critical evaluation. Those working with collections in the heritage field are often confronted with limited funds, high workloads little time and an enormous backlog. Improving records and creating them often comes down to routine procedures that have been in place long before the current employees started working for the institution that often serves one single goal: reducing the amount of backlog as fast as possible (O'Hara Conway and Proffitt 19-20).

The pull for digitization of cultural heritage comes both from the users as from the institutions itself even governments are a pushing factor behind the digitization. Records, catalogues and metadata have been one of the first items to be digitized due to the fact that it is flat text and the usefulness for internal institutional use. The introduction of computers may have made user studies more complex but it also can increase the access to the heritage collections and open up a wider audience. Digital records also enabled institutions to share their data with each other and the world which in essence would make it easier to supplement each other and share best practices but it has also shown the huge amount of discrepancies, messiness in the data. Issues concerning records and especially metadata of individual items are taking almost half of the time allocated to digitization projects making it an important and costly investment (Primary Research Group 28).

The 4 W of Users

Digitization of the records and metadata have made the users obscure, hidden behind screens and IP-addresses and more difficult to research. To better allocate resources for digitations and the evaluation of past projects user studies have made a very important contribution to the understanding of how digital collections are being found and used. “The more users do not need to consult archivists and librarians for searching, the more successful initiatives to improve description and discovery have been” (Schaffner 87). Interestingly, although most user studies are great at pointing out where digital collections are failing to meet user’s expectations they often do not provide the strategical or methodological how to improve the digital collections. This, however, does not take away the immense valuable insights that we have learned about user’s online activity over the past twenty to thirty year, especially for institutions that are just embarking on the digitisations of their collections. Metadata forms a critical part in the discovery and understanding of heritage items and collections, badly formatted or lack of metadata will result in online hidden collections as they can hardly be found. Knowing what is expected can help with the initial designs and setting up digitization projects. In an extensive literature review of user studies on special collections in both libraries and archives, Schaffner identifies several important characteristics of researchers and their interactions with the digital collections:

Where are Users Searching?

The first big misconception of how researchers go about searching is the fact that most start their search on “Google or Wikipedia, and usually only come to libraries and archives for known items” (Schaffner 86). This too counts for museums in a study about museum object discovery in 2008 they noted that “[s]everal researchers said that they would start with Google, even though it is unlikely at present that they would pick up museum collections from a Google search” (Discovering Physical Objects 16). Indeed, for 2008 it may seem unlikely to find collection let alone individual items through Google, ten years on ‘[p]eople expect to find archives and special collections on the open Web using the same techniques they use to find other things’ (Schaffner 86). Portals, platforms and institutions websites are no longer the first places of interaction for users, which makes well- formatted metadata even more important as it can easily be picked up and indexed by search engines.

What are Users Searching?

As discovery is being mediated through search engines, researchers tend to use keywords and subject search in order to find their information. Schaffner identifies that researchers prefer to learn what collections are about while archivist and librarians tend to focus on what the collections are made of (87). The two different approaches make it necessary for descriptions and metadata to contain information and keywords about the objects and collections itself so they can be identified by search engines. In addition, as researchers are not directly coming through the institution’s homepage information about the coverage of the collections online becomes unclear resulting in the assumption that everything that could be found online to be everything that is known about it. For example, during a study of the usability of the Archives Hub Website, a large portal through which archival data in the UK can be accessed, participants believed that their search results were a comprehensive result of everything that was known on that subject (Stevenson). To think that only around 10% of our cultural heritage in Europe has been digitised and hidden collections are hardly coming through in search results, the actual digitized object researchers come into contact online is only a very small percentage of the entire physical collections that are available (Europeana).

What are Users Expecting to Find?

When going down to the final level, the metadata of individual objects, opinions are diverse when it comes to how extensive should the data be. Both short and long descriptions seem to be in favour, where long descriptions are causing less nuisance as short descriptions as long descriptions can be skimmed for keywords and easily scrolled through (Schaffner 90). Nonetheless, “Users support concise minimum-level descriptions, which can also be effective for discovery when it is done well” (ibid).

When it comes to non-expert users, there is a distinct preference for certain metadata fields over others. Using a digital image collection Fear had two panels of non-expert users set out on a search and discovery quest while ranking the usefulness of the Dublin Core elements (33). Fear concluded that description, subject and format were selected by both panels as most useful for understanding an object, making these three fields the most valuable when looked at minimum-level descriptions (44).

What do Users Expect From Institutions?

It is now clearer how users are going about searching for heritage objects online there is just one question left: What do users expect from institutions when it comes to the data? Europeana, Europe's biggest cultural heritage platform, presented in 2015 a two-day conference: EuropeanaTech Conference, in their proceedings they voiced the researchers need from institutions quite clearly with the following headline: "We want better data quality: NOW!". Through several presentations and discussions researchers and representatives of institutions and Europeana debunked long-held understanding and voiced clear recommendations and expectations on how data quality can be improved. Although each institution works within a certain framework to standardize their data and present them at their best, researchers experience it quite differently. Due to their search behaviour, they encounter many different formats often from different portals and platforms although it is not unusual to find different formats within the same platform. Messy data is one of the key struggles of researchers and they feel that the main issue lies with the perception of responsibility of data maintenance after publications. “The quality of the data structure is usually determined by the standards in place in cultural heritage” however, when “[o]n the web data quality is a moving target” as it is continually being used by different applications, algorithms and services making it near impossible for institutions to track and keep up the good quality (‘We Want Better Data Quality’).

Both researchers and representative of institutions came to a better understanding of needs and possibilities concerning data quality most importantly: “Participants agreed on the fact that we should accept any data and define it as good quality as long as it is consistent: ‘weird is okay if it’s consistently weird’” (‘We Want Better Data Quality’). Standardization seems to be the way forward for metadata.

Standardization in Practice

“Data is not necessarily meant to be understood in a context other than the knowledge domain it was created in and the purpose for which it has been published. Lots of information is therefore implicit for the professionals who created the data, but becomes less obvious once published within end-users’ applications or interfaces” (‘We Want Better Data Quality’).

The use of digital information systems has both helped and complicated the standardization. As many institutions are not in the position to build and maintain their own information systems often the commercial sector jumps at the chance to provide this service, these contracts are long term with big price tags and multiple stakeholders, each with their own wish list, have the final say in the matter. It wasn’t until the introduction of the publicly accessible computers within institutions that serious thought was given to these systems on what data should be available to the public and how this would visually be presented. Archives and libraries had previous experience with the publicly visible data, due to the card catalogues their quality and organization solely depends on whoever is in charge. For many institutions having a public catalogue often meant having a (semi-)separate system which contains data of each item which ideally is tailored for the public but often the same as used by professionals as translating/rewriting is time-consuming and expensive. Within the museum field, public accessible catalogues is still a very recent development as there wasn’t a need for such data to be available outside the professional environment (Pekel). Even the reasons for keeping such data and what was actually recorded largely depended on what purposes such records would need to serve. Records kept by galleries and museums are often for insurance purposes whereas with libraries and archives often for retrieval and discovery purposes.

With the coming of the internet real efforts in digitization brought forward the importance of metadata. As previous when the public was unable to find the right information they had always the option to ask someone as they were still in the same institution, with the internet people are becoming more and more disconnected from the institutions, “goals to disclose descriptions online and to digitize primary resources have made special collections more visible and roles of archivists and librarians less visible” (Schaffner 87).

Presenting objects from the cultural heritage institutions online involves yet another system. Early adaptors replicated their own existing frameworks that they already offered to the public within the institution for the online environment. However, the coming of web 2.0 rendered such practices outdated and unfashionable. The continuing digitization also caused problems with materials that were not strictly text, as a full-text search on objects that do not contain text is simply not possible with our current technological accomplishments. In order to make such objects, findable metadata is crucial as the object is represented through its metadata, having this data visible and clickable will not only provide context to the object but also leads to a further discovery for those who are interacting with it (Fear 42-5).

Digitization, the internet and the call for open access offered another big change on a scale that wasn’t possible before. Portals and platforms make collaboration between institutions across the cultural heritage field possible, however, this also lay bare the massive amount of discrepancies of both metadata content as metadata format. Getting all these (largely commercial) systems to talk to each other which haven’t initially been designed for that purpose has been a real challenge especially as portals are becoming more and more like platforms. Portals started out as a webpage with bookmarked links to the different institutions currently they are starting to be developed much like platforms where all the objects of different institutions come together for the world to engage with.

Portals come in many shapes and sizes through all sorts of collaboration efforts: some are the result of connecting different collections of the same topic, regional collaborations, national collaborations, collaborations of institutions with the same information management system, cross-border collaborations, collaborations due to research projects, collaborations based on certain medium, etc. Each collaboration and portal faces the same challenge: after all this time each institution (and sometimes even each collection) has their own metadata standards and formats that have to be rewritten to fit into the portal yet has to remain the same for it to be operable in existing systems. Each effort to bring it all together promises to be the answer to the problem often with lengthy published papers and documentation but the reality is that it creates even more independent systems, formats and standards much like a network but without links connecting each and no formal way of evaluating the success of streamlining and linking all these styles.

How to Move Forward; Crowdsourcing as a Solution Everyone Can Agree On

Over the years many collaborations of institutions have led to a multitude of different formats that were to supposed to solve system integration problems as well as provide constituent formatted metadata for researchers across the institutions. However, the large variety of metadata schemata with mindboggling acronyms have contributed more to the problem of inconsistent data rather provide a solution (Park and Tosaka 109). Although calls for a more common metadata format remain due to the enormous backlog and the push to have collections out on the web a radical change in the creation of metadata seems to be necessary. “Cultural institutions tend to think about data quality in isolation, however, the users of the data may be in the best position to comment on the data and help improving it” (‘We Want Better Data Quality’). Crowdsourcing may be the way forward for creating and improving metadata. Although their might be a fear among cataloguers and records creators that crowdsourcing will leave them without a job and the content that will be created through the crowd will be of such poor quality that the effort of collecting them cost time and money then it saves. Crowdsourcing is however sometimes seen as an evil no use waste of institutions monies but largely these negativities are based on myths that through thorough research and experiments have been largely debunked. Crowdsourcing has been in place for quite some time and its quality closely monitored through experiments and evaluations (Aroyo and Welty 16).

Metadata enrichment and corrections are not a new thing, OCR corrections were one of the first and easiest crowdsource projects as it simply involves reading written or old printed text and compare the output and correct it where necessary. One of the biggest yet low covert projects for OCR corrections is Google’s reCAPTCHA that used sections from Google Books that either had trouble or were not able to be transcribed automatically for the no robot validations on websites. Unknowingly millions of people contributed better descriptions for Google Books while trying to sign into websites, order items, sign up for newsletters. As more people fill in the same word for a reCAPTCHA that word is adopted as the correct transcriptions, although there are some that had even human eyes didn’t understand “reCaptcha nevertheless achieves an accuracy rate above 99 percent, which compares favourably with professional human transcribers” (Gugliotta).

It must be noted however, that reCAPTCHA received many critiques on the fact that people were not aware that they were working on transcriptions while not receiving any compensation. This argument let in 2015 to a lawsuit against Google for stealing people’s time as the transcription was used to for books that Google sold, however, in the following year the judges dismissed the case as the transcription was also used for other free services of Google as well the triviality about the amount of time consumed (Harris, Kravets).

Bringing Data And People Together On a New Platform

Successful crowdsourcing can still be done with volunteers that genuinely want to spend some of their free time tagging, transcribing, identifying, indexing, linking and more. A good example of such is the Dutch initiative VeleHanden.nl. The name derived from an old Dutch saying ‘many hands make light work’ VeleHanden.nl is a crowdsourcing tool from a commercial company Picturae who specialise in digitization, storage, management and enrichment of collections however, the website has been created in collaboration with Stadsarchief Amsterdam and is used by almost all municipality archives in the Netherlands. Volunteers can sign up for free on the website and pick a project that is available and contribute as much or as little as the like. VeleHanden.nl has been praised for the high- quality volunteers deliver as well as the amount of enthusiasm from the volunteers. Volunteers do not work completely for free because the significance of their contributions they are rewarded with points for each entrance which can be traded for printed archival scans, coffee or tea in the heritage institutions, tours of the institutions, tickets for exhibitions, even from time to time products from sponsors of the archives.

Bringing The People To The Collections

VeleHanden.nl is a great initiative however people have to know about it find the site and sign up before they can contribute, this involves deliberate action from the volunteers and a whole separate environment for the enrichment of content. Although some of the volunteers are professionals, users who are using the platforms and online collections indicate a lack of opportunity to report, correct or enrich the metadata they encounter (‘We Want Better Data Quality’). In the current Web 2.0 environment, VeleHanden.nl seems somewhat outdated, to get back to the intimate relationship between researchers and archivist and librarians feedback loops should be in place in both platforms and online collections. As researchers are tagging, annotating and enriching for their own notes enabling those functions in the form of integrated or external tools/APIs within the platform. Data from these inputs can be feedback to institutions for evaluations or shared in commons for the use of others. CLARIAH’s Media Suite is one of such examples, although only accessible for those researchers affiliated with certain institutions and not to the general public due to the copyright restrictions of the collections through which the Media Suite provides access. Both as a valuable research tool and a heritage platform the Media Suite offers several tools for enrichment and annotations of the data in the suite, however, because the Media Suite is in constant development at the moment the documentation is very limited (CLARIAH-Media Studies). To bridge the physical gap digitally between institutions and users the Media Suite maintains an active community where researchers can bring ideas, comment on the APIs and share practices with other researchers (CLARIAH-Media-Studies - Gitter).

Bringing The Collections To The People

If institutions seek to engage more with the wider public rather than only researchers the web 2.0 provides ample opportunity to bring collections to social platforms. Bringing collections to places where people already are, saves a lot of time in promotion and awareness with the public (Kalfatovic, Kapsalis, Spiess, van Camp and Edson 268 ). After seeing the success of the Library of Congress experiment of releasing copyright free material the Flicker Commons the Smithsonian Institution choice to “embrace the social networking reality of the Web 2.0 – choosing to go where visitors are and not requiring them to come to us” (ibid). Although the Smithsonian had expected that due to the exposure on Flicker traffic would redirect towards their own page. This, however, did not happen even though the exposure resulted in a huge amount of interaction with the content on Flicker. An unexpected result instead was measured, the images were used and reused on a great variety of blogs increasing the diffusion of knowledge which aligns with the institution’s motto.

The Framework To Evaluate Data Quality Of a Collection

Users studies have provided insights into how users are engaging with online cultural heritage collections, they no longer access institutions websites, portals and platforms directly, but rather use the open web with keyword and subject search to find what they need and believe to a certain extent that online collection is equal to those that are offline. Researchers want better data quality and more collections online even if it is just records, while institutions would like to meet all these expectations they are under serious budget strains and are facing enormous backlogs. Crowdsourcing can offer a solution for institutions to tackle their own problems and meet researcher’s needs while at the same time engaging with the wider public and increasing the institution’s exposure.

Once collections are out on the web they are knowing and unknowingly being picked up and reused by portals, platforms, APIs, social media and blog, it is no surprise that heritage institutions are not fully aware of what data of theirs is out there and how the quality is. User’s studies only point out that the data needs to be improved but In order to identify which collections are most at risk of not being found and used due to poor data quality they do not provide any guidance or method how to tackle this problem. This method will expose discrepancies in the data by tracing a ‘single’ object across different institutions and platforms. Although institutions in the cultural heritage field like to boast about the unique objects they have in their collections they also have a lot of objects of which (semi-)duplicates or similar versions can be found in other collections of other institutions as well, sometimes even within the same collection. These objects can be key in measuring the level synchronicity of the metadata online with those available in the institution self and have the ability to trace it back from the platform to the institution. In an ideal situation we would know exactly which objects are present in which institutions and collections and be able to only digitize one and have all the others refer to the same digitized object and supplement the records, it would be immensely cost saving but it also requires organizational collaboration effort on a global scale which at the moment is still very far away. With the urge to digitize material now and it being more cost-effective to do entire collections rather than small selections the likelihood of the same/similar object being digitized by different institutions is great. (Primary Research Network 33) Although one ‘digital copy’1 representing all different versions out there might not be the ultimate solution if every single object should have its own digital counterpart the data should be roughly the same. Right now we only discover these multiple copies when the different collections come together online, especially within portals as they often contain collections from archives, libraries, museums and sometimes even galleries.

1 Personally I do not consider the digitized version of a real life object to be a copy or even a 2D/digital representation. The identity of the digitized object is far more problematic as it is stuck between two dimensions: it is both a shadow of its real life counterpart as it is an entirely new object in itself. As this requires more extensive explanation for which this paper is not well suited nor does it help to clarify the method, therefore, I will be using more generally accepted terms such as digital ‘copy’ or ‘version’.

The method is relatively simple in execution:

  1. Take on an object which is most likely to be within multiple collections (preferably one that will also be in different fields).
  2. Find all different versions in the portal or platform (or across portals or platform) and take note of the following things of item:
    1. Is there an image present, if yes what are the dimensions and can it be downloaded and/or zoomed?
    2. Is the most basic metadata (title, author, date, material) present and readable? Note the contents of the basic fields
    3. What is other data present? Note all other fields and contents.
    4. Is the metadata clickable for further discovery?
    5. Is there a reference to the institutions? Also, note the type of institution.
      1. If the reference is clickable where does it lead to? (institution homepage, collection homepage, item within in the institution website or other)
    6. When tracing the object to the institution follow the same format for each page you come across including the institution’s website. Also if the reference to the institution is not clickable do trace it from the institution’s homepage and extra note on the effort it took to trace the object without links.

What should be aimed at concerning good data quality?

  • The idea is that the most basic metadata should be the same or at least consistent across the platform (horizontal).
    • Minimal metadata ensures that items are discoverable
  • additional data should be consistent with that of the institution (vertical).
    • “weird is okay if it’s consistently weird”(‘We Want Better Data Quality’)
  • Images should always be present and at least be viewable in such dimensions that it is clear what the contents are.
  • References to the institutions should be clickable and preferably link to the item on the institution’s website or the online collection.
    • Heritage institutions are generally trusted sources being able to know where the object comes from increases the confidence of non-expert users to reuse the object (Fear 47-8).
  • Metadata should be clickable for ease of further discovery.
    • Although often dependable on the website the object is hosted some form of further discovery allows users to continue to discover and browse collections (Fear 37).

The results can be presented for example in the following table:

  Platform Object Permalink Platform Object Permalink Platform Object Permalink Platform Object Permalink
Image Present?

Yes/No

Dimensions:

Yes/No

Dimensions:

Yes/No

Dimensions:

Yes/No

Dimensions:

Image downloadable? Yes/No Yes/No Yes/No Yes/No
Basic Metadata

Title:

Author:

Date:

Material

Title:

Author:

Date:

Material

Title:

Author:

Date:

Material

Title:

Author:

Date:

Material

Other Data

Field: content

Field: content

Field: content

Field: content

Field: content

Field: content

Field: content

Field: content

Field: content

Field: content

Field: content

Field: content

Metadata Clickable? Yes/No Yes/No Yes/No Yes/No
Reference to Institution?

Yes/No

Type:

Yes/No

Type:

Yes/No

Type:

Yes/No

Type:

Reference clickable?

Yes/No

Link:

Yes/No

Link:

Yes/No

Link:

Yes/No

Link:

Where does the link go?

Homepage/Collection/Item/Other

Ease of finding item:

Homepage/Collection/Item/Other

Ease of finding item:

Homepage/Collection/Item/Other

Ease of finding item:

Homepage/Collection/Item/Other

Ease of finding item:

Table 1: Template Result Metadata Tracking

Template Conclusion Metadata Tracking By Larissa Tijsterman

Figure 1: Template Conclusion Metadata Tracking

Here each link represents the reference through links towards each different version of the object along with a short coded representation of the metadata. The style here would refer to the contents of the metadata, for basic metadata that would consist of a title, author, date and material if the contents in any shape or form are different than it would be a new ‘style’. For example, if within one version everything is the same but the author is first name, last name and the other last name, first name then each version would have their own ‘style’ because depending on the platform this relative simple difference can cause certain more simple algorithms to identify the author of the work as two different authors causing the object not to be found. This schema gives an indication of the state of the collection where the object is part off and also compares it with other institutions. Comparing it with other institutions helps to identify collaboration efforts and data sharing opportunities as well as measure the level of consistency.

This method has been written largely descriptive however, it has the potential to be developed into a tool which uses basic retrieval principles to identify other versions of the same object and harvest the metadata connected to the object. Further development would even enable to feed all data from one collection into the tool and then having each item within the collection cross-checked with other platforms and portals. This would identify not just collections but individual items with poor data quality that needs to be reviewed as fast as possible, the method also provides the opportunity to use quality data from the same item from other institutions to improve the data quality or have a more selective collection of items that can go straight through to crowdsourcing projects or cataloguers for review.

Conclusion and Recommendation

Cultural heritage institutions have been working on the standardization of records for a long time. Online platforms and portals have shown that metadata standardization is still a work in progress as currently, the data is very messy which make collections and items hard to be found. To better allocate resources institutions rely on user studies to figure out how users are engaging with the online collections as well as to find out what their expectations are. The studies have observed that researchers are using the same techniques for searching heritage items as searching for general information. Researchers tend to use the open web for their search queries rather than going directly to the institution’s webpage and search using keywords and subject search to find information (Discovering Physical Objects 16, Schaffner 86). Even though only 10% of Europe’s heritage has been available online due to poor data these cannot always be found resulting in many hidden collections (Europeana). When it comes to the amount of data for each item opinions are divided but in order to make sure that collections are represented minimal descriptions about the collection and/or item/object are a must (Schaffner 90, Fear 33, 44). Above all, for those collections that can be found, users and researchers want better data quality but they understand that institutions are struggling with enormous backlogs and budget strains, they agreed that any data can good quality data as long as it is consistent (‘We Want Better Data Quality’).

Crowdsourcing can provide a solution to tackling backlog and improving data quality. Institutions can choose for different setups depending on what type of enrichment they require. For basic transcription, OCR checks and basic content descriptions crowdsourcing platforms such as VeleHanden.nl can provide a solution however, people need to be aware of the platform and sign up before they can participate. Further enrichment, annotations and vocabularies that researcher use and make while conducting research can be searched with other researchers and be feedback to institutions. Development of such tools can be expensive but also fruitful such as CLARIAH’s Media Suite, the downside, however, is the fact that it mainly remains with researchers and not always suitable for the wider public. For those items that are free of copyright bringing them to where the people are online instead of having people come to another page increases public engagement with cultural heritage such as the Smithsonian Institute did when they released a large collection of pictures to the Flicker Commons (Kalfatovic, Kapsalis, Spiess, van Camp and Edson 268). Although the items were used, reused, shared and commented on getting the data back as useful metadata that can supplement the records is somewhat complicated as it is largely unstructured data.

Before metadata can be improved identifying which collections are the messiest or most at risk of not being found online. Almost all collections have items and objects in them that are also present in other collections of other institutions, these items are crucial for measuring the quality of the metadata of the collection as well as compare it with the data of other institutions. The comparing can help to look for collaboration or data sharing opportunities otherwise crowdsourcing can be used to improve these records. The framework has the potential to be developed into a tool so institutions can feed data from an entire collection into it and have the improved data of other institutions as output or recommended for review.

Data on the web is always evolving and expectations continually, changing institutions and users should remain in constant contact and look for opportunities to help each other out and understand each other’s struggles. We have come a long way in standardization and digitization of records but there is still quite some road ahead. While doing research on digitization it should not be forgotten to not only look at what is happing where and what needs to improve but also how institutions can change their workflows to suit users need.

Bibliography

Aroyo, Lora, and Chris Welty. ‘Truth Is a Lie: Crowd Truth and the Seven Myths of Human Annotation’. The AI Magazine, vol. 36, no. 1, 2015, pp. 15–24.

Bowker, Geoffrey C. Sorting Things out: Classification and Its Consequences / / Geoffrey C. Bowker, Susan Leigh Star. The MIT Press, 1999. lib.uva.nl, http://proxy.uba.uva.nl:2048/login?url=http://cognet.mit.edu/book/sorting-things-out.

Bron, Marc, et al. ‘Thresholds for Discovery: EAD Tag Analysis in ArchiveGrid, and Implications for Discovery Systems :: OCLC Research’. Making Archival and Special Collections More Accessible, Online Computer Library Center, 2009, pp. 63–84, http://library.oclc.org/cdm/ref/collection/p267701coll27/id/444.

Bulterman, D. C. A. ‘Is It Time for a Moratorium on Metadata?’ MultiMedia, IEEE, vol. 11, no. 4, 2004, pp. 10–17. lib.uva.nl, doi:10.1109/MMUL.2004.29.

Calhoun, Karen, et al. Online Catalogs: What Users and Librarians Want an OCLC Report. 311870930, Online Computer Library Center, 3 Mar. 2009, p. 68, http://www.oclc.org/en/reports/onlinecatalogs.html.

CLARIAH - Media Studies. https://www.clariah.nl/werkpakketten/focusgebieden/media-studies#dive-en-dive. Accessed 22 Dec. 2017.

CLARIAH-Media-Studies - Gitter. https://gitter.im/CLARIAH-media-studies/. Accessed 22 Dec. 2017.

Concordia, Cesare, et al. ‘Not Just Another Portal, Not Just Another Digital Library: A Portrait of Europeana as an Application Program Interface’. IFLA Journal, vol. 36, no. 1, 2010, pp. 61–69. lib.uva.nl, doi:10.1177/0340035209360764.

Cultural Heritage Digitisation, Online Accessibility and Digital Preservation. European Commission, 4 Oct. 2017, p. 74, http://ec.europa.eu/information_society/newsroom/image/document/2016-43/2013-2015_progress_report_18528.pdf.

Dangerfield, Marie-Claire, and Lisette Kalshoven. Report and Recommendations from the  Task Force on Metadata Quality. Europeana, 2015, p. 54, https://pro.europeana.eu/files/Europeana_Professional/Europeana_Network/metadata-quality-report.pdf.

Discovering Physical Objects: Meeting Researchers’ Needs’- A Research Information Network Report. Research Information Network, Oct. 2017, p. 56, http://www.rin.ac.uk/system/files/attachments/Discovering-objects-report.pdf.

Dowler, Lawrence. ‘The Role of Use in Defining Archival Practice and Principles: A Research Agenda for the Availability and Use of Records’. American Archivist, vol. 51, no. 1, 1988, pp. 74–86. lib.uva.nl, doi:10.17723/aarc.51.1-2.32305140q0677510.

‘Europeana: access to digital resources of European heritage’. CEF Digital, 30 Oct. 2017, https://ec.europa.eu/cefdigital/wiki/cefdigital/wiki/display/CEFDIGITAL/2017/03/13/Europeana%3A+access+to+digital+resources+of+European+heritage.

Evens, Tom, and Laurence Hauttekeete. ‘Challenges of Digital Preservation for Cultural Heritage Institutions’. Journal of Librarianship and Information Science, vol. 43, no. 3, 2011, pp. 157–165. lib.uva.nl, doi:10.1177/0961000611410585.

Fear, Kathleen. ‘User Understanding of Metadata in Digital Image Collections: Or, What Exactly Do You Mean by “Coverage”?’ The American Archivist, vol. 73, no. 1, 2010, pp. 26–60. lib.uva.nl, doi:10.17723/aarc.73.1.j00044lr77415551.

Gugliotta, Guy. ‘Deciphering Old Texts, One Woozy, Curvy Word at a Time’. The New York Times, 28 Mar. 2011, http://www.nytimes.com/2011/03/29/science/29recaptcha.html.

Harris, David L. ‘Massachusetts Woman’s Lawsuit Accuses Google of Using Free Labor to Transcribe Books, Newspapers’. Boston Business Journal, 26 Jan. 2015, https://www.bizjournals.com/boston/blog/techflash/2015/01/massachusetts-womans-lawsuit-accuses-google-of.html.

Kalfatovic, Martin, et al. ‘Smithsonian Team Flickr: A Library, Archives, and Museums Collaboration in Web 2.0 Space’. Archival Science, vol. 8, no. 4, 2008, pp. 267–277. lib.uva.nl, doi:10.1007/s10502-009-9089-y.

Kravets, David. ‘Judge Tosses Proposed Class Action Accusing Google of CAPTCHA Fraud’. Policy, 2 Sept. 2016, https://arstechnica.com/tech-policy/2016/02/judge-tosses-proposed-class-action-accusing-google-of-captcha-fraud/.

Lopatin, Laurie. ‘Library Digitization Projects, Issues and Guidelines’. Library Hi Tech, vol. 24, no. 2, 2006, pp. 273–289. lib.uva.nl, doi:10.1108/07378830610669637.

Media Suite Documentation. https://clariah.github.io/mediasuite-info/. Accessed 22 Dec. 2017.

Mitchell, Erik T. ‘Chapter 1: Metadata Developments in Libraries and Other Cultural Heritage Institutions’. Library Technology Reports, vol. 49, no. 5, Aug. 2013, pp. 5–10.

O’Hara Conway, Martha, et al. ‘Thresholds for Discovery: EAD Tag Analysis in ArchiveGrid, and Implications for Discovery Systems :: OCLC Research’. Making Archival and Special Collections More Accessible, Online Computer Library Center, 2009, pp. 17–38, http://library.oclc.org/cdm/ref/collection/p267701coll27/id/444.

Orgel, Thomas, et al. ‘A Metadata Model and Mapping Approach for Facilitating Access to Heterogeneous Cultural Heritage Assets’. International Journal on Digital Libraries, vol. 15, no. 2, 2015, pp. 189–207. lib.uva.nl, doi:10.1007/s00799-015-0138-2.

Park, Jung-Ran and Tosaka, Yuji. ‘Metadata Creation Practices in Digital Repositories and Collections: Schemata, Selection Criteria, and Interoperability’. Information Technology and Libraries, vol. 29, no. 3, 2010, pp. 104–116. lib.uva.nl, doi:10.6017/ital.v29i3.3136.

Pekel, Joris. ‘Democratising the Rijksmuseum’. Europeana, 10 Nov. 2017, https://pro.europeana.eu/post/democratising-the-rijksmuseum.

Primary, Research Group. Survey of Library & Museum Digitization Projects. 2016 edition, Primary Research Group Inc, 2015.

Purday, Jonathan. ‘Europeana: Digital Access to Europe’s Cultural Heritage’. Alexandria: The Journal of National and International Library and Information Issues, vol. 23, no. 2, 2012, pp. 1–13. lib.uva.nl, doi:10.7227/ALX.23.2.2.

Schaffner, Jennifer. ‘The Metadata Is the Interface Better Description for Better Discovery of Archives and Special Collections : Synthesized from User Studies :: OCLC Research’. Making Archival and Special Collections More Accessible, Online Computer Library Center, 2009, pp. 85–98, http://library.oclc.org/cdm/ref/collection/p267701coll27/id/444.

Skinner, Julia. ‘Metadata in Archival and Cultural Heritage Settings: A Review of the Literature’. Journal of Library Metadata, vol. 14, no. 1, 2014, pp. 52–68. lib.uva.nl, doi:10.1080/19386389.2014.891892.

Spigel, Lynn. ‘Our TV Heritage: Television, the Archive, and the Reasons for Preservation’. A Companion to Television, edited by Janet Wasko, Blackwell Publishing Ltd, 2005, pp. 67–99. Wiley Online Library, doi:10.1002/9780470997130.ch5.

Stevenson, Jane. ‘What Happens If I Click on This?’: Experiences of the Archives Hub’. Ariadne, no. Issue 57, Oct. 2008, http://www.ariadne.ac.uk/issue57/stevenson.

Stiller, Juliane. ‘A Framework for Classifying Interactions in Cultural Heritage Information Systems’. International Journal of Heritage in the Digital Era, vol. 1, no. 1_suppl, 2012, pp. 141–146. lib.uva.nl, doi:10.1260/2047-4970.1.0.141.

VeleHanden > Over VeleHanden Voor Instellingen. https://velehanden.nl/Inhoud/paginas/index/id/instellingenpagina. Accessed 22 Dec. 2017.

Verma, Naresh Chandra, and J. Dominic. ‘DIGITAL LIBRARIES: DEFINITIONS, ISSUES AND CHALLENGES IN MODERN ERA’. Journal of Library Information and Communication Technology, vol. 1, no. 1 (2009), pp. 44–55.

We Want Better Data Quality. Europeana, 2015, https://pro.europeana.eu/page/data-quality-etech15-roundtables.