Digital Archives: Reading and Manipulating Large-Scale Catalogues, Curating and Creating Small-Scale Archives
Instructors:
Yael Netzer
Duration: both weeks
The purpose of this two-week workshop is to develop practical and critical skills toward the representation of knowledge in digital archives and to build a small-scale digital archive. This workshop blends theory and hands-on activities, enabling participants to engage with digital catalogues, metadata structures, and archival curation tools. Additional Open Lab Session: An optional free exploration day will be available for participants to receive one-on-one guidance on their projects and tools. No prior knowledge is required for this workshop. Participants are encouraged to bring their own datasets but will be provided with starter collections if needed.
Week 1 – Reading and Working with Data / Collections in OpenRefine
Digital data in various formats is at the heart of humanities research. Often, datasets are large, messy, or structured in unfamiliar ways. This week, students will learn to inspect, clean, and enrich digital catalogues using OpenRefine, as well as how to enhance datasets with Linked Open Data (LOD) from sources such as the Library of Congress, VIAF, and Wikidata. By the end of this week, students will be proficient in:
- Understanding different file formats (CSV, TSV, Spreadsheets, JSON, XML TEI)
- Using regular expressions for data manipulation (with some skill and aid from chatGPT)
- Writing expressions with GREL (OpenRefine’s scripting language)
- Fetching and reconciling data via REST API (e.g., GeoNames, Wikidata)
- Scraping and structuring data from the web
- Mapping textual data to geographic locations
Schedule:
- Class 1: Introduction, loading a file, faceting, and exploring data
- Class 2: Regular expressions and working with dates
- Class 3: Clustering techniques for data cleaning
- Class 4: Fetching external data using REST APIs (GeoNames example)
- Hands-On Session: Practicing administrative tasks (changing working directory, memory allocation)
- Class 5: Reconciliation and enriching data with Wikidata
- Class 6: Handling JSON and XML file formats
- Class 7: Web scraping techniques and automation
- Class 8: From text to map – Geospatial representations in OpenRefine
- Class 9: Summary and discussion
Week 2 – Building a Digital Archive: Archives of the Present This week focuses on the creation and structuring of small-scale digital archives, but also introduces the concept of archives of the present—a critical reflection on how contemporary events, data, and digital traces shape our archival practices. Participants will work with their own or provided collections, conceptualizing metadata structures and curatorial strategies. The workshop covers best practices in digital archive development, including metadata schema selection, linked data integration, and user-friendly design. The discussion of archives of the present will explore:
- How digital documentation of real-time events (social media, news articles, live-streamed content) can be archived
- The ethical challenges of archiving contemporary materials
- Methods for ensuring accessibility and preservation of ephemeral data
- The evolving nature of authority files and metadata in fast-changing digital environments By the end of this week, students will be proficient in:
- Theoretical foundations of archival studies
- Metadata structuring and best practices
- Using Omeka-S for archive implementation
- Using Tropy for organizing and annotating images
- Linking archives to external sources and ontologies
- Designing and publishing an accessible, structured digital archive
- Engaging with contemporary data collection and preservation strategies
Schedule:
- Class 1: Theory of archives – an introduction
- Class 2: Digital archives – examples and reviewing participant collections
- Class 3: Modeling the domain
- Class 4: Metadata – methods of description, challenges, and dilemmas
- Class 5: Introduction to Omeka-S – setting up and structuring an archive
- Class 6: Using Tropy – basic features and integration with Omeka
- Hands-On Session: Working on participant collections
- Class 7: Archives of the present – Capturing and preserving digital traces
- Class 8: Linking and integrating with external resources and authority files
- Class 9: Publishing – designing Omeka pages for public access
- Class 10: Summary and reflections
To enrich the learning experience, this workshop will aim to incorporate:
- Case studies of successful digital archive projects
- Collaborative group work, where teams handle different types of archival materials
- Expanded toolset beyond OpenRefine and Omeka, including basic Python for data manipulation and SPARQL for querying LOD sources
- Introduction to IIIF (International Image Interoperability Framework) for handling digital images in archives
- Machine learning-assisted metadata extraction, including OCR (Transkribus), Google Vision API, and Named Entity Recognition (NER)
- Sustainability and long-term digital archive maintenance strategies