Conference Agenda
| Session | ||
POSITIVE + NEGATIVE IMPACT OF AI
| ||
| Presentations | ||
11:30am - 11:52am
Why ask WAAI: A sustainable approach to exploring web archiving artificial intelligence (WAAI) Internet Archive, United States of America Beyond the media hype, financial bubble, and general social freakout over Artificial Intelligence (AI), the emergence of machine learning (ML) and AI technologies merit impartial consideration of the potential for these innovations to benefit many aspects of the overall web archiving endeavour. Much as digitization and the internet itself radically changed how libraries and heritage institutions approached professional practices like acquisition and access, ML/AI may have the potential to address longstanding challenges in web archiving related to harvesting, collection management, and search and discovery. Of course ML/AI tools could also prove too immature, too unreliable, too expensive, or too unwieldy to provide a suitable return on investment for web archive collections that can measure in the hundreds of terabytes, if not petabytes. Thus, ML/AI explorations in web archives need a different methodology of research, testing, and assessment than more traditional, more narrowly focused technologies specific only to certain areas of web archiving practice or infrastructure. This talk will approach the challenge of incorporating ML/AI tools in web archives from a “why ask why” perspective, emphasizing small, low-stakes, and well scoped experimentation across all aspects of the web archiving lifecycle instead of rigorously planned, ambitiously conceived, large scale projects or more formal and ornate methods of research and development. The presentation will thus lay out a general framework for advancing AI-based work in web archiving based on practical examples, use cases, and findings from pursuing such an approach within a large web archiving institution that has been conducting internal AI projects on multiple parts of its web archiving processes. The talk will cover both managerial and practical aspects of exploring ML/AI for web archiving, such as staffing, infrastructure, tools, costs, program/product development, and engineering practices, and will link these with specific completed or in-progress work on leveraging ML/AI tools for various areas of web archiving, such as appraisal, collection, description, quality assurance, and search. By bridging practical details and results with specific areas of professional practice and wrapping both in a framework that emphasizes experimentation and action over procedural, policy, or administrative plodding, the talk hopes to advocate for a “sustainable” approach to exploring ML/AI in web archiving that proves doable, cost-effective, and user-driven. This presentation will propose a method, detail results from implementing that method in a large web archiving organization, and share results and findings intended to help other web archiving institutions pursue ML/AI work that will be sustainable, productive, and successful. 11:52am - 12:14pm
Understanding and mitigating anti-bot technologies' impact on archival web crawling 1MirrorWeb Limited, United Kingdom; 2Library of Congress, United States of America The proliferation of AI bot prevention technologies has created an unprecedented challenge for institutional web archiving programs. Website owners, administrators, and hosting providers—particularly those serving large organisations and government entities—have implemented increasingly aggressive safeguards to protect against AI agents harvesting training data. While well-intentioned, these measures inadvertently block legitimate preservation crawlers, threatening the completeness and quality of web archive collections. This research addresses a critical gap in understanding how anti-bot technologies affect large-scale web archiving operations. Even when securing appropriate crawling permissions per institutional policies, standard preservation tools like Heritrix are increasingly mistaken for malicious bots or AI scrapers, resulting in blocked access to nominated content. While quality assurance teams have documented this issue on individual seeds and domains, no comprehensive analysis of its scale and impact has been conducted. Our investigation analyses data from institutional crawling operations, and aims to enable systematic identification of blocking patterns, affected content types, and the scope of collection gaps caused by anti-bot technologies. This work extends existing guidance (such as robots.txt configuration advice) to address the complex landscape of modern bot prevention technologies. By documenting the real-world impact of these systems on institutional collecting and developing evidence-based mitigation strategies, this presentation is intended to aid web archiving programs maintaining collection quality while minimising resource-intensive manual interventions with individual website owners. The findings will aim to inform both technical approaches to crawling at scale and strategic communication with the broader web archiving community, website creators, and technology providers. Ultimately, this research aims to bridge the gap between legitimate preservation activities and necessary web security measures, ensuring cultural heritage institutions can fulfil their missions in an increasingly bot-hostile web environment. 12:14pm - 12:35pm
AI-powered search to sustain IIPC conference knowledge 1Bibliotheca Alexandrina, Egypt; 2Alamein International University; 3Egypt-Japan University of Science and Technology The IIPC Web Archiving Conference often receives high ratings in surveys from the community for being recognized as a platform for sharing knowledge and experience among web archiving practitioners and researchers. The output from this annual event is kept and made accessible via an online repository, courtesy of the University of North Texas. With today's advancement in Artificial Intelligence (AI) technology, an opportunity presents itself to render the wealth of information stored within the IIPC's repository of conference materials into more accessible knowledge. The IIPC Assistant supports the sustainable preservation and accessibility of the International Internet Preservation Consortium (IIPC) conference materials through an AI-powered search frontend that enables natural-language exploration of conference contributions archived in the online repository. By integrating vector embeddings with generative AI, the system delivers contextually accurate answers grounded in content that has been through a review process and was presented at the conference, contributing to the long-term usability and enhanced accessibility of the material that periodically documents the work done in the area of web archiving. The project began with metadata harvesting via the OAI-PMH API to consolidate creators, titles, subjects, and textual content from IIPC presentations and transcripts into a unified dataset. Because the materials were not designed for interactive querying, a Retrieval-Augmented Generation (RAG) approach was adopted to enable dynamic, source-grounded responses without retraining large models, an approach that promotes computational efficiency and sustainable reuse of existing data. Challenges in data consistency and semantic coherence were addressed by employing generative AI through the Gemini API to restructure fragmented text and enhance contextual quality. The retrieval pipeline was further refined to group and rank documents based on relevance, ensuring balanced coverage and interpretability. Built with a React + TypeScript frontend, Flask backend, and FAISS vector database, the implementation emphasizes scalability and efficiency. By advancing sustainable methods for information retrieval, the IIPC Assistant demonstrates how an AI-powered access interface can broaden the potential of a repository of valuable content accumulated over the history of the organization, thus transforming static collections into an interactive, reusable knowledge resource that supports ongoing research and global collaboration in the domain of web archiving. | ||