IIPC WAC 2025 - ConfTool Pro

Session Abstract

The workshop will begin with a brief introduction to the concept of the webgraph or hyperlink graph - a directed graph whose nodes correspond to web pages and whose edges correspond to hyperlinks from one web page to another. We will also look at aggregations of the page-level webgraph at the level of Internet hosts or pay-level domains. The host-level and domain-level graphs are at least an order of magnitude smaller than the original page-level graph, which makes them easier to study.

To represent and process webgraphs, we utilize the WebGraph framework, which was developed at the Laboratory of Web Algorithms (LAW) of the University of Milano. As a "framework for graph compression aimed at studying web graphs," it allows very large webgraphs to be stored and accessed efficiently. Even on a laptop computer, it's possible to store and explore a graph with 100 million nodes and more than 1 billion edges. The WebGraph framework is also used to compress other types of graphs, such as social network graphs or software dependency graphs. In addition, the framework and related software projects include tools for the analysis of web graphs and the computation of their statistical and topological properties. The WebGraph framework implements a number of graph algorithms, including PageRank and other centrality measures. It is an open-source Java project, but a re-implementation in the Rust language has recently been released. Over the past two decades, the WebGraph format has been widely used by researchers, for example those at LAW or Web Data Commons, to distribute graph dumps. It has also been used by open data initiatives, including the Common Crawl Foundation and the Software Heritage project.

The workshop focuses on interactive exploration of one of the precompiled and publicly available webgraphs. We look at graph properties and metrics, learn how to map node identifiers (just numbers) and node labels (URLs), and compute the shortest path between two nodes. We also show how to detect "cliques", i.e. densely connected subgraphs, or how to run PageRank and related centrality algorithms to rank the nodes of our graph. We share our experiments on how these applications are used for collection curation: how cliques can be used to discover sites with content in a regional language, how link spam is detected or how global domain ranks are used to select a representative sample of websites. Finally, we will build a small webgraph from scratch using crawl data.

Participants will learn how to explore webgraphs (even large ones) in an interactive way and learn how graphs can be used to curate collections. Basic programming skills and basic knowledge of the Java programming language are a plus but not required. Since this is an interactive workshop, attendees should bring their own laptops, preferably with the Java 11 (or higher) JDK and Maven installed. Nevertheless, it will be possible to follow the steps and explanations without having to type them into a laptop. We will provide download and installation instructions, as well as all teaching materials, prior to the workshop.

Introduction to Web Graphs

Sebastian Nagel, Pedro Ortiz Suarez, Thom Vaughan, Greg Lindahl

Common Crawl Foundation, United States of America

The workshop will begin with a brief introduction to the concept of the webgraph or hyperlink graph - a directed graph whose nodes correspond to web pages and whose edges correspond to hyperlinks from one web page to another. We will also look at aggregations of the page-level webgraph at the level of Internet hosts or pay-level domains. The host-level and domain-level graphs are at least an order of magnitude smaller than the original page-level graph, which makes them easier to study.

To represent and process webgraphs, we utilize the WebGraph framework, which was developed at the Laboratory of Web Algorithms (LAW) of the University of Milano. As a "framework for graph compression aimed at studying web graphs," it allows very large webgraphs to be stored and accessed efficiently. Even on a laptop computer, it's possible to store and explore a graph with 100 million nodes and more than 1 billion edges. The WebGraph framework is also used to compress other types of graphs, such as social network graphs or software dependency graphs. In addition, the framework and related software projects include tools for the analysis of web graphs and the computation of their statistical and topological properties. The WebGraph framework implements a number of graph algorithms, including PageRank and other centrality measures. It is an open-source Java project, but a re-implementation in the Rust language has recently been released. Over the past two decades, the WebGraph format has been widely used by researchers, for example those at LAW or Web Data Commons, to distribute graph dumps. It has also been used by open data initiatives, including the Common Crawl Foundation and the Software Heritage project.

The workshop focuses on interactive exploration of one of the precompiled and publicly available webgraphs. We look at graph properties and metrics, learn how to map node identifiers (just numbers) and node labels (URLs), and compute the shortest path between two nodes. We also show how to detect "cliques", i.e. densely connected subgraphs, or how to run PageRank and related centrality algorithms to rank the nodes of our graph. We share our experiments on how these applications are used for collection curation: how cliques can be used to discover sites with content in a regional language, how link spam is detected or how global domain ranks are used to select a representative sample of websites. Finally, we will build a small webgraph from scratch using crawl data.

Participants will learn how to explore webgraphs (even large ones) in an interactive way and learn how graphs can be used to curate collections. Basic programming skills and basic knowledge of the Java programming language are a plus but not required. Since this is an interactive workshop, attendees should bring their own laptops, preferably with the Java 11 (or higher) JDK and Maven installed. Nevertheless, it will be possible to follow the steps and explanations without having to type them into a laptop. We will provide download and installation instructions, as well as all teaching materials, prior to the workshop.

Conference Agenda