Data Availability for Research

Data Availability for Research

Host : Christophe Diot (Google Research)
Scribe : Rulan Yang (Xiamen University)

Introduction

The discussion focuses on the challenges academics face in accessing realistic data, especially in AI research. Most critical datasets are owned by private companies, making them difficult to obtain and limiting the ability to verify research results. The session explores ways to provide academics with “realistic” datasets, including identifying required data types, collaborating with industry, inferring data from commercial networks, running academic workloads on production networks, calibrating synthetic data to match real-world characteristics, and leveraging large-scale platforms to approximate industry datasets. Researchers from both academia and industry share projects and ideas, aiming to develop actionable strategies for enabling academic access to privately owned datasets and workloads.

Academic perspective

The academic community believes that access to large-scale, real-world data is increasingly difficult for researchers due to cost, privacy, and competitive constraints. Unlike industry, which can operate at a massive scale, researchers often rely on artificial or synthetic workloads, as “we could validate on an artificial workload or data set.” There is also a recognized lack of benchmarks and reproducibility in networking research, because “we don’t have a culture of benchmarks,” and results frequently cannot be independently verified when data and code are not shared.

Researchers emphasize the importance of building trust with data owners and gaining practical experience with data before requesting access, noting that “getting your hands dirty, really playing with data, is the key element.” They also explore alternative methods, such as sharing scripts for verification or generating synthetic workloads, while acknowledging limitations in security, scalability, and historical data needs. Overall, the community advocates for improving reproducibility, transparency, and shared infrastructure, aiming to create research environments analogous to scientific instruments in other domains.

Industry perspective

Industry believes that sharing large-scale network data is inherently costly and complex. Companies face constraints related to liability, security, and operational costs, which limit their ability to provide widespread access to raw data. As one speaker noted, “Most of the companies that run the internet today have real constraints. Part of it related to liability that makes them worry… It is not cheap to run an endeavor like this to the logging and the recording.” Consequently, industrial organizations are cautious about releasing data, even when it could be valuable for research purposes.

At the same time, industry recognizes the importance of transparency and trust-building. Initiatives such as Cloudflare Radar illustrate how companies can share aggregated, safe observability data to foster trust and improve understanding of the internet. As mentioned, “Cloudflare benefits with Radar, but at the same time, the internet benefits… one of the reasons to do this is so that people can better understand the internet, start to trust in it, have answers to questions.” Nonetheless, data sharing in industry remains selective and controlled, often requiring trusted relationships with researchers and careful measures to prevent data leakage.

Disscussion

Q1: How can academia get access to realistic data?
A1: By using synthetic workloads that reflect production feature distributions and aligning with industry metrics.

Q2: Is industry willing to collaborate on experiments?
A2: Sometimes, but usually in controlled environments or by running academic code on company datasets, not by sharing raw data.

Q3: Can platforms like Slices help?
A3: Yes, they provide safe experimental conditions, though scaling requires funding and cross-organization effort.

Q4: Other ways to collaborate besides sharing data?
A4: Sharing aggregated metrics, performance reports, tools, and participating in benchmark design.

Q5: Future directions for collaboration?
A5: Focus on building trust and standardization. Industry protects commercial/legal interests, academia promotes reproducibility, aiming for long-term balanced cooperation.