Title: Can ML models replace network algorithms? Feasibility, challenges, and how to get there
Host: Sanjay Rao (Purdue)
Panelists: Mohammad Alizadeh (MIT), Behnaz Arzani (Microsoft), Bruno Ribeiro (Purdue), Keith Winstein (Stanford)
Scribe : Wei Li (HKUST), Chengjin Zhou (Nankai University)
Introduction:
In recent years, we have seen rapid growth and impressive initial results in the use of ML models in lieu of classical network algorithms in domains such as video streaming, Internet routing, traffic engineering, video coding and data-center networking. While this offers exciting opportunities, networking environments pose many unique challenges for ML approaches. In networking, we are rarely content with just observing a system evolve: it is common to intervene (e.g., adapt bit rates, reroute traffic, switch CDNs) in order to improve it. Unfortunately, online experimentation in live systems is not always viable, and can be expensive. Unobserved confounders (e.g., intrinsic network conditions, queuing policy) may significantly influence the effect of an intervention. Moreover, the predicted output from an ML model must extrapolate to settings and actions not present in the training data (e.g., unseen failures, traffic surges). Many directions at the cutting-edge of ML research can help address these questions, yet domain-inspired insights are crucial, and adapting them to the networking context poses unique challenges.
Questions and opinions:
Question 1
Sanjay: Experience with ML and networking, what excites you the most, and what you see as some of the major challenges ahead, and maybe moment.
Opinion
Mohammad: So my kind of interest in ML networking started probably around maybe 2015, in general. I work a lot on the type of resource management problems, with congestion control being a classic example. These are all like decision-making and control problems and working on sort of several problems. They’re kind of combinatory optimization problems, and there doesn’t seem to be any elegant way of approaching them. And around the same time, AlphaGo was happening, and people were getting all these, super impressive, amazing results of this AI agent that self-learns and solves, this hard task, and kind of intellectual curiosity.
Behnaz: I first started working on ml, I think in 2016 when I was interning at Microsoft. Back then, there was a type of incident that was common, which was what they called Event 17, where when a VM couldn’t connect to storage, it would panic and reboot. And so what we had, though was a lot of TCP traces. And the hypothesis we had was, well, TCP goes through both the server, through the network, and through the remote service. And so are there signals in terms of what TCP sees that allow us to differentiate between where the incident might be coming from, to route it to the correct team so that they can investigate and find the root cause. And so as we thought about that, the signals were nuanced. So we thought, well, let’s just throw eval at it and see if it’ll work. And it turned out it did. And then we went to deploy it.
Bruno: From an ML perspective, what I think is fascinating about networking problems is that if you’re just learning correlation. If you’re just learning how inputs and outputs relate to the ML method, you’re probably not going to do supervisory networking, because a lot of the tasks are tasks where what you want to do the prediction is some intervention. And once you do that, the machine learning method needs to understand that intervention will have consequences, and what the consequences are. And the second part is, that I’m very interested in the notion of neuro-algorithmic reasoning, which is generally the way that I believe algorithms are generated in the real world, you see a problem and you make an approximation of the problem, right?
Keith: The fact that this community that we have published so many papers that turn out to be wrong is very exciting because that suggests that the problem is right in the sweet spot, ml for networking. It’s not too easy, but it’s not network information theory. It’s right in the middle. And so I’m very optimistic. Over the next 10 or 20 or 30 years, we’re going to do some exciting stuff. That’s my skeptical optimism.
Questions 2
Sanjay: So what makes modeling difficult, or what sorts of challenges are you tackling in that space?
Opinion
Mohammad: This is sort of not something we do too much in engineering. It’s actually like what social scientists have been doing for a long time. if you’re a social scientist, if you have some problem, you can’t run experiments like that. It’s not you can take every hypothesis and test it. Like do an arbitrary number of randomized control trials or something, or you can’t simulate this kind of system. So social science is super advanced, actually, in thinking about, how you get insights about dynamics from data? And I think that is sort of like trickling now, I hope it can trickle into kind of more engineered systems.
Behnaz: One of the challenges in causal modeling is that they are all different. you can’t have a universal causal model that would work on any sort of problem. Now we are talking about your need to understand how the networking protocol works. We need to understand how the system behaves.
Bruno: I think the biggest hurdle to overcome is building trust. We were going to give them access to the confidence level of the model so they know, how much the confidence the model has in the prediction that it’s making.
Keith: You know why we can’t simulate the internet, and that should be a provocative paper even now because so much of our work is based on being able to causally model what would happen if you deployed something in real life without deploying it in real life. unlike most places where we do machine learning the internet, is a multi-agent problem with partial observability.
Question 3
Sanjay: asks the question that whether machine learning is truly a “black box” and how it can be made more reliable.
Opinion
Bruno: suggests that while machine learning can seem like a black box, there are principles to ensure robustness and transferability of models. Bruno proposes creating benchmarks to evaluate ML models in specific tasks, which can reveal the reality of what ML can achieve.
Question 4
Sanjay: asks the question that the skills students should develop to get internships.
Opinion
Behnaz: emphasizes the importance of students being open-minded and willing to learn.
Keith: focuses on the importance of being “multilingual” in understanding and communicating across different domains such as networking, machine learning.
Question 5
Sanjay: asks about the types of problems that can be solved by ML and what criteria would convince someone that a problem should be addressed
Keith thinks that if a problem can be solved by humans, it can likely be solved by a computer with the right approach.
Opinoin
Behnaz: prefers to use traditional methods like signal processing or information theory when they can effectively solve a problem. She turns to machine learning when there isn’t a clear theoretical understanding or solution available through traditional methods.
Mohammad: points out that problems with strict correctness requirements, like security protocols, may not be well-suited for ML due to the need for full verification. In contrast, problems in networking often lack hard correctness criteria, making ML suitable due to its ability to handle stochastic and probabilistic elements. And suggests that engineers should adopt a probabilistic mindset, viewing ML as one of many stochastic elements in a system, particularly for performance-related problems where soft correctness criteria apply.
Question 6
Sanjay: inquires about principled approaches to integrate fundamental checks in machine learning systems.
Opinoin
Keith acknowledges that establishing fundamental checks in networking problems, especially in multi-user observable situations like congestion control, is challenging. It is difficult to define acceptance criteria due to the complexity of observing all flows, even in simulations.
Behnaz approaches the problem from a performance analysis perspective. She suggests that machine learning algorithms, as mathematical models, can be reasoned about and analyzed in specific cases. By placing these models within an optimization framework, constraints can be modeled and analyzed, though this approach is not universally applicable. It depends on the specific model and system, indicating that while it is possible in certain cases, it is not a generalized solution.
Question 7
Sanjay: asks whether machine learning networks are inherently better at interpolation than extrapolation.
Opinoin
Mohammad: suggests that machine learning models generally excel at interpolation because they learn associations within the training data. For a model to successfully extrapolate, it must learn fundamental concepts beyond the training data, which is a significant challenge. And considers achieving good out-of-distribution generalization to be almost impossible, as models are not inherently designed for arbitrary extrapolation.
Bruno: adds that not all problems require a machine learning solution, especially when a model cannot provide reliable predictions outside its training scope. A valuable model is one that can acknowledge its limitations by indicating uncertainty when encountering unknown scenarios. Such a model offers useful insights by showing confidence only within its learned parameters. This approach emphasizes the need to define problems accurately and understand the strengths and limitations of machine learning models in addressing specific challenges.
Personal thoughts: ML models do can help us solve some problems, especially about data analysis. But sometime it may not work so well, the cost of time consumption, GPU resources consumption are also cannot be omitted.