> Pyro enables flexible and expressive deep probabilistic modeling, unifying the best of modern deep learning and Bayesian modeling.
Does anyone who works in this area have a sense of why PPLs haven't "taken off" really? Like, of the last several years of ML surprising successes, I can't really think of any major ones that come from this line of work. To the extent that Bayesian perspectives contribute to deep learning, I more often see e.g. some particular take on ensembling around the same models trained to find a point estimate via SGD, rather than models built up from random variables about which we update beliefs including representation of uncertainty.
Some might disagree with me but my best guesses are:
- Probability math is confusing and difficult, and a base understanding is required to use PPLs in a way that is not true of other ML/DL. Most CS PhDs will not be required to take enough of it to find PPLs intuitive, so to be familiar they will have had to opt into those classes. This is to say nothing of BS/MS practitioners, so the user base is naturally limited to the subset of people who studied Math/Stats is a rigorous way AND opted into the right classes or taught themselves later.
- Probabilistic models are often unique to the application. They require lots of bespoke code, modeling, and understanding. Contrast this with DL, where you throw your data in a blender and receive outputs.
- Uncertainty quantification often is not the most important outcome for sexy ML use cases. That is more frequently things like "accuracy," "residual error," or "wow that picture looks really good".
- PPL package tooling and documentation are often very confusing and don't work similarly to one another. This isn't necessarily the developer's fault, this stuff is hard, and the people with the domain knowledge needed to actually understand this stuff often have spent fewer hours in the open-source trenches.
Re your comment on CS PhDs not having probability background -- do you find that's true of ML researchers? I would understand that in a bunch of CS specialties, probability may not be a requirement, but in ML I would have expected otherwise.
Not OP but I deal with this a lot. In my experience a lot of folks working in mainstream ML haven't been exposed to it unless they specifically focused on it. It might just be a course load thing... getting the most out of these probabilistic PLs requires fairly deep expertise in both probability theory/Bayesian stats as well as in CS and you have a finite amount of courses you can take in school. Plus, a lot of the work in this area pre-dates the modern focus on deep learning or machine learning in general, so a lot of the knowledge tends to be held by professors/researchers that may not be as involved with the "new" ML courses. And of course, Math/Stats/CS departments don't always play nicely with each other and like to fight turf wars, though I've noticed cross-disciplinary research among the three becoming more accepted at the universities/institutes I work with.
As a case study, I did most of my grad work on solving Bayesian inverse problems using probabilistic programming for applications in engineering, which is pretty cross-disciplinary. I now work mostly in ML, but I didn't really even touch anything in the ML domain until after I finished school. I could have, the courses were available, but they just weren't relevant to me at the time.
Edit: I wouldn't be surprised if there was a considerable userbase in industries like finance, but in my experience those folks don't share much.
One of the best ML researcher I know has a background in signal processing (and degrees in EE to go with). Not probability per se, but heavily uses probability and statistics.
Can you elaborate? The unreasonable effectiveness of approximate methods on discretized spaces doesn't change the fact that the theory underlying it is exact and continuous.
You don’t need a professional license to do math. Lots of computer scientists to harder and more interesting mathematics than their peers in the math dept. In that respect at least, the main substantive difference between the fields is about $40k/yr.
You're just stating things without justifying them.
What else would you consider a subfield of CS? Finance? Accounting? Logistics? UI design?
What is or isn't a subfield of a given science has nothing to do with the professional qualifications of those who practice it or how the tools may be implemented. We don't call pharmaceuticals "a subfield of robotics" because of how the factories are built.
The same can unfortunately be said of many "statisticians", who use statistics as a big recipe book without understanding the first thing about the mathematical underpinnings of the topic.
Don't believe me?
Go ask the first statistician you run into to give you a half decent explanation of how the Chi-squared distribution and the Chi-squared test works, see what happens.
I work on one of these PPLs, and I personally find Bayesian inference to be useful in a few cases:
1. When your main objective is not prediction but understanding the effect of some underlying / unobserved random variable.
2. When you don't have tons data + you have very clear ideas of the data generation process.
(1) is mainly relevant for science rather than private companies, e.g. if you're an epidemiologist, you're generally speaking interested in determining the effect of certain underlying factors, e.g. effect of mobility patterns, rather than just predicting the number of infected people tomorrow since the hidden variables are often someting you can directly control, e.g. impose travel restrictions.
(2) can occur either in academic settings or in private sector in applications such as revenue optimization. In these scenarios, it's also very useful to have a notion of the "risk" you're taking by optimizing according to this model. Such a notion of risk is in the Bayesian framework completely straight-forward, while less so in the frequentist scenarios.
I've been involved in the above scenarios and have seen clear advantages of using Bayesian inference, both in academia and private sector.
With that being said, I don't think ever Bayesian inference, and thus even less so PPLs, are going to "take off" in a similar fashion to many other machine learning techniques. The reason for this are fairly simple:
1. It's difficult. Applying these techniques efficiently and correctly is way more difficult than standard frequentist methods (even interpeting the results is often non-trivial).
2. The applicability of Bayesian inference (and thus PPLs) is just so much more limited due to the computational complexity + reduction in utility of the methods as data increases (which, for private companies, is more and more the case).
PPLs mainly try to address (1), and we do have examples of very successful examples of this, e.g. PyMC3 (they also have a bunch of nice examples of applying Bayesian inference in private sector context), and Stan (maybe more heavily used in academia).
> It's difficult. Applying these techniques efficiently and correctly is way more difficult than standard frequentist methods
Do you have any good resources/examples for applying these methods effectively? I've read Statistical Rethinking which is a good introduction to these methods at a high level but I find when I dig into an actual problem I have a lot of gaps and wish there were more real world code examples I could learn from.
I think Bayesian Data Analysis is the natural progression step.
Not sure if there is a more recent book that's updated to use modern Stan examples, but the Stan user guide itself has developed into a very useful resource on its own. It contains a large number of example models and builds up concepts incrementally. The writing style is also generally easy to follow.
It knows nothing of the modern stuff (because MacKay died too early), but skipping the first parts of David MacKay: Information Theory, Inference, and Learning Algorithms you get a very accessible course in (200x) Bayesian Inference that should cover most of what you need for diving into PPL applications.
In my case, I used it in an actual course on Bayesian inference. Looking back over the material it doesn't seem particularly complicated for anyone with a solid probability background, but maybe the concepts are hard if you aren't seeing them presented nicely in a lecture setting.
They've really taken off in niche places. If you have a complex model of something, it's dramatically easier to use of of these to build/fit your model than it is to code by hand.
But those cases are still things were you might have just a dozen variables (though each might be a long vector). It's more the realm of statistical inference than it is general programming or ML.
It hasn't "taken off" in ML because ML problems generally have more specific solutions based on the problem. If you have something simple and tabular, other solutions are generally better. If you have something recsys shaped, other solutions are generally better. If you have something vision/language shaped, other solutions are generally better.
It hasn't "taken off" in general programming because PPLs generally have trouble with control flow. Cutting off an entire arm of a program is trivial in a traditional language, but in PPLs you'll have to evaluate both. If the arm is a recursion step and hitting the base case is probabilistic, you might even have to evaluate arbitrarily deep (or you approximate that in a way that significantly limits the breadth of techniques available for running a program).
AFAICT, a truism in PPL is that there are always programs that your language will run poorly on but a bespoke engine will do better, by an extreme margin. There just aren't general languages that perform as reliably as in deterministic languages.
It's also just really really hard. It's roughly impossible to make things that are easy in normal languages easy to work with in PPLs. Consider these examples:
`def f(x, y): return x + y + noise` where you condition on `f(3, y) == 5`. It's easy.
`def f(password, salt): return hash(password + salt)` where you condition on `f(password, 8123746) == 1293487`. It's basically not going to happen even though forward evaluation of f is straightforward in any traditional language.
Hell, even just supporting `def f(x, y): return x+y` is hard to generalize. Surprisingly it's harder to generalize than the `x+y+noise` case.
I think you’re overgeneralizing in your control flow discussion.
I also don’t understand your f example with (x, y, noise) if you fix x and the return value, you still have two unknowns with 1 equation. How is that easy to solve?
Unless you’re considering using parametric inverses to represent the solution — but you didn’t mention this so I assume you didn’t mean this.
I have spent a lot of time trying to use PPLs (including Pyro, Edward, numpyro, etc.) in Real World data science use cases, and many times mixing probabilistic programming (which in these contexts means Bayesian inference on graphical models) and deep networks (lots of parameters) doesn't work simply because you don't have very strong priors. There are cases where these are considered very effective (e.g. medicine, econometrics, etc.) but I haven't worked in those areas.
NUTS-based approaches like Stan (and numpyro) have more usage, and I think Prophet is a good example of a generalizable (if limited) tool built on top of PPLs.
Pyro is a very impressive system, as is numpyro, which I think is the successor since Uber AI disbanded (it's much faster).
It's much more expensive to train models. Besides, compilers are not that smart yet. E.g. a HMM implemented in a PPL is far from the efficiency of hand-rolled code. For many use cases, they are still a leaky abstraction.
However, in areas where measuring uncertainty is important, they have taken off. Stan has become mainstream in Bayesian statistics. Pyro and PyMC are also quite used in industry (I have had recruiters contacting me for this skill). Infer.NET has its own niche on discrete and online inference. Infer.NET models ship with several Microsoft products.
Other interesting PPLs include Turing.jl, Gen.jl, and the venerable BUGS.
Sure. It's hard to make this justice on a single comment, as there are lots of applications. Basically, any scenario with small or medium size datasets where you would use generative models. PPLs are just a way to encode generative models and get an inference engine compiled for you, instead of needing to write one. For example:
* Landing the Apollo on the Moon or tracking systems used by e.g. Sidewinder employ Kalman filters. See Example 24.4 (p. 510) [1].
* Predicting ride demand on heavy-tailed time series. Uber does this all the time [2].
* Estimating the effect of some policy on data with hierarchical structures (State > County > Individual observations) [3].
These things go in and out of fashion. Now it's LLMs' turn to have their fifteen minutes.
I think one reason why Bayesian models have not taken off is that representing prediction uncertainty comes at the expense of accuracy, for a given model size. People prefer to devote model capacity to reducing the bias rather than modeling uncertainty.
Bayesian models make more sense in the small-data regime, where uncertainty looms large.
I don’t think the field has converged on “the right abstractions” yet.
It’s an active area of programming language research — it feels similar to where AD was at for awhile.
I work on this stuff for my research — so I do believe that there is a really good set of abstractions. my lab has had good success at solving problems with these abstractions (which you might not think are amenable or scale well with Bayesian techniques, like pose or trajectory estimation and SLAM, with renderers in a loop).
Other PPLs I’ve studied also have a mix of these abstractions, but make other key design distinctions in interface / type design that seem to cause issues when it comes to building modular inference layers (or exposing performance optimization, or extension).
I also often have the opinion that the design choices taken by other PPLs feel overspecialized (optimized too early, for specific inference patterns). I’m not blaming the creators! If you setup to design abstractions, you often start with existing problems.
On the other hand: if you’re just solving similar problem instances over and over again, in increasingly clever ways — what’s the point? Unless: (a) these problems are massive value drivers for some sector (b) your increasingly clever ways are driving down the cost, by reducing compute, or increasing speed.
I think PPLs which overspecialize to existing problems are useful, but have trouble inspiring new paradigms in AI (or e.g. new hardware accelerator design, etc).
Partially this is because there’s an upper bound on the inference complexity which you can express with these systems — so it is hard to reach cases where people can ask: what X application would this enable if we could run this inference approximation 1000x faster?
(Also note that inference approximations _can_ include neural networks)
I'm no authority on the subject, but FWIW I tried quite a bit to make various bayesian methods work for me. I never found them to outperform equivalent frequentist (point estimate) methods.
Modelling uncertainty sounds nice and sometimes is a goal in itself, but often at the end of the day you need a point estimate. And then IME all the priors, flexible models, parameter distributions, just don't add anything. You could imagine they do, with a more flexible model, but that is not my experience.
But then, PPL is just so much harder. The initial premise is nice - you write a program with some unknown parameters, you have some inputs and outputs, and get some probabilistic estimates out. But in practice it is way more complex. It can easily and silently diverge (i.e. converge to a totally wrong distribution), and even plain vanilla bayesian estimation is a dark art.
You need to be intimately aware of your data input, the models you’re proposing, and initialisations.
Practically, this means iteratively visualising your data and making informed judgements to even make your model run without humans thinking much about the structure of the data and model.
The ML promise is that there are robust models that you can feed nearly unlimited amounts of data to get better predictions.
Probabilistic modeling is better for people who have a fixed dataset they can visualise and fit an elegant model that incorporates lots of prior information about the problem of interest.
Though I see what you're saying, using a PPL for VAEs just seems like overkill given the simplistic nature of VAEs.
PPLs are useful when the data generation process is not easily represented by something like a simple multivariate Gaussian, etc. You find many good examples academic research, e.g. epidemiology.
Yes but mathematical integration (to solve Bayesian equations) is difficult, the higher the dimension, the more difficult it is. That's why differentiation is preferred. The concepts behind PPL are firmly entrenched in probabilistic ML, the ideas were never lost.
Making a new language is not the way to do it. A new language means you wipe out all the tools you were using before, from syntax highlighting to libraries to optimizations. Even languages like java, go, julia, lua and D worked on their garbage collection for at least a decade.
Not only that, there is no reason why the math can't be done in a library and used in another language in the first place.
I wish Pyro would do a better job of hiding the implementation details. I shouldn't need to understand variational inference and such just to get the probability of a god dang hot dog. I've tried to use Pyro a few times, but every time I spend more effort trying to understand poutines and such instead of modeling my problem.
And I wish they would merge it with the beautiful explanations at https://probmods.org/. We need a practical probabilistic programming language in Python. We have PyMC, but to use that you have to pull out your old notes on Theano.
It's a remote object model, very similar in spirit to CORBA. This allows the object creator/user and the object itself to be in different fault domains - which makes it all too easy to lose track of objects and leak them, unless you've added significant management scaffolding.
Does anyone who works in this area have a sense of why PPLs haven't "taken off" really? Like, of the last several years of ML surprising successes, I can't really think of any major ones that come from this line of work. To the extent that Bayesian perspectives contribute to deep learning, I more often see e.g. some particular take on ensembling around the same models trained to find a point estimate via SGD, rather than models built up from random variables about which we update beliefs including representation of uncertainty.