Book review: Notes on a New Philosophy of Empirical Science (Draft Version), by Daniel Burfoot.
Standard views of science focus on comparing theories by finding examples where they make differing predictions, and rejecting the theory that made worse predictions.
Burfoot describes a better view of science, called the Compression Rate Method (CRM), which replaces the “make prediction” step with “make a compression program”, and compares theories by how much they compress a standard (large) database.
These views of science produce mostly equivalent results(!), but CRM provides a better perspective.
Machine Learning (ML) is potentially science, and this book focuses on how ML will be improved by viewing its problems through the lens of CRM. Burfoot complains about the toolkit mentality of traditional ML research, arguing that the CRM approach will turn ML into an empirical science.
This should generate a Kuhnian paradigm shift in ML, with more objective measures of the research quality than any branch of science has achieved so far.
Burfoot focuses on compression as encoding empirical knowledge of specific databases / domains. He rejects the standard goal of a general-purpose compression tool. Instead, he proposes creating compression algorithms that are specialized for each type of database, to reflect what we know about topics (such as images of cars) that are important to us.
- Unambiguous evaluation:
Hypotheses are evaluated by quantifying how much compression they achieve on a standard database. (The size of the software needed for decompression is included in the compression measure).
- Designed to work on unlabeled data:
ML research is constrained, in part, by the cost of producing labeled training data. CRM emphasizes the benefits of unsupervised learning of unlabeled data.
- Scientific fraud becomes much harder.
- Avoids Overfitting
These benefits seem real, but Burfoot exaggerates them. He claims that fraud and manual overfitting “cannot occur” with CRM.
Yet I’m sure there will still be some fraud with CRM. For example, people will try to cheat by hiding in their software something which cheats by connecting to an external database.
But when I tried to produce examples of overfitting while using the CRM approach, I discovered that I kept drifting back into using methods that looked somewhere in between CRM and traditional science.
That convinced me to replace my initial reaction of “CRM is good, but not very novel” with a “that’s harder and stranger than I expected” reaction.
Lossless versus lossy compression
Burfoot focuses on lossless compression. Yet it seems much more natural to me to use lossy compression.
Lossy compression discards noise and low-value information from datasets, in order to focus on the most valuable information. Traditional science does that, to produce insights that are simple enough for humans to understand. Human brains use lossy compression, due to both the resource costs of more accurate compression, and due to the difficulty of evolving more accurate neurons. Machine learning research produces compression that keeps more information than brains or traditional science keep, but the most valuable approaches still use the equivalent of lossy compression.
Lossless compression seems harder to implement, but it offers a valuable improvement in how objectively we can measure the quality of our hypotheses.
I was puzzled, when I reached the end of the book, by Burfoot’s failure to comment on this tradeoff. Then it occurred to me that I could convert any good lossy compression algorithm into a good lossless compression algorithm. Now that I look, I see that the book contains hints to this effect, but it somehow diverted my attention away from this possibility.
My intuition tells me that such a conversion might be trivial to a theorist. I can imagine that it will someday be sufficiently automated that it is trivial in practice. But it seems sufficiently alien to mainstream compression goals that it deserves some comment in a book such as this.
That doesn’t fully answer my doubts about overfitting. Burfoot’s main argument seems to be that the CRM approach uses large databases, whereas traditional ML approaches use only labeled data, which can only be produced in much smaller quantities.
I’m pretty concerned about overfitting to stock market data. That’s somewhat atypical, in that there’s an obvious way to automatically label the most important data (stock prices can be labeled by how much they go up over some arbitrary time period), with the main problem being that the data only provide a few independent pieces of evidence, generated by a somewhat malicious sampling process, for the features I care about.
For example, the dot.com bubble shows up in my database as if it happened once every two decades. But based on hard-to-quantify history books, I’m guessing that similar  bubbles happen more like once a century.
I’m confused about how the CRM approach is supposed to help me avoid overfitting on that data. My guess is that if I’m careful, I’ll find multiple hypotheses that provide indistinguishable (and unimpressive) amounts of compression. Or maybe I will fail to find hypotheses that produce any compression. Maybe a good hypothesis would require using a much larger database of human behavior, which would produce a very general model of human minds, and enlighten us about market bubbles as a minor side-benefit.
In spite of these doubts, I expect the CRM approach will help me if I can find a practical way to combine my intuitions with some sort of unsupervised learning.
Related ideas from other authors
This book builds on ideas that have been floating around for a while, and in many cases it isn’t obvious where the ideas originated. Here are two sources that Burfoot points to, plus two that I’ve happened to notice:
- Hinton’s generative approach has moved ML research in the direction of CRM, but lacks the focus on objective measurement. Hinton’s role in catalyzing AI progress is moderate evidence in favor of Burfoot’s thesis.
- The Hutter Prize is almost a CRM approach, but only for one medium-quality database.
- Eric Baum emphasizes compact representations of reality as the main ingredient of intelligence, but focuses on understanding evolution and intelligence, not on using compression to improve ML and science.
- Max Tegmark makes a brief comment in his book that endorses defining science as compression.
Burfoot is more ambitious than any of those authors, aiming to make ML into a rigorous science, and to make science in general more objective.
There’s a slight resemblance to MIRI’s goal of making AI research more rigorous, but a large difference in what the two approaches imply for the speed of AGI takeoff. Burfoot implies that intelligence is mostly empirical knowledge, while MIRI focuses on something closer to a general-purpose compression tool.
In sum, this is a pretty good book. It helped clarify my understanding of science, and of recent trends in ML. It is almost polished enough to be publishable. It seems a shame that it has apparently been abandoned so close to completion.
 – Yes, I’m being vague about “similar”. I have a clearer meaning in mind, but I’m too busy to turn this post into a theory of market bubbles. Yes, I’m concerned that my meaning of “similar” is the result of overfitting.