Keio University

Compression, Mining, and Analysis

Participant Profile

  • Akihito Sakurai

    Akihito Sakurai

File Compression and Compression Algorithms

Are you familiar with file compression? It's used when you want to put a large file on a small USB drive or attach it to an email. Compressing a file can reduce its size to about half or even a fraction of the original, which is very convenient. Also, since Office 2007, Microsoft Office software automatically compresses files when you save them.

The basic principle of compressing (and decompressing back to the original) is extremely simple. For example, consider a string where "01" is repeated 500 times, like 0101...01. This would be 1,000 characters in total. However, writing "01 repeated 500 times" requires only a handful of characters. This is how compression algorithms work: they find regularities within a string and cleverly describe them to achieve compression.

Ultimate Data Compression

So, does an ultimate compression algorithm exist? Kolmogorov and Chaitin, independently and at around the same time, considered this in slightly (or perhaps quite) different ways. Given a string x, they considered a program y (which is also a string) that outputs x when executed. Among the infinite number of such programs, they defined the shortest one (the one with the fewest characters) as the compressed representation of string x. Does this mapping from x to y exist? It's easy to create a program that is slightly longer than string x to print x. Therefore, you just need to run all possible strings (of which there is a finite number) from length 1 up to "the length of this program" and find the shortest program that outputs x. Thus, this mapping does exist.

This ultimate compression program is incredibly powerful. Since it performs ultimate compression, it must take into account all regularities contained within string x and find the one most useful for compression. In other words, by looking at program y, the shortest compression of x, you can understand the properties of x very well. For example, it's like what Kepler did when he derived laws from the vast observational data collected by Tycho Brahe.

In other words, it can be described as extracting knowledge y from data x.

Discovering Laws

In fact, humanity has long been acting in line with this concept. From the movements of the sun, moon, and planets, we have predicted the changing of the seasons that explain them, and at times even inferred the will of the gods. In modern times, we have learned that by inducing the shortest and most elegant theories (such as the equations of Newton and Einstein, or the "divine equations" recently featured on television) from a large amount of experimental data, we can explain and predict even more data, including unknown phenomena. We have established this method as science.

However, when you think about it, you realize something strange. If this ultimate compression program existed, great scientific discoveries would become possible simply by collecting data. This ultimate compression program is a dream-like tool, but in fact, it is just a dream. Computationally, it is incomputable. In other words, we cannot write it down as an algorithm. It's a pity.

Machine Learning

However, it is possible to approximate it, and better approximations can be obtained by limiting the domain. This is what our research field, "machine learning," does. It is research on tools for obtaining knowledge from data. It is also called data mining. It is also related to the analysis of big data, which has become popular recently. In machine learning research, the general-purpose ultimate compression program is usually kept as a concept. In practice, we look for good methods to find regularities of a specific type (for example, those describable by a linear equation) hidden in the data.

Of course, there is also research that uses approximations of the ultimate compression program. This is research in machine learning that uses the compression programs mentioned at the beginning of this article. Ordinary compression programs are mere toys compared to the ultimate compression program, as they only use the bias in the frequency of character string occurrences. Even so, for example, document classification (such as detecting spam emails or selecting only the news you want to hear) can be done with an accuracy only slightly inferior to the best classification methods. As an unusual example, there is research that calculated the distance between musical scores using a compression program to distinguish between the music of Chopin and Debussy.

Since compression programs are designed to compress one-dimensional sequences, their compression performance is not very good when applied to two-dimensional images, for example. In other words, they cannot extract useful regularities. For example, when classifying images based on the objects they contain (airplanes, lions, buildings, etc.), it is difficult to extract the regularities present in the image of the target object you want to find. This is due to the inherent difficulty of the difference in dimensionality, as well as difficulties arising from the characteristics of real-world data, such as noise, distortion, similarity, and ambiguity present in images. Nevertheless, by devising ways to extract and describe image "features," it has become possible to achieve more than half the performance of the best recognition methods.

One of the important tools in machine learning is the neural network. Among the learning methods for neural networks, there is a method that forces data compression. Recently, a method has been discovered that automatically finds image "features" (basic "words" used to describe an image, such as vertices or faces) suitable for image classification by combining this data compression method in multiple stages. This is being actively researched. It is a method called deep learning.

In Conclusion

As you can see, there is a common foundation among data compression, regularity discovery, knowledge discovery, machine learning, and data mining. In fact, there are also philosophically interesting topics, but there isn't enough space to explain them here. For those who are interested, I recommend looking up "Occam's razor," for example.

In the last year or two, the importance of data and the need to analyze it have become widely recognized. Data generated by human activities, as well as naturally occurring data, has various characteristics. Proper analysis can provide various insights and guidelines for appropriate actions, such as designing evacuation routes that consider human instinctive behavior or promoting sales based on the proper placement of products. Analysis methods are also being developed one after another and are constantly evolving.

Whether you start from data analysis, Occam's razor, or computation theory, why don't you join us in thinking about these things?

Gakumon no susume (An Encouragement of Learning) (Research Introduction)

Showing item 1 of 3.

Gakumon no susume (An Encouragement of Learning) (Research Introduction)

Showing item 1 of 3.