Keio University

Statistics: A Field That Is Broad, Multifaceted, and Deceptively Deep

Publish: November 10, 2022

What image comes to mind when you think of statistics? When I tell people I study statistics, the reactions I get are incredibly diverse. An American accountant once told me to my face, "Statistics is a tool for deceiving people," while others have given intellectual responses like, "Then measure theory must be essential." There is even an anecdote about a statistician whose elementary school teacher, upon hearing his profession, mistakenly asked if he raised chickens (a pun in Japanese). Since he is not the type to make things up, I am sure it is a true story. The last one is an exception, but the point is that the image of statistics varies greatly, even among researchers. I also suspect that the public's perception has changed due to the news about Covid-19 in recent years. It is hard to think of another research field where this "multifaceted nature" is so prominent.

Recently, within the field of statistics, I have been researching analytical methods that focus specifically on the geometric features of data. A long-studied example is directional statistics, which deals with data distributed on a circle or a sphere. Examples include directional data like wind direction and distribution data on the Earth's surface. In recent years, however, as the data we handle has become larger and more diverse, and with the addition of machine learning to analytical methods, research on data with more complex geometric structures has become increasingly active.

In this context, in a joint research project with Professor Wynn of the London School of Economics, I proposed a new data analysis method focusing on the "curvature" of data. Specifically, we first construct a "proximity graph" by connecting nearby data points distributed in a space, and then define a "fundamental" distance between data points as the shortest path length on this graph. This distance itself is an approximation of the shortest path length (geodesic) on a manifold when the data is distributed on one (a geometric set with "smoothness," such as a sphere or a torus). For this reason, it is often used in a machine learning method called "manifold learning."

Our method transforms this fundamental distance into another distance that further improves analytical precision. In doing so, we proposed:

(1) changing the distance so that the "curvature" of the metric space where the data is distributed changes monotonically, and furthermore,

(2) enabling a relative monotonic change in curvature even for transformations to distances where curvature cannot generally be defined, by embedding them in a space called a "metric cone."

The "curvature" mentioned here is what is known as CAT(k), which was proposed by Gromov and others in the 1980s to handle curvature even in general geodesic metric spaces like proximity graphs. The proposed method was applied to the analysis of regional rainfall data in the UK, and it was confirmed that the "geometric" variance of annual rainfall has been rapidly increasing in recent years. This is a phenomenon that was not detected when using conventional variance.

In statistics, the process involves obtaining data, developing methods to analyze it, and theoretically evaluating the accuracy and validity of those methods, or conversely, creating analytical methods based on theoretical ideas. When you include the actual analysis of data and the consideration of its results, it becomes a very extensive process. For me, the greatest appeal of statistics, I have recently come to feel, is the ability to execute this entire process within a manageable scope, while also being able to oversee and design the whole picture.

Figure 1: Annual cyclical structure of 85 years of rainfall in the UK. Visualized in 3D using a method called principal component analysis. Each point on the loop represents the average data for a specific day (e.g., January 1) over 85 years, and the "whiskers" indicate the direction of data dispersion. It is clear that the data has a geometric structure.
Figure 2: Distance graph created with the proposed method based on 1986 data. The geometric variance of the 365 days of data from 1986 is calculated based on this graph. The color of the points here represents the value of what is called the Fréchet function and is unrelated to the colors in Figure 1.
Figure 3: Variation in geometric variance over 85 years. Time-series data for 85 years of geometric variance, calculated based on graphs constructed for each year's rainfall data as shown in Figure 2. With the proposed method (blue line), the variance tends to increase in recent years, and the year-to-year changes also show a tendency to expand. This trend is almost unobservable when using conventional variance (red line).

Gakumon no susume (An Encouragement of Learning) (Research Introduction)

Showing item 1 of 3.

Gakumon no susume (An Encouragement of Learning) (Research Introduction)

Showing item 1 of 3.