Participant Profile
Yuki Mitsufuji
Other : Lead Research Scientist, Sony AI AmericaFaculty of Science and Technology GraduatedGraduate School of Science and Technology GraduatedKeio University alumni (2002 Faculty of Science and Technology, 2004 Master of Science and Technology). Guest Professor at New York University. Ph.D. in Information Science and Technology. Selected for Stanford/Elsevier's World's Top 2% Scientists.
Yuki Mitsufuji
Other : Lead Research Scientist, Sony AI AmericaFaculty of Science and Technology GraduatedGraduate School of Science and Technology GraduatedKeio University alumni (2002 Faculty of Science and Technology, 2004 Master of Science and Technology). Guest Professor at New York University. Ph.D. in Information Science and Technology. Selected for Stanford/Elsevier's World's Top 2% Scientists.
Interviewer: Hideo Saito
Faculty of Science and Technology Professor, Department of Information and Computer ScienceInterviewer: Hideo Saito
Faculty of Science and Technology Professor, Department of Information and Computer Science
On Receiving the Yagami Prize
──Congratulations on receiving the "Yagami Prize." First of all, how do you feel about winning the award?
It has already been over 20 years since I graduated. While Professor Shinji Ozawa and Professor Saito's laboratories were primarily focused on images, I was able to conduct research related to music. I am grateful that I was allowed to conduct research while engaging in activities that were slightly outside the norm.
I chose that research theme because I was absorbed in musical activities during my university days. When I joined a company, I chose Sony because I was looking for a company that handled music as entertainment. Even after joining, I was able to work in the direction I wanted, and I have been researching music and AI.
Through trial and error with performing artists, creators, and content producers, I have been able to release several works into the world. Among them, the technology called "sound source separation," which led to this Yagami Prize, gradually gained recognition. When I spoke to Professor Saito about that research, he said, "You might just win it," so I applied for the Yagami Prize, and I am very happy to have achieved such a good result.
──I was also happy to see you again at the award ceremony.
Since you were at the ceremony and Professor Ozawa also made the trip, I felt for a moment as if I had returned to 20 years ago. Professor Saito is relatively close to me in age, and he used to come to see my live music performances and support me, for which I am grateful.
──I knew you had joined Sony after graduation, but I remember being surprised to see in web articles that you had earned your doctorate and were very active as a researcher. The Keio AI Center was established in 2024, and at that time, you became the group leader on the Sony side, which has a cooperative relationship. I am also very happy that a connection has been established in terms of the relationship with Keio.
What is Sound Source Separation Technology?
──What exactly is the sound source separation technology that you have been involved with for so long? Why is it necessary? Why do sounds need to be separated?
Sound source separation technology has existed since the 1990s. The term "cocktail party effect" is often used; when someone calls out to you in the middle of a noisy party, humans can somehow notice that voice even though it is at the same volume as the surroundings. Humans can do this, but machines cannot. Sound source separation started as a technology to achieve this.
There were expectations that this technology could be applied to extracting only specific instruments or vocals from music where various instruments are mixed, but there were technical hurdles.
For example, technology to change a person's voice into another voice has become quite widespread now with generative AI, but it could not be realized a short while ago. Because human voices always contain some noise, it was not possible to successfully train or match the characteristics of Person A and Person B. Therefore, everyone thought how great it would be if we could extract only the specific sound we wanted.
Besides voices, there is spatial audio. Instead of 2-channel playback through headphones, what do you do when you want a luxurious setup with speakers behind and above you? If the original sound source is a CD, it is only in stereo, so no matter how much you output those left and right channels from various speakers, you end up just listening to those two sounds.
In this case, if you extract specific instruments and play the guitar from the right, the piano from the left, the vocals from slightly in front, and the drums from slightly above, it feels like you are surrounded by different instruments. However, music sources to achieve that were not being created. This is because recordings suited for spatial audio had not been made.
──So that's where sound source separation becomes necessary.
Yes. I thought this was a very important theme. If the challenge of sound source separation were solved, there would be many applications that could be used afterward.
So, in 2011, I went to study abroad at a facility called the Institute for Research and Coordination in Acoustics/Music (IRCAM) in France and researched sound source separation. After that, I wrote papers as a researcher and went to a conference called ICASSP in 2013. It was a keynote talk by Geoffrey Hinton from the University of Toronto, who is called the Godfather of AI and has also won a Nobel Prize. His talk was about how performance could be vastly improved by introducing deep learning—what we now call AI—in the two fields of object recognition and speech recognition.
I felt I was witnessing the moment something new was happening, so after returning to Japan, I thought about applying this and tried introducing deep learning into sound source separation technology.
Success by Introducing AI
──So that's when you started introducing AI.
That's right. As a team, we started with the feeling that it would be nice if something worked out, but when we actually tried it, we got incredible performance. Then, when we entered an international competition held in 2015, we were the only ones who had thought of using deep learning for sound source separation, and our score was exceptionally better than the rest. The people there were shocked, and for us, it was a moment that changed our standing and our destiny.
After that, the performance of sound source separation technology continued to improve, so we took the finished product to artists and studios. However, as of 2018, it had not reached a practical level at all, and we received feedback many times from artists and studio people saying, "This is useless."
However, as we repeated the process, the performance gradually improved. At one point, there was a story that Kanji Ishimaru wanted to create a Japanese version of a narration part on an old record where narration was layered over a performance by Glenn Gould. To perform with the late Gould, we needed Gould's sound alone. Since it was a sound source from the 1960s, there were no machines to record instruments separately; everything was mixed together. Our technology appeared there at just the right level, and we were able to achieve it. That was in 2020.
──So it was put into practical use then.
Also, old movies are monaural with one channel. Dialogue, music, and the remaining sound effects and ambient sounds are all mixed together. So even if you wanted to change this to 5.1 channel or a new format, there was no way to do it.
Around 2020, there was a movement to 4K remaster past masterpieces. For sound, there is a format called Dolby Atmos where the sound source is mixed down within the device according to the speaker arrangement to sound three-dimensional, but to do that, you must have separated sound sources.
I was asked to extract the sound of motorcycles and horses running, so I first performed sound source separation for the past Academy Award-winning films "Lawrence of Arabia" and "Gandhi" and handed them to the studio. The studio then mixed them into Atmos format and released them to the world.
──That's wonderful.
In this way, sound source separation evolved into the form it should be alongside studios and artists, and it also became recognized in academic fields.
Furthermore, we are now able to successfully proceed with optimization that can be used even in electronic devices. We developed technology to make software compact and realized technology that allows recording while canceling surrounding noise on mobile devices and the like.
──That technology is also utilized in Sony products, isn't it? In so-called classical noise reduction, cleanly removing only the noise was like a magic trick. It means that current sound source separation has become able to remove only the noise while leaving the original voice or music clean.
That's right. The example I have always used to describe the difficulty of sound source separation is that it is a problem as difficult as separating a mixed juice afterward. It's hard to take out only the orange juice after mixing apple juice and orange juice, right? That can now be done using the power of AI.
What to Address in the Generative AI Era
──I believe you are considering applications in the entertainment field and various other areas, but how do you see the technology being applied and spreading in the future? And how do you personally want to be involved in that?
During my university days, I was doing music seriously, composing my own songs and performing at live houses, and I wanted to debut eventually. However, I was the vocalist, and I realized I didn't have that something that stood out more than others. While I was wondering what to do, a relative who was an engineer in the music industry advised me, "How about being involved in music while utilizing what you have now?" and I chose the path of becoming a researcher at a company.
Therefore, I think I understand to some extent how artists feel when they are active. For example, now, when talk arises that generative AI might replace people in the field of musical activities, I understand how they feel.
So, I am thinking about how they can smoothly transition to that era, even if the things they valued are created in the same way by generative AI in a new paradigm.
──Specifically, what kind of things are you considering?
What I am thinking about now is a mechanism to properly monetize an artist if what comes out of generative AI closely resembles that artist's songs. That way, the artist won't lose their job because of it.
Also, I think artists can open up new fields by actively using generative AI. I believe this will be a powerful tool, and genres that have never existed before might be created.
In the field of music, just as electronic music using synthesizers spread in earnest in the 70s, new genres can change along with technology. So I think it is quite possible that generative AI will open up new fields. What I am aiming for next is to properly build the foundation for that.
An Awareness of Bridging the "Gap"
──As an engineer and researcher within a company, with what kind of awareness and what values do you work?
My current workplace is surrounded by entertainment, and the environment is set up to make it easy to interact with customers, so I think I am in an environment where I can do what I want to do.
On the other hand, I am in New York now, and since generative AI and the like are progressing very rapidly in the US, I feel that the situation in the US might not be properly communicated to Japan. I think there are differences in terms of policy regarding how open they are about copyright, and there is also a big difference in awareness between the group of AI researchers and the group doing art, so I have done quite a bit of activity to bridge that gap.
For example, the reason I served as an associate professor at Tokyo Tech (now Institute of Science Tokyo) was because I thought that things we take for granted in companies might not be common outside, or our work might not be clearly visible. Currently, I am also a Guest Research Professor at New York University's Steinhardt School in a field that nurtures people involved in entertainment.
The motivation for that is whether I can bridge the gap between the field of AI research and students studying entertainment. Fortunately, I have been involved in both, so I am trying to bridge the gap by making the mysterious entity called AI visible and showing it as an opportunity rather than a threat. I do this because I feel that if we don't do these things, important things might be lost.
Recently, I feel a lot of passion for Japanese anime like "Demon Slayer" when I am in New York. The situation where Americans know Japanese content is something that was unthinkable about 20 years ago. On Halloween, everyone is dressed as Japanese anime characters. Japanese music is also attracting attention recently, and city pop is played quite a bit.
When expanding such things, there are often cases where things don't go well because of a gap, even though opportunities could be seized if awareness were high. For example, recently, a famous entertainment company and an equally famous AI company formed a partnership. That was a form where, if they pay a proper fee, it is okay for characters whose rights are held by the entertainment company to appear in the AI output.
It's very happy for users because they can enjoy creating content using famous characters. The entertainment company is not being exploited either; it's a form where money properly returns to them.
On the other hand, and this is strictly a personal view, in Japan, there is a tendency to perceive what AI companies are doing as a threat to their own important content, and I think there is a sense of caution.
Working to bridge these gaps is not something I can do alone, but by doing it steadily, I always want to ensure that Japan's precious content is properly utilized in the world.
──I am very happy to realize once again that you have such a passionate vision. Finally, I would like you to talk a little bit about your student days at Keio.
During my university days, I used to attend international conferences, but in terms of my attitude toward research, I felt that I was a bit detached. My stance was to do the music I liked while also doing research on signal processing, and since I couldn't devote time to only one side, I don't think I looked like a model student.
However, I didn't think I was inferior; I believe people change depending on where they direct their enthusiasm. I suddenly became motivated from the stage where I could imagine that what I created would be used and take shape in the world as a product or service. It might be a bit difficult to imagine social implementation or releasing products within a university.
──Receiving such a message from the outside is very stimulating. Now, the number of young researchers directly aiming for social implementation is increasing even at universities, and I think universities are gradually becoming places where such things are possible, so I would be happy if you could continue to cooperate if there is an opportunity.
I look forward to your future success. Thank you very much for today.
(Recorded online on December 15, 2025)
*Affiliations and titles are those at the time of publication.