Fourth lecture

  • Presentation of Zhongshi He

Some students of College of Computer Science in Chongqing University, under the direction of Zhongshi He, published a very interesting paper about speech emotion recognition.

GitHub Logo

Zhongshi He received the B.S. degree in applied mathematics and the Ph.D. degree in computer science from Chongqing University, Chongqing, China, in 1987 and in 1996, respectively. He was a Postdoctoral Fellow in the School of Computer Science at the University of Witwatersrand in South Africa from September 1999 to August 2001. He is currently a Full Professor, Ph.D. Supervisor, and the Vice-Dean of the College of Computer Science and Technology at the Chongqing University. He is a Member of AIPR (Artificial Intelligence and Pattern Recognition, editor’s note) Professional Committee of China Federation of Computer. His fiels of research include machine learning and data mining, natural language computing, and image processing.

  • Speech emotion recognition

Presentation

Filming someone during a speech to increase the information you can extract from this speech, such as emotions of course, seems a promising idea. However, research in this field were not that conclusive. This paper proposes a solution to highly increase this level of recognition.

“Speech emotion recognition (SER) is to study the formation and change of speaker’s emotional state from the speech signal perspective, to make the interaction between human and computer more intelligent.

Their purpose is to analyze a voice signal to find their inner emotions and ideological activities. Through the modifications along time, they want to obtain “a more intelligent and natural human-computer interaction”. It would help develop new HCI system and to realize artificial intelligence.

The research

The key to the traditional machine learning method of SER is feature selection, which is directly related to the accuracy of recognition.

It seems that there is still a lack of accuracy about the recognition. Moreover, all those studies are far from the practical application. The solution of this paper is DRCNNs, which consists of two parts:

  1. Data Augmentation Algorithm Based on Retinal Imaging Principle (DAARIP), using the principle of retina and convex lens imaging.
  2. Deep Convolution Neural Networks (DCNNs), which can extract high-level features from the spectrogram and make precise prediction.

They reach 99 % accuracy with this method, which is very promising for a real life application.

Criticism

“As we all know, the closer we get to the object, the bigger we see it. In other words, what we see in our retina is different because of the different distance. But it does not affect our recognition. Since our brains have learned high-level features of the object, we can accurately identify the same thing of different sizes.”

GitHub Logo

All this explication have almost no sense in terms of optics and photography as only the depth of field changes here. Then, even if this paper is not about photography, not talking about an aperture seems very surprising to me. ‘Mm’ are focal length, and depth of field depends on the aperture far more. Even if they must have kept a fix aperture, still the precision was worth written I think. I then wonder why they picked this solution instead of another.

Opening

They plan to extend the proposed method and evaluate its performance on multilingual speech emotion database in the near future, in order to implement this method in real life by reaching more different kind of emotions and people.

  • Source

https://arxiv.org/pdf/1707.09917.pdf

Index

Next :

Fifth lecture

List :

Index lecture, Index lab