Human Evaluation

We conduct subjective evaluations by collecting annotations about the qualities of the translated songs from music school students. Following mean opinion score (MOS) in speech synthesis, we use five-point Likert scales (1 for bad and 5 for excellent). And we evaluate the songs on four dimensions:

  • sense, fidelity to the meaning of the source lyric;
  • style, whether the translated lyric resembles song-text style;
  • listenability, whether the translated lyric sounds melodious with the given melody;
  • intelligibility, whether the audience can easily comprehend the translated lyrics if sung with provided melody.

The last two dimensions require the annotators to sing the song.


The instructions we sent to each annotators are shown as below:

Music Sheets

《Song 1》

Sheet 1
Sheet 2

《Song 2》

Sheet 1
Sheet 2

《Song 3》

Sheet 1
Sheet 2

《Song 4》

Sheet 1
Sheet 2

《Song 5》

Sheet 1
Sheet 2