Recognition of music information from people’s voice
Background: Perfect pitch and relative pitch, pitch detection
To some extent, non-musicians are “deaf” compared to well trained musicians who could actually write down music notes by listening to melody. However, the information of melody, including pitch, rhythm and amplitude actually exists in non-musicians’ mind because they can sing out what they hear but it requires a lot of training talent to gain “perfect pitch”, also years of training to get “relative pitch”. In order to help “deaf” non-musicians to express their music ideas in mind, I choose this topic to try to design a “translator” to translate people’s music idea by analyzing the humming of people and extract its musical information and turning it into music language notation.
By recording people’s voice of humming a random tune, the program is supposed to extract the melody, including pitch, rhythm and turn it into music notation like the picture above, play it with midi sound.
Preliminary result: Successful pitch detection
There are two kinds of information to gain from the recorded voice:
First, Time-Amplitude wave form of sound:
I downloaded a java recorder file, which could record sound and I extract the array of amplitude of sound, plot it and write the sound into outputfile.wav in the directory using java MathPlot library, which produced the graph below, and Jmusic library, which makes it easier to write wav format files.
Figure 1: Amplitude-Time graph of recorded sound
Second, Frequency-Time domain graph:
I downloaded a java frequency of sound analysis file, which apply the YIN pitch detection algorithm(PDA) to output the fundamental frequency of sound in real time. Below is the output graph.
Figure 2: Frequency-Time graph of recorded sound
Pitch detection method:
1, Click button to separate analysis of notes:
Using the frequency-time analysis above, I added a click button in this program. While recording, Every time when I finish singing one note and start to sing another note, I would click the button once to tell the computer that now I’m turning to another note. This event cuts the Figure 2 graph into pieces of frequency graph, each representing one note.
2, Within separated one note analysis, set threshold to denoise:
Figure 3: The analysis of one note
The result note information must be all integers representing the index of keys on piano. But Figure 3 graph shows all float numbers fluctuating around with background noise.
Within every note analysis, I set a threshold to erase noise frequency which is out of human voice range, and then analyze the distribution of frequency value within human voice range by first rounding float note values into integers and second finding the integer which appears most frequently in this array of note integers, Storing it into a new array, which is then transformed in the format of music by string manipulation.
Finally I use Jfugue, which is another java music library that enables me play the formatted music notes, to play the result of detected note sequence.
Till now, it works perfectly with almost 100% accuracy as long as I don’t sing too fast that the frequency analysis in Figure 2 gets messy.
In the above pitch detection, the program can only get rhythm information by receiving my click time point, but:
1, People may not click at exactly the right time in sync with voice.
2, It’s not convenient to click, if a person tries to detect a sound from playing an instrument, say a guitar, both two hands will be on the guitar, instead of keyboard.
3, For music notation, the float time value needs to be transformed into music beat standard time like below:
Durations that can be specified for a note:
whole , half , quarter ,eighth ,sixteenth ,thirty-second ,sixty-fourth ,one-twenty-eighth
Goal for this algorithm:
Since every one sings in different, inaccurate speed, it’s necessary to design an algorithm to:
1, automatically decide when a note starts and ends based on the shape of amplitude-time and frequency-time analysis graph without click button.
2, Translate the float time length value of every note into standard durations that can be specified for a note for computer to read.
Every time when a person is humming a new note, there tends to be a high rise of the curve in Figure 1—Amplitude-Time graph shape. Also, the frequency curve in Figure 2 tends to fluctuate with amplitude together. By considering both fluctuations of the two graphics, a probability of whether the person’s humming is turning to a different note can be estimated.
The impact of noise on both the shape of amplitude-time graph and frequency-time graph can make it hard to get accurate beat. For example, in Figure 1, the first big rise is the sound of clicking keyboard, and there are a lot of small ups and downs of the shape that might be caused by background noise. In Figure 2, there are also a lot of nose frequency points scattered around the graph.
Above is the experiment of putting amplitude-time domain information and frequency time domain information together in one picture.
The sample rate is 44100Hz.
The buffersize is 512, which means, every 512 samples, I draw one point to represent it in this picture.
The white point represent amplitude when I'm singing, certain threshold is set to detect the space between different notes.
The blue points represent calculated frequency. The number is rounded to integers.
The green points are the detected beat. Basically, it's trying to find the point where different frequency meets and the point between two amplitude peaks.
The result is very good when I sing with not only vowels. For example: "da", "ba", "bang" would be much easier to detect beat, giving me quite clear lines like above. But if I sing with only "a", "o", "i", the line becomes hard to identify. like below.
YIN, a fundamental frequency estimator for speech and music
Alain de Cheveigne´ b)
Ircam-CNRS, 1 place Igor Stravinsky, 75004 Paris, France
Received 7 June 2001; revised 10 October 2001; accepted 9 January 2002