Thursday, April 24, 2008

References

http://www.scena.org/lsm/sm4-1/sm4-1VoiceDoc_en.htm - Comfortable Vocal Ranges

http://cnx.org/content/m11716/latest/ - HPS

http://cnx.org/content/m11714/latest/ - HPS

http://web.media.mit.edu/~moo/thesis/YEK_thesis.pdf - Singing Voices Analysis/Synthesis (Thesis)

http://www.phys.unsw.edu.au/jw/notes.html - Musical Scale

Wednesday, April 23, 2008

Matlab Code

%Sample singing voice
[song,fs] = wavread('stevethirdscale.wav');
soundsc(song,16000)
song_length = length(song);
samples = round(song_length/8000)


%The song we are testing against, find its length
the_song = [523.2 587.3 659.2 698.4 783.9 880 987.7];
test_length = length(the_song);


correct = 0;
deviation = 0;

%matrices to store results
dev_matrix = zeros(1,samples);
pitch_matrix = zeros(1,samples);

Analysis Of Results

Average Error - Best Two Results

Trained Female - 244.5
Untrained Female - 369
Untrained Male - 1108

From this experiment I was able to determine that pitch is a reasonable measure for singing quality. The major problems that I came across is that simply using pitch does not take into account octave changes which may not be easily perceived by human listeners. On several occasions, the samples changed from, for example, a D5 to a D6. For this system that was a significant error, however, for someone listening it might not be noticeable.

The result is that this is most useful as a tool to allow singers to train their vocal mimicry. It forces the singer to stay in pitch and in time, which if practiced would lead to an overall improvement of that singer's vocal control.

Another limitation of this system is that it is not song independent. In order for it to work, you need to be testing against a specific song. Moreover, the correct frequency and timing needs to be known a priori. This makes the system less robust than an implementation that could be used on an arbitrary song.

As compared to other works in the field, this implementation was the only one I ran across that used pitch as the only metric for speech quality. A few other implementations used other more subtle variations in vocal quality, HMM's with training data, and other classification methods to attempt to differentiate between a good and bad singer.

Given the limitations of the system, I am extremely happy with the results.

Results (Files)

Simulations



Matlab Simulations

Audio Samples



Trained Female





Untrained Female




Untrained Male



Solving The Problem

Collecting Samples



My solution to the problem required three distinct tasks: segmenting the singing, determining the pitch, and comparing the data.

Segmenting the pitch was done by collecting samples of singing from three volunteers. First, samples were collected for a simple scale. Where the singers were asked to sing one octave above middle C (in the typical vocal range of a soprano). Three samples were taken from each singer. Afterwards, given sheet music for the song "Mary Had A Little Lamb," each of the singers was asked to follow the sheet music as closely as possible. After looking at all of the data produced, I discovered that unlike normal speech, pitch changes for singing were substantial more regular.

This makes intuitive sense. A musical score is broken down into quarter, half and whole notes. As a result, "correct" singing means not only agreeing with the score in terms of pitch but also agreeing in terms of time. After additional experimentation, it was determined that .5 second frames produced the most accurate data.


(Sample of a scale - simple inspection shows a very regular note distribution)

Initial Constraints



While the software could easily be modified to measure any frequency range, I constrained this experiment to test the quality of speech of Soprano singers. There were several reasons why this was an obvious choice, considering the scope of the project. The first was available samples. In order to test that the algorithm could actually separate "good" and "bad" singers, I needed to find a sample that was a trained singer. The singer that I was able to find was a Soprano.

This choice also simplified the problem mathematically. For the lower frequency ranges, the frequency difference between notes is much smaller. As a result, some notes could be misclassified for little more than an inconsistency in the algorithm. By restricting the test to Soprano's and to a lessor degree Tenor's (the male sample), it allowed us to have a much wider frequency distribution and reduce the chance for resolution based note misclassification's.

Harmonic Product Spectrum




(Harmonic Product Spectrum - http://cnx.org/content/m11714/latest/)

After segmenting the speech, the next biggest concern was determining pitch.

For that task, I choose to use Harmonic Product Spectrum. Harmonic Product Spectrum is a pitch detection algorithm best suited for the detection of musical notes. It works by segmenting the input signal and downsampling it several times (as illustrated in the figure above). The motivation is that the spectrum should consist of peaks at integer multiples from the fundamental frequency. After downsampling the signal, we find the the strongest peaks line up. When we multiple these peaks together, the result is the fundamental frequency of the signal.

I ran this pitch detection algorithm over every sample of the segmented speech. The algorithm further windowed the signal, to test for variations from pure tone. Since there was some minor outliers over every segment, I removed them and took the most common detected pitch as the correct one. The HPS algorithm I used is a slightly modified version of a standard HPS function found in Matlab.

Comparing Data



The comparison will be explained in much greater detail in the results section of this report.

The process involved creating a mapping of a sample song. Using the sheet music for the major scale that I choose, I mapped the frequency and the "time" to a matrix. An example, if someone was supposed to hold a D5 for two time steps, then the matrix entry would be [. . . 587 587 . . .]. Using that I was able to come up with a composite of what the song was supposed to look like pitch-wise. By finding the absolute value of error at every time step, I generated one interpretation of vocal quality. The metric specifically measures how well the singer was capable of matching the pitch and timing of a particular piece of sheet music.

On top of that, I produced a transcript of the notes that the singer produced and how closely those notes were to perfect pitch. For a singer training using this system, that information would be crucial in determining not only how well they could mimic a song, note for note, but also whether they were singing at the pitches that they were trying to produce accurately.

Other Research

For this project, the most important audio analysis tool that I used was Harmonic Product Spectrum as my pitch detection algorithm. For musical notes and similar input signals, HPS is an extremely robust method of pitch detection. As applied to this research, it was tested against a pure tone "middle C" and it correctly identified its pitch to within 2%.



Another method of pitch detection that was described in Youngmoo Edmund Kim's Thesis entitled, "Singing Voice Analysis/Synthesis" is pitch detection using the Autocorrelation function.

Principally, this algorithm exploits the fact that periodic signals are similar from one period to the next. As a result, the algorithm only requires that you window the signal and take the autocorrelation of the signal. By differentiating this data set and searching for the minima, you can find the fundamental period (and thus the frequency) of the windowed signal.

Kim choose this method because it was computationally inexpensive. In addition, since his actual task involved speech coding, he needed the autocorrelation values for each frame for his LPC calculations.

Two other pitch detection techniques that are common in Speech Analysis applications but were not used in musical applications were Zero Crossing Rate and Cepstral coefficients.

(Image credit - http://cnx.org/content/m11714/latest/)

Problem Statement

My project tackles the problem of creating an objective measure for singing quality. Within the scope of this project, quality will be approximated by pitch. More specifically, a singer will be determined to be "good" if the pitch of the notes that they produce are close to the target pitch of the song they are trying to mimic. My software will produce a transcript of the notes they produced and pitch comparison between what they sang and what they "should have" sung.