Collecting Samples
My solution to the problem required three distinct tasks: segmenting the singing, determining the pitch, and comparing the data.
Segmenting the pitch was done by collecting samples of singing from three volunteers. First, samples were collected for a simple scale. Where the singers were asked to sing one octave above middle C (in the typical vocal range of a soprano). Three samples were taken from each singer. Afterwards, given sheet music for the song "Mary Had A Little Lamb," each of the singers was asked to follow the sheet music as closely as possible. After looking at all of the data produced, I discovered that unlike normal speech, pitch changes for singing were substantial more regular.
This makes intuitive sense. A musical score is broken down into quarter, half and whole notes. As a result, "correct" singing means not only agreeing with the score in terms of pitch but also agreeing in terms of time. After additional experimentation, it was determined that .5 second frames produced the most accurate data.
(Sample of a scale - simple inspection shows a very regular note distribution)
Initial Constraints
While the software could easily be modified to measure any frequency range, I constrained this experiment to test the quality of speech of Soprano singers. There were several reasons why this was an obvious choice, considering the scope of the project. The first was available samples. In order to test that the algorithm could actually separate "good" and "bad" singers, I needed to find a sample that was a trained singer. The singer that I was able to find was a Soprano.
This choice also simplified the problem mathematically. For the lower frequency ranges, the frequency difference between notes is much smaller. As a result, some notes could be misclassified for little more than an inconsistency in the algorithm. By restricting the test to Soprano's and to a lessor degree Tenor's (the male sample), it allowed us to have a much wider frequency distribution and reduce the chance for resolution based note misclassification's.
Harmonic Product Spectrum
(Harmonic Product Spectrum - http://cnx.org/content/m11714/latest/)
After segmenting the speech, the next biggest concern was determining pitch.
For that task, I choose to use Harmonic Product Spectrum. Harmonic Product Spectrum is a pitch detection algorithm best suited for the detection of musical notes. It works by segmenting the input signal and downsampling it several times (as illustrated in the figure above). The motivation is that the spectrum should consist of peaks at integer multiples from the fundamental frequency. After downsampling the signal, we find the the strongest peaks line up. When we multiple these peaks together, the result is the fundamental frequency of the signal.
I ran this pitch detection algorithm over every sample of the segmented speech. The algorithm further windowed the signal, to test for variations from pure tone. Since there was some minor outliers over every segment, I removed them and took the most common detected pitch as the correct one. The HPS algorithm I used is a slightly modified version of a standard HPS function found in Matlab.
Comparing Data
The comparison will be explained in much greater detail in the results section of this report.
The process involved creating a mapping of a sample song. Using the sheet music for the major scale that I choose, I mapped the frequency and the "time" to a matrix. An example, if someone was supposed to hold a D5 for two time steps, then the matrix entry would be [. . . 587 587 . . .]. Using that I was able to come up with a composite of what the song was supposed to look like pitch-wise. By finding the absolute value of error at every time step, I generated one interpretation of vocal quality. The metric specifically measures how well the singer was capable of matching the pitch and timing of a particular piece of sheet music.
On top of that, I produced a transcript of the notes that the singer produced and how closely those notes were to perfect pitch. For a singer training using this system, that information would be crucial in determining not only how well they could mimic a song, note for note, but also whether they were singing at the pitches that they were trying to produce accurately.
No comments:
Post a Comment