moDernisT

 

 

moDernisT began with the a simple question: what sounds are deleted during mp3 compression? If mp3’s use a lossy compression algorithm to reduce the file size of uncompressed audio recordings by as much as 90%, what does the lost data sound like?

Taking this as our motivation several practical problems arise. How do we determine what the lost material is, exactly? This problem suggests several possible solutions. One option is to create a negative image mp3 encoder which performs the same analysis as the usual mp3 encoding, but instead computes the inverse compression algorithms, deleting bits that would usually be unchanged and proportionally maintaining resolution where it would usually be compromised. This would be very labor intensive.

Another possible solution is to deal with the output of the mp3 encoding directly. With this approach, we could analyze the original audio file and compare it to the mp3 file. This has the advantage of allowing us to choose a variety of perceptually motivated approaches for analysis, and avoids needing to write a new codec from scratch. The challenge here becomes to reconstruct the difference satisfactorily.

For moDernisT, we combined two techniques to generate a variety of material. The first, simplest technique was to sample-align the two audio files and invert the phase of one. Phase inversion is a technique used in other areas of audio production and here produces a difference signal.

Phase inversion takes place in what is called the time-domain, though I think it should have been called the amplitude domain. A time-domain representation places amplitude on the y-axis and time on the x-axis. Any time-domain signal can be converted to a frequency-domain representation by applying a transform such as the FFT or DCT. This kind of representation allows us to see frequency on the x-axis, time-on the y-axis, and amplitude on the z-axis. A major advantage to this technique is that it allows us to isolate particular frequency components by their amplitude at a given time.

To implement this, I wrote a script in Python that utilized several functions from Michael Casey’s Bregman Python Toolkit to compare the uncompressed audio with the MP3 and return the difference. Details of this process can be found in my article in the Proceedings of the International Computer Music Conference titled The Ghost in the MP3. An example of this technique, you can view my GitHub repository on Masking Processes.

After isolating a wide variety of timbres and frequency components by running the masking procedure with slight variations and selectively filtering the phase-inversion extractions, I separated the material based on an analysis of the musical form of the original song. Included in this analysis was information about the lyrical content, mood, timbre, and notable musical features for each subsection. This data would inform composition decision later in the process.

Having generate dozens of audio files for each verse, each containing different isolated components of the sounds that had been deleted during MP3 compression, I listened. I auditioned each audio file, considered it’s sound and relation to the musical features of that section of the song, and saved the audio files that sounded most compelling to my ear. In this sense, I was enacting a kind of personal lossy compression algorithm on the extracted material. To be sure, I discarded at least 90% of the audio files I had created during the previous stage. Had I become an MP3 myself?

The material was all very suggestive, traces of the song, hints at the melody. I wanted to bring this interesting perceptual feature out more. To do so, I wrote a Python script to create varying degrees of rhythmic deviation from the original timing of the song. In this way, the time element would come across as a similarly obscured version of what the MP3 algorithm had done to the frequency content. We could have rhythms that suggest the familiar rhythm of the song but aren’t quite right, leaving important sign posts and strong beats in along the way, but deviating enough so that the mind would fill in the blanks.

I again turned to Python to implement what I considered a rhythmic analogue to lossy compression. I wrote a script to take an input audio file, segment, and scramble it to varying degrees. Again drawing on my musical analysis, I varied these parameters for each section of the song. You can view the some examples of the Python source code for these lossy rhythm functions on my GitHub page.

After generating multiple layers for each verse from which to construct a rich, polyphonic texture, I began to consider imaging. Most live music and traditional composition is not concerned with the perception of direction or space as a compositional parameter. One of the most dramatic changes with the advent of recorded music is the degree to which space and directionality play a crucial role in the composition of recorded music. The music industry has settled on stereo playback as the de facto standard for music imaging, which affords an artist a one-dimensional plane to compose direction on, from far left, moving across the front of the virtual soundstage to the far right. Through reverberation one can also compose a sense of depth around this directional plane, allowing one to dynamically control something akin to room size for each given sound source.

Given the limitation of stereo playback, and considering further that most music listening is now done on earbuds/headphones in the United States and many other industrialized nations, composers wishing to expand the sense of space to three dimensions, rather than the simple left-to-right axis usually afforded, are forced to consider virtual 3-d audio techniques. The most commonly used technique for the creation of such binaural recordings is to apply a head related transfer function to the source audio. Rather than simply choosing a panning position along the left-right axis, a composer is now given control over the azimuth, elevation, and radius. This can be imagined as varying the position of a sound source on the surface of a variable-radius dome surrounding the listeners head.

To create a sense of immersion in the MP3 detritus, I wrote a python script drawing on an open source hrtf python library called headspace. The script I wrote left the radius fixed but varied the Azimuth and Elevation dynamically. What I sought to create was a sense of varying dynamic movement in 3-d space around the listener. I composed this compositional layer as a counterpoint to the varying timbres and textures of each section of the song. You can view an example of the functions I created on my gitHub page, under the Dynamic HRTF folder.

With all of this material in tow, I transitioned to working in a traditional digital audio workstation. For this project I worked in Logic Pro, but other DAW’s would be just as well, such as Reaper, Rosegarden, or Ardour. This software allowed me to stitch together the numerous layers and sections of the song, blend them, apply post-production effects as needed, and make final dynamic adjustments to the amplitude and imaging.

Working in Logic, a composer can dynamically control a wide range of parameters, from dynamically varying the virtual room acoustics of each sound source, to micro-editing dynamic changes, panning, timbral transformations via signal processing algorithms such as ring modulation, and micro-rhythmic adjustments. After spending several weeks making fine adjustments to such parameters, I ended up with a meticulously labored over composition ready for mastering and digital release.