Introducing Thrush: A Web-Based Collaborative Programmable Synthesizer
- Eldan Ben Haim
- Oct 27, 2022
- 14 min read
Once in a while I get an itch to start a project that combines technology and music. In most cases I ignore that itch and get on with my life, however recently I did decide to take on such a project. The idea is to create an application that allows composing music by combining novel musical pieces / phrases with parts from existing compositions. Any new composition then becomes basis for such “plagiarism” by subsequent composers. The musical pieces that may be reused include synthesizer voices (e.g samples or FM synthesis parameters) as well as complete patterns or sequences. Oh, and code; the synthesizer supports procedural compositions expressed in code. Users may use any of these bits as they are, or modify them. The system maintains lineage of these reusable bits, so that when a user finds a piece that they like, they can then follow this lineage to discover variations made by others until they find something they want to work based off of. Due to the collaborative nature of this application, I’m developing it as a web-app. I’m using Angular and TypeScript for the front-end, and will probably base the backend on NestJS.
I’m calling the application Thrush because I think it’s a pretentious name that expresses enigmatic sophistication. Besides, that was the first song-bird that came up on Google. You can experiment with whatever is there now by visiting https://thrush.benhaim.net. Be advised, there’s not much there yet. The entire collaboration bit is still far, far away and for now this is mostly a showcase of the synthesizer and sequencer capabilities. Code for the project is available here. This blog post is a first of a series that will follow the project and shed some light on its implementation.
Sequencers and Tone-Generators
Like many other digital systems for composing music, Thrush is composed of several tone generators that are controlled by a sequencer. Tone generators are systems that receive as input requests to generate specific sounds, and err.. generate tones… based on this input. These input requests to tone generators are typically expressed in terms of notes, musical instruments, note intensity, effects, etc. One example for where this input could come from is a keyboard (a real one, like the one pianos have, not the ones you computer people call keyboards): when the player hits a note on the keyboard the tone-generator starts to generate a tone based on the played note, selected instrument, note velocity etc. When the player releases the key the tone generator may start a decay phase for the tone (or just abruptly terminate it).
Tone generators vary in many parameters; probably the most fundamental of them is the core method that they use to render tones. There is a variety of approaches to generating tones; two very common ones are Wavetable synthesis and FM synthesis. We’ll dive into these later, but for now suffice to say that FM synthesis generates tone based off parameterized mathematical formulas — and as such the tones are often “mechanic” and “unnatural” (but not necessarily in a bad way), whereas wavetable synthesis can generate arbitrary tones, including tones based on recordings of natural sounds — hence they are capable of generating more complex and natural sounds.
A common source of input requests to tone generators in a digital music system is the Sequencer. These are systems that transmit a sequence of events to tone-generators. A sequencer may send inputs to multiple tone-generators, and as such it “orchestrates” them by routing the right event at the right time to the right tone-generator. The events that the sequencer transmits may have been recorded a-priori, possibly by someone playing on a keyboard. Alternatively, they may have been edited or even originally mastered using appropriate software. Sequencers may also generate events based on an algorithm, and/or real-time commands received from controllers and keyboards (granted, this stretches the definition of sequencers somewhat).
Wave-table Synthesis
The two tone generators currently implemented by Thrush generate sound by employing a tone generation method called wave-table synthesis (additional types of tone generators are planned for the future).
Sounds we hear are formed by changes in air pressure over time. The intensity of these changes affects how “loud” the sound is, and their frequency affects the pitch. Digital recordings of sounds are in fact a series of samples — sampling the air pressure at each point in time. Recordings are characterized by how often the sound pressure is sampled (this is the sampling rate; for example 44.1KHz means air pressure is sampled 44,100 times per second) and by the resolution of the measurement of air pressure at each point in time (for example, when we say 16 bit-per-sample means that 2^16=65,536 different levels of pressure can be distinguished). There’s a lot to be said about how to choose an optimal sampling rate and resolution, coping with multiple channels (stereo), how to compress recordings, etc. but for our current discussion suffice we deal with “raw” recordings which we’ll actually call waveforms.
With wave-table synthesis, each musical instrument is assigned one or more waveforms that represent it (that’s a table of waveforms, or a wave-table ;) ). Since the waveform is a representation of air-pressure over time, it can accurately represent a recording of any sound. Waveforms don’t have to originate from recordings of "real" sounds, however. They may be generated by any other means (for example, computed using a mathematical formula). A wave-table "instrument" rendered by a tone-generator may have different waveforms assigned to different notes, and different versions for each note based on additional parameters such as note intensity. When the tone generator is requested to play a note using an instrument and with specific parameters, it looks up the assigned waveform and plays it. Multiple notes may be played in parallel (this is called polyphony) in which case their waveforms will be mixed.
However, having a separate waveform recorded for each combination of note and parameters is not very practical. Hence, typically an instrument includes a relatively small number of waveforms assigned to specific notes and parameter values. When other notes and parameter values are requested, for which the tone generator does not have a waveform, it will synthesize one based on the waveform for a close note and parameters combination, applying transforms to “adapt” the waveform to the requested note. For example, an instrument may include a single waveform of a recorded piano assigned to note C2, intensity 80%. A request to play note C2 at intensity 90% can be satisfied by playing an amplified version of that waveform — so it is louder than the original. Similarly, a request to play note C3 at intensity 80% can be satisfied by playing the original waveform after shifting its pitch x2 (given a note, shifting it one octave up means that its pitch is doubled; every semitone out of the 12 semitones in an octave multiplies the pitch by 2^0.5 =~ 1.059).
Of course, if you consider a piano recording this is not completely ideal: a note played on the piano has a component generated by the resonating strings — which, granted, has its pitch changed between notes — but also a component generated by the hammer hitting the string. This component doesn’t change in pitch with the note (well, actually it does but the change is more subtle than the string’s resonant frequency). So if we take a recording of a piano playing C2 and double its pitch to render a C3 note — the result will sound somewhat unnatural. This “non-linearity” of notes of the same instrument is often addressed in two ways. First, observe that we could get away with the pitch shift method if we use it to render C#2, or even E2 based on C2. But as we move father away from the original recorded note, the fidelity of the rendered sound decays. An obvious possible solution is, then, to include more dense wavetable recordings and make sure that the transformations we apply on waveforms are kept subtle by always basing on a close note (or parameters; note that the same discussion applies to note parameters other than pitch).
Another approach to solve the fidelity issue is to use “layering”. In this approach, we render each note of an instrument by mixing multiple waveforms — often called layers. The transformations we apply on these layers may differ from one layer to another. Back to our piano example, we could think about a layer that contains the string resonant part of the piano voice and a separate layer recording the hammer hit components of the sound. The hammer hit layer can be defined as “fixed pitch”, whereas the string resonant bit will change with pitch. Note that both layers could be defined as variable amplification based on intensity. Of course both the layering approach and the pitch shift approach may be combined to obtain optimal results.
Some wavetable synthesizers allow defining effects or filters to apply to the mixed instrument sound, to further characterize it. The available effects differ but usual suspects include sustain / reverb, chorus, high/low-pass filters, resonance, etc.
Soft Wavetable Synthesis in Thrush
Thrush implements both a software-based and an API-native wavetable tone-generator. The software-based synthesizer serves two purposes: one, it is potentially more flexible than the API-native wavetable tone-generator. Two, it seemed like a really fun piece of code to write :). The soft wavetable synthesizer is implemented as a WebAudio worker node. Without diving too much into details of WebAudio (which we’ll cover to more depth in our discussion of the API-native wavetable tone-generator), we’ll state that the WebAudio worker node API allows us to create a Web Worker (basically a Javascript thread running in the background of a web application) that will get requests from WebAudio for a frame of upcoming samples to play, and respond with the content of the requested frame.
For our soft wavetable synthesizer, the Web Worker maintains a state for all currently playing notes. Each note is associated with a “channel”. The state for each channel includes information such as when the note started to play, which instrument it is using, which additional parameters such as intensity it was requested with etc. On each invocation, the Web Worker will generate the waveform for each playing note based on this state and mix the waveforms into the returned output frame.
A Thrush soft wavetable instrument is currently comprised of a single wave (that’s a 1 entry wavetable ;)) along with an optional definition for sample looping. Sample looping allows creating “infinite length” samples by looping back to some point in the sample if we’ve reached its end and the note should still play.
In our current implementation, the parameters that may be included for each note play request (other than a channel on which to play the note) are as follows:
Instrument. This is the instrument that plays the note.
Pitch. Expressed in terms of what note we’re playing (e.g E3).
Volume. Expressed as a floating point value in the range 0-1.
Panning. Our tone generator renders audio in stereo. A note may have a panning value representing whether it sounds more on the left channel or the right channel. This is expressed as a value in the range 0-1, where 0 means the note only plays on the left channel, 1 means the note only plays on the right channel and 0.5 means it plays equally on both channels.
Vibrato. The speed, shape and depth of vibrato to apply to the note.
The state maintained by the tone generator for each channel includes, in addition to these parameters, a sample-cursor that advances over time and indicates where we are in time for each played note, and state pertaining to parameters and effects that change over time (for example for vibrato we maintain the current pitch shift which changes to over time to generate the vibrator effect).
The Thrush software synthesizer is a stereo synthesizer, so each frame consists of two sets of sample — one for the left channel, and one for the right channel. To calculate the content of each sample in each channel of the frame, we perform the following:
Calculate the sample for each channel: sampleL/R[sampleCursor] := ∑ channel.instrumentL/R[channel.sampleCursor]*channel.effectiveVolume* panningFactorL/R(channel.effectivePanning) where panningFactorL(panning) := panning and panningFactorR(panning) := 1-panning and the summation above is over all channels channel.
Increment the channel.sampleCursor for each channel based on the effective pitch. If we’re past the last sample, and there’s a loop defined for the instrument — move the sampleCursor back to the loop’s beginning.
Update the effective pitch, volume and padding for the channel. The effective pitch is determined based on the played note + any modifications by effects such as Vibrato. The effective volume and panning are presently determined solely by the volume and panning requested for the note.
The soft wavetable synthesizer maintains a buffer of upcoming requests; these may include requests to play a new note, silence a note, or otherwise modify note parameters. Each request in this buffer, called the the soft wavetable synthesizer “event buffer”, is accompanied by a time in which it should be played. As the wavetable synthesizer iterates to fill samples in the output buffer as portrayed above, it also tracks whether there are new note requests that need to be served — and updates channel state based on these events.
Events are submitted to the soft wavetable synthesizer’s event buffer by sending a Web Worker message to the web worker (through the postMessage API). In Thrush I’ve implemented a simple IPC layer that implements sending these messages using JavaScript proxies (see here for another discussion of other uses for JavaScript proxies).
We’ll see that event buffering is a recurring theme throughout the Thrush implementation. It allows us to offload timing to small efficient ‘realtime’ routines once we generate a predicted sequence of events, and removes realtime timing constraints from most code that deals with generating event sequences. For example, by pre-submitting events to the tone generator we don’t really care about the latency of the postMessage call (assuming it’s not ridiculously high of course) — even if the latency for postMessage is 50ms, if we buffer events 3 seconds in advance then we won’t really experience timing jitter. The downside of using buffering is obviously that if the sequence of events isn’t known in advance things become more complex. This may occur, for example, if we want to include real-time keyboard-input notes in our music… we only know that the sequence of events includes a request to play C1 at time 3sec once the note is played on the keyboard, at time 3sec….
Native Wavetable Synthesis in Thrush
The native wavetable synthesizer in Thrush is in many ways simpler to implement than the soft wavetable synthesizer. Much of the heavy lifting of rendering the sounds is done here by using Web Audio API constructs. In addition to being simpler to implement (tho a little more boring), this also leads to better performance as computation work is done natively by the browser.
The Web Audio API models audio generation using a Node Graph abstraction. A node in the graph may receive zero or more inputs (of sample data), and passes its output (again, of sample data) to one or more subsequent nodes. Typically the “last” node in the graph is the actual audio rendering node (this is given by AudioContext.destination) — whatever input it is fed with is being played by the audio hardware. Nodes preceding this node may play waveforms, transform waveforms, mix them, etc. We've already mentioned one type of node earlier: the soft synthesizer is implemented as a Web Worker node which generates audio frames by running Javascript code.
Web Audio audio graph nodes may have parameters that govern their function. For example a Gain node — that essentially changes the volume of its input — has a parameter that controls how volume is changed (by how much it’s increased or decreased). When creating nodes it’s typically possible to provide values for their various parameters over time in advance; so one can request that a Gain node changes its gain from 1 to 0 over 3 seconds to create a fade-out effect. This means that a lot (but not all) of the “event buffering” logic we had for the soft synthesizer can, too, be off-loaded to the WebAudio implementation.
And so as the native wavetable synthesizer in Thrush receives a note request instead of storing it in its own queue, it immediately creates an AudioBufferSourceNode Web Audio node, sets its parameters to make it play the relevant sample in the relevant pitch (AudioBufferSourceNode will take care of running the right pitch), and at the relevant time, and connects it to a GainNode and StereoPanning node previously allocated for the channel — setting these nodes parameters at the time of note play request to match the requested volume and panning. In case you're wondering, AudioBufferSourceNode's implementation is engineered to make creating these nodes a lightweight operation and this approach sits well with the API designer's intended use.
With this mechanism in place, most of the tone rendering process — along with its timing — is indeed offloaded to Web Audio. The remaining bit is rendering note parameters that change over time e.g., pitch shifts due to Vibrato. This is done by pre-programming parameter values over time; in the case of Vibrato applied to a note, this means queueing changes to the pitch of the AudioBufferSourceNode over time. But there’s a caveat here: consider a note play request, for a note with a vibrato effect. We can’t know in advance for how much time the note will play (an event for silencing the note may have not been buffered yet!) — so for how long in advance do we set vibrato-controlled parameters?
The solution here comes, again, in the form of event-buffering. We pre-submit parameter changes for a fixed time in the future, and we periodically check if the current time is getting too close for comfort to the time of the last buffered parameter change. If we’re too close, we’re buffering an additional batch of parameter changes. Now, at some point in time the note for which we’re buffering these events will be silenced — or perhaps some parameters will be changed. This will be expressed by a new note request being submitted to the tone generator. When we get such a note request, that affects a currently playing note, we need to “clear” all future parameter changes that were submitted for a time past the event’s time. Luckily Web Audio allows us doing this — instructing it to undo future parameter change requests starting from a specific time — using the cancelAndHoldAtTime method.
Future Work for Thrush Wavetable Tone Generators
Thrush is at a point where it makes sense to make some changes in how tone generators are abstracted. As described above, currently tone generators are modeled as having multiple channels and allowing notes to be submitted to channels that play concurrently. This means that the tone generator’s client is responsible to allocate channels to note when doing polyphony. However the client doesn’t necessarily have enough information to make the optimal allocation of channels. For example, for notes that have a decay period (that is, they continue to play a sound even after they’re ‘released’), it is unreasonable to expect the client to track such decay period to know when it can re-use a channel. As such, my intention is to omit channels from requests. Instead of channels, a tone generator will receive a request to play a note using an instrument and will allocate the channel itself. A note play request may include an ID that will allow subsequent requests to refer to the playing note to alter its parameters (e.g pitch bend).
I’m also going to make some changes to the implementation of the two tone generators. Right now all channels are modeled in a flat, symmetrical manner; the soft synthesizer simply iterates through channels and mixes them, and the hardware generator generates per-channel gain and pan node and connects them all to the output node. However to add effects such as reverb to instruments, we'll need to apply these effects on all notes played concurrently by a specific instrument. For the native synthesizer this means that the audio node topology will include a processing node per instrument and gain/pan/source nodes will be connected to that instead of directly to the destination node. For the soft synthesizer this means that other than channel state we’ll also maintain instrument-level state to apply effects.
Of course, all of these changes are only needed to begin with in order to support note and instrument parameters that are currently not available and will be added in following versions, such as:
A volume envelope for wavetable instruments, including a decay phase
A panning envelope for wavetable instruments, including a decay phase
Instrument-level effects
Epilogue
Tone generates are just one part of Thrush. The project already has the foundations for sequencers as well (which I’ll cover in a following post in the series), and there’s still work to be done on creating instrument editors, collaboration, sequence editors, and much more. This project is clearly going to be a large one, with many complexities and interesting challenges in both design and implementation. I think there’s easily work for 3-4 man-months for the initial set of goals I’ve defined and considering the fact that I’m working on a couple more projects in parallel, this is easily going to take more than a year to complete. If you’re reading this, interested by the subject matter, are familiar with the technologies (TypeScript, NodeJS, Angular) and feel up to the challenge to lend a hand in the development, do drop me a note!
Comments