I need to take an array of bytes that represents an audio file (a wav files data) and make a new array of bytes that represents that same wav file, but with an increased pitch. It sounds simple but apparently is anything but. I tried doing this on my own, but I lack the math skills to even follow any of the guidelines for doing this.
Using the library Tone.js, I am able to increase the pitch of my audio with a few lines of code. The issue is the audio just plays from my speakers, I need the actual data of that audio.
I knew that Tone.js is using the Web Audio API, and when I use the Web Audio API, playing audio looks a little something like this.
let source = this.audioCtx.createBufferSource();
// this.audioBuffer is an AudioBuffer.
source.buffer = this.audioBuffer;
// Logs the first PCM float32 value of the sources buffer, which for my data is 0.012939848005771637.
// This data is retrieved from an actual .wav file using audioContext.decodeAudioData()
console.log(this.audioBuffer.getChannelData(0)[0]);
// Honestly, not 100% sure exactly what this is doing. I know it's needed though.
source.connect(this.audioCtx.destination);
// Calling the start() method will play the audio.
source.start();
The main takeaways from that code is 3 things.
- There is a source that is an AudioBufferSourceNode that is created by using the audioContexts createBufferSource() method.
- If you call getChannelData() on that sources buffer property, you will retrieve the actual PCM data. Again, this is what I am trying to do, get the actual PCM data.
- The source.start() will be the last thing that is called before the audio plays from the speakers.
My idea was that if I dug through the Tone.js source code, and tracked down the code that calls the start() method which plays the audio, I would be able to take a peek at the audio using the getChannelData() and see the audio that is playing from my speakers, which when using the PitchShift effect from Tone.js, is the pitched up version of my audio.
It took some time, but I eventually found what I was looking for in the ToneBufferSource.ts file. There is a start() function defined towards the top of the ToneBufferSource class, and in that function there is a code that looks like this, among lots of other code.
// These logs are mine
console.log(this._source);
console.log(this._source.buffer.getChannelData(0)[0]);
// This plays the audio.
this._source.start(computedTime, computedOffset);
My first log tells me that this._source is indeed an AudioBufferSourceNode, the issue is that when I log the first float32 data from getChannelData, it's the same as my unpitched data, 0.012939848005771637. If I log the 10th or 100th value in the array it's the same as the 10th or 100th of the unpitched data array. Basically, even though what I am hearing through my speakers is my audio with an increased pitch, the data sitting in the buffers source is the vanilla data? So, really, what gives? How is Tone.js increasing the pitch but the source is the same?
Then I found that Tone.js is using the standardized-audio-context library, and it's that library that is actually handling the AudioContext, AudioBufferSourceNode's, etc. Even though I noted in the paragraph above this one: that the this._source was an AudioBufferSourceNode, it actually isn't.
You can see in the image that there is a _context property, but the normal AudioBufferSourceNode doesn't have a _context property, but simple a context property. There is another property called _nativeAudioBufferSourceNode, this is the actual AudioBufferSourceNode.
Anyways, all that to say that I then dug through the standardized-audio-context code looking for it's source.start() method to analyze the data that is coming through the speakers, the same as I did above. Here is what I found... in the factories\audio-buffer-source-node-constructor.js
// Again, I put this log here. It's not in the source code...
console.log(this._nativeAudioBufferSourceNode.buffer.getChannelData(0)[0]);
this._nativeAudioBufferSourceNode.start(when, offset, duration);
And that log, that is logging the first float32 from the sources buffer, is again, 0.012939848005771637, the exact same as the un-pitched data. It's the same regardless of the pitchShift too, whether I do const pitch = new PitchShift(8).toDestination();
or const pitch = new PitchShift(-1).toDestination();
it's the same thing. The data isn't changing to represent the actual different sound coming from my speakers.
Is there a way to get the data representing the audio I am hearing through my speakers?