Problems decoding streamed mp3 data using JLayer

Question

Im trying to use the JLayer java lib to decode an mp3 data stream. I have a callback which is called asynchronously when the next chunk of mp3 data has arrived from the network. Each chunk that arrives contains 4 mp3 frames in byte[] format. This data is passed to the short[] decode(byte[] mp3_data) to be decoded, and the output is a short[] pcm audio buffer. The buffer is appended to inside the while loop using the concatArray() method, until all the mp3 frames are exhausted. The problem I am having is the first 2 or sometimes 3 frames of data return a pcm buffer filled with zeros, where as the last 2 or 1 return valid 16 bit audio values.

   public short[] decode(byte[] mp3_data) throws IOException {

        SampleBuffer output = null;
        InputStream inputStream = new ByteArrayInputStream(mp3_data);
        short[] pcmOut = {};
        try {
            Bitstream bitstream = new Bitstream(inputStream);
            Decoder decoder = new Decoder();
            boolean done = false;
            int i = 0;
            while (! done) {
                Header frameHeader = bitstream.readFrame();
                if (frameHeader == null) {
                    done = true;
                } else {
                    output = (SampleBuffer) decoder.decodeFrame(frameHeader, bitstream);
                    short[] next = output.getBuffer();
                    pcmOut = concatArrays(pcmOut, next);
                }

                bitstream.closeFrame();
                i++;
            }
            return pcmOut;

        } catch (BitstreamException e) {
            throw new IOException("Bitstream error: " + e);
        } catch (DecoderException e) {
            Log.w(LOG_TAG, "Decoder error", e);
        }
        return null;
    }


    short[] concatArrays(short[] A, short[] B) {

        int aLen = A.length;
        int bLen = B.length;
        short[] C= new short[aLen+bLen];

        System.arraycopy(A, 0, C, 0, aLen);
        System.arraycopy(B, 0, C, aLen, bLen);

        return C;
    }

LOG OUTPUT

Frame 0 len: 2304, First 10 samples: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Frame 1 len: 2304, First 10 samples: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Frame 2 len: 2304, First 10 samples: [-4128, -4158, -4252, -3934, -4452, -3775, -4799, -3762, -5430, -4092]
Frame 3 len: 2304, First 10 samples: [-18050, -19711, -18184, -19753, -18143, -19595, -17046, -18362, -14773, -15933]

Frame 0 len: 2304, First 10 samples: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Frame 1 len: 2304, First 10 samples: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Frame 2 len: 2304, First 10 samples: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Frame 3 len: 2304, First 10 samples: [2455, 2345, 5253, 5129, 6716, 6442, 7475, 6866, 8461, 7444]

Frame 0 len: 2304, First 10 samples: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Frame 1 len: 2304, First 10 samples: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Frame 2 len: 2304, First 10 samples: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Frame 3 len: 2304, First 10 samples: [951, 1322, 1497, 1929, 1615, 2198, 1320, 2134, 1040, 2114]

Frame 0 len: 2304, First 10 samples: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Frame 1 len: 2304, First 10 samples: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Frame 2 len: 2304, First 10 samples: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Frame 3 len: 2304, First 10 samples: [-10213, -9578, -11691, -10867, -13686, -12770, -14837, -13874, -15619, -14574]

As you can see printing out the pcm buffers for each 4 frame mp3 chunk, you can see that the first 2 - 3 buffers are filled with zeros. Does anyone have any expreince with JLayer who can see an obvious problem with my method?

Durandal · Answer 1 · 2013-05-24T17:36:31.517

What is the problem? First, many mp3's will obviously start with silence. Second, due to the nature of PCM synthesis it takes a while to fill the polyphase synthesis filter bank, so the very first samples will very likely be zeros, the synthesis filter starts out with all zeros in its 16 banks.

Look at the entire frame to decide if its silent, not at 10 samples.

EDIT: You apparently are not familiar with how MP3 works internally, so I'll elaborate a bit on the basics.

An MP3 frame contains the header word (tells about bit rate, sample rate and stereo type), and some control information. The majority of the frame consists just of packed data. Opposite to what is mostly implied when spoken about MP3, the packed data does not belong entirely to that single frame. A frame can "borrow" packed data space from its predecessors, and it can also carry data belonging to the following frame(s). CBR (constant bit rate) just tells that all the frames are of equal size, but due to the borrowing from previous frames, particuarly complicated frames may be allocated more bits by borrwing space from preceeding frames (this decision is made by the encoder when it creates the stream). VBR just adds the additional possibility to also vary the frame size, technically CBR streams are already able to allocate a variable amount of bits per frame, just within tighter limits than VBR.

To decouple the decoding from the unevenly allocated frame data, the decoder feeds the packed data it receives with each frame into a FIFO buffer called "Bit Reserve" that basically takes care that all data borrowed from previous frames is remembered until it is requested by the decoding pipeline.

Data from the bit reserve is then huffman decoded, processed through some complex math to produce time-frequency samples. To transform those into PCM, they are fed into the synthesis filter. The synthesis filter remembers each time-frequency sample for a fixed period of time (well technically steps, the wall-clock time varies with the sample rate) into the past in its "banks" (each time-frequency sample influences multiple PCM samples), with the oldest being pushed out by the newest.

This entire decoding pipeline introduces quite some latency. Seeking inside an MP3 properly is non-trivial due to the latency of the pipeline and further complicated by the bitreserve borrowing mechanism.

Im trying to play a continuous mp3 live stream. I know for certain there is no silence in the stream because I can listen to it from another source. The problem is that its almost like the decoder is only decoding 1 or 2 frames, but the first couple are always a string of zeros. Which is wrong since I know there is no silence. — Sabobin, May 24 '13 at 16:00
@Sabobin You can not properly decode staring at a random frame if the bitrate is variable (BitReserve is not properly filled in that case, as it depends on the previous frame which you didn't decode). Its completely normal that it takes a few frames until everything is properly in sync then. — Durandal, May 24 '13 at 16:03
the stream is not variable bitrate. Its a static 128kbps bit rate. — Sabobin, May 24 '13 at 16:07
@Sabobin You still miss the data that is expected in the synthesis filter (and the previous block data used in the ICT) when you start in the mid of a stream. MP3 isn't as simple as wav. Each frame depends partially on the previous frame. And depending on how the bitreserve is handled, you can still lose bits even on CBR. — Durandal, May 24 '13 at 16:14
How can this be possible if each frame passed in has a header which describes the data being passed to the decoder? I dont understand your reasoning. — Sabobin, May 24 '13 at 16:38
@Sabobin Sigh... a *frame* is just a packet of data describing how the state of the decoder is to be altered. It *doesn't* map directly to the PCM output by the decoder for that call to decodeFrame(). MP3 decoding is basically a pipeline, and in your case the pipeline stages aren't properly filled initially. — Durandal, May 24 '13 at 16:50
I realise it doesnt map directly to the pcm data. Surely if the decoder requieres more frames to be read before it can output the next pcm buffer, it wouldn't output anything until the next call, but in my case its outputting a bunch of zeros. — Sabobin, May 24 '13 at 17:00
The decoder can *not* output nothing, its output rate is dictated by the sample rate of the stream (its an architectural choice). Since all its internal state is initially at zeros, it will compute zeros until that state is replaced. — Durandal, May 24 '13 at 17:32

Yozek · Answer 2 · 2013-06-05T14:57:15.893

I've been playing a little with mp3 decoding using JLayer and I'm just facing your same issue: for each frame I get lots of zeros and then several non-zeros pcm samples.

I suppose the decodeFrame() method should return the real pcm samples decoded because it has already processed, requantized, huffman-decoded, polyphase resynthesized the encoded for me.

This way the total pcm samples are more that they should so I decided to strip-off all the pcm zeros samples and I write-out the samples in wav format. I know it's a bit 'weird' but.. now it sounds really as it should !!

The song I decoded is a CBR format, mono channel just to keep the stuff simpler.

I thought that maybe all those zeros have something to do with bit-reservoir, so if the song and the psycoacustic model used doesn't really need them, they're set to zero. Then I made other tests.

What I've argued is that if each Layer 3 frame is decoded in 2304 pcm samples, in a mono song maybe only the first half is non-zero, while the seconds half are all zero. But if I use a stereo mp3...almost all samples are non-zeros, except obviously at the very beginning of the song.

So it seems that this 'issue' only arises with mono encoded mp3. With stero mp3 I can get all the correct pcm samples, in a mono mp3 I just need to get the first half of the decoded pcm samples per frame.

But isn't this a waste of space for an audio compression algorithm ? Maybe I'm still loosing something...

Hope this could help a bit...

EDIT

For waht I can see, the channels are interleaved in the frame: for 2-channels mp3, the 2304 pcm samples decoded are:

L[0],R[0],L[1],R[1],L[2],R[2],.......,L[1152],R[1152]

The ouptut wav file generated sounds now much better than before.

Problems decoding streamed mp3 data using JLayer

2 Answers2