Understanding the role of time in a AVCaptureSession regarding CMSampleBuffers

Question

I recently started programming in Swift as I am trying to work out an iOS camera app idea I've had. The main goal of the project is to save the prior 10 seconds of video before the record button is tapped. So the app is actually always capturing and storing frames, but also discarding the frames that are more than 10 seconds old if the app is not 'recording'.

My approach is to output video and audio data from the AVCaptureSession using respectively AVCaptureVideoDataOutput() and AVCaptureAudioDataOutput(). Using captureOutput() I receive a CMSampleBuffer for both video and audio, who I store in different arrays. I would like those arrays to later serve as an input for the AVAssetWriter.

This is the point where I'm not sure about the role of time and timing regarding the sample buffers and the capture session in general, because in order to present the sample buffers to the AVAssetWriter as an input (I believe) I need to make sure my video and audio data are the same length (duration wise) and synchronized.

I currently need to figure out at what rate the capture session is running, or how I can set that rate. Ideally I would have one audioSampleBuffer for each videoSampleBuffer, representing both the exact same duration. I don't know what realistic values are, but in the end my goal is to output 60fps, so it would be perfect if the videoSampleBuffer would contain 1 frame and the audioSampleBuffer would represent 1/60th of a second. I then could easily append the newest sample buffers to the arrays and drop the oldest.

I've of course done some research regarding my problem, but wasn't able to find what I was looking for.

My initial thought was I had to let the capture session run at some sort of set timescale, but didn't see such an option in the AVFoundation documentation. I then looked into Core Media if there was some way to set the clock the capture session was using, but couldn't find a way to say to the session to use a different CMClock (with properties I know), so I gave up this route. I still wasn't sure about the internal mechanics and timing of the capture session, so I tried to find more information about it, but without much luck. I've also stumbled on the synchronizationClock property of AVCaptureSession, but I couldn't find out how to implement this or find an example.

To this point my best guess is that with every step in time (represented by a timestamp) a new sample buffer for both video and audio is created. Which would be a good thing. But I've a feeling this is just wishful thinking and then would still not know what duration the buffers would represent.

Could anyone help me in the right direction, helping me to understand how time works in a capture session and how to get or set the duration of sample buffers?

I don't think you will get anywhere trying to make video from single frames. I have to handle frames for one of my apps, and it's very heavy from performance and memory perspective (esp on better devices). Explore an opposite avenue instead: what if you start saving when user navigates to video view, and just cut last 10 sec when user clicks "record". You can also cut and delete the video every, say 5 min if it comes to it — timbre timbre, Jan 31 '23 at 23:15
Thanks for your reply. Could you please elaborate on why handeling frames is so much more intensive performance wise compared to a regular video file output? It seems to me, as a beginner, that just storing data is more lightweight than continuous video writing. Because real-time video file output is also data > store > write, right? It seems to me like taking away the write part makes it more efficient. And about your suggested solution: I haven’t come across a way to cut/trim the video while it is being written. This was my initial approach when I first started looking into AVFoundation. — casvandergun, Feb 01 '23 at 09:01
Too long for comment, but basic reasons are here: https://developer.apple.com/library/archive/technotes/tn2445/_index.html. You will be getting frame by frame. You need to process each frame in a very short time, or they will either start dropping. Now what do you do with the frames? Say you stitch them into video. This is not trivial - so you may not be able to do it after every frame (or frames will start dropping). So you have to save them to disk first anyways, and then start stitching when user pressed record, causing a huge memory spike (forget showing live view at that moment) and a lag — timbre timbre, Feb 03 '23 at 15:27
There are of course other scenarios you can try. But the bottom line is: frame by frame processing is hard. Take s shortcut by letting video to get recorded for you into a file, and then you can do what you want with that video - chop it, cut it, etc. — timbre timbre, Feb 03 '23 at 15:28

Understanding the role of time in a AVCaptureSession regarding CMSampleBuffers

0 Answers0