GOAL : Detect people in real time with minimal delay/latency from given remote video stream.
Setup :
- Raspberry Pi (2) w/ usb web camera serving up images/video stream using Flask.
- Local machine ( macbook pro ) obtains video stream, processes images through OpenCV, Darknet/DarkFlow/Yolo, and Tensorflow.
- Display obtained processed stream with detected people. Detected people will have a rectangle around them.
- Python 3
I currently have the base functionality working, BUT, it seems to be rather slow. A image is processed about every few seconds, when I need it to be processed in less than a second. So the result is a video that shows updates way behind the stream and is choppy. From searching around, this seems to be a common problem but I have not seemed to find a straight forward answer.
I have implemented the stream grab as its own thread as some forums have said, but I believe the issue now to be just the time it takes to process that grabbed image.
Is it possible to improve performance? Do I need to do this processing in the cloud on a system supplying a good GPU so I can take advantage of that performance increase? Am I using the wrong yolo weights and cfg? I know yolov3 is out, but I think I had issues getting to work with my env.
incoming_frames = queue.Queue()
class Stream(threading.Thread):
def __init__(self, ID):
threading.Thread.__init__(self)
self.cam=cv2.VideoCapture('http://raspberrypi.local:5000/')
def run(self):
frame_id = 0
while True:
ret,frame=self.cam.read()
if ret:
frame_id = frame_id + 1
frame_dict = {}
frame_dict['frame'] = frame
frame_dict['id'] = frame_id
incoming_frames.put(frame_dict)
print("ACQUIRED FRAME " + str(frame_id))
time.sleep(0.1)
def stop(self):
self._stop_event.set()
print("[INFO] Starting Process......")
print("[INFO] Load Model / Weights")
options = {"model": "cfg/yolo.cfg", "load": "bin/yolo.weights", "threshold": 0.1}
tfnet = TFNet(options)
print("[INFO] Start Video Grab Thread")
stream = Stream(0)
stream.start()
while True:
if(not incoming_frames.empty()):
frame = incoming_frames.get()
result = tfnet.return_predict(frame['frame'])
print("Processing Frame " + str(frame['id']))
coordinates = []
for detection in result:
if detection['label'] == 'person' and detection['confidence'] >= 0.4:
cv2.rectangle(frame['frame'], (detection['topleft']['x'], detection['topleft']['y']),
(detection['bottomright']['x'], detection['bottomright']['y']),
(0, 255, 0), 2)
body = {'x': detection['topleft']['x'], 'y': detection['topleft']['y'],
'width': (detection['bottomright']['x'] - detection['topleft']['x']),
'height': (detection['bottomright']['y'] - detection['topleft']['y'])}
coordinates.append(body)
cv2.rectangle(frame['frame'], frame['x1'], frame['y1'], frame['x2'], frame['y2'], (0, 255, 0), 2)
cv2.imshow('Video', frame['frame'])
cv2.waitKey(1)
stream.stop()
cv2.destroyAllWindows()