User Interface is key for humans to interact with computers. The traditional way is to use keyboard and mouse to browse websites, use apps, or play video games. When it comes to specific applications such as infographics display in a mall or airport often the amount of content will be limitted and simple. In such cases a keyboard/mouse would be an over kill and hand gesture recognition is an alternative. Moreover post Covid-19, it is more than important to have systems that are contactless. This will help reduce the spread of transmitted diseases such as fever/flu in public places.
In this article we are going to see how can we integrate pretrained gesture recognition models for controlling a simple web app. This can be deployed in a device like Raspberry Pi 4 along with some LED indications. This is one of the case study type article from our blog.
To achieve the same we will use MediaPipe Gesture Recognizer model from Google which is pretrained and light weight. The main component of the projects will be a Python script that loads the pretrained model and predicts on the streaming video input. Another part is the web app for the purpose of navigating an infographics/pages as per the gesture input from the user. I
The Web App
For the purpose of demonstration, we will use a simple web app that moves pages forward or backward using any of the valid hand gestures detected. The MediaPipe gesture recognition model can detect the following hand gestures such as “Closed_Fist”, “Open_Palm”, “Pointing_Up”, “Thumb_Down”, “Thumb_Up”, “Victory”, and “ILoveYou”. I have choosen “Thumb_Up” for moving forward and “Thumb_Down” for moving backward.
Python Libraries
The Python libraries we need to import and use are OpenCV for image capturing, GPIO for LEDs, mediapipe for gesture recognition, and streamlit for easy creation of web app and UI. Now let us take a look at the main script.
import cv2
import streamlit as st
import mediapipe as mp
from os.path import join
from mediapipe.tasks import python
from mediapipe.tasks.python import vision
from utils import setup_gpio, gpio_action, gpio_clear
base_options = python.BaseOptions(model_asset_path="./models/gesture_recognizer.task")
options = vision.GestureRecognizerOptions(base_options=base_options)
recognizer = vision.GestureRecognizer.create_from_options(options)
DELAY_COUNT = 10
NUM_PAGES = 9
SELECTED_CLASSES = ["Thumb_Up", "Thumb_Down"]
CLASSES = [
    "None",
    "Closed_Fist",
    "Open_Palm",
    "Pointing_Up",
    "Thumb_Down",
    "Thumb_Up",
    "Victory",
    "ILoveYou",
]
pages = [
    "video",
    "cpu.jpeg",
    "network_card.jpeg",
    "smps.jpeg",
    "motherboard.jpeg",
    "gpu.jpeg",
    "fan.jpeg",
    "storage.jpeg",
    "ram.jpeg",
]
@st.cache_data
def load_image(file):
    return cv2.imread(file)
def play_video(frame_holder, html_holder, class_holder):
    video_html = """<video width="720" controls autoplay="true" loop="true">
<source src="https://github.com/cksajil/hand_gesture_recognition/raw/video/static/war.mp4" type="video/mp4" />
</video>"""
    frame_holder.markdown(video_html, unsafe_allow_html=True)
    html_holder.write("")
    class_holder.write("")
def main_page(frame_holder, html_holder, idx):
    current_page = pages[idx]
    frame_holder.write("")
    html_holder.image(load_image(join("static", current_page)), channels="BGR")
    return idx
def predict_frame(raw_frame):
    mp_image = mp.Image(image_format=mp.ImageFormat.SRGB, data=raw_frame)
    recognition_result = recognizer.recognize(mp_image)
    top_gesture = recognition_result.gestures
    gesture_detected = "None"
    if top_gesture:
        gesture_detected = top_gesture[0][0].category_name
    return gesture_detected
def main():
    setup_gpio()
    cap = cv2.VideoCapture(0)
    width = 16
    height = 16
    cap.set(cv2.CAP_PROP_FRAME_WIDTH, width)
    cap.set(cv2.CAP_PROP_FRAME_HEIGHT, height)
    stop_button_pressed = st.button("Stop")
    frame_holder = st.empty()
    html_holder = st.empty()
    class_holder = st.empty()
    idx = 0
    gesture_buffer = []
    while cap.isOpened() and not stop_button_pressed:
        success, raw_frame = cap.read()
        if not success:
            st.write("Video Capture Ended")
            break
        raw_frame = cv2.cvtColor(raw_frame, cv2.COLOR_BGR2RGB)
        gesture_detected = predict_frame(raw_frame)
        if gesture_detected not in SELECTED_CLASSES:
            continue
        else:
            gesture_buffer.append(gesture_detected)
            gesture_buffer = gesture_buffer[-DELAY_COUNT:]
            up_count = gesture_buffer.count("Thumb_Up")
            down_count = gesture_buffer.count("Thumb_Down")
            if up_count == DELAY_COUNT:
                idx += 1
                gesture_buffer.clear()
            elif down_count == DELAY_COUNT:
                idx -= 1
                gesture_buffer.clear()
            idx = idx % NUM_PAGES
            if idx == 0:
                play_video(frame_holder, html_holder, class_holder)
                gpio_clear()
            else:
                idx = main_page(frame_holder, html_holder, idx)
                gpio_action(idx)
        if stop_button_pressed:
            break
    cap.release()
    cv2.destroyAllWindows()
if __name__ == "__main__":
    main()
The number of pages in our app will be nine. The first page will be a promo video. The further pages will be images containing some infographic messages/content. The user can skip forward/backward the pages using “Thumb_Up” and “Thumb_Down”.
Now let us take a look at each part of the code and what it does. The first part is the import statements where we import all the necessary libraries.
import cv2
import streamlit as st
import mediapipe as mp
from os.path import join
from mediapipe.tasks import python
from mediapipe.tasks.python import vision
The libraries need to be installed to the machine so that the imports are all successful. If any library is missing we can install them using the PIP command
pip install <library_name>
The next part is defining the global variables, path to the gesture recognition model etc.
base_options = python.BaseOptions(model_asset_path="./models/gesture_recognizer.task")
options = vision.GestureRecognizerOptions(base_options=base_options)
recognizer = vision.GestureRecognizer.create_from_options(options)
DELAY_COUNT = 10
NUM_PAGES = 9
SELECTED_CLASSES = ["Thumb_Up", "Thumb_Down"]
CLASSES = [
    "None",
    "Closed_Fist",
    "Open_Palm",
    "Pointing_Up",
    "Thumb_Down",
    "Thumb_Up",
    "Victory",
    "ILoveYou",
]
pages = [
    "video",
    "cpu.jpeg",
    "network_card.jpeg",
    "smps.jpeg",
    "motherboard.jpeg",
    "gpu.jpeg",
    "fan.jpeg",
    "storage.jpeg",
    "ram.jpeg",
]
Now we have couple of functions for specific purposes.
@st.cache_data
def load_image(file):
    return cv2.imread(file)
The load_image function is for displaying an image in streamlit app. Please note that, here we are using a decorator to cache the data and make the image loading faster.
Next is the function to play video by rendering a Video HTML tag.
 def play_video(frame_holder, html_holder, class_holder):
    video_html = """<video width="720" controls autoplay="true" loop="true">
<source src="https://github.com/cksajil/hand_gesture_recognition/raw/video/static/war.mp4" type="video/mp4" />
</video>"""
    frame_holder.markdown(video_html, unsafe_allow_html=True)
    html_holder.write("")
    class_holder.write("")
Here a video is uploaded online and its path is mentioned as source so that the HTML video player can play it from the internet. Also we are using empty streamlit place holders and later mentions it is of type markdown with HTML support.
Next is another function that loads and displays a particular page/image.
def main_page(frame_holder, html_holder, idx):
    current_page = pages[idx]
    frame_holder.write("")
    html_holder.image(load_image(join("static", current_page)), channels="BGR")
    return idx
Here according to the index we are passing the html_holder will load the corresponding image and displays it in the web app.
Next is a function that predicts the gesture on any frame/image passed.
def predict_frame(raw_frame):
    mp_image = mp.Image(image_format=mp.ImageFormat.SRGB, data=raw_frame)
    recognition_result = recognizer.recognize(mp_image)
    top_gesture = recognition_result.gestures
    gesture_detected = "None"
    if top_gesture:
        gesture_detected = top_gesture[0][0].category_name
    return gesture_detected
The crucial part of the code is the main function which controls the overall flow of the program.
def main():
    setup_gpio()
    cap = cv2.VideoCapture(0)
    width = 16
    height = 16
    cap.set(cv2.CAP_PROP_FRAME_WIDTH, width)
    cap.set(cv2.CAP_PROP_FRAME_HEIGHT, height)
    stop_button_pressed = st.button("Stop")
    frame_holder = st.empty()
    html_holder = st.empty()
    class_holder = st.empty()
    idx = 0
    gesture_buffer = []
    while cap.isOpened() and not stop_button_pressed:
        success, raw_frame = cap.read()
        if not success:
            st.write("Video Capture Ended")
            break
        raw_frame = cv2.cvtColor(raw_frame, cv2.COLOR_BGR2RGB)
        gesture_detected = predict_frame(raw_frame)
        if gesture_detected not in SELECTED_CLASSES:
            continue
        else:
            gesture_buffer.append(gesture_detected)
            gesture_buffer = gesture_buffer[-DELAY_COUNT:]
            up_count = gesture_buffer.count("Thumb_Up")
            down_count = gesture_buffer.count("Thumb_Down")
            if up_count == DELAY_COUNT:
                idx += 1
                gesture_buffer.clear()
            elif down_count == DELAY_COUNT:
                idx -= 1
                gesture_buffer.clear()
            idx = idx % NUM_PAGES
            if idx == 0:
                play_video(frame_holder, html_holder, class_holder)
                gpio_clear()
            else:
                idx = main_page(frame_holder, html_holder, idx)
                gpio_action(idx)
        if stop_button_pressed:
            break
    cap.release()
    cv2.destroyAllWindows()
The general flow of the program is as follows. The OpenCV streams images frame by frame. The predict function checks if there is any gesture detected, if yes return them. Now by default the app will be at first page and plays the video. As the user shows “Thumbs Up” or “Thumbs Down” the pages moves forward or backward and respective page is displayed. Since one iteration itself takes some fraction of seconds, I simply used a counter instead of calling a delay function.
The complete source code of the project including pretrained models and web app code is available in the projects GitHub link.

Pingback: Visual Intelligence using Computer Vision