Introduction
Imagine controlling your desktop, apps, and tasks without touching a keyboard or mouse—just using your voice and hand gestures. With advancements in computer vision, NLP, and AI automation, this is now possible!
In this blog, we’ll build an AI-powered agent that allows users to open apps, switch windows, and control tasks hands-free using Python, OpenCV, and TensorFlow.
How It Works
- Hand Gesture Recognition: Detect gestures using OpenCV & MediaPipe.
- Voice Commands: Use NLP to interpret user speech.
- Automate Tasks: Open apps, close windows, switch tabs using automation scripts.
Step 1: Install Dependencies
pip <span>install </span>opencv-python mediapipe pyttsx3 speechrecognition pyautoguipip <span>install </span>opencv-python mediapipe pyttsx3 speechrecognition pyautoguipip install opencv-python mediapipe pyttsx3 speechrecognition pyautogui
Enter fullscreen mode Exit fullscreen mode
Step 2: Implement Hand Gesture Control
We’ll use MediaPipe for real-time hand tracking and map gestures to actions.
<span>import</span> <span>cv2</span><span>import</span> <span>mediapipe</span> <span>as</span> <span>mp</span><span>import</span> <span>pyautogui</span><span>mp_hands</span> <span>=</span> <span>mp</span><span>.</span><span>solutions</span><span>.</span><span>hands</span><span>hands</span> <span>=</span> <span>mp_hands</span><span>.</span><span>Hands</span><span>()</span><span>mp_draw</span> <span>=</span> <span>mp</span><span>.</span><span>solutions</span><span>.</span><span>drawing_utils</span><span>cap</span> <span>=</span> <span>cv2</span><span>.</span><span>VideoCapture</span><span>(</span><span>0</span><span>)</span><span>while</span> <span>cap</span><span>.</span><span>isOpened</span><span>():</span><span>success</span><span>,</span> <span>frame</span> <span>=</span> <span>cap</span><span>.</span><span>read</span><span>()</span><span>if</span> <span>not</span> <span>success</span><span>:</span><span>break</span><span># Convert frame to RGB </span> <span>frame_rgb</span> <span>=</span> <span>cv2</span><span>.</span><span>cvtColor</span><span>(</span><span>frame</span><span>,</span> <span>cv2</span><span>.</span><span>COLOR_BGR2RGB</span><span>)</span><span>results</span> <span>=</span> <span>hands</span><span>.</span><span>process</span><span>(</span><span>frame_rgb</span><span>)</span><span>if</span> <span>results</span><span>.</span><span>multi_hand_landmarks</span><span>:</span><span>for</span> <span>hand_landmarks</span> <span>in</span> <span>results</span><span>.</span><span>multi_hand_landmarks</span><span>:</span><span>mp_draw</span><span>.</span><span>draw_landmarks</span><span>(</span><span>frame</span><span>,</span> <span>hand_landmarks</span><span>,</span> <span>mp_hands</span><span>.</span><span>HAND_CONNECTIONS</span><span>)</span><span># Detect open hand (command to open browser) </span> <span>thumb_tip</span> <span>=</span> <span>hand_landmarks</span><span>.</span><span>landmark</span><span>[</span><span>4</span><span>].</span><span>y</span><span>index_tip</span> <span>=</span> <span>hand_landmarks</span><span>.</span><span>landmark</span><span>[</span><span>8</span><span>].</span><span>y</span><span>if</span> <span>index_tip</span> <span><</span> <span>thumb_tip</span><span>:</span><span>pyautogui</span><span>.</span><span>hotkey</span><span>(</span><span>'</span><span>ctrl</span><span>'</span><span>,</span> <span>'</span><span>t</span><span>'</span><span>)</span> <span># Open new tab in browser </span><span>cv2</span><span>.</span><span>imshow</span><span>(</span><span>"</span><span>Hand Gesture Control</span><span>"</span><span>,</span> <span>frame</span><span>)</span><span>if</span> <span>cv2</span><span>.</span><span>waitKey</span><span>(</span><span>1</span><span>)</span> <span>&</span> <span>0xFF</span> <span>==</span> <span>ord</span><span>(</span><span>'</span><span>q</span><span>'</span><span>):</span><span>break</span><span>cap</span><span>.</span><span>release</span><span>()</span><span>cv2</span><span>.</span><span>destroyAllWindows</span><span>()</span><span>import</span> <span>cv2</span> <span>import</span> <span>mediapipe</span> <span>as</span> <span>mp</span> <span>import</span> <span>pyautogui</span> <span>mp_hands</span> <span>=</span> <span>mp</span><span>.</span><span>solutions</span><span>.</span><span>hands</span> <span>hands</span> <span>=</span> <span>mp_hands</span><span>.</span><span>Hands</span><span>()</span> <span>mp_draw</span> <span>=</span> <span>mp</span><span>.</span><span>solutions</span><span>.</span><span>drawing_utils</span> <span>cap</span> <span>=</span> <span>cv2</span><span>.</span><span>VideoCapture</span><span>(</span><span>0</span><span>)</span> <span>while</span> <span>cap</span><span>.</span><span>isOpened</span><span>():</span> <span>success</span><span>,</span> <span>frame</span> <span>=</span> <span>cap</span><span>.</span><span>read</span><span>()</span> <span>if</span> <span>not</span> <span>success</span><span>:</span> <span>break</span> <span># Convert frame to RGB </span> <span>frame_rgb</span> <span>=</span> <span>cv2</span><span>.</span><span>cvtColor</span><span>(</span><span>frame</span><span>,</span> <span>cv2</span><span>.</span><span>COLOR_BGR2RGB</span><span>)</span> <span>results</span> <span>=</span> <span>hands</span><span>.</span><span>process</span><span>(</span><span>frame_rgb</span><span>)</span> <span>if</span> <span>results</span><span>.</span><span>multi_hand_landmarks</span><span>:</span> <span>for</span> <span>hand_landmarks</span> <span>in</span> <span>results</span><span>.</span><span>multi_hand_landmarks</span><span>:</span> <span>mp_draw</span><span>.</span><span>draw_landmarks</span><span>(</span><span>frame</span><span>,</span> <span>hand_landmarks</span><span>,</span> <span>mp_hands</span><span>.</span><span>HAND_CONNECTIONS</span><span>)</span> <span># Detect open hand (command to open browser) </span> <span>thumb_tip</span> <span>=</span> <span>hand_landmarks</span><span>.</span><span>landmark</span><span>[</span><span>4</span><span>].</span><span>y</span> <span>index_tip</span> <span>=</span> <span>hand_landmarks</span><span>.</span><span>landmark</span><span>[</span><span>8</span><span>].</span><span>y</span> <span>if</span> <span>index_tip</span> <span><</span> <span>thumb_tip</span><span>:</span> <span>pyautogui</span><span>.</span><span>hotkey</span><span>(</span><span>'</span><span>ctrl</span><span>'</span><span>,</span> <span>'</span><span>t</span><span>'</span><span>)</span> <span># Open new tab in browser </span> <span>cv2</span><span>.</span><span>imshow</span><span>(</span><span>"</span><span>Hand Gesture Control</span><span>"</span><span>,</span> <span>frame</span><span>)</span> <span>if</span> <span>cv2</span><span>.</span><span>waitKey</span><span>(</span><span>1</span><span>)</span> <span>&</span> <span>0xFF</span> <span>==</span> <span>ord</span><span>(</span><span>'</span><span>q</span><span>'</span><span>):</span> <span>break</span> <span>cap</span><span>.</span><span>release</span><span>()</span> <span>cv2</span><span>.</span><span>destroyAllWindows</span><span>()</span>import cv2 import mediapipe as mp import pyautogui mp_hands = mp.solutions.hands hands = mp_hands.Hands() mp_draw = mp.solutions.drawing_utils cap = cv2.VideoCapture(0) while cap.isOpened(): success, frame = cap.read() if not success: break # Convert frame to RGB frame_rgb = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB) results = hands.process(frame_rgb) if results.multi_hand_landmarks: for hand_landmarks in results.multi_hand_landmarks: mp_draw.draw_landmarks(frame, hand_landmarks, mp_hands.HAND_CONNECTIONS) # Detect open hand (command to open browser) thumb_tip = hand_landmarks.landmark[4].y index_tip = hand_landmarks.landmark[8].y if index_tip < thumb_tip: pyautogui.hotkey('ctrl', 't') # Open new tab in browser cv2.imshow("Hand Gesture Control", frame) if cv2.waitKey(1) & 0xFF == ord('q'): break cap.release() cv2.destroyAllWindows()
Enter fullscreen mode Exit fullscreen mode
This detects hand gestures and opens a new tab when an open-hand gesture is detected.
Step 3: Add Voice Command Recognition
Now, let’s integrate speech commands to open apps and control the system.
<span>import</span> <span>speech_recognition</span> <span>as</span> <span>sr</span><span>import</span> <span>pyttsx3</span><span>import</span> <span>os</span><span>recognizer</span> <span>=</span> <span>sr</span><span>.</span><span>Recognizer</span><span>()</span><span>engine</span> <span>=</span> <span>pyttsx3</span><span>.</span><span>init</span><span>()</span><span>def</span> <span>listen_and_execute</span><span>():</span><span>with</span> <span>sr</span><span>.</span><span>Microphone</span><span>()</span> <span>as</span> <span>source</span><span>:</span><span>print</span><span>(</span><span>"</span><span>Listening...</span><span>"</span><span>)</span><span>audio</span> <span>=</span> <span>recognizer</span><span>.</span><span>listen</span><span>(</span><span>source</span><span>)</span><span>try</span><span>:</span><span>command</span> <span>=</span> <span>recognizer</span><span>.</span><span>recognize_google</span><span>(</span><span>audio</span><span>).</span><span>lower</span><span>()</span><span>print</span><span>(</span><span>f</span><span>"</span><span>Command: </span><span>{</span><span>command</span><span>}</span><span>"</span><span>)</span><span>if</span> <span>"</span><span>open notepad</span><span>"</span> <span>in</span> <span>command</span><span>:</span><span>os</span><span>.</span><span>system</span><span>(</span><span>"</span><span>notepad</span><span>"</span><span>)</span><span>elif</span> <span>"</span><span>open browser</span><span>"</span> <span>in</span> <span>command</span><span>:</span><span>os</span><span>.</span><span>system</span><span>(</span><span>"</span><span>start chrome</span><span>"</span><span>)</span><span>elif</span> <span>"</span><span>shutdown</span><span>"</span> <span>in</span> <span>command</span><span>:</span><span>os</span><span>.</span><span>system</span><span>(</span><span>"</span><span>shutdown /s /t 1</span><span>"</span><span>)</span><span>except</span> <span>sr</span><span>.</span><span>UnknownValueError</span><span>:</span><span>print</span><span>(</span><span>"</span><span>Sorry, I didn</span><span>'</span><span>t catch that.</span><span>"</span><span>)</span><span>except</span> <span>sr</span><span>.</span><span>RequestError</span><span>:</span><span>print</span><span>(</span><span>"</span><span>Error with speech recognition service.</span><span>"</span><span>)</span><span>listen_and_execute</span><span>()</span><span>import</span> <span>speech_recognition</span> <span>as</span> <span>sr</span> <span>import</span> <span>pyttsx3</span> <span>import</span> <span>os</span> <span>recognizer</span> <span>=</span> <span>sr</span><span>.</span><span>Recognizer</span><span>()</span> <span>engine</span> <span>=</span> <span>pyttsx3</span><span>.</span><span>init</span><span>()</span> <span>def</span> <span>listen_and_execute</span><span>():</span> <span>with</span> <span>sr</span><span>.</span><span>Microphone</span><span>()</span> <span>as</span> <span>source</span><span>:</span> <span>print</span><span>(</span><span>"</span><span>Listening...</span><span>"</span><span>)</span> <span>audio</span> <span>=</span> <span>recognizer</span><span>.</span><span>listen</span><span>(</span><span>source</span><span>)</span> <span>try</span><span>:</span> <span>command</span> <span>=</span> <span>recognizer</span><span>.</span><span>recognize_google</span><span>(</span><span>audio</span><span>).</span><span>lower</span><span>()</span> <span>print</span><span>(</span><span>f</span><span>"</span><span>Command: </span><span>{</span><span>command</span><span>}</span><span>"</span><span>)</span> <span>if</span> <span>"</span><span>open notepad</span><span>"</span> <span>in</span> <span>command</span><span>:</span> <span>os</span><span>.</span><span>system</span><span>(</span><span>"</span><span>notepad</span><span>"</span><span>)</span> <span>elif</span> <span>"</span><span>open browser</span><span>"</span> <span>in</span> <span>command</span><span>:</span> <span>os</span><span>.</span><span>system</span><span>(</span><span>"</span><span>start chrome</span><span>"</span><span>)</span> <span>elif</span> <span>"</span><span>shutdown</span><span>"</span> <span>in</span> <span>command</span><span>:</span> <span>os</span><span>.</span><span>system</span><span>(</span><span>"</span><span>shutdown /s /t 1</span><span>"</span><span>)</span> <span>except</span> <span>sr</span><span>.</span><span>UnknownValueError</span><span>:</span> <span>print</span><span>(</span><span>"</span><span>Sorry, I didn</span><span>'</span><span>t catch that.</span><span>"</span><span>)</span> <span>except</span> <span>sr</span><span>.</span><span>RequestError</span><span>:</span> <span>print</span><span>(</span><span>"</span><span>Error with speech recognition service.</span><span>"</span><span>)</span> <span>listen_and_execute</span><span>()</span>import speech_recognition as sr import pyttsx3 import os recognizer = sr.Recognizer() engine = pyttsx3.init() def listen_and_execute(): with sr.Microphone() as source: print("Listening...") audio = recognizer.listen(source) try: command = recognizer.recognize_google(audio).lower() print(f"Command: {command}") if "open notepad" in command: os.system("notepad") elif "open browser" in command: os.system("start chrome") elif "shutdown" in command: os.system("shutdown /s /t 1") except sr.UnknownValueError: print("Sorry, I didn't catch that.") except sr.RequestError: print("Error with speech recognition service.") listen_and_execute()
Enter fullscreen mode Exit fullscreen mode
This AI assistant listens for commands and executes system actions hands-free.
Future Enhancements
- Train a custom ML model for gesture classification using TensorFlow.
- Create an AI-powered voice assistant with GPT-3 for natural interactions.
- Deploy as a cross-platform desktop app using Electron.js + Python.
Why This Matters?
Innovative AI Interaction – Hands-free control is the future of computing.
Improves Accessibility – Helps users with mobility challenges.
Real-World Applications – Can be used in smart homes, AR/VR, and robotics.
Conclusion
This AI-powered assistant combines Computer Vision + NLP + Automation to create a seamless, hands-free desktop experience. With further improvements, it could revolutionize human-computer interaction.
Want to take it further? Try integrating it with LLMs for a conversational AI assistant!
原文链接:Building an AI Agent for Hands-Free Software Control Using Python and OpenCV
暂无评论内容