Unleashing the Power of Whisper and Falcon in Voice AI

Introduction

Explore the transformative power of speaker recognition and speaker diarization in this tutorial. We’ll integrate OpenAI’s Whisper for advanced transcription and Picovoice’s Falcon to precisely identify speakers, offering unparalleled audio conversation analysis.

Background

Utilizing Whisper’s capabilities for terminal-based transcriptions, we’ll enhance it with Falcon’s diarization to distinctively identify speakers. This integration improves transcription accuracy and enriches the context of who is speaking and when, enabling sophisticated AI-driven audio analysis. Lets combine the power of these cutting-edge technologies together!

Installation & Setup

1. Audio Recording

Install FFmpeg, an all-in-one tool for audio and video, which we’ll use for recording audio and creating transcriptions with Whisper AI.

Homebrew

brew install ffmpeg

Chocolatey

choco install ffmpeg

Once installed, use FFmpeg to list all inputs on your machine. Select an input device for later use.

Mac OS
ffmpeg -f avfoundation -list_devices true -i ""

Linux
ffmpeg -f alsa -list_devices true -i ""

Windows
ffmpeg -f dshow -list_devices true -i dummy

Note the exact name of the audio input you’ll use.

2. Whisper Speaker Recognition

To use OpenAI’s Whisper for speaker recognition, follow these steps:

Install Python 3.8–3.11: Check your Python version with python3 -V. If it’s not within 3.8–3.11, download the latest 3.11 version from python.org.
Install PIP: Ensure Python’s package manager is installed with python3 -m pip --version. Install or upgrade it with python3 -m pip install --upgrade pip.
Install Whisper: Install Whisper and its dependencies via pip install -U openai-whisper.

3. Falcon Speaker Diarization

To install Picovoice’s Falcon for speaker diarization:

Create an account and get your AccessKey from Picovoice’s Dashboard.
Install the pvleopard Python package using pip3 install pvfalcon.

Python Script for Audio Recording, Transcription, and Diarization

This Python script demonstrates the integration of Whisper for transcription and Falcon for speaker diarization. The script automatically records audio using CLI, transcribes it, performs speaker diarization, and outputs the final transcript with speaker and timestamp labels.

Code Explanation

import os
import subprocess
import datetime
import pvfalcon
import json

def record_audio():
    # Records audio using FFmpeg and saves it as a WAV file     today = datetime.datetime.now().strftime('%Y%m%d')
    audio_file = f"./{today}.wav"
    subprocess.run([
        "ffmpeg", "-f", "avfoundation", "-i", ":YOUR_INPUT_SOURCE",
        "-ar", "16000",  # Set sample rate to 16 kHz         "-ac", "1",      # Set audio to mono         "-t", "15",      # Record for 15 seconds         audio_file
    ])
    return audio_file

def transcribe_audio(audio_file):
    # Transcribes the audio using Whisper     subprocess.run(["whisper", audio_file, "--model", "medium", "--language", "English"], check=True)
    json_output = f"{audio_file.rsplit('.', 1)[0]}.json"
    # Display macOS notification when transcription is complete     subprocess.run(["osascript", "-e", 'display notification "Whisper Transcription Complete!" with title "Whisper AI"'])
    if not os.path.exists(json_output):
        raise FileNotFoundError(f"The file {json_output} was not created by Whisper.")
    with open(json_output, 'r') as f:
        transcription = json.load(f)
    return transcription

def perform_diarization(audio_file, access_key):
    # Applies Falcon's speaker diarization on the audio file     falcon = pvfalcon.create(access_key=access_key)
    segments = falcon.process_file(audio_file)
    falcon.delete() # Clean up Falcon instance after processing     # Display macOS notification when diarization is complete     subprocess.run(["osascript", "-e", 'display notification "Falcon Diarization Complete!" with title "Falcon AI"']) 

def merge_transcripts(transcription, diarization, overlap_threshold=0.2):
    # Merges transcripts from Whisper and diarization data from Falcon     merged_output = []
    used_transcript_segments = set()

    for seg in diarization:
        speaker_tag = f"Speaker {seg.speaker_tag}"

        for part in transcription['segments']:
            if part['id'] in used_transcript_segments:
                continue  # Skip segments already used 
            overlap_start = max(seg.start_sec, part['start'])
            overlap_end = min(seg.end_sec, part['end'])
            overlap_duration = max(0, overlap_end - overlap_start)

            diarization_duration = seg.end_sec - seg.start_sec
            transcription_duration = part['end'] - part['start']
            min_duration = min(diarization_duration, transcription_duration)

            if overlap_duration >= overlap_threshold * min_duration:
                merged_output.append({
                    'speaker': speaker_tag,
                    'start': overlap_start,
                    'end': overlap_end,
                    'text': part['text']
                })
                used_transcript_segments.add(part['id'])

 # Sort and merge close segments     merged_output.sort(key=lambda x: (x['speaker'], x['start']))
    final_output = []
    for seg in merged_output:
        if final_output and seg['speaker'] == final_output[-1]['speaker'] and seg['start'] - final_output[-1]['end'] < 1:
            final_output[-1]['end'] = seg['end']  # Extend the previous segment             final_output[-1]['text'] += ' ' + seg['text']
        else:
            final_output.append(seg)

    return final_output
def main():
    access_key = "YOUR_FALCON_ACCESS_KEY" # Replace with your actual Falcon access key     audio_file = record_audio()
    transcription = transcribe_audio(audio_file)
    diarization = perform_diarization(audio_file, access_key)
    merged_output = merge_transcripts(transcription, diarization)
    # Print the final output with speaker tags and timestamps     for m in merged_output:
        print(f"{m['speaker']} [{m['start']:.2f}-{m['end']:.2f}] {m['text'].strip()}")
    os.remove(audio_file)  # Clean up the audio file 
if __name__ == "__main__":
    main()

Enter fullscreen mode Exit fullscreen mode

This integration of Whisper and Falcon offers devs a powerful tool ️ for audio analysis. The script not only transcribes but also assigns text to specific speakers with timestamps .

It’s a perfect starting point for further customization. Dive into this project on GitHub, tweak it, and adapt it to your needs 🤝!

原文链接：Unleashing the Power of Whisper and Falcon in Voice AI

文章版权声明 1、本网站名称：拾光赋
2、本站永久网址：https://www.blogs.ink
3、本网站的文章部分内容可能来源于网络，仅供大家学习与参考，如有侵权，请联系站长QQ：805375623进行删除处理。
4、本站一切资源不代表本站立场，并不代表本站赞同其观点和对其真实性负责。
5、本站一律禁止以任何方式发布或转载任何违法的相关信息，访客发现请向站长举报
6、本站资源大多存储在云盘，如发现链接失效，请联系我们我们会第一时间更新。

THE END