Unleashing the Power of Whisper and Falcon in Voice AI

Introduction

Explore the transformative power of speaker recognition and speaker diarization in this tutorial. We’ll integrate OpenAI’s Whisper for advanced transcription and Picovoice’s Falcon to precisely identify speakers, offering unparalleled audio conversation analysis.

Background

Utilizing Whisper’s capabilities for terminal-based transcriptions, we’ll enhance it with Falcon’s diarization to distinctively identify speakers. This integration improves transcription accuracy and enriches the context of who is speaking and when, enabling sophisticated AI-driven audio analysis. Lets combine the power of these cutting-edge technologies together!

Installation & Setup

1. Audio Recording

Install FFmpeg, an all-in-one tool for audio and video, which we’ll use for recording audio and creating transcriptions with Whisper AI.

Homebrew

brew install ffmpeg

Chocolatey

choco install ffmpeg

Once installed, use FFmpeg to list all inputs on your machine. Select an input device for later use.

Mac OS
ffmpeg -f avfoundation -list_devices true -i ""

Linux
ffmpeg -f alsa -list_devices true -i ""

Windows
ffmpeg -f dshow -list_devices true -i dummy

Note the exact name of the audio input you’ll use.

2. Whisper Speaker Recognition

To use OpenAI’s Whisper for speaker recognition, follow these steps:

  • Install Python 3.8–3.11: Check your Python version with python3 -V. If it’s not within 3.8–3.11, download the latest 3.11 version from python.org.

  • Install PIP: Ensure Python’s package manager is installed with python3 -m pip --version. Install or upgrade it with python3 -m pip install --upgrade pip.

  • Install Whisper: Install Whisper and its dependencies via pip install -U openai-whisper.

3. Falcon Speaker Diarization

To install Picovoice’s Falcon for speaker diarization:

Python Script for Audio Recording, Transcription, and Diarization

This Python script demonstrates the integration of Whisper for transcription and Falcon for speaker diarization. The script automatically records audio using CLI, transcribes it, performs speaker diarization, and outputs the final transcript with speaker and timestamp labels.

Code Explanation

import os
import subprocess
import datetime
import pvfalcon
import json

def record_audio():
    # Records audio using FFmpeg and saves it as a WAV file     today = datetime.datetime.now().strftime('%Y%m%d')
    audio_file = f"./{today}.wav"
    subprocess.run([
        "ffmpeg", "-f", "avfoundation", "-i", ":YOUR_INPUT_SOURCE",
        "-ar", "16000",  # Set sample rate to 16 kHz         "-ac", "1",      # Set audio to mono         "-t", "15",      # Record for 15 seconds         audio_file
    ])
    return audio_file

def transcribe_audio(audio_file):
    # Transcribes the audio using Whisper     subprocess.run(["whisper", audio_file, "--model", "medium", "--language", "English"], check=True)
    json_output = f"{audio_file.rsplit('.', 1)[0]}.json"
    # Display macOS notification when transcription is complete     subprocess.run(["osascript", "-e", 'display notification "Whisper Transcription Complete!" with title "Whisper AI"'])
    if not os.path.exists(json_output):
        raise FileNotFoundError(f"The file {json_output} was not created by Whisper.")
    with open(json_output, 'r') as f:
        transcription = json.load(f)
    return transcription

def perform_diarization(audio_file, access_key):
    # Applies Falcon's speaker diarization on the audio file     falcon = pvfalcon.create(access_key=access_key)
    segments = falcon.process_file(audio_file)
    falcon.delete() # Clean up Falcon instance after processing     # Display macOS notification when diarization is complete     subprocess.run(["osascript", "-e", 'display notification "Falcon Diarization Complete!" with title "Falcon AI"']) 

def merge_transcripts(transcription, diarization, overlap_threshold=0.2):
    # Merges transcripts from Whisper and diarization data from Falcon     merged_output = []
    used_transcript_segments = set()

    for seg in diarization:
        speaker_tag = f"Speaker {seg.speaker_tag}"

        for part in transcription['segments']:
            if part['id'] in used_transcript_segments:
                continue  # Skip segments already used 
            overlap_start = max(seg.start_sec, part['start'])
            overlap_end = min(seg.end_sec, part['end'])
            overlap_duration = max(0, overlap_end - overlap_start)

            diarization_duration = seg.end_sec - seg.start_sec
            transcription_duration = part['end'] - part['start']
            min_duration = min(diarization_duration, transcription_duration)

            if overlap_duration >= overlap_threshold * min_duration:
                merged_output.append({
                    'speaker': speaker_tag,
                    'start': overlap_start,
                    'end': overlap_end,
                    'text': part['text']
                })
                used_transcript_segments.add(part['id'])

 # Sort and merge close segments     merged_output.sort(key=lambda x: (x['speaker'], x['start']))
    final_output = []
    for seg in merged_output:
        if final_output and seg['speaker'] == final_output[-1]['speaker'] and seg['start'] - final_output[-1]['end'] < 1:
            final_output[-1]['end'] = seg['end']  # Extend the previous segment             final_output[-1]['text'] += ' ' + seg['text']
        else:
            final_output.append(seg)

    return final_output
def main():
    access_key = "YOUR_FALCON_ACCESS_KEY" # Replace with your actual Falcon access key     audio_file = record_audio()
    transcription = transcribe_audio(audio_file)
    diarization = perform_diarization(audio_file, access_key)
    merged_output = merge_transcripts(transcription, diarization)
    # Print the final output with speaker tags and timestamps     for m in merged_output:
        print(f"{m['speaker']} [{m['start']:.2f}-{m['end']:.2f}] {m['text'].strip()}")
    os.remove(audio_file)  # Clean up the audio file 
if __name__ == "__main__":
    main()

Enter fullscreen mode Exit fullscreen mode

This integration of Whisper and Falcon offers devs a powerful tool ️ for audio analysis. The script not only transcribes but also assigns text to specific speakers with timestamps .

It’s a perfect starting point for further customization. Dive into this project on GitHub, tweak it, and adapt it to your needs 🤝!

原文链接:Unleashing the Power of Whisper and Falcon in Voice AI

© 版权声明
THE END
喜欢就支持一下吧
点赞12 分享
评论 抢沙发

请登录后发表评论

    暂无评论内容