Introduction
Explore the transformative power of speaker recognition and speaker diarization in this tutorial. We’ll integrate OpenAI’s Whisper for advanced transcription and Picovoice’s Falcon to precisely identify speakers, offering unparalleled audio conversation analysis.
Background
Utilizing Whisper’s capabilities for terminal-based transcriptions, we’ll enhance it with Falcon’s diarization to distinctively identify speakers. This integration improves transcription accuracy and enriches the context of who is speaking and when, enabling sophisticated AI-driven audio analysis. Lets combine the power of these cutting-edge technologies together!
Installation & Setup
1. Audio Recording
Install FFmpeg, an all-in-one tool for audio and video, which we’ll use for recording audio and creating transcriptions with Whisper AI.
Homebrew
brew install ffmpeg
Chocolatey
choco install ffmpeg
Once installed, use FFmpeg to list all inputs on your machine. Select an input device for later use.
Mac OS
ffmpeg -f avfoundation -list_devices true -i ""
Linux
ffmpeg -f alsa -list_devices true -i ""
Windows
ffmpeg -f dshow -list_devices true -i dummy
Note the exact name of the audio input you’ll use.
2. Whisper Speaker Recognition
To use OpenAI’s Whisper for speaker recognition, follow these steps:
-
Install Python 3.8–3.11: Check your Python version with
python3 -V
. If it’s not within 3.8–3.11, download the latest 3.11 version from python.org. -
Install PIP: Ensure Python’s package manager is installed with
python3 -m pip --version
. Install or upgrade it withpython3 -m pip install --upgrade pip
. -
Install Whisper: Install Whisper and its dependencies via
pip install -U openai-whisper
.
3. Falcon Speaker Diarization
To install Picovoice’s Falcon for speaker diarization:
-
Create an account and get your AccessKey from Picovoice’s Dashboard.
-
Install the pvleopard Python package using
pip3 install pvfalcon
.
Python Script for Audio Recording, Transcription, and Diarization
This Python script demonstrates the integration of Whisper for transcription and Falcon for speaker diarization. The script automatically records audio using CLI, transcribes it, performs speaker diarization, and outputs the final transcript with speaker and timestamp labels.
Code Explanation
import os
import subprocess
import datetime
import pvfalcon
import json
def record_audio():
# Records audio using FFmpeg and saves it as a WAV file today = datetime.datetime.now().strftime('%Y%m%d')
audio_file = f"./{today}.wav"
subprocess.run([
"ffmpeg", "-f", "avfoundation", "-i", ":YOUR_INPUT_SOURCE",
"-ar", "16000", # Set sample rate to 16 kHz "-ac", "1", # Set audio to mono "-t", "15", # Record for 15 seconds audio_file
])
return audio_file
def transcribe_audio(audio_file):
# Transcribes the audio using Whisper subprocess.run(["whisper", audio_file, "--model", "medium", "--language", "English"], check=True)
json_output = f"{audio_file.rsplit('.', 1)[0]}.json"
# Display macOS notification when transcription is complete subprocess.run(["osascript", "-e", 'display notification "Whisper Transcription Complete!" with title "Whisper AI"'])
if not os.path.exists(json_output):
raise FileNotFoundError(f"The file {json_output} was not created by Whisper.")
with open(json_output, 'r') as f:
transcription = json.load(f)
return transcription
def perform_diarization(audio_file, access_key):
# Applies Falcon's speaker diarization on the audio file falcon = pvfalcon.create(access_key=access_key)
segments = falcon.process_file(audio_file)
falcon.delete() # Clean up Falcon instance after processing # Display macOS notification when diarization is complete subprocess.run(["osascript", "-e", 'display notification "Falcon Diarization Complete!" with title "Falcon AI"'])
def merge_transcripts(transcription, diarization, overlap_threshold=0.2):
# Merges transcripts from Whisper and diarization data from Falcon merged_output = []
used_transcript_segments = set()
for seg in diarization:
speaker_tag = f"Speaker {seg.speaker_tag}"
for part in transcription['segments']:
if part['id'] in used_transcript_segments:
continue # Skip segments already used
overlap_start = max(seg.start_sec, part['start'])
overlap_end = min(seg.end_sec, part['end'])
overlap_duration = max(0, overlap_end - overlap_start)
diarization_duration = seg.end_sec - seg.start_sec
transcription_duration = part['end'] - part['start']
min_duration = min(diarization_duration, transcription_duration)
if overlap_duration >= overlap_threshold * min_duration:
merged_output.append({
'speaker': speaker_tag,
'start': overlap_start,
'end': overlap_end,
'text': part['text']
})
used_transcript_segments.add(part['id'])
# Sort and merge close segments merged_output.sort(key=lambda x: (x['speaker'], x['start']))
final_output = []
for seg in merged_output:
if final_output and seg['speaker'] == final_output[-1]['speaker'] and seg['start'] - final_output[-1]['end'] < 1:
final_output[-1]['end'] = seg['end'] # Extend the previous segment final_output[-1]['text'] += ' ' + seg['text']
else:
final_output.append(seg)
return final_output
def main():
access_key = "YOUR_FALCON_ACCESS_KEY" # Replace with your actual Falcon access key audio_file = record_audio()
transcription = transcribe_audio(audio_file)
diarization = perform_diarization(audio_file, access_key)
merged_output = merge_transcripts(transcription, diarization)
# Print the final output with speaker tags and timestamps for m in merged_output:
print(f"{m['speaker']} [{m['start']:.2f}-{m['end']:.2f}] {m['text'].strip()}")
os.remove(audio_file) # Clean up the audio file
if __name__ == "__main__":
main()
Enter fullscreen mode Exit fullscreen mode
This integration of Whisper and Falcon offers devs a powerful tool ️ for audio analysis. The script not only transcribes but also assigns text to specific speakers with timestamps .
It’s a perfect starting point for further customization. Dive into this project on GitHub, tweak it, and adapt it to your needs 🤝!
暂无评论内容