Chatbot with Semantic Kernel - Part 5: Text-to-speech

Chatbot with Semantic Kernel (5 Part Series)

1 Chatbot with Semantic Kernel – Part 1: Setup and first steps
2 Chatbot with Semantic Kernel – Part 2: Plugins 🧩
3 Chatbot with Semantic Kernel – Part 3: Inspector & tokens
4 Chatbot with Semantic Kernel – Part 4: Speech-to-text with Whisper
5 Chatbot with Semantic Kernel – Part 5: Text-to-speech

In the last chapter, we added the first audio capability to our chatbot by allowing the user to interact with the model using their voice. In this chapter, we are going to add the opposite skill: giving a voice to our chatbot.

Text-to-speech

In recent times, models have vastly improved in creating audio based on text input. In some cases, model providers offer standalone models for Text-to-speech, like TTS from OpenAI. On the other hand, we have also the possibility of using more powerful models that support both multimodal input (text, image, video and audio) and output (text, image and voice). Some examples of these more powerful models are Gemini 2.0 flash from Google and GPT-4o-realtime from OpenAI.

The possibility of generating high quality audio thanks to these TTS models, combined with the potential of powerful text models (like GPT-4o), has enabled many use cases that were unimaginable just a few years ago. For example, in 2024, Google released NotebookLM, an application that generates podcasts based on sources uploaded by the user. If you are researching evaluation techniques for LLMs, you can upload materials such as papers or articles, and the application creates a podcast where two AI voices have a conversation summarizing and explaining your material.

Text-to-speech on Semantic Kernel

In November 2024, Microsoft added audio capabilities support to Semantic Kernel. For the Text-to-speech scenario, we will build the following workflow:

Introduce a text or audio input. You can check the previous article where we added Audio-to-text functionality.
Use a standard LLM to generate a response from the user’s input.
Use the TTS model from OpenAI to convert the response into audio (WAVformat).
Reproduce the generated audio to the user.

Based on our previous chatbot, the two first steps are already accomplished. Let’s now focus on converting the text response into an audio with the TTS model.

Generate audio

First of all, we need to inject a new service into our Kernel. In this case, we register an AzureTextToAudio service:


<span># Inject the service into the Kernel </span><span>self</span><span>.</span><span>kernel</span><span>.</span><span>add_service</span><span>(</span><span>AzureTextToAudio</span><span>(</span>
    <span>service_id</span><span>=</span><span>'</span><span>text_to_audio_service</span><span>'</span>
<span>))</span>
<span># Get the service from the Kernel </span><span>self</span><span>.</span><span>text_to_audio_service</span><span>:</span><span>AzureTextToAudio</span> <span>=</span> <span>self</span><span>.</span><span>kernel</span><span>.</span><span>get_service</span><span>(</span><span>type</span><span>=</span><span>AzureTextToAudio</span><span>)</span>
<span># Inject the service into the Kernel </span><span>self</span><span>.</span><span>kernel</span><span>.</span><span>add_service</span><span>(</span><span>AzureTextToAudio</span><span>(</span>
    <span>service_id</span><span>=</span><span>'</span><span>text_to_audio_service</span><span>'</span>
<span>))</span>

<span># Get the service from the Kernel </span><span>self</span><span>.</span><span>text_to_audio_service</span><span>:</span><span>AzureTextToAudio</span> <span>=</span> <span>self</span><span>.</span><span>kernel</span><span>.</span><span>get_service</span><span>(</span><span>type</span><span>=</span><span>AzureTextToAudio</span><span>)</span>
# Inject the service into the Kernel self.kernel.add_service(AzureTextToAudio(
    service_id='text_to_audio_service'
))

# Get the service from the Kernel self.text_to_audio_service:AzureTextToAudio = self.kernel.get_service(type=AzureTextToAudio)

Enter fullscreen mode Exit fullscreen mode

Because the service is declared as an Azure service, it uses the following environment variables:

AZURE_OPENAI_TEXT_TO_AUDIO_DEPLOYMENT_NAME: the name of the model deployed in Azure OpenAI.
AZURE_OPENAI_API_KEY: the API key associated to the Azure OpenAI instance.
AZURE_OPENAI_ENDPOINT: the endpoint associated to the Azure OpenAI instance.

Similarly, Semantic Kernel has many AI connectors, like the OpenAITextToAudio service. In that case, the name of the variables would be:

OPENAI_AUDIO_TO_TEXT_MODEL_ID: the OpenAI audio to text model ID to use.
OPENAI_API_KEY: the API key associated to your organization.
OPENAI_ORG_ID: the unique identifier for your organization.

You can check all the settings used on Semantic Kernel on the official Github repository.

The TextToAudio service is quite simple to use. It has two important methods:

get_audio_contents: return a list of generated audio contents. Some models do not support generation of multiple audios from one single input, in that case the list will contain only one element.
get_audio_content: identical to previous method but always return the first element of the list.

Both methods have an optional argument OpenAITextToAudioExecutionSettings, to customize the behavior of the service. With the current version of Semantic Kernel, you can customize the speed of the playback, the voice used (with Alloy being the default one), and the output format. In this case, I have decided to use the echo voice in WAV format:


<span>async</span> <span>def</span> <span>generate_audio</span><span>(</span><span>self</span><span>,</span> <span>message</span><span>:</span> <span>str</span><span>)</span> <span>-></span> <span>bytes</span><span>:</span>
    <span>audio_settings</span> <span>=</span> <span>OpenAITextToAudioExecutionSettings</span><span>(</span><span>voice</span><span>=</span><span>'</span><span>echo</span><span>'</span><span>,</span> <span>response_format</span><span>=</span><span>"</span><span>wav</span><span>"</span><span>)</span>
    <span>audio_content</span> <span>=</span> <span>await</span> <span>self</span><span>.</span><span>text_to_audio_service</span><span>.</span><span>get_audio_content</span><span>(</span><span>message</span><span>,</span> <span>audio_settings</span><span>)</span>
    <span>return</span> <span>audio_content</span><span>.</span><span>data</span>
<span>async</span> <span>def</span> <span>generate_audio</span><span>(</span><span>self</span><span>,</span> <span>message</span><span>:</span> <span>str</span><span>)</span> <span>-></span> <span>bytes</span><span>:</span>
    <span>audio_settings</span> <span>=</span> <span>OpenAITextToAudioExecutionSettings</span><span>(</span><span>voice</span><span>=</span><span>'</span><span>echo</span><span>'</span><span>,</span> <span>response_format</span><span>=</span><span>"</span><span>wav</span><span>"</span><span>)</span>
    <span>audio_content</span> <span>=</span> <span>await</span> <span>self</span><span>.</span><span>text_to_audio_service</span><span>.</span><span>get_audio_content</span><span>(</span><span>message</span><span>,</span> <span>audio_settings</span><span>)</span>
    <span>return</span> <span>audio_content</span><span>.</span><span>data</span>
async def generate_audio(self, message: str) -> bytes:
    audio_settings = OpenAITextToAudioExecutionSettings(voice='echo', response_format="wav")
    audio_content = await self.text_to_audio_service.get_audio_content(message, audio_settings)
    return audio_content.data

Enter fullscreen mode Exit fullscreen mode

The output generated by the method is a list of bytes containing the audio. Now we can easily use the output of the standard response from the LLM to generate the corresponding audio:


<span>response</span> <span>=</span> <span>await</span> <span>assistant</span><span>.</span><span>generate_response</span><span>(</span><span>text</span><span>)</span>
<span>add_message_chat</span><span>(</span><span>'</span><span>assistant</span><span>'</span><span>,</span> <span>response</span><span>)</span>
<span>if</span> <span>config</span><span>[</span><span>'</span><span>audio</span><span>'</span><span>]</span> <span>==</span> <span>'</span><span>enabled</span><span>'</span><span>:</span>
    <span>audio</span> <span>=</span> <span>await</span> <span>assistant</span><span>.</span><span>generate_audio</span><span>(</span><span>response</span><span>)</span>
<span>response</span> <span>=</span> <span>await</span> <span>assistant</span><span>.</span><span>generate_response</span><span>(</span><span>text</span><span>)</span>
<span>add_message_chat</span><span>(</span><span>'</span><span>assistant</span><span>'</span><span>,</span> <span>response</span><span>)</span>

<span>if</span> <span>config</span><span>[</span><span>'</span><span>audio</span><span>'</span><span>]</span> <span>==</span> <span>'</span><span>enabled</span><span>'</span><span>:</span>
    <span>audio</span> <span>=</span> <span>await</span> <span>assistant</span><span>.</span><span>generate_audio</span><span>(</span><span>response</span><span>)</span>
response = await assistant.generate_response(text)
add_message_chat('assistant', response)

if config['audio'] == 'enabled':
    audio = await assistant.generate_audio(response)

Enter fullscreen mode Exit fullscreen mode

Reproducing the audio

Once we have the audio generated, we need some code to reproduce it on the user’s computer. For that purpose, I have created a simple AudioPlayer class using pyaudio library:


<span>import</span> <span>io</span>
<span>import</span> <span>wave</span>
<span>import</span> <span>pyaudio</span>
<span>class</span> <span>AudioPlayer</span><span>:</span>
    <span>def</span> <span>play_wav_from_bytes</span><span>(</span><span>self</span><span>,</span> <span>wav_bytes</span><span>,</span> <span>chunk_size</span><span>=</span><span>1024</span><span>):</span>
        <span>p</span> <span>=</span> <span>pyaudio</span><span>.</span><span>PyAudio</span><span>()</span>
        <span>try</span><span>:</span>
            <span>wav_io</span> <span>=</span> <span>io</span><span>.</span><span>BytesIO</span><span>(</span><span>wav_bytes</span><span>)</span>
            <span>with</span> <span>wave</span><span>.</span><span>open</span><span>(</span><span>wav_io</span><span>,</span> <span>'</span><span>rb</span><span>'</span><span>)</span> <span>as</span> <span>wf</span><span>:</span>
                <span>channels</span> <span>=</span> <span>wf</span><span>.</span><span>getnchannels</span><span>()</span>
                <span>rate</span> <span>=</span> <span>wf</span><span>.</span><span>getframerate</span><span>()</span>
                <span>stream</span> <span>=</span> <span>p</span><span>.</span><span>open</span><span>(</span><span>format</span><span>=</span><span>p</span><span>.</span><span>get_format_from_width</span><span>(</span><span>wf</span><span>.</span><span>getsampwidth</span><span>()),</span>
                                <span>channels</span><span>=</span><span>channels</span><span>,</span>
                                <span>rate</span><span>=</span><span>rate</span><span>,</span>
                                <span>output</span><span>=</span><span>True</span><span>)</span>
                <span>data</span> <span>=</span> <span>wf</span><span>.</span><span>readframes</span><span>(</span><span>chunk_size</span><span>)</span>
                <span>while</span> <span>len</span><span>(</span><span>data</span><span>)</span> <span>></span> <span>0</span><span>:</span>
                    <span>stream</span><span>.</span><span>write</span><span>(</span><span>data</span><span>)</span>
                    <span>data</span> <span>=</span> <span>wf</span><span>.</span><span>readframes</span><span>(</span><span>chunk_size</span><span>)</span>
                <span>stream</span><span>.</span><span>stop_stream</span><span>()</span>
                <span>stream</span><span>.</span><span>close</span><span>()</span>
        <span>finally</span><span>:</span>
            <span>p</span><span>.</span><span>terminate</span><span>()</span>
<span>import</span> <span>io</span>
<span>import</span> <span>wave</span>
<span>import</span> <span>pyaudio</span>

<span>class</span> <span>AudioPlayer</span><span>:</span>
    <span>def</span> <span>play_wav_from_bytes</span><span>(</span><span>self</span><span>,</span> <span>wav_bytes</span><span>,</span> <span>chunk_size</span><span>=</span><span>1024</span><span>):</span>
        <span>p</span> <span>=</span> <span>pyaudio</span><span>.</span><span>PyAudio</span><span>()</span>

        <span>try</span><span>:</span>
            <span>wav_io</span> <span>=</span> <span>io</span><span>.</span><span>BytesIO</span><span>(</span><span>wav_bytes</span><span>)</span>

            <span>with</span> <span>wave</span><span>.</span><span>open</span><span>(</span><span>wav_io</span><span>,</span> <span>'</span><span>rb</span><span>'</span><span>)</span> <span>as</span> <span>wf</span><span>:</span>
                <span>channels</span> <span>=</span> <span>wf</span><span>.</span><span>getnchannels</span><span>()</span>
                <span>rate</span> <span>=</span> <span>wf</span><span>.</span><span>getframerate</span><span>()</span>

                <span>stream</span> <span>=</span> <span>p</span><span>.</span><span>open</span><span>(</span><span>format</span><span>=</span><span>p</span><span>.</span><span>get_format_from_width</span><span>(</span><span>wf</span><span>.</span><span>getsampwidth</span><span>()),</span>
                                <span>channels</span><span>=</span><span>channels</span><span>,</span>
                                <span>rate</span><span>=</span><span>rate</span><span>,</span>
                                <span>output</span><span>=</span><span>True</span><span>)</span>

                <span>data</span> <span>=</span> <span>wf</span><span>.</span><span>readframes</span><span>(</span><span>chunk_size</span><span>)</span>
                <span>while</span> <span>len</span><span>(</span><span>data</span><span>)</span> <span>></span> <span>0</span><span>:</span>
                    <span>stream</span><span>.</span><span>write</span><span>(</span><span>data</span><span>)</span>
                    <span>data</span> <span>=</span> <span>wf</span><span>.</span><span>readframes</span><span>(</span><span>chunk_size</span><span>)</span>

                <span>stream</span><span>.</span><span>stop_stream</span><span>()</span>
                <span>stream</span><span>.</span><span>close</span><span>()</span>

        <span>finally</span><span>:</span>
            <span>p</span><span>.</span><span>terminate</span><span>()</span>
import io
import wave
import pyaudio

class AudioPlayer:
    def play_wav_from_bytes(self, wav_bytes, chunk_size=1024):
        p = pyaudio.PyAudio()

        try:
            wav_io = io.BytesIO(wav_bytes)

            with wave.open(wav_io, 'rb') as wf:
                channels = wf.getnchannels()
                rate = wf.getframerate()

                stream = p.open(format=p.get_format_from_width(wf.getsampwidth()),
                                channels=channels,
                                rate=rate,
                                output=True)

                data = wf.readframes(chunk_size)
                while len(data) > 0:
                    stream.write(data)
                    data = wf.readframes(chunk_size)

                stream.stop_stream()
                stream.close()

        finally:
            p.terminate()

Enter fullscreen mode Exit fullscreen mode

Finally, we call the play_wav_from_bytes method to reproduce the audio generated by the model:


<span># Generate response with standard LLM </span><span>response</span> <span>=</span> <span>await</span> <span>assistant</span><span>.</span><span>generate_response</span><span>(</span><span>text</span><span>)</span>
<span># Add response to the user interface </span><span>add_message_chat</span><span>(</span><span>'</span><span>assistant</span><span>'</span><span>,</span> <span>response</span><span>)</span>
<span>if</span> <span>config</span><span>[</span><span>'</span><span>audio</span><span>'</span><span>]</span> <span>==</span> <span>'</span><span>enabled</span><span>'</span><span>:</span>
    <span># Generate audio from the text </span>    <span>audio</span> <span>=</span> <span>await</span> <span>assistant</span><span>.</span><span>generate_audio</span><span>(</span><span>response</span><span>)</span>
    <span># Reproduce audio </span>    <span>player</span> <span>=</span> <span>AudioPlayer</span><span>()</span>
    <span>player</span><span>.</span><span>play_wav_from_bytes</span><span>(</span><span>audio</span><span>)</span>
<span># Generate response with standard LLM </span><span>response</span> <span>=</span> <span>await</span> <span>assistant</span><span>.</span><span>generate_response</span><span>(</span><span>text</span><span>)</span>

<span># Add response to the user interface </span><span>add_message_chat</span><span>(</span><span>'</span><span>assistant</span><span>'</span><span>,</span> <span>response</span><span>)</span>

<span>if</span> <span>config</span><span>[</span><span>'</span><span>audio</span><span>'</span><span>]</span> <span>==</span> <span>'</span><span>enabled</span><span>'</span><span>:</span>
    <span># Generate audio from the text </span>    <span>audio</span> <span>=</span> <span>await</span> <span>assistant</span><span>.</span><span>generate_audio</span><span>(</span><span>response</span><span>)</span>

    <span># Reproduce audio </span>    <span>player</span> <span>=</span> <span>AudioPlayer</span><span>()</span>
    <span>player</span><span>.</span><span>play_wav_from_bytes</span><span>(</span><span>audio</span><span>)</span>
# Generate response with standard LLM response = await assistant.generate_response(text)

# Add response to the user interface add_message_chat('assistant', response)

if config['audio'] == 'enabled':
    # Generate audio from the text     audio = await assistant.generate_audio(response)

    # Reproduce audio     player = AudioPlayer()
    player.play_wav_from_bytes(audio)

Enter fullscreen mode Exit fullscreen mode

Summary

In this chapter, we have provided a voice to our chatbot thanks to a Text-to-speech model. We have transformed our agent into a multimodal agent by supporting text and audio as input and output.

In the next chapter, we will integrate the chatbot with Ollama to enable the use of locally run models

Remember that all the code is already available on my GitHub repository PyChatbot for Semantic Kernel.

Chatbot with Semantic Kernel (5 Part Series)

原文链接：Chatbot with Semantic Kernel – Part 5: Text-to-speech

展开阅读全文

文章版权声明 1、本网站名称：拾光赋
2、本站永久网址：https://www.blogs.ink
3、本网站的文章部分内容可能来源于网络，仅供大家学习与参考，如有侵权，请联系站长QQ：805375623进行删除处理。
4、本站一切资源不代表本站立场，并不代表本站赞同其观点和对其真实性负责。
5、本站一律禁止以任何方式发布或转载任何违法的相关信息，访客发现请向站长举报
6、本站资源大多存储在云盘，如发现链接失效，请联系我们我们会第一时间更新。

THE END