Chatbot with Semantic Kernel – Part 5: Text-to-speech

Chatbot with Semantic Kernel (5 Part Series)

1 Chatbot with Semantic Kernel – Part 1: Setup and first steps
2 Chatbot with Semantic Kernel – Part 2: Plugins 🧩
3 Chatbot with Semantic Kernel – Part 3: Inspector & tokens
4 Chatbot with Semantic Kernel – Part 4: Speech-to-text with Whisper
5 Chatbot with Semantic Kernel – Part 5: Text-to-speech

In the last chapter, we added the first audio capability to our chatbot by allowing the user to interact with the model using their voice. In this chapter, we are going to add the opposite skill: giving a voice to our chatbot.

Text-to-speech

In recent times, models have vastly improved in creating audio based on text input. In some cases, model providers offer standalone models for Text-to-speech, like TTS from OpenAI. On the other hand, we have also the possibility of using more powerful models that support both multimodal input (text, image, video and audio) and output (text, image and voice). Some examples of these more powerful models are Gemini 2.0 flash from Google and GPT-4o-realtime from OpenAI.

The possibility of generating high quality audio thanks to these TTS models, combined with the potential of powerful text models (like GPT-4o), has enabled many use cases that were unimaginable just a few years ago. For example, in 2024, Google released NotebookLM, an application that generates podcasts based on sources uploaded by the user. If you are researching evaluation techniques for LLMs, you can upload materials such as papers or articles, and the application creates a podcast where two AI voices have a conversation summarizing and explaining your material.

Text-to-speech on Semantic Kernel

In November 2024, Microsoft added audio capabilities support to Semantic Kernel. For the Text-to-speech scenario, we will build the following workflow:

  1. Introduce a text or audio input. You can check the previous article where we added Audio-to-text functionality.
  2. Use a standard LLM to generate a response from the user’s input.
  3. Use the TTS model from OpenAI to convert the response into audio (WAVformat).
  4. Reproduce the generated audio to the user.

Based on our previous chatbot, the two first steps are already accomplished. Let’s now focus on converting the text response into an audio with the TTS model.

Generate audio

First of all, we need to inject a new service into our Kernel. In this case, we register an AzureTextToAudio service:

<span># Inject the service into the Kernel </span><span>self</span><span>.</span><span>kernel</span><span>.</span><span>add_service</span><span>(</span><span>AzureTextToAudio</span><span>(</span>
<span>service_id</span><span>=</span><span>'</span><span>text_to_audio_service</span><span>'</span>
<span>))</span>
<span># Get the service from the Kernel </span><span>self</span><span>.</span><span>text_to_audio_service</span><span>:</span><span>AzureTextToAudio</span> <span>=</span> <span>self</span><span>.</span><span>kernel</span><span>.</span><span>get_service</span><span>(</span><span>type</span><span>=</span><span>AzureTextToAudio</span><span>)</span>
<span># Inject the service into the Kernel </span><span>self</span><span>.</span><span>kernel</span><span>.</span><span>add_service</span><span>(</span><span>AzureTextToAudio</span><span>(</span>
    <span>service_id</span><span>=</span><span>'</span><span>text_to_audio_service</span><span>'</span>
<span>))</span>

<span># Get the service from the Kernel </span><span>self</span><span>.</span><span>text_to_audio_service</span><span>:</span><span>AzureTextToAudio</span> <span>=</span> <span>self</span><span>.</span><span>kernel</span><span>.</span><span>get_service</span><span>(</span><span>type</span><span>=</span><span>AzureTextToAudio</span><span>)</span>
# Inject the service into the Kernel self.kernel.add_service(AzureTextToAudio( service_id='text_to_audio_service' )) # Get the service from the Kernel self.text_to_audio_service:AzureTextToAudio = self.kernel.get_service(type=AzureTextToAudio)

Enter fullscreen mode Exit fullscreen mode

Because the service is declared as an Azure service, it uses the following environment variables:

  • AZURE_OPENAI_TEXT_TO_AUDIO_DEPLOYMENT_NAME: the name of the model deployed in Azure OpenAI.
  • AZURE_OPENAI_API_KEY: the API key associated to the Azure OpenAI instance.
  • AZURE_OPENAI_ENDPOINT: the endpoint associated to the Azure OpenAI instance.

Similarly, Semantic Kernel has many AI connectors, like the OpenAITextToAudio service. In that case, the name of the variables would be:

  • OPENAI_AUDIO_TO_TEXT_MODEL_ID: the OpenAI audio to text model ID to use.
  • OPENAI_API_KEY: the API key associated to your organization.
  • OPENAI_ORG_ID: the unique identifier for your organization.

You can check all the settings used on Semantic Kernel on the official Github repository.

The TextToAudio service is quite simple to use. It has two important methods:

  • get_audio_contents: return a list of generated audio contents. Some models do not support generation of multiple audios from one single input, in that case the list will contain only one element.
  • get_audio_content: identical to previous method but always return the first element of the list.

Both methods have an optional argument OpenAITextToAudioExecutionSettings, to customize the behavior of the service. With the current version of Semantic Kernel, you can customize the speed of the playback, the voice used (with Alloy being the default one), and the output format. In this case, I have decided to use the echo voice in WAV format:

<span>async</span> <span>def</span> <span>generate_audio</span><span>(</span><span>self</span><span>,</span> <span>message</span><span>:</span> <span>str</span><span>)</span> <span>-></span> <span>bytes</span><span>:</span>
<span>audio_settings</span> <span>=</span> <span>OpenAITextToAudioExecutionSettings</span><span>(</span><span>voice</span><span>=</span><span>'</span><span>echo</span><span>'</span><span>,</span> <span>response_format</span><span>=</span><span>"</span><span>wav</span><span>"</span><span>)</span>
<span>audio_content</span> <span>=</span> <span>await</span> <span>self</span><span>.</span><span>text_to_audio_service</span><span>.</span><span>get_audio_content</span><span>(</span><span>message</span><span>,</span> <span>audio_settings</span><span>)</span>
<span>return</span> <span>audio_content</span><span>.</span><span>data</span>
<span>async</span> <span>def</span> <span>generate_audio</span><span>(</span><span>self</span><span>,</span> <span>message</span><span>:</span> <span>str</span><span>)</span> <span>-></span> <span>bytes</span><span>:</span>
    <span>audio_settings</span> <span>=</span> <span>OpenAITextToAudioExecutionSettings</span><span>(</span><span>voice</span><span>=</span><span>'</span><span>echo</span><span>'</span><span>,</span> <span>response_format</span><span>=</span><span>"</span><span>wav</span><span>"</span><span>)</span>
    <span>audio_content</span> <span>=</span> <span>await</span> <span>self</span><span>.</span><span>text_to_audio_service</span><span>.</span><span>get_audio_content</span><span>(</span><span>message</span><span>,</span> <span>audio_settings</span><span>)</span>
    <span>return</span> <span>audio_content</span><span>.</span><span>data</span>
async def generate_audio(self, message: str) -> bytes: audio_settings = OpenAITextToAudioExecutionSettings(voice='echo', response_format="wav") audio_content = await self.text_to_audio_service.get_audio_content(message, audio_settings) return audio_content.data

Enter fullscreen mode Exit fullscreen mode

The output generated by the method is a list of bytes containing the audio. Now we can easily use the output of the standard response from the LLM to generate the corresponding audio:

<span>response</span> <span>=</span> <span>await</span> <span>assistant</span><span>.</span><span>generate_response</span><span>(</span><span>text</span><span>)</span>
<span>add_message_chat</span><span>(</span><span>'</span><span>assistant</span><span>'</span><span>,</span> <span>response</span><span>)</span>
<span>if</span> <span>config</span><span>[</span><span>'</span><span>audio</span><span>'</span><span>]</span> <span>==</span> <span>'</span><span>enabled</span><span>'</span><span>:</span>
<span>audio</span> <span>=</span> <span>await</span> <span>assistant</span><span>.</span><span>generate_audio</span><span>(</span><span>response</span><span>)</span>
<span>response</span> <span>=</span> <span>await</span> <span>assistant</span><span>.</span><span>generate_response</span><span>(</span><span>text</span><span>)</span>
<span>add_message_chat</span><span>(</span><span>'</span><span>assistant</span><span>'</span><span>,</span> <span>response</span><span>)</span>

<span>if</span> <span>config</span><span>[</span><span>'</span><span>audio</span><span>'</span><span>]</span> <span>==</span> <span>'</span><span>enabled</span><span>'</span><span>:</span>
    <span>audio</span> <span>=</span> <span>await</span> <span>assistant</span><span>.</span><span>generate_audio</span><span>(</span><span>response</span><span>)</span>
response = await assistant.generate_response(text) add_message_chat('assistant', response) if config['audio'] == 'enabled': audio = await assistant.generate_audio(response)

Enter fullscreen mode Exit fullscreen mode

Reproducing the audio

Once we have the audio generated, we need some code to reproduce it on the user’s computer. For that purpose, I have created a simple AudioPlayer class using pyaudio library:

<span>import</span> <span>io</span>
<span>import</span> <span>wave</span>
<span>import</span> <span>pyaudio</span>
<span>class</span> <span>AudioPlayer</span><span>:</span>
<span>def</span> <span>play_wav_from_bytes</span><span>(</span><span>self</span><span>,</span> <span>wav_bytes</span><span>,</span> <span>chunk_size</span><span>=</span><span>1024</span><span>):</span>
<span>p</span> <span>=</span> <span>pyaudio</span><span>.</span><span>PyAudio</span><span>()</span>
<span>try</span><span>:</span>
<span>wav_io</span> <span>=</span> <span>io</span><span>.</span><span>BytesIO</span><span>(</span><span>wav_bytes</span><span>)</span>
<span>with</span> <span>wave</span><span>.</span><span>open</span><span>(</span><span>wav_io</span><span>,</span> <span>'</span><span>rb</span><span>'</span><span>)</span> <span>as</span> <span>wf</span><span>:</span>
<span>channels</span> <span>=</span> <span>wf</span><span>.</span><span>getnchannels</span><span>()</span>
<span>rate</span> <span>=</span> <span>wf</span><span>.</span><span>getframerate</span><span>()</span>
<span>stream</span> <span>=</span> <span>p</span><span>.</span><span>open</span><span>(</span><span>format</span><span>=</span><span>p</span><span>.</span><span>get_format_from_width</span><span>(</span><span>wf</span><span>.</span><span>getsampwidth</span><span>()),</span>
<span>channels</span><span>=</span><span>channels</span><span>,</span>
<span>rate</span><span>=</span><span>rate</span><span>,</span>
<span>output</span><span>=</span><span>True</span><span>)</span>
<span>data</span> <span>=</span> <span>wf</span><span>.</span><span>readframes</span><span>(</span><span>chunk_size</span><span>)</span>
<span>while</span> <span>len</span><span>(</span><span>data</span><span>)</span> <span>></span> <span>0</span><span>:</span>
<span>stream</span><span>.</span><span>write</span><span>(</span><span>data</span><span>)</span>
<span>data</span> <span>=</span> <span>wf</span><span>.</span><span>readframes</span><span>(</span><span>chunk_size</span><span>)</span>
<span>stream</span><span>.</span><span>stop_stream</span><span>()</span>
<span>stream</span><span>.</span><span>close</span><span>()</span>
<span>finally</span><span>:</span>
<span>p</span><span>.</span><span>terminate</span><span>()</span>
<span>import</span> <span>io</span>
<span>import</span> <span>wave</span>
<span>import</span> <span>pyaudio</span>

<span>class</span> <span>AudioPlayer</span><span>:</span>
    <span>def</span> <span>play_wav_from_bytes</span><span>(</span><span>self</span><span>,</span> <span>wav_bytes</span><span>,</span> <span>chunk_size</span><span>=</span><span>1024</span><span>):</span>
        <span>p</span> <span>=</span> <span>pyaudio</span><span>.</span><span>PyAudio</span><span>()</span>

        <span>try</span><span>:</span>
            <span>wav_io</span> <span>=</span> <span>io</span><span>.</span><span>BytesIO</span><span>(</span><span>wav_bytes</span><span>)</span>

            <span>with</span> <span>wave</span><span>.</span><span>open</span><span>(</span><span>wav_io</span><span>,</span> <span>'</span><span>rb</span><span>'</span><span>)</span> <span>as</span> <span>wf</span><span>:</span>
                <span>channels</span> <span>=</span> <span>wf</span><span>.</span><span>getnchannels</span><span>()</span>
                <span>rate</span> <span>=</span> <span>wf</span><span>.</span><span>getframerate</span><span>()</span>

                <span>stream</span> <span>=</span> <span>p</span><span>.</span><span>open</span><span>(</span><span>format</span><span>=</span><span>p</span><span>.</span><span>get_format_from_width</span><span>(</span><span>wf</span><span>.</span><span>getsampwidth</span><span>()),</span>
                                <span>channels</span><span>=</span><span>channels</span><span>,</span>
                                <span>rate</span><span>=</span><span>rate</span><span>,</span>
                                <span>output</span><span>=</span><span>True</span><span>)</span>

                <span>data</span> <span>=</span> <span>wf</span><span>.</span><span>readframes</span><span>(</span><span>chunk_size</span><span>)</span>
                <span>while</span> <span>len</span><span>(</span><span>data</span><span>)</span> <span>></span> <span>0</span><span>:</span>
                    <span>stream</span><span>.</span><span>write</span><span>(</span><span>data</span><span>)</span>
                    <span>data</span> <span>=</span> <span>wf</span><span>.</span><span>readframes</span><span>(</span><span>chunk_size</span><span>)</span>

                <span>stream</span><span>.</span><span>stop_stream</span><span>()</span>
                <span>stream</span><span>.</span><span>close</span><span>()</span>

        <span>finally</span><span>:</span>
            <span>p</span><span>.</span><span>terminate</span><span>()</span>
import io import wave import pyaudio class AudioPlayer: def play_wav_from_bytes(self, wav_bytes, chunk_size=1024): p = pyaudio.PyAudio() try: wav_io = io.BytesIO(wav_bytes) with wave.open(wav_io, 'rb') as wf: channels = wf.getnchannels() rate = wf.getframerate() stream = p.open(format=p.get_format_from_width(wf.getsampwidth()), channels=channels, rate=rate, output=True) data = wf.readframes(chunk_size) while len(data) > 0: stream.write(data) data = wf.readframes(chunk_size) stream.stop_stream() stream.close() finally: p.terminate()

Enter fullscreen mode Exit fullscreen mode

Finally, we call the play_wav_from_bytes method to reproduce the audio generated by the model:

<span># Generate response with standard LLM </span><span>response</span> <span>=</span> <span>await</span> <span>assistant</span><span>.</span><span>generate_response</span><span>(</span><span>text</span><span>)</span>
<span># Add response to the user interface </span><span>add_message_chat</span><span>(</span><span>'</span><span>assistant</span><span>'</span><span>,</span> <span>response</span><span>)</span>
<span>if</span> <span>config</span><span>[</span><span>'</span><span>audio</span><span>'</span><span>]</span> <span>==</span> <span>'</span><span>enabled</span><span>'</span><span>:</span>
<span># Generate audio from the text </span> <span>audio</span> <span>=</span> <span>await</span> <span>assistant</span><span>.</span><span>generate_audio</span><span>(</span><span>response</span><span>)</span>
<span># Reproduce audio </span> <span>player</span> <span>=</span> <span>AudioPlayer</span><span>()</span>
<span>player</span><span>.</span><span>play_wav_from_bytes</span><span>(</span><span>audio</span><span>)</span>
<span># Generate response with standard LLM </span><span>response</span> <span>=</span> <span>await</span> <span>assistant</span><span>.</span><span>generate_response</span><span>(</span><span>text</span><span>)</span>

<span># Add response to the user interface </span><span>add_message_chat</span><span>(</span><span>'</span><span>assistant</span><span>'</span><span>,</span> <span>response</span><span>)</span>

<span>if</span> <span>config</span><span>[</span><span>'</span><span>audio</span><span>'</span><span>]</span> <span>==</span> <span>'</span><span>enabled</span><span>'</span><span>:</span>
    <span># Generate audio from the text </span>    <span>audio</span> <span>=</span> <span>await</span> <span>assistant</span><span>.</span><span>generate_audio</span><span>(</span><span>response</span><span>)</span>

    <span># Reproduce audio </span>    <span>player</span> <span>=</span> <span>AudioPlayer</span><span>()</span>
    <span>player</span><span>.</span><span>play_wav_from_bytes</span><span>(</span><span>audio</span><span>)</span>
# Generate response with standard LLM response = await assistant.generate_response(text) # Add response to the user interface add_message_chat('assistant', response) if config['audio'] == 'enabled': # Generate audio from the text audio = await assistant.generate_audio(response) # Reproduce audio player = AudioPlayer() player.play_wav_from_bytes(audio)

Enter fullscreen mode Exit fullscreen mode

Summary

In this chapter, we have provided a voice to our chatbot thanks to a Text-to-speech model. We have transformed our agent into a multimodal agent by supporting text and audio as input and output.

In the next chapter, we will integrate the chatbot with Ollama to enable the use of locally run models

Remember that all the code is already available on my GitHub repository PyChatbot for Semantic Kernel.

Chatbot with Semantic Kernel (5 Part Series)

1 Chatbot with Semantic Kernel – Part 1: Setup and first steps
2 Chatbot with Semantic Kernel – Part 2: Plugins 🧩
3 Chatbot with Semantic Kernel – Part 3: Inspector & tokens
4 Chatbot with Semantic Kernel – Part 4: Speech-to-text with Whisper
5 Chatbot with Semantic Kernel – Part 5: Text-to-speech

原文链接:Chatbot with Semantic Kernel – Part 5: Text-to-speech

© 版权声明
THE END
喜欢就支持一下吧
点赞6 分享
The course of true love never did run smooth.
真诚的爱情之路永不会是平坦的
评论 抢沙发

请登录后发表评论

    暂无评论内容