openai-whisper-talk
v0.0.2
openai-whisper-talk is a sample voice conversation application powered by OpenAI technologies such as Whisper, an automatic speech recognition (ASR) system, Chat Completions, an interface that simulates conversation with a model that plays the role of assistant, Embeddings, converts text to vector data that can be used in tasks like semantic searching, and the latest Text-to-Speech, that turn text ito lifelike spoken audio. The application is built using Nuxt, a Javascript framework based on Vue.js.
The application has two new features: Schedule Management and Long-Term Memory. With Schedule Management, you can command the chatbot to add, modify, delete, and retrieve scheduled events. The Long-Term Memory feature allows you to store snippets of information that the chatbot will remember for future reference. You can seamlessly integrate both functions into your conversations simply by interacting with the chatbot.
Update: Updated the openai module to latest version and replaced the embedding model from
text-embedding-ada-002
to the new v3 modeltext-embedding-3-small
.
openai-whisper-talkは、Whisper(自動音声認識(ASR)システム)、Chat completions(アシスタントの役割を果たすモデルとの会話をシミュレートするインターフェース)、Embeddings(セマンティック検索などのタスクで使用できるベクターデータにテキストを変換する)、そして最新のText-to-speech(テキストをリアルな話し言葉のオーディオに変える)など、OpenAIの技術を駆使したサンプル音声会話アプリケーションです。このアプリケーションは、Vue.jsに基づいたJavascriptフレームワークであるNuxtを使用して構築されています。
このアプリケーションには、「スケジュール管理」と「永続メモリ」の2つの新機能があります。スケジュール管理を使用すると、チャットボットにスケジュールイベントの追加、変更、削除、取得を指示できます。永続メモリ機能を使用すると、将来の参照のためにチャットボットが覚えておく情報のスニペットを保存できます。これらの機能をチャットボットとの対話を通じてシームレスに統合することができます。将来的に、いくつかの機能強化、たとえばメールやメッセージング機能を追加することで、完全な個人アシスタントになるかもしれません。
The App
From the main page, you can choose which chatbot to engage with. Each chatbot has a distinct personality, speaks a different language, and possesses a unique voice. You can alter the name and personality of any chatbot by clicking the Edit button adjacent to its name. Currently, the user interface does not directly support the addition of new chatbots; however, you can manually add a chatbot and customize the voice and language settings for each by modifying the assets/contacts.json file.
Additionally, you can personalize your profile by clicking on the avatar icon in the upper right corner. This allows you to enter your name and share details about yourself, enabling the chatbot to interact with you in a more personalized manner.
Audio Capture
Audio data is automatically recorded if sound is detected. A threshold setting is available to prevent background noise from triggering the audio capture. By default, this is set to -45dB (with -10dB representing the loudest sound cutoff). You can adjust this threshold by modifying the MIN_DECIBELS
1 variable according to your needs.
When recording is enabled and no sound is detected for 3 seconds, the audio data is uploaded and sent to the backend for transcription. It’s worth noting that in typical conversations, the average gap between each turn is around 2 seconds. Similarly, the pause between sentences when speaking is approximately the same. Therefore, I’ve chosen a duration that’s long enough to mean to wait for a reply. You can adjust this duration by editing the MAX_COUNT
1 variable.
The system can continuously record audio data until a reply is received. Once the audio reply from the chatbot is received and played, audio recording is disabled to prevent inadvertent recording of the chatbot’s own response.
A text input is also provided if you want to write your messages.
Whisper
All recorded audio data is uploaded to the public/upload
directory in WebM file format. Before submitting the audio file to the Whisper API, it is necessary to remove all silent segments. This step helps prevent the well-known issue of hallucinations generated by Whisper. For this same reason, it is recommended to set the MIN_DECIBELS
value as high as possible, ensuring that only speech is recorded.
Remove Silent Parts From Audio
To remove silent parts from the audio, we will be using ffmpeg
. Be sure to install it.
ffmpeg -i sourceFile -af silenceremove=stop_periods=-1:stop_duration=1:stop_threshold=-50dB outputFile
In this command:
-1 sourceFile
specifies the input file.-af silenceremove
applies the filtersilencerremove
.stop_periods=-1
removes all periods of silence.stop_duration=1
sets any period of silence longer than 1 second as silence.stop_threshold=-50dB
sets any noise level below -50dB as silence.outputFile
the output file.
To invoke this shell command in our api route, we will be using exec
from child-process
module.
After removing the silent parts, the file size should be checked. The final file size is usually much smaller than the original. During this check, files smaller than 16 KB are ignored, assuming that a file of this size, with a 16-bit depth, equates to roughly one second of audio. Anything shorter is likely inaudible.
// simple file size formula
const audio_file_size = duration_in_seconds * bitrate
The entire process, from eliminating silent parts to checking file size, is designed to ensure that only viable audio data is sent to the Whisper API. Our aim is to avoid hallucination and the unnecessary transmission of data.
Having confirmed the viability of our audio data, we then proceed to call the Whisper API.
const transcription = await openai.audio.transcriptions.create({
file: fs.createReadStream(filename),
language: lang,
response_format: 'text',
temperature: 0,
})
where lang
is the ISO 639-1 code of the specified language of our chatbot. Please check the list of the currently supported language.
We opt for a simple text
format since timestamps are not required, and we set the temperature
parameter to zero to achieve deterministic output.
Chat Completions API
After receiving the transcript from Whisper, we proceed to submit it to the Chat Completions API with function calling. We utilize the latest OpenAI Node.js module (version 4) that was released yesterday (2023/11/07), which includes an updated function-calling format. This newest iteration enables the invocation of multiple functions in a single request.
let messages = [{ role: 'system', content: system_prompt }]
let all_history = await mongoDb.getMessages()
if(all_history.length > 0) {
const history_context = trim_array(all_history, 20)
messages = messages.concat(history_context)
}
messages.push({ role: 'user', content: user_query })
let response = await openai.chat.completions.create({
temperature: 0.3,
messages,
tools: [
{ type: 'function', function: add_calendar_entry },
{ type: 'function', function: get_calendar_entry },
{ type: 'function', function: edit_calendar_entry },
{ type: 'function', function: delete_calendar_entry },
{ type: 'function', function: save_new_memory },
{ type: 'function', function: get_info_from_memory }
]
})
Let’s dissect the various components of this call in the subsequent sections.
The System Prompt
The system prompt plays a crucial role in giving life to our chatbot. It is here where we establish its name and persona, based on the chatbot chosen by the user. We provide it with specific instructions on how to respond, along with a list of functions it can execute. We also inform it about the user's identity and some personal details. Finally, we set the current date and time, which is essential for activating calendar functions.
const today = new Date()
let system_prompt = `In this session, we will simulate a voice conversation between two friends.\n\n` +
`# Persona\n` +
`You will act as ${selPerson.name}.\n` +
`${selPerson.prompt}\n\n` +
`Please ensure that your responses are consistent with this persona.\n\n` +
`# Instructions\n` +
`The user is talking to you over voice on their phone, and your response will be read out loud with realistic text-to-speech (TTS) technology.\n` +
`Use natural, conversational language that are clear and easy to follow (short sentences, simple words).\n` +
`Keep the conversation flowing.\n` +
`Sometimes the user might just want to chat. Ask them relevant follow-up questions.\n\n` +
`# Functions\n` +
`You have the following functions that you can call depending on the situation.\n` +
`add_calendar_entry, to add a new event.\n` +
`get_calendar_entry, to get the event at a particular date.\n` +
`edit_calendar_entry, to edit or update existing event.\n` +
`delete_calendar_entry, to delete an existing event.\n` +
`save_new_memory, to save new information to memory.\n` +
`get_info_from_memory, to retrieve information from memory.\n\n` +
`When you present the result from the function, only mention the relevant details for the user query.\n` +
`Omit information that is redundant and not relevant to the query.\n` +
`Always be concise and brief in your replies.\n\n` +
`# User\n` +
`As for me, in this simulation I am known as ${user_info.name}.\n` +
`${user_info.prompt}\n\n` +
`# Today\n` +
`Today is ${today}.\n`
Context
All messages will be stored in MongoDB.
To manage tokens and avoid surpassing the model's maximum limit, we will only send the last 20 interactions. The trim_array
function is designed to trim the message history if it surpasses 20 turns. This threshold can be adjusted to meet your specific requirements.
// allow less than or equal to 15 turns
const history_context = trim_array(all_history, 15)
From the main screen, you have the option to erase the previous history for each chatbot.
Function Calling
Please note that we have separated the handling of function calling (function_call.js) from the main Chat Completions API call (transcribe.php). This distinction is made to address instances when text content is present while a function call is invoked, which is typically null. This separation enables the app to display the text while simultaneously processing the function call. I have also enclosed the process within a loop in the event that a second API call results in another function call. This could probably be handled more elegantly by implementing streaming, but I have yet to learn how to use streaming with Nuxt.
Our functions are categorized under two new features: Schedule Management and Long-Term Memory.
Let's first examine how we manage function calling with the new format. We must utilize the new tools parameter instead of the now-deprecated functions parameter to enable the invocation of multiple function calls.
let response = await openai.chat.completions.create({
temperature: 0.3,
messages,
tools: [
{ type: 'function', function: add_calendar_entry },
{ type: 'function', function: get_calendar_entry },
{ type: 'function', function: edit_calendar_entry },
{ type: 'function', function: delete_calendar_entry },
{ type: 'function', function: save_new_memory },
{ type: 'function', function: get_info_from_memory }
]
})
Check the JSON Schema of each functions from lib/
directory.
Here is for add_calendar_entry
:
{
"name": "add_calendar_entry",
"description": "Adds a new entry to the calendar",
"parameters": {
"type": "object",
"properties": {
"event": {
"type": "string",
"description": "The name or title of the event"
},
"date": {
"type": "string",
"description": "The date of the event in 'YYYY-MM-DD' format"
},
"time": {
"type": "string",
"description": "The time of the event in 'HH:MM' format"
},
"additional_detail": {
"type": "string",
"description": "Any additional details or notes related to the event"
}
},
"required": [ "event", "date", "time", "additional_detail" ]
}
}
To dicuss how all these works, let's move on to the next section.
Schedule Management
For Schedule Management, we have the following functions:
- add_calendar_entry, adds new calendar entry
- edit_calendar_entry, edits calendar entry by event name
- get_calendar_entry, retrieves calendar entries by date
- delete_calendar_entry, deletes calendar entry either by event name or by date2.
All calendar entries will be stored in MongoDB. Please note that all entries will be accessible to all chatbots.
Sample Conversation - Schedule Management
Let’s take a look at a sample chat conversation to demonstrate how these elements interact:
user: good morning, jeeves. what's my schedule for today?
Function calling (invoke get_calendar_entry):
{ role: 'assistant',
content: null,
tool_calls:[{
id: 'call_4cCtmNlgYN5o4foVOQM9MIdA',
type: 'function',
function: { name: 'get_calendar_entry', arguments: '{"date":"2023-11-10"}' }
}]
}
Function response:
[
{
tool_call_id: 'call_4cCtmNlgYN5o4foVOQM9MIdA',
role: 'tool',
name: 'get_calendar_entry',
content: '{\n' +
' "message": "Found 3 entries",\n' +
' "items": [\n' +
' {\n' +
' "_id": "654b2805e51fcd815dea8e2d",\n' +
' "event": "Project presentation",\n' +
' "date": "2023-11-10",\n' +
' "time": "10:00",\n' +
' "additional_detail": "Important project presentation",\n' +
' "__v": 0\n' +
'
Footnotes
-
You can find these variables in pages/talk/[id].vue file. ↩ ↩2
-
If by date, item needs to be only one under the date otherwise an error will be returned. ↩