Close this search box.

IBM Speech to Text Solutions

IBM Speech to Text

IBM speech to text is a cloud solution that uses artificial intelligence (AI)  and machine learning (ML) to convert speech to text

With IBM Watson speech to text, you can transcribe speech in real-time as audio is playing, or, using batch mode, you can upload audio files to the system and wait for them to be transcribed.

Features of IBM Watson Speech to Text

IBM voice to text is a robust tool with a wide range of features.

Watson Assistant for Voice Interaction

The Watson Assistant for voice interaction is the newest feature in IBM speech to text. It allows organizations to interact with their customers quickly, accurately, and consistently across a wide range of applications, devices, and channels. Artificial intelligence (AI) is used to learn from customer interactions, so the tool learns over time. This increases its problem-solving capabilities, reduces customer wait times, and increases overall customer satisfaction. The feature integrates with a wide range of customer service SaaS platforms. According to the Forrester Total Economic Impact report, this feature saw organizations “experience benefits of $23.9 million over three years versus costs of $5.5 million, adding up to a net present value (NPV) of $18.4 million and a return on investment (ROI) of 337%.” 

IBM Speech to text

(Image source: IBM)

This feature has a free tier that allows you to send up to 10,000 messages per month. Premium plans start from $120 per month.

IBM Speech to Text – Automatic Speech Recognition (ASR)

Automatic speech recognition refers to the process of transcribing audio as it plays back or in real-time as someone is speaking. IBM speech recognition uses powerful deep learning and neural networks to convert speech to text. 

To begin speech recognition in IBM voice to text service, you only need to provide the audio that you want to be transcribed. There are three interfaces – the WebSocket interface, the synchronous HTTP interface, and the asynchronous HTTP interface – and they all come with the same basic transcription features. 

IBM Speech to Text – Several Audio Transmission Choices

You can stream audio in real-time directly from an application or upload recorded audio. Many file compression formats are supported. The tool identifies each format and displays its supported compression. Compression reduces the audio file size and maximizes the amount of data a user can pass to the service. A maximum of 100Mb can be sent to IBM speech to text via a single synchronous HTTP or WebSocket request. The audio must be in a supported format. IBM voice recognition supports ten audio formats, and, in most cases, the format is automatically detected. 

IBM Speech to Text – Real-time Audio Diagnostics

Advanced audio metrics provides detailed information on the audio signal characteristics. These metrics are available at the end of the transcription and can provide actionable insights to technical users.

This feature also provides the user with real-time feedback on the quality of the input audio. When there is a problem with the input, the tool provides feedback, such as letting you know there is too much background noise. It also offers solutions when problems are identified, such as asking the user to move closer to the mic.

Interim Transcription Before Final Results

IBM Watson speech to text is one of the few services that offer an interim result before the final transcription is complete. These interim results are likely to change before the final output is generated. They are useful for long audio files that can take time to transcribe, real-time transcription, and interactive applications. With interim results, a user can quickly gauge the quality of the audio file and decide whether to proceed with the batch job or terminate it.

Language Model Selection

You can choose from a wide range of models across several languages that support telephone speech and Voice over Internet Protocol (VoIP) frequencies. Broadband and narrowband models are supported for a large number of languages. Broadband models are used where the audio frequency is greater than or equal to 16 kHz, while narrowband models are used where the audio frequency is 8 kHz. Broadband models typically apply in the case of live speech or real-time applications, while narrowband models are better suited to telephone speech. 

Language Model Training

IBM speech recognition was developed with a broad audience in mind. The base vocabulary has thousands of words used in normal daily conversation, and the technology accurately recognizes many words. However, esoteric terms that are specific to certain domains are not included. To improve accuracy for fields such as law, medicine, and technology, users make use of language model customization. This feature allows users to expand and customize the vocabulary for a specific domain in a matter of minutes.

Acoustic Model Training

Just like the base vocabulary, IBM Watson speech to text was designed with base acoustic models that function well for several audio characteristics. However, you can also customize your acoustic model to improve speech recognition in many cases – such as when you have background noise, poor mic quality, atypical speech patterns, and pronounced accents. 

Grammar Training

In speech recognition technology, speech recognition grammar is used to tell the system what to listen for when a human speaks. It is a set of words, specifically:

  • Words a human may say
  • Patterns in which those words may be spoken
  • The spoken language of each word

Grammar can be added to a custom language model and then used to improve speech recognition accuracy. This feature restricts the set of phrases that can be recognized from an audio file, increasing the accuracy and speed of the transcript.

Speaker Diarization

This feature of IBM speech to text enables the recognition of multiple voices. It is optimized for two-way call center conversations but can recognize up to 6 speakers in an audio file. The transcript output is labeled to identify each speaker. This feature is ideal for meeting transcripts and call center records.

Numeric Redaction

Sensitive user data such as credit card numbers, telephone numbers, and emails are protected through numeric data’s redaction. This is not a default setting. The user has to enable it by setting the redaction parameter to “True,” and the redaction is applied to the final transcript before returning results to the user. 

Smart Formatting

With IBM Watson speech to text, you can convert text into conventional forms in your final transcript and make it more readable. Examples where this would be applicable include email addresses, telephone numbers, dates, currencies, and more. This feature is also not enabled by default and must be activated by the user. 

Word Spotting and Filtering

This feature is currently available in US English. When enabled, the system will spot unwanted words and filter them out. This is a great tool to filter out profanity, offensive slurs, and other undesired words. A maximum of 1,000 words can be spotted in a single request with 1,024 characters being the maximum length of one keyword.

IBM Speech to Text- Pricing

IBM Speech to text comes with a free tier that allows a user to convert up to 500 minutes of audio monthly. Once this is exhausted, users pay on a per-minute basis. The fee charged per minute reduces with increased usage.

IBM Watson Text to Speech

In addition to speech to text, IBM also offers a text to speech service. IBM text to speech scans text and generates human-like audio. 

Features of IBM Watson Text to Speech

The tool comes with a wide range of features as indicated below.

Neural Voice Technology

IBM Text to Speech makes use of concatenative synthesis and deep neural networks that are trained on human speech to produce the most natural-sounding voice. 

Custom Voices

Using as little as an hour of recorded audio, you can create your custom voice and use it to read text out loud to you. 

Speech Synthesis Markup Language

You can control various elements of the text to speech processes such as speed, volume, pitch, pronunciation, and other elements using The Speech Synthesis Markup Language (SSML).

Customize Word Pronunciations

Regular pronunciation works well for common everyday words but can be problematic for words specific to certain industries. Also, the default pronunciation may not work well for foreign words, personal names, names of places, and abbreviations. To overcome this, the system comes with a customization interface where you specify how the system will pronounce certain words. 


In linguistics, expressiveness is the quality of conveying a feeling. In IBM Text to Speech, you can apply the expressiveness element to get the system to output audio in three different styles: 

  • A positive or upbeat style
  • A regretful speaking style, for example, where an apology is being communicated in the text
  • An uncertain or interrogative style

Voice Transformation

Finally, the system allows you to control various aspects of the output audio. For example, you can give the audio a more youthful sound, make it softer, increase the pitch, and perform many other transformations.

IBM Speech to Text – Pricing

The service has three pricing plans as follows:

  • Lite: This is a free tier that offers 10,000 characters per month
  • Standard: Pricing for this plan starts at USD 0.02/thousand characters
  • Premium: Pricing is the same as the standard plan in addition to USD 5,000 per instance. This plan comes with a wide range of premium features such as high availability, custom voice, private storage of training and usage data, and much more.
Augmented intelligence vs artificial intelligence
Artificial intelligence in procurement
Robot typing on a laptop, representing ERP artificial intelligence.

Explore our topics