Python

How To Deploy Your Own Optimized Speech-To-Text Web App with Python

Pinterest LinkedIn Tumblr

A detailed guide on how you can construct your own speech-to-text web application to optimally convert the audio into text format.

Speech is one of the more effective forms of communication that is utilized by humans. Audiobooks and podcasts have been growing in popularity over the past few years, especially with the fast-paced lifestyle in the modern era. However, reading has always been a crucial aspect of our lives and the most preferred method of extracting information. The viewer reading this article is the best example for the same.

Hence, the need to convert plain audio data into a readable text format for reaching a wide range of audiences is critical. Whether it is a subtitle in a movie or conversion of audiobooks to written material, the extraction of text data from the available speech is a significant application. With the right deep learning techniques, it is possible to obtain a highly optimized speech-to-text conversion.

In this article, we will understand how to construct one of the most optimized approaches to converting your audio data into text. We will utilize the AssemblyAI platform that will enable us to transcribe and understand audio in the most effective way. Let us now proceed to get started with the construction of this project!

Construction of the Speech-To-Text Web Application:

In this section, we will construct the speech-to-text web application for transcribing a video. We will make use of an easy-to-use technology in AssemblyAI’s API token to transcribe audio files with high efficiency. We will also use an interactive web framework in streamlit for deploying our project. Firstly, let us acquire all the necessary components to run this project successfully.

Installing all the necessary dependencies:

The two primary requirements that the users will need to install are the streamlit web development framework for deploying the project and the youtube-dl library that will help us to download the videos from YouTube and other similar video sites. Both of these libraries can be downloaded with a simple pip installation command, as shown below.

pip install streamlit
pip install youtube_dl

These commands can be typed as shown above in a command prompt. As for the last dependency, we will require the FFmpeg installation. It is a free and open-source software project that will enable us to handle audio files, video contents, streams, and other multimedia files in the most effective manner. Hence, download the requirements from the following link and add them directly to your path by going to the environment variables in the Windows platform.

You can either have a single FFmpeg installation with the path to its bin file or download all the requirements for Windows 64 from the provided link. If you want to follow along with what I did, then you can download ffmpeg, ffprobe, and ffplay into a folder titled Paths and add it into your environment variables, as shown in the below images.

Image by Author
Image by Author

Now that we have finished acquiring all the required library requirements, we can proceed to create a new Python file and import them accordingly. I am going to call this Python file final_app.py. This Python file will contain all the primary code contents for our project.

import streamlit as st
import youtube_dl
import requests
from config import API_Key

It is noticeable that we now have all the necessary imports, including the requests library that will allow us to access URL links of any website with ease. However, one might wonder what the config import is, and we will figure out what it is in the next section.

Creating a configuration file:

Before we continue with the coding process, we will visit the AssemblyAI platform, where we can create an account for free and obtain an API key that we can utilize for easily understanding and transcribing audio. As soon as you sign in to your account, you can copy your API key on the right side of the screen.

In the next step, we will create another Python file that will store our API key that is generated from the AssemblyAI platform. Make sure to label the created Python file as config.py, where we can create a variable called API_Key that will store the API address that we previously copied from the AssemblyAI platform. The code snippet is as shown below.

API_Key = "Enter Your Key Here"

Adding all the essential parameters:

Once you have finished creating the configuration file, the rest of the project is coded in the final_app.py section. Firstly, we will add the essential parameters that are required for downloading a video as well as set the appropriate locations to the AssemblyAI website. Let us check out the code block below before we understand the explanation of these parameters.

ydl_opts = {
'format': 'bestaudio/best',
'postprocessors': [{
'key': 'FFmpegExtractAudio',
'preferredcodec': 'mp3',
'preferredquality': '192',
}],
'ffmpeg-location': './',
'outtmpl': "./%(id)s.%(ext)s",
}

In the above code block, we are just defining some of the essential constraints required for the YouTube downloader. We will download the video in the form of best audio and save them in our local drive in the mp3 format. In the next code snippet, we will define our parameters for the AssemblyAI platform.

transcript_endpoint = "https://api.assemblyai.com/v2/transcript"
upload_endpoint = 'https://api.assemblyai.com/v2/upload'headers_auth_only = {'authorization': API_Key}
headers = {
"authorization": API_Key,
"content-type": "application/json"
}
CHUNK_SIZE = 5242880

We are describing the endpoints where we will upload the audio file and also the endpoint where the transcription of this file will occur. Finally, we will also set up the header pointing to the API key that we obtained earlier. The chunk size variable will help us to break bigger audio files into smaller chunks. Let us move to the next section, where we will transcribe the videos.

Using AssemblyAI to transcribe YouTube Videos:

In this section, we will create the function that will help us to transcribe the entire audio data. Firstly, we will call a cache function from streamlit that will store previous data and will only perform a new action if the arguments of the function are changed. The transcribe link is the primary function that will compute all the necessary actions. Make sure that all the other statements and functions in this section are defined under the transcribe_from_link() function.

@st.cache
def transcribe_from_link(link, categories: bool):

In the first function under transcribe_from_link(), we will make use of the youtube-dl import and download the audio content into our desired save location using the video ID. We will now create another function to read the contents of the download file and upload it to the AssemblyAI website. Below is the code snippet for the following functionalities.

_id = link.strip()def get_vid(_id):
with youtube_dl.YoutubeDL(ydl_opts) as ydl:
return ydl.extract_info(_id)# download the audio of the YouTube video locally
meta = get_vid(_id)
save_location = meta['id'] + ".mp3"print('Saved mp3 to', save_location)def read_file(filename):
with open(filename, 'rb') as _file:
while True:
data = _file.read(CHUNK_SIZE)
if not data:
break
yield data

In the final code snippet of this section, we will use the requests library to upload the downloaded audio file to the AssemblyAI platform. Once the audio file is uploaded, we can start the process of transcription by sending another request for performing the following action. Finally, we can access the transcription of the audio file with its respective ID and store it at the polling endpoint, containing our final result.

# upload audio file to AssemblyAI
upload_response = requests.post(
upload_endpoint,
headers=headers_auth_only, data=read_file(save_location)
)audio_url = upload_response.json()['upload_url']
print('Uploaded to', audio_url)# start the transcription of the audio file
transcript_request = {
'audio_url': audio_url,
'iab_categories': 'True' if categories else 'False',
}transcript_response = requests.post(transcript_endpoint, json=transcript_request, headers=headers)# this is the id of the file that is being transcribed in the AssemblyAI servers
# we will use this id to access the completed transcription
transcript_id = transcript_response.json()['id']
polling_endpoint = transcript_endpoint + "/" + transcript_idprint("Transcribing at", polling_endpoint)return polling_endpoint

Note: Please look at the complete code section to have a better copy-paste version of the above code. All the remaining functions and statement in the above code block must be defined under the transcribe_from_link() function.

Building the web application:

In this section, we will develop the web application using the streamlit framework. Firstly, we will define a few functions that we will utilize in the website layout. The web application might take some time for the completion of the execution to upload the audio file and receive the converted text data. To monitor the time taken, we will make use of the get_status and refresh_state functions that will enable us to receive the response and check if the website is working.

We can then proceed to add the title and a text box that will allow the user to add a link through which the audio from the YouTube video can be downloaded and transcribed accordingly. We will also add a button that will allow the users to click and check the status of the transcription process. The code for the web development is as shown in the below code snippet.

if 'status' not in st.session_state:
st.session_state['status'] = 'submitted'def get_status(polling_endpoint):
polling_response = requests.get(polling_endpoint, headers=headers)
st.session_state['status'] = polling_response.json()['status']def refresh_state():
st.session_state['status'] = 'submitted'st.title('Easily transcribe YouTube videos')link = st.text_input('Enter your YouTube video link', 'https://youtu.be/dccdadl90vs', on_change=refresh_state)
st.video(link)st.text("The transcription is " + st.session_state['status'])polling_endpoint = transcribe_from_link(link, False)st.button('check_status', on_click=get_status, args=(polling_endpoint,))transcript=''
if st.session_state['status']=='completed':
polling_response = requests.get(polling_endpoint, headers=headers)
transcript = polling_response.json()['text']st.markdown(transcript)

Complete Code:

Image By Author
Image By Author

Finally, let us explore the entire code of this project in the form of a GitHub embed and see what it looks like. The viewers can choose to copy-paste the below code to execute their project and immediately experiment with the various interpretations of this project.

import streamlit as st
import youtube_dl
import requests
from config import API_Key
ydl_opts = {
‘format’: ‘bestaudio/best’,
‘postprocessors’: [{
‘key’: ‘FFmpegExtractAudio’,
‘preferredcodec’: ‘mp3’,
‘preferredquality’: ‘192’,
}],
‘ffmpeg-location’: ‘./’,
‘outtmpl’: “./%(id)s.%(ext)s”,
}
transcript_endpoint = “https://api.assemblyai.com/v2/transcript”
upload_endpoint = ‘https://api.assemblyai.com/v2/upload’
headers_auth_only = {‘authorization’: API_Key}
headers = {
“authorization”: API_Key,
“content-type”: “application/json”
}
CHUNK_SIZE = 5242880
@st.cache
def transcribe_from_link(link, categories: bool):
_id = link.strip()
def get_vid(_id):
with youtube_dl.YoutubeDL(ydl_opts) as ydl:
return ydl.extract_info(_id)
# download the audio of the YouTube video locally
meta = get_vid(_id)
save_location = meta[‘id’] + “.mp3”
print(‘Saved mp3 to’, save_location)
def read_file(filename):
with open(filename, ‘rb’) as _file:
while True:
data = _file.read(CHUNK_SIZE)
if not data:
break
yield data
# upload audio file to AssemblyAI
upload_response = requests.post(
upload_endpoint,
headers=headers_auth_only, data=read_file(save_location)
)
audio_url = upload_response.json()[‘upload_url’]
print(‘Uploaded to’, audio_url)
# start the transcription of the audio file
transcript_request = {
‘audio_url’: audio_url,
‘iab_categories’: ‘True’ if categories else ‘False’,
}
transcript_response = requests.post(transcript_endpoint, json=transcript_request, headers=headers)
# this is the id of the file that is being transcribed in the AssemblyAI servers
# we will use this id to access the completed transcription
transcript_id = transcript_response.json()[‘id’]
polling_endpoint = transcript_endpoint + “/” + transcript_id
print(“Transcribing at”, polling_endpoint)
return polling_endpoint
if ‘status’ not in st.session_state:
st.session_state[‘status’] = ‘submitted’
def get_status(polling_endpoint):
polling_response = requests.get(polling_endpoint, headers=headers)
st.session_state[‘status’] = polling_response.json()[‘status’]
def refresh_state():
st.session_state[‘status’] = ‘submitted’
st.title(‘Easily transcribe YouTube videos’)
link = st.text_input(‘Enter your YouTube video link’, ‘https://youtu.be/dccdadl90vs’, on_change=refresh_state)
st.video(link)
st.text(“The transcription is ” + st.session_state[‘status’])
polling_endpoint = transcribe_from_link(link, False)
st.button(‘check_status’, on_click=get_status, args=(polling_endpoint,))
transcript=”
if st.session_state[‘status’]==’completed’:
polling_response = requests.get(polling_endpoint, headers=headers)
transcript = polling_response.json()[‘text’]
st.markdown(transcript)

view rawfinal_app.py hosted with ❤ by GitHub

Once you have the complete setup of the code, use the following command to run the program.

streamlit run final_app.py

If the viewers are looking for a video tutorial on this topic, I would highly recommend checking out the following link for a concise explanation on how to perform the following project in the form of a two-part series.

Conclusion:

Photo by israel palacio on Unsplash

With the consistent evolution of the field of natural language processing, deep learning models are becoming more efficient in handling tasks related to them. As discussed in this article, we are able to achieve highly advanced results on the task of speech-to-text conversion by making use of modern AI technologies. We will cover more such intriguing projects in future articles!

In this article, we explored in detail how you can easily build your own web application to successfully transcribe the audio data into the form of text with the help of streamlit and the AssemblyAI platform. If you have any queries related to the various points stated in this article, then feel free to let me know in the comments below. I will try to get back to you with a response as soon as possible.

Original Source

Love to explore and learn new concepts. Extremely interested in AI, deep learning, robots, and the universe.