How to build your own Python Voice Assistant | thecodingpie

thecoding pie

Published in

ITNEXT

10 min readNov 11, 2020

Let’s build it!

This post was originally published in my blog — https://thecodingpie.com

Are you interested in building your own virtual voice assistant like Jarvis in the movie Iron Man? If you are interested in building one, then you have come to the right place.

Howdy folks, In this tutorial, you will learn how to build your own personal voice assistant like Jarvis using Python.

You can download the finished project code from my Github repo — Final Version.

Now before getting started, let’s understand what we are going to build…

Understanding What we are going to build?

The speech recognition program which we are going to build will be able to recognize these commands:

name — tells its name.
date — tells the date.
time — tells the current time.
how are you? — will say “I am fine…”.
search — will search using Google.
and finally, if we say “quit” or “exit”, it will terminate.

To achieve all these functionalities, we are going to use mainly 3 python modules:

SpeechRecognition — to recognize our speech and to convert it into text format using Google’s Web Speech API.
PyAudio — for accessing and working with the Microphone.
pyttsx3 — for converting given text to speech(ie for generating computer voice)

How we are going to build this?

It’s basically very simple. We need to create only 3 functions and that’s it!

The first function, recognize_voice(), will be responsible for capturing our voice (which we input through the Microphone), recognizing it, and returning the “text” version of it.
Then we will take that “text” version of our voice and give it to another function called reply(), which will be responsible for replying back to us and doing all sorts of other crazy things (like searching google, telling the current time, etc.).
Finally, a function called speak(), which will take whatever text we give it and converts it into speech.

We will repeat the above functions infinitely until the user says “quit” or “exit”.

Requirements

You should be good at python3.
You should have python3.3 or a higher version installed on your computer.
You should have venv installed. If you are using Python 3.3 or newer, then the venv is already included in the Python standard library and requires no additional installation.
You should have a microphone (your laptop’s builtin one or the one on your earphone will do the job)
You should need an Internet connection.
Finally, you should have a modern code editor like visual studio code.

With these things in place, let’s get started.

Initial Setups

First, create a folder named voice_assistant anywhere on your computer.
Then open it inside visual studio code.

Now let’s make a new virtual environment using venv and activate it. To do that:

Open Terminal > New Terminal.
Then type:

python3 -m venv venv

This command will create a virtual environment named venv for us.

To activate it, if you are on windows, type the following:

venv\Scripts\activate.bat

If you are on Linux/Mac, then:

source venv/bin/activate

Now you should see something like this:

This means you have successfully activated your virtual environment

Note: Virtual environments like venv help us to keep all the dependencies related to the current project in its own environment isolated from the main computer. That’s one of the main reasons why we are using it.

Finally, create a new file named “main.py” directly inside the voice_assistant folder like below:

New file creating icon on visual studio code click that — Click on this Icon and create the file

Now you will have something similar to this:

main.py file we just created — main.py file

That’s it, now let’s install those required modules.

Installing the requirements

For recognizing our voice and converting it into text, we need some additional modules like SpeechRecognizer, so let’s install it. Type the following command in the terminal:

pip install SpeechRecognition

Now If you are using the Microphone as the input source, in our case we are, then we need to install the PyAudio package.

The process for installing PyAudio will vary depending on your operating system.

For Linux:

sudo apt-get install python-pyaudio python3-pyaudiopip install pyaudio

If you are on Mac:

brew install portaudiopip install pyaudio

If you are on Windows:

pip install pyaudio

If you got any errors installing PyAudio on Windows, then refer to this StackOverflow solution. If you are on different machines, then try to Google the error. If you still got those errors, then feel free to comment below.

Once you’ve got PyAudio installed, you can test the installation from the terminal by typing this:

python -m speech_recognition

Make sure your default microphone is on and unmuted. If the installation worked, you should see something like this:

If you are using Ubuntu, then you may get some errors of the form “ALSA lib […] Unknown PCM” like this:

To suppress those errors, see this Stackoverflow answer.

Now to give the program the ability to talk, we have to install the pyttsx3 module:

pip install pyttsx3

pyttsx3 is a Text to Speech (TTS) library for Python 2 and 3. It works without an internet connection or delay. It also supports multiple TTS engines, including Sapi5, nsss, and espeak.

That’s it, we have installed and set up all the pre-requirements. Now it’s time to write the program itself, so let’s do that.

recognize_voice()

First of all, let’s import all the necessary imports.

Type the following code inside the main.py file:

# all our imports
import speech_recognition as sr
from time import sleep
from datetime import datetime
import webbrowser
import pyttsx3

First, we are importing the speech_recognition module as sr.
Then we are importing the sleep() function from the time module. We will use this in a bit to make a fake delay.
Then for knowing the current date and time, we need that datetime module.
Then to open up a browser and do a google search, we need the help of the webbrowser module.
Then as I said earlier, to convert text to speech, we need pyttsx3.

All of the magic in SpeechRecognition happens with the Recognizer class. So let’s instantiate it next:

# make an instance of Recognizer class
r = sr.Recognizer()

Now configure the pyttsx3:

# confs for pyttsx3
engine = pyttsx3.init()

pyttsx3 will be responsible for generating the computer voice. To see/hack the gender, age, speed, etc. of the generated computer voice, read this description.

Now let’s create that recognize_voice() function. This recognize_voice() function will do the following:

listens to our Microphone.
recognize our voice with the help of recognize_google() function.
converts it into text format.
And then returns that text version of our voice.

Create the recognize_voice() function like below:

""" fn to recognize our voice and return the text_version of it"""
def recognize_voice():
  text = ''  # create an instance of the Microphone class
  with sr.Microphone() as source:
    # adjust for ambient noise
    r.adjust_for_ambient_noise(source)    # capture the voice
    voice = r.listen(source)    # let's recognize it
    try:
      text = r.recognize_google(voice)
    except sr.RequestError:
      speak("Sorry, the I can't access the Google API...")
    except sr.UnknownValueError:
      speak("Sorry, Unable to recognize your speech...")
  return text.lower()

If some error happens like if your Internet connection is bad, then it will just speak() the appropriate message.

Remember that the speak() function is not a builtin function. We have to create it and we will do it at the end because it is a small function.

And also remember that this speak() function will convert the given text to speech(the computer-generated voice).

Now at the very bottom of the file, type the following:

# wait a second for adjust_for_ambient_noise() to do its thing
sleep(1)while True:
  speak("Start speaking...")
  # listen for voice and convert it into text format
  text_version = recognize_voice()  # give "text_version" to reply() fn
  reply(text_version)

After making a delay of 1 second, we start an infinite loop.
Then speak() the message “Start speaking…”, which will be like a prompt for the end-user.
Then we listen for the voice and convert it into text format using the recognize_voice() function which we just created.
Now we have the text_version of our inputted speech. So we can use this to generate responses like telling the date, current time, searching the google like that according to what we asked for.
That’s what the reply() function is going to do.

Now let’s create that reply() function.

reply()

This function will accept text_version as an argument and then act accordingly. Type the following code below the recognize_voice() function which we created earlier:

""" fn to respond back """
def reply(text_version):
  # name
  if "name" in text_version:
    speak("My name is JARVIS")
  
  # how are you?
  if "how are you" in text_version:
    speak("I am fine...")  # date
  if "date" in text_version:
    # get today's date and format it - 9 November 2020
    date = datetime.now().strftime("%-d %B %Y")
    speak(date)  # time
  if "time" in text_version:
    # get current time and format it like - 02 28 
    time = datetime.now().time().strftime("%H %M")
    speak("The time is " + time)
  
  # search google
  if "search" in text_version:
    speak("What do you want me to search for?")
    keyword = recognize_voice()    # if "keyword" is not empty
    if keyword != '':
      url = "https://google.com/search?q=" + keyword      # webbrowser module to work with the webbrowser
      speak("Here are the search results for " + keyword)
      webbrowser.open(url)
      sleep(3)
  
  # quit/exit
  if "quit" in text_version or "exit" in text_version:
    speak("Ok, I am going to take a nap...")
    exit()

See it’s very simple. All we are doing is just checking if “any_piece_of_text” is present in the given text_version. If we found any of those certain texts which we are looking for, then we will act accordingly like speak() -ing the current time, or date, searching Google by opening the webbrowser like that.
Again see, we are using the speak() function, but haven’t created it yet. And that’s what we are going to do next.

speak()

Type the following code above/below reply() function:

""" speak (text to speech) """
def speak(text):
  engine.say(text)
  engine.runAndWait()

Pretty straight forward isn’t it? Here we are using the engine, we earlier instantiated, to say() the text we give. And that’s the only thing we are doing inside the speak() function.

That’s it you have successfully created your own python voice assistant in a matter of time!

Now let’s test it. Type the following code inside the terminal window at the bottom:

python main.py

Go on, ask a few questions like “What is your name?”, “What is the date today?”, “Search Google” like that.

Have fun with it…

Final Code

Here is the final version of the main.py file. If you got any error, then cross-check your code with the following one:

# all our imports
import speech_recognition as sr
from time import sleep
from datetime import datetime
import webbrowser
import pyttsx3
# make an instance of Recognizer class
r = sr.Recognizer()
# confs for pyttsx3
engine = pyttsx3.init()
""" speak (text to speech) """
def speak(text):
  engine.say(text)
  engine.runAndWait()
""" fn to recognize our voice and return the text_version of it"""
def recognize_voice():
  text = ''  # create an instance of the Microphone class
  with sr.Microphone() as source:
    # adjust for ambient noise
    r.adjust_for_ambient_noise(source)    # capture the voice
    voice = r.listen(source)    # let's recognize it
    try:
      text = r.recognize_google(voice)
    except sr.RequestError:
      speak("Sorry, the I can't access the Google API...")
    except sr.UnknownValueError:
      speak("Sorry, Unable to recognize your speech...")
  return text.lower()
""" fn to respond back """
def reply(text_version):
  # name
  if "name" in text_version:
    speak("My name is JARVIS")
  
  # how are you?
  if "how are you" in text_version:
    speak("I am fine...")  # date
  if "date" in text_version:
    # get today's date and format it - 9 November 2020
    date = datetime.now().strftime("%-d %B %Y")
    speak(date)  # time
  if "time" in text_version:
    # get current time and format it like - 02 28 
    time = datetime.now().time().strftime("%H %M")
    speak("The time is " + time)
  
  # search google
  if "search" in text_version:
    speak("What do you want me to search for?")
    keyword = recognize_voice()    # if "keyword" is not empty
    if keyword != '':
      url = "https://google.com/search?q=" + keyword      # webbrowser module to work with the webbrowser
      speak("Here are the search results for " + keyword)
      webbrowser.open(url)
      sleep(3)
  
  # quit/exit
  if "quit" in text_version or "exit" in text_version:
    speak("Ok, I am going to take a nap...")
    exit()
# wait a second for adjust_for_ambient_noise() to do its thing
sleep(1)while True:
  speak("Start speaking...")
  # listen for voice and convert it into text format
  text_version = recognize_voice()  # give "text_version" to reply() fn
  reply(text_version)

Wrapping Up

I hope you enjoyed this tutorial. In some places, I intentionally skipped the explanation part. Because those codes were simple and self-explanatory. That’s why I left it to you to decode it on your own.

True learning takes place when you try things on your own. By simply following a tutorial won’t make you a better programmer. You have to use your own brain.

If you still have any error, first try to decode it on your own by googling it.

If you didn’t find any solutions, then only comment on them below. Because you should know how to find and resolve a bug on your own and that’s a skill that every programmer should have!

And that’s it, Thank you ;)