Here are the steps to install the Speech to Text Python program:

  1. Make sure that you have Python3 and pip installed on your system.

  2. Visit https://platform.openai.com/account/api-keys and make an API Key if you do not have one. Note that you have to activate a payment method too because the API is not free, although it is reasonably priced (1hr of transcription will cost you $0.36 at the moment of writing this.)

  3. Save the API key as an OS variable using this guide here: https://help.openai.com/en/articles/5112595-best-practices-for-api-key-safety. For linux (bash) the commands are the following:

    $ echo "export OPENAI_API_KEY='PASTE-YOUR-KEY-HERE'" >> ~/.bash_profile
    $ source ~/.bash_profile
    

Verify that you now have the OpenAI API key as an OS environment variable:

$ echo $OPENAI_API_KEY

If your API key is displayed in your terminal then you are good to go.

  1. Create a Python virtual environment in the folder where you want to run the program (Linux, should be similar on Mac, or use Windows terminal and look up the slight differences):

    $ pip install virtualenv
    $ mkdir speech2text
    $ virtualenv speech2text
    $ source speech2text/bin/activate
    
  2. Install the Python libraries inside your virtualenv:

    pip install gradio pyautogui openai
    
  3. Copy paste the code below into a file and save the filen as .py, e.g. speech2text.py. The Gradio and OpenAI parts of the code is from https://www.linkedin.com/pulse/create-talking-bot-new-chatgpt-whisper-api-using-python-leo-wang, I have customized it slightly to add the PyAutoGUI functionality:

import gradio as gr
import openai
import os
import pyautogui

# Set the region where the text to be selected and copied will appear
# Use a graphics program or other tool to get the coordinates
# Where you would like your transcribtion to appear
TEXT_REGION = (55, 182, 500, 200) # (left, top, width, height)

# Load the API key from your OS environment 
openai.api_key = os.environ["OPENAI_API_KEY"]

# Note: You need to be using OpenAI Python v0.27.0 for the code below to work
def transcribe(audio):
    print(audio)

    os.rename(audio, audio + '.wav')
    audio_file = open(audio + '.wav', "rb")
    transcript = openai.Audio.transcribe("whisper-1", audio_file)
    # Convert the "text" value to a string
    textStr = str(transcript["text"])
    # Run process_text
    process_text(textStr)
    # Return value to  gr.Interface
    return transcript["text"]

def process_text(textStr):
    print("Response: " + textStr)
    # Move and click text box / document / whatever
    pyautogui.moveTo(TEXT_REGION[0]+10, TEXT_REGION[1]+10, duration=0.5)
    pyautogui.click()
    # Loop the string to simulate typing
    for char in textStr:
         pyautogui.typewrite(char)

recSend = gr.Interface(
    fn=transcribe, inputs=gr.Audio(source="microphone",
    type="filepath"),
    outputs="text"
    )
recSend.launch()

If you want to just have the response in the Gradio web interface and not move the mouse and write output whereever, you can comment out the line “process_text(textStr)”. If you want the PyAutoGUI functionality, make sure to set the coordinates.

  1. Save the file and run it in the terminal using gradio NOT python:

    gradio speecht2text.py
    

    ``

You will now have Gradio interface running on http://127.0.0.1:7860/ (localhost), so just open that in your web browser and click “Record from microphone”, and then “Submit”.

This program can obviously be customized more in order to automate it to your needs, I will probably make some changes, but the PyAutoGUI functionality makes it quite useful in all its simplicity.