Google Cloud Text-to-Speech with PowerShell

A guide for using PowerShell with the Google Cloud Text-to-Speech API.

6 minute read

TL; DR

In this post, I’ll walk through the basics of using PowerShell to interact with the Google Cloud Text-to-Speech API. Partly a documentation exercise and partly a guide I’d like to have been able to read when I started on my Elite Dangerous Google Cloud Text-to-Speech project.

As such we’ll start with a basic script to produce an audio file and explain how that works. Later I’ll walk through some parameters of interest to affect the response. Lastly, I’ll introduce how we can use SSML for a variation in response.

The bare minimum

This post assumes you’ve followed the steps in this Google article to correctly set up a project and configure the Cloud SDK on your machine.

To get started with PowerShell, let’s look at the minimum script that would produce an audio file:

#Common auth, URL and headers
$gauth = gcloud auth print-access-token
$headers = @{}
$headers.add("Authorization", "Bearer $gauth")
$target = "https://texttospeech.googleapis.com/v1/text:synthesize"

#Variables
$text = "Hey there! You're using plain text for this synthesis"
$languageCode = 'en'

$body = @{
    input = @{
        text = $text
    }
    voice = @{
        languageCode = $languageCode
    }
    audioConfig = @{
        audioEncoding = 'MP3'
    }
}

#Build JSON body for the request
$jbody = ConvertTo-Json ($body)

#Try conversion request
try{
    $response = Invoke-RestMethod -ContentType 'application/json' -headers $headers -Uri $target -Method Post -body $jbody
    #Extract the base64 encoded response
    $base64Audio = $response.audioContent

    #Produce output file
    $base64Audio | Out-File -FilePath "./google.txt" -Encoding ascii -Force
    $convertedFileName = 'GTTS-Plain-{0}.mpga' -f (get-date -f yyyy-MM-dd-hh-mm-ss)
    certutil -decode google.txt $convertedFileName
}catch {
    Write-Host "StatusCode:" $_.Exception.Response.StatusCode.value__ 
    Write-Host "StatusDescription:" $_.Exception.Response.StatusDescription
}

Breaking it down

Most of the code is structural (build up authentication, target URL and file output). The #Variables section of the code will be the focus of our changes. When we examine the requirements documentation for a request - we need only to offer the following information:

  • Text or SSML field for synthesis input
  • Language code
  • Audio encoding

You might notice I’ve picked en as the language code. This invokes a specific behaviour:

Note that the TTS service may choose a voice with a slightly different language code than the one selected; it may substitute a different region (e.g. using en-US rather than en-CA if there isn’t a Canadian voice available), or even a different language, e.g. using “nb” (Norwegian Bokmal) instead of “no” (Norwegian)“.

Therefore using en we can only say that we’ll get an English result back - without certainty on which region is selected. We can use en-GB for a preferred British voice conversion - with the understanding that an alternative region voice could be selected perhaps due to capacity within the system. Here are some plain text examples with only the region specified:

languageCode Audio
EN-GB
EN-US

Specifying gender

We can request a preferred gender for the voice. The documentation notes another may be picked if it is not available rather than failing the request. The relevant Powershell code is changed as follows:

$body = @{
    input = @{
        text = $text
    }
    voice = @{
        languageCode = $languageCode
        ssmlGender = 'FEMALE'
    }
    audioConfig = @{
        audioEncoding = 'MP3'
    }
}
languageCode ssmlGender Audio
EN-GB female
EN-GB male

Voice selection

If you do not specify a voice in the JSON request then one is picked for you based on the indicated language code. If you want to specify a voice you can choose from a list of supported voices. Below is an example to indicate a voice name preference:

#Variables
$languageCode = 'en-GB'
$voicename = 'en-GB-Standard-D'
$text = "Hey there! You're using plain text for this synthesis and selecting voice $voicename."


$body = @{
    input = @{
        text = $text
    }
    voice = @{
        languageCode = $languageCode
        name = $voicename
    }
    audioConfig = @{
        audioEncoding = 'MP3'
    }
}
languageCode voicename Audio
EN-GB en-GB-Standard-A
EN-GB en-GB-Standard-B
EN-GB en-GB-Standard-C
EN-GB en-GB-Standard-D

For completeness here is a similar set of examples with the premium Wavenet option.

$text = "Hey there! You're using plain text for this synthesis and selecting premium voice $voicename."
languageCode voicename Audio
EN-GB en-AU-Wavenet-A
EN-GB en-AU-Wavenet-B
EN-GB en-AU-Wavenet-C
EN-GB en-AU-Wavenet-D

Changing the audio configuration

A number of options are available to use for changing the resulting audio - here are 3 key options:

  • Rate of speech
  • Speaking pitch
  • Gain

We can extend our hash table to include these options, the below values represent the same as the current defaults:

audioConfig = @{
        audioEncoding = 'MP3'
        speakingRate = 1
        pitch = 0
        volumeGainDb = 0
    }

Here are some examples with the modification of those values with the following text:

$text = "The quick brown fox jumps over the lazy dog."

Audio config Audio
speakingRate = 1,pitch = 0,volumeGainDb = 0
speakingRate = 1.5,pitch = 0,volumeGainDb = 0
speakingRate = 1,pitch = 0,volumeGainDb = 5
speakingRate = 1,pitch = 10,volumeGainDb = 0

Using Speech Synthesis Markup Language (SSML)

SSML allows you to nuance text with a variety of tools in the form of a mark-up language. Google document what they support in a clear fashion. Modifying the code to use SSML is as simple as:

#Variables
$languageCode = 'en-GB'
$voicename = 'en-GB-Wavenet-A'
$text = "<speak>The <say-as interpret-as=`"characters`">quick</say-as> brown fox jumps over the lazy dog.</speak>"

$body = @{
    input = @{
        ssml = $text
    }
    voice = @{
        languageCode = $languageCode
        name = $voicename
    }
    audioConfig = @{
        audioEncoding = 'MP3'
    }
}

The modifications are to specify ssml = $text and for $text to contain a valid SSML string. Note the character escaping `” so that we may use “ in a PowerShell string.

SSML say-as effect Audio
none
interpret-as="characters"
interpret-as="expletive"

In conclusion

Part of the novelty in using PowerShell for this project was the lack of direct documentation or examples. On the flip side, there were numerous examples in other languages (Ruby/Python/PHP/Node.js/Java/Go/C# and curl). Those examples helped me find analogues in PowerShell.

The ability to select voices and nuance speech is relavent to adding variantion into my Elite Dangerous project and working on one of the documented limitations.

Acknowledgements

I’ve a standing acknowledgement to add due to the use of PowerShell and hash tables. Thanks, Kevin Marquette, for your ever excellent “Everything you wanted to know about hashtables”.

Related posts

Elite Dangerous Google Cloud Text-to-Speech project