Google Cloud Text-to-Speech with PowerShell
A guide for using PowerShell with the Google Cloud Text-to-Speech API.
TL; DR
In this post, I’ll walk through the basics of using PowerShell to interact with the Google Cloud Text-to-Speech API. Partly a documentation exercise and somewhat a guide I’d like to have been able to read when I started on my Elite Dangerous Google Cloud Text-to-Speech project.
As such, we’ll start with a basic script to produce an audio file and explain how that works. Later I’ll walk through some parameters of interest to affect the response. Lastly, I’ll introduce how we can use SSML for a variation in response.
The bare minimum
This post assumes you’ve followed the steps in this Google article to correctly set up a project and configure the Cloud SDK on your machine.
To get started with PowerShell, let’s look at the minimum script that would produce an audio file:
#Common auth, URL and headers
$gauth = gcloud auth print-access-token
$headers = @{}
$headers.add("Authorization", "Bearer $gauth")
$target = "https://texttospeech.googleapis.com/v1/text:synthesize"
#Variables
$text = "Hey there! You're using plain text for this synthesis"
$languageCode = 'en'
$body = @{
input = @{
text = $text
}
voice = @{
languageCode = $languageCode
}
audioConfig = @{
audioEncoding = 'MP3'
}
}
#Build JSON body for the request
$jbody = ConvertTo-Json ($body)
#Try conversion request
try{
$response = Invoke-RestMethod -ContentType 'application/json' -headers $headers -Uri $target -Method Post -body $jbody
#Extract the base64 encoded response
$base64Audio = $response.audioContent
#Produce output file
$base64Audio | Out-File -FilePath "./google.txt" -Encoding ascii -Force
$convertedFileName = 'GTTS-Plain-{0}.mpga' -f (get-date -f yyyy-MM-dd-hh-mm-ss)
certutil -decode google.txt $convertedFileName
}catch {
Write-Host "StatusCode:" $_.Exception.Response.StatusCode.value__
Write-Host "StatusDescription:" $_.Exception.Response.StatusDescription
}
Breaking it down
Most of the code is structural (build up authentication, target URL and file output). The #Variables
section of the code will be the focus of our changes. When we examine the requirements documentation for a request - we need only to offer the following information:
- Text or SSML field for synthesis input
- Language code
- Audio encoding
You might notice I’ve picked en
as the language code. Using this invokes a specific behaviour:
Note that the TTS service may choose a voice with a slightly different language code than the one selected; it may substitute a different region (e.g. using en-US rather than en-CA if there isn’t a Canadian voice available), or even a different language, e.g. using “nb” (Norwegian Bokmal) instead of “no” (Norwegian)".
Therefore using en
we can only say that we’ll get an English result back - without certainty on which region is selected. We can use en-GB
for a preferred British voice conversion - with the understanding that an alternative region voice could be selected perhaps due to capacity within the system. Here are some plain text examples with only the region specified:
languageCode | Audio |
---|---|
EN-GB | |
EN-US |
Specifying gender
We can request a preferred gender for the voice. If it is not available rather than failing the request another is selected. The relevant Powershell code is changed as follows:
$body = @{
input = @{
text = $text
}
voice = @{
languageCode = $languageCode
ssmlGender = 'FEMALE'
}
audioConfig = @{
audioEncoding = 'MP3'
}
}
languageCode | ssmlGender | Audio |
---|---|---|
EN-GB | female | |
EN-GB | male |
Voice selection
If you do not specify a voice in the JSON request, then one is picked for you based on the indicated language code. If you want to select a voice, you can choose from a supported list. Below is an example to indicate a voice preference:
#Variables
$languageCode = 'en-GB'
$voicename = 'en-GB-Standard-D'
$text = "Hey there! You're using plain text for this synthesis and selecting voice $voicename."
$body = @{
input = @{
text = $text
}
voice = @{
languageCode = $languageCode
name = $voicename
}
audioConfig = @{
audioEncoding = 'MP3'
}
}
languageCode | voicename | Audio |
---|---|---|
EN-GB | en-GB-Standard-A | |
EN-GB | en-GB-Standard-B | |
EN-GB | en-GB-Standard-C | |
EN-GB | en-GB-Standard-D |
For completeness here is a similar set of examples with the premium Wavenet option.
$text = "Hey there! You're using plain text for this synthesis and selecting premium voice $voicename."
languageCode | voicename | Audio |
---|---|---|
EN-GB | en-AU-Wavenet-A | |
EN-GB | en-AU-Wavenet-B | |
EN-GB | en-AU-Wavenet-C | |
EN-GB | en-AU-Wavenet-D |
Changing the audio configuration
Several options are available to use for changing the resulting audio - here are three key options:
- Rate of speech
- Speaking pitch
- Gain
We can extend our hash table to include these options, the below values represent the same as the current defaults:
audioConfig = @{
audioEncoding = 'MP3'
speakingRate = 1
pitch = 0
volumeGainDb = 0
}
Here are some examples with the modification of those values with the following text:
$text = "The quick brown fox jumps over the lazy dog."
Audio config | Audio |
---|---|
speakingRate = 1,pitch = 0,volumeGainDb = 0 | |
speakingRate = 1.5,pitch = 0,volumeGainDb = 0 | |
speakingRate = 1,pitch = 0,volumeGainDb = 5 | |
speakingRate = 1,pitch = 10,volumeGainDb = 0 |
Using Speech Synthesis Markup Language (SSML)
SSML allows you to nuance text with a variety of tools in the form of a mark-up language. Google document what they support in a straightforward fashion. Modifying the code to use SSML is as simple as:
#Variables
$languageCode = 'en-GB'
$voicename = 'en-GB-Wavenet-A'
$text = "<speak>The <say-as interpret-as=`"characters`">quick</say-as> brown fox jumps over the lazy dog.</speak>"
$body = @{
input = @{
ssml = $text
}
voice = @{
languageCode = $languageCode
name = $voicename
}
audioConfig = @{
audioEncoding = 'MP3'
}
}
The modifications are to specify ssml = $text
and for $text to contain a valid SSML string. Note the character escaping `" so that we may use " in a PowerShell string.
SSML say-as effect | Audio |
---|---|
none | |
interpret-as="characters" | |
interpret-as="expletive" |
In conclusion
Part of the novelty in using PowerShell for this project was the lack of direct documentation or examples. On the flip side, there were numerous examples in other languages (Ruby/Python/PHP/Node.js/Java/Go/C# and curl). Those examples helped me find analogues in PowerShell.
The ability to select voices and nuance speech is relevant to adding variation into my Elite Dangerous project and working on one of the documented limitations.
Acknowledgements
I’ve got a standing acknowledgement to add due to the use of PowerShell and hash tables. Thanks, Kevin Marquette, for your ever-excellent “Everything you wanted to know about hashtables”.
Share this post