.NET .NET Core AI Azure C# Speech Visual Studio

How to use Azure AI Speech Services Language Translation

Welcome to today’s post.

In today’s post I will be showing you how to use the Azure AI Speech Service to translate speech from one language to another language.

The mode of speech input I will be using in this post is from the microphone input device. In a previous post I showed how to process speech input from a microphone and recognize the voice input to produce output as text.

It is also possible to use other forms of input for speech recognition and language translation processing, such as a pre-recorded sound file that is in the WAV format. In a previous post I showed how to use speech recognition to produce text output from input sound files.

Explaining the Language Translation Process

In this section I will explain how the language translation process works with the Azure AI Speech Service.

Below is an overview of the audio input, speech recognition, and language translation process:

The above process shows microphone input however you can even use a pre-recorded audio file to convert to other languages following speech recognition.

The steps taken to translate speech from a sound file are as follows:

  1. Set the input mode. The default is microphone.
  2. Start voice recording from the microphone.
  3. Process each spoken word.
  4. Determine if the word is recognizable.
  5. If the word is recognized, output the recognized word as text.
  6. Translate the text in the target language.
  7. Output the text translation in the target language.
  8. Repeat translation for each recognized word.

As we can see, before we can apply a translation for each spoken word from the input, the actual word needs to be recognized. Following recognition, text that represents the word is output. The output text is then fed into a language translator, which outputs a text translation of the recognized text.

Setup of the Language Translation Service

In this section, I will show how to set up language translation using the Azure Cognitive Services Speech SDK from an Azure AI Speech Service subscription. In previous posts I showed how to setup Speech Recognition using the Azure Cognitive Services Speech SDK to provide speech recognition services. I will explain the SDK classes and structures that are needed to configure and process language translation processes.

The first part of the speech translation configuration is like our speech recognition configuration, where we supply the speech key and speech region from our Azure AI Speech Service subscription to the speech translation class, SpeechTranslationConfig SDK class and set a source language to be recognized and translated:

var speechTranslationConfig = SpeechTranslationConfig.FromSubscription(
    speechKey, 
    speechRegion
);

speechTranslationConfig.SpeechRecognitionLanguage = sourceLanguage;

Before a translation of the source language can commence, we need to define at least one target language to translate to from the source language. This is done with the AddTargetLanguage() method of the SpeechTranslationConfig SDK class, which takes a string parameter that corresponds to a target language that we want to translate to:

speechTranslationConfig.AddTargetLanguage(l));

If we want to translate more than one target (destination) language to the list of languages to translate, then we can call the same method repeatedly. Below demonstrates the addition of the languages German and Turkish as target languages:

speechTranslationConfig.AddTargetLanguage(“de”);
speechTranslationConfig.AddTargetLanguage(“fr”);

After setup of the source and destination languages, we can run the speech translator. This is done with the speech translation recognizer class, TranslationRecognizer, which takes a speech configuration object and an audio configuration object.

Setting up audio configuration to the default microphone is done this way:

using var audioConfig = AudioConfig.FromDefaultMicrophoneInput();

The speech translation recognizer is configured as shown:

using var speechTranslationRecognizer = new TranslationRecognizer(
    speechConfig, 
    audioConfig
); 

To capture a single sound utterance and translate that, we can use the RecognizeOnceAsync() method of the TranslationRecognizer class, which returns a structure of type TranslationRecognitionResult.

In the code for the main method, we setup the translation configuration and target languages, then allow selection of single or continuous utterance from the input source:

async static Task Main(string[] args)
{
    HostApplicationBuilder builder = Host.CreateApplicationBuilder(args);

    string speechKey = Environment.GetEnvironmentVariable("SPEECH_KEY");
    string speechRegion = Environment.GetEnvironmentVariable("SPEECH_REGION");

    builder.Services.AddLogging(
        l => l.AddConsole().SetMinimumLevel(LogLevel.None));
    using IHost host = builder.Build();

    var sourceLanguage = "en-US";
    var destinationLanguages = new List<string> { "it", "fr", "es", "tr" };

    var speechTranslationConfig = SpeechTranslationConfig.FromSubscription(
        speechKey, 
        speechRegion
    );

    speechTranslationConfig.SpeechRecognitionLanguage = sourceLanguage;

    Console.WriteLine("For Single Utterance Translation Press T.");
    Console.WriteLine("For Continuous Translation Press S.");
    Console.WriteLine("Press Escape to finish.");
    ConsoleKeyInfo consoleKeyInfo = Console.ReadKey(true);

    if (consoleKeyInfo.Key == ConsoleKey.T)
        RunTranslateSingleCommands(
            speechTranslationConfig, 
            sourceLanguage, 
            destinationLanguages
        );

        if (consoleKeyInfo.Key == ConsoleKey.S)
            RunTranslateContinuousCommands(
                speechTranslationConfig, 
                sourceLanguage, 
                destinationLanguages
            );

        await host.RunAsync();

        Console.WriteLine("Application terminated.");
}

In the next section, I will show how to run the translation, capture, and output the result(s) for single utterance translation. For continuous utterance translation I will show that in a later section.

Running a Single Utterance Translation

After declaring the translation result structure,

TranslationRecognitionResult speechTranslationRecognitionResult;

we then run the translation as shown:

speechTranslationRecognitionResult = await 
    speechTranslationRecognizer.RecognizeOnceAsync();

To determine the status of the translation result, we examine the Reason property of the TranslationRecognitionResult structure, which returns one of the following enumerated values:

ResultReason.TranslatedSpeech

ResultReason.TranslatingSpeech

ResultReason.NoMatch

ResultReason.Canceled

Where the result enumerations and reasons are:

ResultReason
TranslatedSpeechSpeech translation is completed.
TranslatingSpeechSpeech translation is in progress.
NoMatchSpeech cannot be translated.
CanceledTranslation process ended due to an error.

Where there is no matching translation for a spoken word, either the word could not be recognized in the source language, or the recognized word in the source language did not have a translation in the target language.

A method to run the single utterance speech translation with input parameters for the translation configuration, source and target language lists is shown below:

static async void RunTranslateSingleCommands(
    SpeechTranslationConfig speechConfig, 
    string sourceLanguage, 
    List<string> destinationLanguages)
{
    TranslationRecognitionResult speechTranslationRecognitionResult;
    ConsoleKeyInfo consoleKeyInfo;
    bool isFinished = false;

    var languagesList = string.Join(",", destinationLanguages);

    // add languages to translate to …
    destinationLanguages.ForEach(l => speechConfig.AddTargetLanguage(l));

    using var audioConfig = AudioConfig.FromDefaultMicrophoneInput();
    using var speechTranslationRecognizer = new TranslationRecognizer(
        speechConfig, 
        audioConfig
    );

    while (!isFinished)
    {
        var stringPrompt = 
            $"Speak into your microphone with language {sourceLanguage} " +
            $"to translate to {languagesList}.";
        Console.WriteLine(stringPrompt);

        speechTranslationRecognitionResult = await speechTranslationRecognizer.RecognizeOnceAsync();

        OutputTranslationRecognitionResult(speechTranslationRecognitionResult);

        Console.WriteLine("Press Y to give another sample. Escape to finish.");
        consoleKeyInfo = Console.ReadKey(true);
        if (consoleKeyInfo.Key == ConsoleKey.Escape)
        {
            Console.WriteLine("Escape Key Pressed.");
            isFinished = true;
        }
    }
}

After calling the asynchronous method RecognizeOnceAsync(), we report the translation results to the method OutputTranslationRecognitionResult(), which I will explain soon.

A method to capture the resulting translation is shown below:

static void OutputTranslationRecognitionResult(
    TranslationRecognitionResult translationRecognitionResult)
{
    switch (translationRecognitionResult.Reason)
    {
        case ResultReason.TranslatedSpeech:
            Console.WriteLine($"TRANSLATED: Text={translationRecognitionResult.Text}");
            foreach (var element in translationRecognitionResult.Translations)
            {
                Console.WriteLine($"    TRANSLATED into '{element.Key}': {element.Value}");
            }
            break;
        case ResultReason.TranslatingSpeech:
            Console.WriteLine($"TRANSLATING: Text={translationRecognitionResult.Text}");
            break;
        case ResultReason.NoMatch:
            Console.WriteLine($"NOMATCH: Speech could not be translated.");
            break;
        case ResultReason.Canceled:
            var cancellation = CancellationDetails.FromResult(
                translationRecognitionResult);
            Console.WriteLine($"CANCELED: Reason={cancellation.Reason}");

            if (cancellation.Reason == CancellationReason.Error)
            {
                Console.WriteLine($"CANCELED: ErrorCode={cancellation.ErrorCode}");
                Console.WriteLine($"CANCELED: ErrorDetails={cancellation.ErrorDetails}");
                Console.WriteLine(
                    $"CANCELED: Did you set the speech resource key and region values?");
            }
            break;
    }
}

We have a switch … case statement to capture each possible result reason returned from the translation and output it accordingly, with a listing of translations.

Below I have a snapshot of an instance of the TranslationRecognitionResult structure with the uttered speech input Text property and its associated translations in the Translations property:

The translations are stored in key-value pairs, with the Key corresponding to the target language, and the Value corresponding to the text translation in the target language of the input Text property:

Below is a session where we run the choices for single and continuous translation, then select single utterance translation with the English word Welcome translated into each of the four specified target languages:

One thing to note is that the text translations do not display any accent characters, which would be useful for native readers of the language(s).

In the next section, I will show how to run a continuous speech translation with translation outputs.

Running a Continuous Utterance Speech Translation

With the continuous utterance speech translation, we speak a stream of words in the microphone, with the translation commencing after a sentence has been detected.

After the translation has completed, the final translated sentence is output to the console.

The configuration of continuous speech translation is identical to that for single utterance speech translation.

Using the same asynchronous processing model as with the speech recognition SDK, we first set up event handlers to process each state of the translation events of the TranslationRecognizer class. The events are:

Recognizing

Recognized

Canceled

SessionStopped

Where the Recognizing event traps a recognized word. With Recognized event, the following reason enumerations are available from the Result.Reason property of the handler output parameter e:

ResultReason.RecognizedSpeech

ResultReason.NoMatch

ResultReason.TranslatedSpeech

The important reason code is TranslatedSpeech, which provides the translated key-value pairs in the Result.Translations property.

To start the continuous speech translation process we run the asynchronous StartContinuousRecognitionAsync() method of the TranslationRecognizer SDK class:

await speechTranslationRecognizer.StartContinuousRecognitionAsync();

To end continuous speech translation, we run the StopContinuousRecognitionAsync() method of the TranslationRecognizer SDK class:

speechTranslationRecognizer.StopContinuousRecognitionAsync();

The continuous translation method with self-contained result status and translation outputs is shown below:

static async void RunTranslateContinuousCommands(
SpeechTranslationConfig speechConfig,
    string sourceLanguage, 
    List<string> destinationLanguages)
{
    TranslationRecognitionResult speechTranslationRecognitionResult;
    ConsoleKeyInfo consoleKeyInfo;
    bool isFinished = false;

     ar languagesList = string.Join(",", destinationLanguages);

    // add languages to translate to …
    destinationLanguages.ForEach(l => speechConfig.AddTargetLanguage(l));

    using var audioConfig = AudioConfig.FromDefaultMicrophoneInput();
    using var speechTranslationRecognizer = new TranslationRecognizer(
        speechConfig, 
        audioConfig);

    var stopRecognition = new TaskCompletionSource<int>();

    speechTranslationRecognizer.Recognizing += (s, e) =>
    {
        Console.WriteLine($"RECOGNIZING: Text={e.Result.Text}");

        if (e.Result.Text.ToLower().StartsWith("stop"))
            // Make the following call at some point to stop recognition:
            speechTranslationRecognizer.StopContinuousRecognitionAsync();
    };

    speechTranslationRecognizer.Recognized += (s, e) =>
    {
        if (e.Result.Reason == ResultReason.RecognizedSpeech)
        {
            Console.WriteLine($"RECOGNIZED: Text={e.Result.Text}");
        }
        else if (e.Result.Reason == ResultReason.NoMatch)
        {
            Console.WriteLine($"NOMATCH: Speech could not be recognized.");
        }
        if (e.Result.Reason == ResultReason.TranslatedSpeech)
        {
            Console.WriteLine($"TRANSLATED: Text={e.Result.Text}");
            foreach (var element in e.Result.Translations)
            {
                Console.WriteLine($"    TRANSLATED into '{element.Key}': {element.Value}");
            }
        }
    };

    speechTranslationRecognizer.Canceled += (s, e) =>
    {
        Console.WriteLine($"CANCELED: Reason={e.Reason}");

        if (e.Reason == CancellationReason.Error)
        {
            Console.WriteLine($"CANCELED: ErrorCode={e.ErrorCode}");
            Console.WriteLine($"CANCELED: ErrorDetails={e.ErrorDetails}");
            Console.WriteLine($"CANCELED: Did you set the speech resource key and region values?");
        }

        stopRecognition.TrySetResult(1);
    };

    speechTranslationRecognizer.SessionStopped += (s, e) =>
    {
        Console.WriteLine("\n    Session stopped event.");
        stopRecognition.TrySetResult(0);
    };

    while (!isFinished)
    {
        Console.WriteLine("Speak into your microphone.");

        await speechTranslationRecognizer.StartContinuousRecognitionAsync();

        Console.WriteLine("Press Y to give another continuous sample. Escape to finish.");
        consoleKeyInfo = Console.ReadKey(true);
        if (consoleKeyInfo.Key == ConsoleKey.Escape)
        {
            Console.WriteLine("Escape Key Pressed.");
            isFinished = true;
            stopRecognition.TrySetResult(0);
        }
    }

    Console.WriteLine("Sampling Session Concluded by Key Stroke.");

    // Waits for completion. Use Task.WaitAny to keep the task rooted.
    Task.WaitAny(new[] { stopRecognition.Task });

    Console.WriteLine("Sampling Session Concluded by Voice Command.");
}

When the above is executed, a sample session is shown below with the sentence:

Welcome to the office

We can see that the entire sentence has been translated for each target language.

I included a condition in the Recognizing event to stop the continuous translation when the word Stop is uttered. You can see that it has even translated the command to stop translation as well!

We have seen how to translate from a source language from microphone inputs into target languages.

The above process can be used to convert a live speech to a multicultural audience that requires the output language text transcript to be in a foreign language. Audiences can be part of a conference, or even visitors to a common tourist area where multiple languages can be displayed on a screen from translated tour guides.

The use of the Speech SDK for translating live speech can also be input from other mediums including pre-recorded sound files, or text.

We will explore the AI Services Speech SDK in more detail in future posts.

That is all for today’s post.

I hope that you have found this post useful and informative.

Social media & sharing icons powered by UltimatelySocial