Skip to main content
This guide explains how to customize word pronunciation in your video voice-overs using SSML (Speech Synthesis Markup Language) phoneme tags. Phoneme tags give you precise control over how AI voices pronounce brand names, technical terms, foreign words, or any text that might be mispronounced.

What You’ll Learn

Phoneme Basics

Understand SSML phoneme tag syntax and usage

Provider Differences

Learn how ElevenLabs, AWS Polly, and Google handle pronunciation

CMU Arpabet

Use the CMU Arpabet phonetic alphabet for pronunciation

Practical Examples

Apply pronunciation control to real video content

Before You Begin

Make sure you have:

Understanding Voice-Over Types

Pictory supports three voice-over services, each with different SSML phoneme support:
ServiceCategoryPhoneme SupportAlphabet
ElevenLabsPremiumEnglish onlyCMU Arpabet
AWS PollyStandardFull supportIPA, X-SAMPA
Google TTSStandardFull supportIPA
You can identify the voice-over service by the service field in the Get Voiceover Tracks API response. Values are elevenlabs, aws, or google.

Enabling SSML in Your Story

To use SSML tags including phoneme tags, you must set isSSMLStory: true in your scene configuration:
{
  "scenes": [
    {
      "story": "Welcome to <phoneme alphabet=\"cmu-arpabet\" ph=\"P IH1 K T AO0 R IY0\">Pictory</phoneme>.",
      "isSSMLStory": true,
      "createSceneOnEndOfSentence": true
    }
  ]
}
The isSSMLStory property is required when using any SSML tags. Without it, SSML tags will be read as plain text instead of being processed.

ElevenLabs (Premium Voices)

ElevenLabs premium voices provide high-quality, natural-sounding speech with phoneme support for English language content.

Key Requirements

  1. Model Configuration: When using phoneme tags with ElevenLabs, you must specify a modelId in premiumVoiceSettings. If not provided, eleven_flash_v2 is used by default for phoneme processing.
  2. Limited Model Support: Only three ElevenLabs models support phoneme tags. Other models will ignore phoneme markup.
  3. English Only: Phoneme pronunciation control works only with English language content in ElevenLabs.
  4. CMU Arpabet: ElevenLabs uses the CMU Arpabet phonetic alphabet.

Models with Phoneme Support

Only the following three models support SSML phoneme tags in ElevenLabs. Using phoneme tags with other models will not produce the expected pronunciation changes.
Model IDDescriptionPhoneme Support
eleven_flash_v2Fast, efficient model (default for phonemes)Yes
eleven_turbo_v2Optimized for speedYes
eleven_monolingual_v1English-optimizedYes

Models Without Phoneme Support

The following models do not support phoneme tags:
Model IDDescriptionPhoneme Support
eleven_flash_v2_5Enhanced flash modelNo
eleven_turbo_v2_5Enhanced turbo modelNo
eleven_multilingual_v2Multi-language supportNo
eleven_multilingual_v1Legacy multi-languageNo

Complete Example

import axios from "axios";

const API_BASE_URL = "https://api.pictory.ai/pictoryapis";
const API_KEY = "YOUR_API_KEY";

async function createVideoWithPronunciation() {
  const storyWithPhonemes = `
    Welcome to <phoneme alphabet="cmu-arpabet" ph="P IH1 K T AO0 R IY0">Pictory</phoneme>, the leading <phoneme alphabet="cmu-arpabet" ph="EY1 AY1">AI</phoneme> video creation platform.
    Our <phoneme alphabet="cmu-arpabet" ph="IH2 N T UW1 IH0 T IH0 V">intuitive</phoneme> interface transforms your text into professional videos using advanced <phoneme alphabet="cmu-arpabet" ph="AE1 L G AH0 R IH2 DH AH0 M Z">algorithms</phoneme>.
    Whether you are an <phoneme alphabet="cmu-arpabet" ph="AA2 N T R AH0 P R AH0 N ER1">entrepreneur</phoneme> or content creator, <phoneme alphabet="cmu-arpabet" ph="P IH1 K T AO0 R IY0">Pictory</phoneme> makes video production effortless.
    Watch our <phoneme alphabet="cmu-arpabet" ph="T UW0 T AO1 R IY0 AH0 L">tutorial</phoneme> to learn how <phoneme alphabet="cmu-arpabet" ph="EY1 AY1">AI</phoneme> <phoneme alphabet="cmu-arpabet" ph="V IH2 ZH UW0 AH0 L AH0 Z EY1 SH AH0 N">visualization</phoneme> can revolutionize your workflow.
  `;

  const response = await axios.post(
    `${API_BASE_URL}/v2/video/storyboard/render`,
    {
      videoName: "pictory_pronunciation_demo",

      // Voice-over with ElevenLabs premium voice
      voiceOver: {
        enabled: true,
        aiVoices: [
          {
            speaker: "Brian",
            speed: 100,
            premiumVoiceSettings: {
              modelId: "eleven_flash_v2"  // Required for phoneme support
            }
          }
        ]
      },

      backgroundMusic: {
        enabled: true,
        volume: 0.1,
        autoMusic: true
      },

      scenes: [
        {
          story: storyWithPhonemes,
          isSSMLStory: true,
          createSceneOnEndOfSentence: true
        }
      ]
    },
    {
      headers: {
        "Content-Type": "application/json",
        Authorization: API_KEY
      }
    }
  );

  console.log("Job ID:", response.data.data.jobId);
  return response.data;
}

createVideoWithPronunciation();

ElevenLabs External Documentation

For detailed information about ElevenLabs pronunciation features:

AWS Polly (Standard Voices)

AWS Polly voices provide reliable SSML support with multiple phonetic alphabets for precise pronunciation control.

Key Features

  1. Multiple Alphabets: AWS Polly supports IPA (International Phonetic Alphabet) and X-SAMPA phonetic systems.
  2. SSML Categories: AWS Polly voices have different SSML support levels (Category A or B). Check the ssmlSupportCategory field from the tracks API.
  3. Neural and Standard Engines: Different voices use different engines with varying SSML capabilities.

Phoneme Tag Syntax

<phoneme alphabet="ipa" ph="ˈpɪk.tɔː.ri">Pictory</phoneme>
Or using X-SAMPA:
<phoneme alphabet="x-sampa" ph="&quot;pIk.tO:.ri">Pictory</phoneme>

Complete Example

import axios from "axios";

const API_BASE_URL = "https://api.pictory.ai/pictoryapis";
const API_KEY = "YOUR_API_KEY";

async function createVideoWithAWSPolly() {
  const response = await axios.post(
    `${API_BASE_URL}/v2/video/storyboard/render`,
    {
      videoName: "aws_polly_pronunciation_demo",

      // Voice-over with AWS Polly voice
      voiceOver: {
        enabled: true,
        aiVoices: [
          {
            speaker: "Joanna",  // AWS Polly neural voice
            speed: 100
          }
        ]
      },

      backgroundMusic: {
        enabled: true,
        volume: 0.1,
        autoMusic: true
      },

      scenes: [
        {
          story: `Welcome to <phoneme alphabet="ipa" ph="ˈpɪk.tɔː.ri">Pictory</phoneme>.
                  Transform your text into engaging videos with AI.`,
          isSSMLStory: true,
          createSceneOnEndOfSentence: true
        }
      ]
    },
    {
      headers: {
        "Content-Type": "application/json",
        Authorization: API_KEY
      }
    }
  );

  console.log("Job ID:", response.data.data.jobId);
  return response.data;
}

createVideoWithAWSPolly();

AWS Polly External Documentation

For detailed information about AWS Polly phoneme tags:

Google Text-to-Speech (Standard Voices)

Google TTS voices offer high-quality neural speech synthesis with comprehensive IPA phoneme support.

Key Features

  1. IPA Support: Google TTS uses the International Phonetic Alphabet (IPA) for phoneme specification.
  2. WaveNet and Neural2 Voices: Google offers advanced neural voice engines with natural-sounding output.
  3. Multi-language: Phoneme support across multiple languages with language-specific IPA symbols.

Phoneme Tag Syntax

<phoneme alphabet="ipa" ph="ˈpɪktəri">Pictory</phoneme>

Complete Example

import axios from "axios";

const API_BASE_URL = "https://api.pictory.ai/pictoryapis";
const API_KEY = "YOUR_API_KEY";

async function createVideoWithGoogleTTS() {
  const response = await axios.post(
    `${API_BASE_URL}/v2/video/storyboard/render`,
    {
      videoName: "google_tts_pronunciation_demo",

      // Voice-over with Google TTS voice
      voiceOver: {
        enabled: true,
        aiVoices: [
          {
            speaker: "Steffi",  // Google WaveNet voice
            speed: 100
          }
        ]
      },

      backgroundMusic: {
        enabled: true,
        volume: 0.1,
        autoMusic: true
      },

      scenes: [
        {
          story: `Welcome to <phoneme alphabet="ipa" ph="ˈpɪktəri">Pictory</phoneme>.
                  Create professional videos using artificial intelligence.`,
          isSSMLStory: true,
          createSceneOnEndOfSentence: true
        }
      ]
    },
    {
      headers: {
        "Content-Type": "application/json",
        Authorization: API_KEY
      }
    }
  );

  console.log("Job ID:", response.data.data.jobId);
  return response.data;
}

createVideoWithGoogleTTS();

Google TTS External Documentation

For detailed information about Google TTS phoneme support:

CMU Arpabet Reference

The CMU Arpabet is a phonetic alphabet commonly used with ElevenLabs. Here’s a quick reference:

Vowels

ArpabetExampleWord
AAɑfather
AEæcat
AHʌcut
AOɔcaught
EHɛbed
ERɝbird
IHɪbit
IYibeat
UHʊbook
UWuboot

Stress Markers

MarkerMeaning
0No stress
1Primary stress
2Secondary stress

Example Breakdown

For “Pictory” pronounced as P IH1 K T AO0 R IY0:
SymbolSound
P/p/ as in pat
IH1/ɪ/ as in bit (primary stress)
K/k/ as in kit
T/t/ as in top
AO0/ɔ/ as in caught (no stress)
R/r/ as in run
IY0/i/ as in beat (no stress)

Common Pronunciation Examples

Here are phoneme representations for words commonly mispronounced:
WordCMU ArpabetIPA
PictoryP IH1 K T AO0 R IY0ˈpɪktɔːri
AIEY1 AY1ˌeɪˈaɪ
VideoV IH1 D IY0 OW0ˈvɪdioʊ
TutorialT UW0 T AO1 R IY0 AH0 Ltuːˈtɔːriəl

Best Practices

Always test your phoneme tags with a short video before creating longer content. Different voices may interpret phonemes slightly differently.
Stick to one phonetic alphabet per voice provider:
  • ElevenLabs: CMU Arpabet
  • AWS Polly: IPA or X-SAMPA
  • Google TTS: IPA
Only use phoneme tags for words that are genuinely mispronounced. Overusing them can make content harder to maintain.
Keep a reference document of phoneme tags used for your brand names and technical terms for consistency across videos.

Troubleshooting

Problem: The phoneme tags appear as literal text in the voice-over.Solution: Ensure isSSMLStory: true is set in your scene configuration. This flag enables SSML processing.
Problem: The word is still mispronounced even with phoneme tags.Solution:
  • Verify you’re using CMU Arpabet (not IPA) with ElevenLabs
  • Check that premiumVoiceSettings.modelId is specified
  • Ensure stress markers (0, 1, 2) are correctly placed
Problem: Some voices don’t process SSML tags correctly.Solution: Check the ssmlSupportCategory field from the Get Voiceover Tracks API. Some voices have limited SSML support.
Problem: Request fails when using special characters in phoneme strings.Solution: Ensure proper escaping of special characters. In JSON, use \" for quotes within the phoneme attribute.

Next Steps


External Resources