Pronunciation Guide with SSML Phoneme Tags

This guide explains how to customize word pronunciation in your video voice-overs using SSML (Speech Synthesis Markup Language) phoneme tags. Phoneme tags give you precise control over how AI voices pronounce brand names, technical terms, foreign words, or any text that might be mispronounced.

What You’ll Learn

Phoneme Basics

Understand SSML phoneme tag syntax and usage

Provider Differences

Learn how ElevenLabs, AWS Polly, and Google handle pronunciation

CMU Arpabet

Use the CMU Arpabet phonetic alphabet for pronunciation

Practical Examples

Apply pronunciation control to real video content

Before You Begin

Make sure you have:

A Pictory API key (get one here)
Basic understanding of the Storyboard API
Familiarity with voice-over configuration

Understanding Voice-Over Types

Pictory supports three voice-over services, each with different SSML phoneme support:

Service	Category	Phoneme Support	Alphabet
ElevenLabs	Premium	English only	CMU Arpabet
AWS Polly	Standard	Full support	IPA, X-SAMPA
Google TTS	Standard	Full support	IPA

You can identify the voice-over service by the service field in the Get Voiceover Tracks API response. Values are elevenlabs, aws, or google.

Enabling SSML in Your Story

To use SSML tags including phoneme tags, you must set isSSMLStory: true in your scene configuration:

{
  "scenes": [
    {
      "story": "Welcome to <phoneme alphabet=\"cmu-arpabet\" ph=\"P IH1 K T AO0 R IY0\">Pictory</phoneme>.",
      "isSSMLStory": true,
      "createSceneOnEndOfSentence": true
    }
  ]
}

The isSSMLStory property is required when using any SSML tags. Without it, SSML tags will be read as plain text instead of being processed.

ElevenLabs (Premium Voices)

ElevenLabs premium voices provide high-quality, natural-sounding speech with phoneme support for English language content.

Key Requirements

Model Configuration: When using phoneme tags with ElevenLabs, you must specify a modelId in premiumVoiceSettings. If not provided, eleven_flash_v2 is used by default for phoneme processing.
Limited Model Support: Only three ElevenLabs models support phoneme tags. Other models will ignore phoneme markup.
English Only: Phoneme pronunciation control works only with English language content in ElevenLabs.
CMU Arpabet: ElevenLabs uses the CMU Arpabet phonetic alphabet.

Models with Phoneme Support

Only the following three models support SSML phoneme tags in ElevenLabs. Using phoneme tags with other models will not produce the expected pronunciation changes.

Model ID	Description	Phoneme Support
`eleven_flash_v2`	Ultra-fast model (~75ms latency), English only (default for phonemes)	Yes
`eleven_turbo_v2`	High quality with low latency (~250-300ms), English only	Yes
`eleven_monolingual_v1`	First generation model, English only (deprecated)	Yes

Models Without Phoneme Support

The following models do not support phoneme tags:

Model ID	Description	Phoneme Support
`eleven_v3`	Latest and most advanced model, 70+ languages	No
`eleven_multilingual_v2`	Most lifelike model with rich emotional expression, 29 languages	No
`eleven_flash_v2_5`	Ultra-fast model (~75ms latency), 32 languages, 50% lower cost	No
`eleven_turbo_v2_5`	Balanced quality and speed (~250-300ms), 32 languages, 50% lower cost	No
`eleven_multilingual_v1`	First generation multilingual model, 8 languages (deprecated)	No

Complete Example

import axios from "axios";

const API_BASE_URL = "https://api.pictory.ai/pictoryapis";
const API_KEY = "YOUR_API_KEY";

async function createVideoWithPronunciation() {
  const storyWithPhonemes = `
    Welcome to <phoneme alphabet="cmu-arpabet" ph="P IH1 K T AO0 R IY0">Pictory</phoneme>, the leading <phoneme alphabet="cmu-arpabet" ph="EY1 AY1">AI</phoneme> video creation platform.
    Our <phoneme alphabet="cmu-arpabet" ph="IH2 N T UW1 IH0 T IH0 V">intuitive</phoneme> interface transforms your text into professional videos using advanced <phoneme alphabet="cmu-arpabet" ph="AE1 L G AH0 R IH2 DH AH0 M Z">algorithms</phoneme>.
    Whether you are an <phoneme alphabet="cmu-arpabet" ph="AA2 N T R AH0 P R AH0 N ER1">entrepreneur</phoneme> or content creator, <phoneme alphabet="cmu-arpabet" ph="P IH1 K T AO0 R IY0">Pictory</phoneme> makes video production effortless.
    Watch our <phoneme alphabet="cmu-arpabet" ph="T UW0 T AO1 R IY0 AH0 L">tutorial</phoneme> to learn how <phoneme alphabet="cmu-arpabet" ph="EY1 AY1">AI</phoneme> <phoneme alphabet="cmu-arpabet" ph="V IH2 ZH UW0 AH0 L AH0 Z EY1 SH AH0 N">visualization</phoneme> can revolutionize your workflow.
  `;

  const response = await axios.post(
    `${API_BASE_URL}/v2/video/storyboard/render`,
    {
      videoName: "pictory_pronunciation_demo",

      // Voice-over with ElevenLabs premium voice
      voiceOver: {
        enabled: true,
        aiVoices: [
          {
            speaker: "Brian",
            speed: 100,
            premiumVoiceSettings: {
              modelId: "eleven_flash_v2"  // Required for phoneme support
            }
          }
        ]
      },

      backgroundMusic: {
        enabled: true,
        volume: 0.1,
        autoMusic: true
      },

      scenes: [
        {
          story: storyWithPhonemes,
          isSSMLStory: true,
          createSceneOnEndOfSentence: true
        }
      ]
    },
    {
      headers: {
        "Content-Type": "application/json",
        Authorization: API_KEY
      }
    }
  );

  console.log("Job ID:", response.data.data.jobId);
  return response.data;
}

createVideoWithPronunciation();

ElevenLabs External Documentation

For detailed information about ElevenLabs pronunciation features:

ElevenLabs Pronunciation Best Practices

AWS Polly (Standard Voices)

AWS Polly voices provide reliable SSML support with multiple phonetic alphabets for precise pronunciation control.

Key Features

Multiple Alphabets: AWS Polly supports IPA (International Phonetic Alphabet) and X-SAMPA phonetic systems.
SSML Categories: AWS Polly voices have different SSML support levels (Category A or B). Check the ssmlSupportCategory field from the tracks API.
Neural and Standard Engines: Different voices use different engines with varying SSML capabilities.

Phoneme Tag Syntax

<phoneme alphabet="ipa" ph="ˈpɪk.tɔː.ri">Pictory</phoneme>

Or using X-SAMPA:

<phoneme alphabet="x-sampa" ph="&quot;pIk.tO:.ri">Pictory</phoneme>

Complete Example

import axios from "axios";

const API_BASE_URL = "https://api.pictory.ai/pictoryapis";
const API_KEY = "YOUR_API_KEY";

async function createVideoWithAWSPolly() {
  const response = await axios.post(
    `${API_BASE_URL}/v2/video/storyboard/render`,
    {
      videoName: "aws_polly_pronunciation_demo",

      // Voice-over with AWS Polly voice
      voiceOver: {
        enabled: true,
        aiVoices: [
          {
            speaker: "Joanna",  // AWS Polly neural voice
            speed: 100
          }
        ]
      },

      backgroundMusic: {
        enabled: true,
        volume: 0.1,
        autoMusic: true
      },

      scenes: [
        {
          story: `Welcome to <phoneme alphabet="ipa" ph="ˈpɪk.tɔː.ri">Pictory</phoneme>.
                  Transform your text into engaging videos with AI.`,
          isSSMLStory: true,
          createSceneOnEndOfSentence: true
        }
      ]
    },
    {
      headers: {
        "Content-Type": "application/json",
        Authorization: API_KEY
      }
    }
  );

  console.log("Job ID:", response.data.data.jobId);
  return response.data;
}

createVideoWithAWSPolly();

AWS Polly External Documentation

For detailed information about AWS Polly phoneme tags:

AWS Polly Phoneme Tag Documentation

Google Text-to-Speech (Standard Voices)

Google TTS voices offer high-quality neural speech synthesis with comprehensive IPA phoneme support.

Key Features

IPA Support: Google TTS uses the International Phonetic Alphabet (IPA) for phoneme specification.
WaveNet and Neural2 Voices: Google offers advanced neural voice engines with natural-sounding output.
Multi-language: Phoneme support across multiple languages with language-specific IPA symbols.

Phoneme Tag Syntax

<phoneme alphabet="ipa" ph="ˈpɪktəri">Pictory</phoneme>

Complete Example

import axios from "axios";

const API_BASE_URL = "https://api.pictory.ai/pictoryapis";
const API_KEY = "YOUR_API_KEY";

async function createVideoWithGoogleTTS() {
  const response = await axios.post(
    `${API_BASE_URL}/v2/video/storyboard/render`,
    {
      videoName: "google_tts_pronunciation_demo",

      // Voice-over with Google TTS voice
      voiceOver: {
        enabled: true,
        aiVoices: [
          {
            speaker: "Steffi",  // Google WaveNet voice
            speed: 100
          }
        ]
      },

      backgroundMusic: {
        enabled: true,
        volume: 0.1,
        autoMusic: true
      },

      scenes: [
        {
          story: `Welcome to <phoneme alphabet="ipa" ph="ˈpɪktəri">Pictory</phoneme>.
                  Create professional videos using artificial intelligence.`,
          isSSMLStory: true,
          createSceneOnEndOfSentence: true
        }
      ]
    },
    {
      headers: {
        "Content-Type": "application/json",
        Authorization: API_KEY
      }
    }
  );

  console.log("Job ID:", response.data.data.jobId);
  return response.data;
}

createVideoWithGoogleTTS();

Google TTS External Documentation

For detailed information about Google TTS phoneme support:

Google Cloud Text-to-Speech Phonemes

CMU Arpabet Reference

The CMU Arpabet is a phonetic alphabet commonly used with ElevenLabs. Here’s a quick reference:

Vowels

Arpabet	Example	Word
AA	ɑ	father
AE	æ	cat
AH	ʌ	cut
AO	ɔ	caught
EH	ɛ	bed
ER	ɝ	bird
IH	ɪ	bit
IY	i	beat
UH	ʊ	book
UW	u	boot

Stress Markers

Marker	Meaning
0	No stress
1	Primary stress
2	Secondary stress

Example Breakdown

For “Pictory” pronounced as P IH1 K T AO0 R IY0:

Symbol	Sound
P	/p/ as in pat
IH1	/ɪ/ as in bit (primary stress)
K	/k/ as in kit
T	/t/ as in top
AO0	/ɔ/ as in caught (no stress)
R	/r/ as in run
IY0	/i/ as in beat (no stress)

Common Pronunciation Examples

Here are phoneme representations for words commonly mispronounced:

Word	CMU Arpabet	IPA
Pictory	`P IH1 K T AO0 R IY0`	`ˈpɪktɔːri`
AI	`EY1 AY1`	`ˌeɪˈaɪ`
Video	`V IH1 D IY0 OW0`	`ˈvɪdioʊ`
Tutorial	`T UW0 T AO1 R IY0 AH0 L`	`tuːˈtɔːriəl`

Best Practices

Test Pronunciation Before Rendering

Always test your phoneme tags with a short video before creating longer content. Different voices may interpret phonemes slightly differently.

Use Consistent Phonetic Alphabet

Stick to one phonetic alphabet per voice provider:

ElevenLabs: CMU Arpabet
AWS Polly: IPA or X-SAMPA
Google TTS: IPA

Keep Phoneme Tags Simple

Only use phoneme tags for words that are genuinely mispronounced. Overusing them can make content harder to maintain.

Document Your Phonemes

Keep a reference document of phoneme tags used for your brand names and technical terms for consistency across videos.

Troubleshooting

Phoneme tags are read as text

Problem: The phoneme tags appear as literal text in the voice-over.Solution: Ensure isSSMLStory: true is set in your scene configuration. This flag enables SSML processing.

Pronunciation sounds incorrect with ElevenLabs

Problem: The word is still mispronounced even with phoneme tags.Solution:

Verify you are using CMU Arpabet (not IPA) with ElevenLabs
Check that premiumVoiceSettings.modelId is specified
Ensure stress markers (0, 1, 2) are correctly placed

SSML not working with certain voices

Problem: Some voices do not process SSML tags correctly.Solution: Check the ssmlSupportCategory field from the Get Voiceover Tracks API. Some voices have limited SSML support.

Special characters causing errors

Problem: Request fails when using special characters in phoneme strings.Solution: Ensure proper escaping of special characters. In JSON, use \" for quotes within the phoneme attribute.

Next Steps

AI Voice-Over Guide

Learn the basics of adding voice-over to videos

Multi-Level Voice-Over

Use different voices for different scenes

Get Voiceover Tracks

Discover all available AI voices

Render Storyboard Video

Complete API reference for video rendering

External Resources

ElevenLabs Docs

ElevenLabs pronunciation guide

AWS Polly Docs

AWS Polly phoneme reference

Google TTS Docs

Google TTS phoneme guide

Getting started

Text to Video

Video with Avatar

Article to Video

Presentation to Video

Audio to Video

Video to Shorts

AI-Generated Visuals

Video Story CoPilot

Smart Layouts and Subtitles

Branding & Customization

Template to Video

Background Music

Video Storyboard

Voice-Over

Advanced Features

​What You’ll Learn

Phoneme Basics

Provider Differences

CMU Arpabet

Practical Examples

​Before You Begin

​Understanding Voice-Over Types

​Enabling SSML in Your Story

​ElevenLabs (Premium Voices)

​Key Requirements

​Models with Phoneme Support

​Models Without Phoneme Support

​Complete Example

​ElevenLabs External Documentation

​AWS Polly (Standard Voices)

​Key Features

​Phoneme Tag Syntax

​Complete Example

​AWS Polly External Documentation

​Google Text-to-Speech (Standard Voices)

​Key Features

​Phoneme Tag Syntax

​Complete Example

​Google TTS External Documentation

​CMU Arpabet Reference

​Vowels

​Stress Markers

​Example Breakdown

​Common Pronunciation Examples

​Best Practices

​Troubleshooting

​Next Steps

AI Voice-Over Guide

Multi-Level Voice-Over

Get Voiceover Tracks

Render Storyboard Video

​External Resources

ElevenLabs Docs

AWS Polly Docs

Google TTS Docs

What You’ll Learn

Before You Begin

Understanding Voice-Over Types

Enabling SSML in Your Story

ElevenLabs (Premium Voices)

Key Requirements

Models with Phoneme Support

Models Without Phoneme Support

Complete Example

ElevenLabs External Documentation

AWS Polly (Standard Voices)

Key Features

Phoneme Tag Syntax

Complete Example

AWS Polly External Documentation

Google Text-to-Speech (Standard Voices)

Key Features

Phoneme Tag Syntax

Complete Example

Google TTS External Documentation

CMU Arpabet Reference

Vowels

Stress Markers

Example Breakdown

Common Pronunciation Examples

Best Practices

Troubleshooting

Next Steps

External Resources