Ideas on using A.I. to assist with subtitling

Earlier this year I made my first contribution to WordPress and joined the subtitling team for Contributor DayContributor Day Contributor Days are standalone days, frequently held before or after WordCamps but they can also happen at any time. They are events where people get together to work on various areas of There are many teams that people can participate in, each with a different focus. It was a great way to get involved and I enjoyed working friendly and focussed team, aiming to submit our captions for review by the end of the day.

Bristol 2019 Contributor Day.

As someone with fairly good typing skills, I thought it would be easy to subtitle a 12 minute video, thinking that I could do maybe two or three videos in the day. I was surprised that it took the entire day to do this. Other people had problems too:

Current challenges with the subtitling process

Now while Amara is a fantastic free resource, the following considerations need to be met:

  • The reading rate shouldn’t exceed 21 characters
    • You need to lengthen duration, reduce text or split the subtitle.
  • The “beginner” mode in plays 4 seconds, then pauses.
    • You have to do this while being aware of subtitle limits
  • After editing you have to line up the subtitle with the video in the timeline editor.
    • This process is generally straightforward but sometimes you need to go back and split the subtitle so it reads more naturally.
  • You have to be aware of typos and adding off camera indications such as laughter or a second person talking.

One of the good things about Amara is that it easily allows alternative language subtitles to be done too, multiple people to be working on subtitles of the same video, and the possibility to pick up an existing transcription if a contributor gets stuck.

Investigation into AI tools.

Subtitling is important for accessibilityAccessibility Accessibility (commonly shortened to a11y) refers to the design of products, devices, services, or environments for people with disabilities. The concept of accessible design ensures both “direct access” (i.e. unassisted) and “indirect access” meaning compatibility with a person’s assistive technology (for example, computer screen readers). (, but also for search, user experience, and learning. WordPress TV have a campaign running on subtitling- some subtitling work can be done by automation, but this still needs human involvement.

Videos hosted on YouTube already have access to an excellent auto-captioning library available in English, Dutch, French, German, Italian, Japanese, Korean, Portuguese, Russian, and Spanish. While YouTube are constantly improving their speech recognition technology, automatic captions might misrepresent the spoken content due to mispronunciations, accents, dialects, or background noise.

Therefore, allowing YouTube to automate 80-90% of the captioning process could form a good starting point for the transcription as time stamps would have been created allowing the final ~10% to be reviewed and properly transcribed. The downside is that the automated versions would likely not be as intended creating all sorts of implications, and publishing responsibilities.

WordCampWordCamp WordCamps are casual, locally-organized conferences covering everything related to WordPress. They're one of the places where the WordPress community comes together to teach one another what they’ve learned throughout the year and share the joy. Learn more. videos on YouTube are being uploaded from January 2018 and up.

Doing a quick search on GitHubGitHub GitHub is a website that offers online implementation of git repositories that can easily be shared, copied and modified by other developers. Public repositories are free to host, private repositories require a paid subscription. GitHub introduced the concept of the ‘pull request’ where code changes done in branches by contributors can be reviewed and discussed before being merged be the repository owner. also reveals hundreds of open sourceOpen Source Open Source denotes software for which the original source code is made freely available and may be redistributed and modified. Open Source **must be** delivered via a licensing model, see GPL. libraries for “Speech-to-text” implementations. Mozilla is actively developing a speech to text implementation called DeepSpeech

DeepSpeech is an open source Speech-To-Text engine, using a model trained by machine learning techniques based on Baidu’s Deep Speech research paper. Project DeepSpeech uses Google’s TensorFlow to make the implementation easier.

I managed to install DeepSpeech locally with Docker and to my excitement was able to output some text via the terminal from a small English/American audio clip. The process is quite prone to error as you need to have all the required libraries installed but I will be investigating this further.

Ideally, DeepSpeech would be installed on some globally available server with an interface to upload audio files and download text. However, the bottleneck would still come from create and reviewing the ttml file.

While the video file can be downloaded from WordPress TV, isolating the audio file needs to be done manually.

Existing resources

The transcripts from WordCamps, speakers providing their notes, some of the text versions produced by STTR and tools also contribute to making subtitling easier. In addition, subtitles broaden the usage of videos and make them easier to translate / be used by people who can not access the recorded language.

Dublin did a lot of testing on this to produce materials which could help the community and this is being put together. The more that people subtitle and correct automated transcripts, the better the tools will become at learning different accents, words and dialects.