Extracting DVD Subtitles

Wednesday, July 2nd, 2008

Problem

  • Our foundation has a great little video produced by an outside complany and they want us to put it online, but…
  • We require that all online video includes captioning for accessibility (dfxp format) and…
  • They only have a DVD of the video with captions (not original text).

Solution

Well we could re-transcribe the video, but that seems like a lot of redundant work since someone already took the time to make the captions for the DVD. The only problem is that DVD captions are stored as bitmap images rather then acctual timed- text.

Step 1: OCR the DVD subtitles

OCR stands for “Optical Character Recognition”. Your scanner may have this for decoding paper documents. Because the subs are actually images – OCR is needed here too. There is a cool little application called SubRip that does exactly what we need.

You point to the correct DVD file and it will attempt to transform the subtitle images into text. Since subtitles can come in all different fonts, languages and colors – the process is not completely automatic. From time to time it will ask you to identify certain character(s). Think L vs. I and m vs. rn.

Step 2: Convert the subs to dfxp

SubRip does a pretty good job, but you will still need to clean up the output and transform it into the correct format.

SubRip outputs an .srt file where the captions look like:
1
00:00:06,373 --> 00:00:08,933
Whether we realize it or not, we need
each other.

We need them to be in dfxp format like:
<p begin="00:00:06.37" end="00:00:08.93">Whether we realize it or not, we need each other.</p>

This is a job for regular expressions!

  • Replace the entry number (1,2,3) with a marker: /n/d+/n becomes ~
  • Then grab the srt timecode and text info:
    ~(\d\d:\d\d:\d\d),(\d\d)\d --> (\d\d:\d\d:\d\d),(\d\d)\d
    ([^~]+)
  • This becomes:
    <p begin="$1.$2" end="$3.$4"><span tts:fontStyle="italic" >Speaker</span><br/>$5</p>

Step 3: Check your work

Now, you just need to make sure that it all worked as expected.

For this example, I found that the captions were actually off about 4 seconds. This could have been due to editing differences between this DVD and my final FLV.

I also found some minor mistakes that the OCR had created.

Luckily, both of these were easy to fix manually.

Leave a Reply

You know you want to...