I’ve been watching the Chinese TV show 他来了，请闭眼 (Love Me If You Dare). It’s a good show, kinda reminiscent of the BBC series Sherlock, likewise a crime drama centered around an eccentric crime-solving protagonist and a sympathetic sidekick. You should check it out if you’re into Chinese film or are learning Chinese and want something interesting to watch.
I wanted to get a transcript of the episode’s dialog so I could study the unfamiliar vocabulary. Unfortunately, the video files I have only have hard subtitles, i.e. the subtitles are images directly composited into the video stream. After an hour spent scouring both the English- and Chinese- language webs, I couldn’t find any soft subs (e.g. SRT format) for the show.
So I thought it’d be interesting to try to convert the hard subs in the video files to text. For example, here’s a frame of the video:
We could just try throwing Tesseract at it and see what comes out:
Hmm, so that didn’t work. What’s happening?
Tesseract requires that you clean your input image before you do OCR. Our input image is full of irrelevant background features but Tesseract expects clean black text on a white background (or white on black).
To remove the background image and get just the subtitles, we turn to OpenCV. The easiest part is cropping the image. We keep a larger left/right border because some frames have more text:
Now we want to isolate the text. The text is white, so we can mask out all the areas in the image that aren’t white:
This uses the OpenCV
inRange returns a value of 255 (pure white in an 8-bit grayscale context) for pixels where the red, blue, and green components are all between 200 and 255, and 0 (black) for pixels that are outside this range. This is called thresholding. Here’s what we get:
A lot better! Let’s run Tesseract again:
And Tesseract returns (drumroll…):
Now we’re getting somewhere! Several areas in the background are white, so when we pass those through to Tesseract it interprets them as assorted punctuation. Let’s strip out these non-Chinese characters using the built-in Python unicodedata library:
'Lo' here is one of the General Categories that Unicode assigns to characters and stands for “Letter, other”. It’s good for extracting East Asian characters. From this code we get:
There are two mistakes here: a spurious 二 character on the front, and a mismatched character in the middle (that 逯 should be 这). Still, not bad!
That’s all for now, but in Part 2 (and maybe Part 3?) of this post series I’ll discuss how we can use some more advanced techniques to perfect the above example and also handle cases where extracting the text isn’t so straightforward. If you can’t wait until then, the code is on GitHub.
If you have any comments about this post, join the discussion on Hacker News, and if you enjoyed it, please upvote on HN!