Extracting Hardsubs

(Work in progress)

Introduction

For many aspiring re-releasers, one of the biggest obstacles is working with older (mainly pre-2006) anime whose original fansubs were 100% hardsubbed. Let's face it, any newer anime likely has softsubs easily available, and is going to be covered by half a dozen Blu-Ray encoding/raw-remuxing groups. Older hardsubbed series represent the greatest percentage of potential re-release projects, in addition to being the series most in need of updated versions with better audio/video quality. And if one isn't doing a full line-by-line retiming, edit, and translation check, obtaining scripts from hardsubbed videos is the most difficult and time-consuming part of the process.

Since most of the projects I work on involve series whose original fansubs were 100% hardsubbed, I thought I'd share a few tips and tricks on the optimal ways of obtaining these scripts, and compare some advantages and disadvantages of each.

These guides assume you can obtain and learn the basic functions of Aegisub. Also, the Aegisub numerical values assume that you're putting these subs on a 480p DVD encode. Adjust them accordingly for higher-resolution sources.

Method Zero: Obtaining Scripts Directly

If the original fansub group is still active, contact them through standard channels (website, forum, e-mail, IRC), explain who you are and what you're doing, and ask nicely for the scripts for the show in question. If the group has disbanded, check the staff credits in the video (most older fansubs will have them), and track down the individuals via IRC, AnimeSuki, or MAL.

This can be the easiest method, but it also has a low probability of success. While styled softsubs first became theoretically possible in 2002 or 2003 with .ogm + external .ssa releases (as seen with some R1 DVD-rips by Anime-HQ), many groups were protective of their scripts and used hardsubbing as a way to prevent people from... well, doing exactly what we're doing here. So some fansubbers/groups may not be willing to share scripts. And even if they are, the scripts may have been lost to FTP or HDD crashes, and thus no longer exist anywhere outside of the hardsubbed videos.

Advantages:

* If original fansub scripts can be obtained, it effectively turns your hardsub -> softsub project into a softsub -> softsub project, eliminating the need to obtain scripts manually.
* Ensures that no errors will be introduced, unless you do further editing to the scripts.

Disadvantages:

* Reduced chances of success, since it relies on others having the scripts and being willing to share them.
* Since many groups used After Effects (AFX) for their typesetting and karaoke, the scripts you receive might only have the dialogue, forcing you to redo signs and songs yourself.

My projects done this way: Mamotte Shugogetten TV (future)

Method One: Optical Character Recognition (OCR):

OCR is a method that scans images, finds patterns of pixels of certain colors grouped into shapes, and interprets those shapes as text based on user input. For this method, you will need SubRip or an equivalent tool. The process of using SubRip on hardsubbed videos is complicated enough to require its own guide. I highly suggest reading this one, as it's how I learned to do it. This section will mainly be supplemental tips for that guide.

The first step is to decide, "Are these subs OCR-able?" In general, subs that use all one color, or have only a few similar colors, will be the most OCR-able. Simpler, san-serif fonts are also better, as serifs and other ornamentation will cause problems for Subrip. Subtitles that use many different colors, like different text or outline colors for each character, will not be OCR-able. Examples of the latter include a.f.k.'s Full Moon, Static-Subs' My-Hime/My-Otome, Ryoumi's Tonagura!, and Anime-Keep's Yumeria.

Some tips the supplement the linked SubRip guide:
1) Before starting in with Subrip, open one of the TV-fansubs in Aegisub. Use Aegisub's Color Picker to determine the color hex values of the subtitles. Now when you're in SubRip, you can manually enter those values if the automatic color detection doesn't get the right colors. As the guide suggests, try to find a 2-line sub, and have SubRip scan a 2-line area. Get a sense of how wide the subtitles extend in the image, and set the scanning rectangle wide enough so that characters at the beginning and ends of sentences don't get cut off. For many fansubs, this will be close to the entire width of the image. If the original subs have few or no 2-line subs, you can try a 1-line rectangle to reduce the scanning area, thus reducing the amount of non-text marks SubRip will detect. Be prepared to use "Manual Entry" in SubRip if 2-liners do appear. Under "1-line settings", if SubRip perfectly reads the bottom line of a 2-line sub without any user input, it will miss the top line completely.

2) Use the color values to set SubRip's sensitivity settings. For instance, if you find that your outline's values are Red 4, Green 78, and Blue 121, unclick the "move all values together" option and change the outline sensitivity to reflect the differences in those numbers, i.e. Red 34, Green 108, Blue 150. Don't hesitate to pause the OCR scanning and adjust those numbers up or down with "move all values" re-enabled if the detection isn't working right. Generally, bright scenes where subtitles overlay images with colors similar to the subtitles will require less sensitivity. Un-checking the "outline" button can also be helpful. It's hard to go into here, but you have to get used to playing with the settings to suit the scene, show, and subs you're OCRing. Your goal is to have SubRip showing only the text and as few extraneous dots/marks as possible.

3) While running the OCR and inputting characters, make sure not to make typos (duh). For random non-text marks, press Enter to "ignore" them. Resist the temptation to press SpaceBar+Enter to mark them as blank spaces, even if they appear in a space between two letters. Chances are, you'll later see words broken by spaces if SubRip sees a similar mark between two letters within a word. Automatic spell-checkers like Aegisub's have a much easier time correcting missing spaces than additional ones. I recommend pausing SubRip and skipping over OP/ED songs, since karaoke subs are often placed differently from dialogue subs, use weird fonts, and have funky effects -- all of these will confuse SubRip and slow you down. Just go back later with Aegisub to manually retime and type those yourself.

4) While SubRip's automatic correction can fix some errors, you'll need to do further correction. Open the resulting .srt in Aegisub, and load the hardsubbed video. Move all the new subs so that they appear above the old ones, either with Styles Manager or Select All + Margin Override. Run spell-check, and add all common names and series-specific terms to the custom dictionary. Use "Replace All" for OCR mistakes like I'II instead of I'll. Aegisub's Find+Replace can also be useful in correcting OCR errors not noticed by spellcheck. Go through all the subs line-by-line to check punctuation and consistency with the original subs, as SubRip often omits or adds periods or other punctuation marks. Of course, if the original subs had spelling errors or other typos, fix those as well!

5) While SubRip's "sins of comission" should be fixed at this point, there are still omissions to consider. Open the original hardsubbed video in a media player, and have your script open in Aegisub. Fastforward through the video, taking note of any relevant onscreen text, relevant insert songs you want to include, and missing dialogue. Obviously, SubRip is limited to that rectangle, so it won't catch most onscreen text, dialogue at the top of the screen, or 3-line subs, e.g. Person A's speech has 2 lines of text, Person B interrupts and adds a 3rd line. And depending on your settings and the original fansubs' timing, SubRip may miss short lines like What? / But... / Huh? / I... / Sorry. / etc. Use "Insert before/after..." in Aegisub to add the missing content in the proper places, and retime them when you do your retiming/shifting for the new video source. Once you've created karaoke files, copy+paste the song lyrics in, and timeshift them if necessary.

Advantages:

* For subs with high "OCR-ability," SubRip can run fairly quickly and automatically, once text and color settings are optimized and a character matrix is established.
* Only a small percentage of the hardsubbed text actually needs to be entered by you, thus allowing you to multitask. (I prefer to throw on an English-dubbed rewatch anime on my adjacent TV.)
* OCR is the best way to automatically get timings for shows where no timed scripts (in any language, see below) are available.

Disadvantages:

* Requires a LOT of work just to learn and set up.
* Does not work on all varieties of hardsubs.
* Inevitably introduces many errors and omissions, which must be manually corrected.
* Does not detect "specialty" text outside a given area, so even if it works perfectly, some manual retyping and retiming is still necessary.
* Depending on difficulty and PC speed, will likely take 2x an episode's run time to scan a single episode. May take longer.

My projects done this way: Strawberry 100% TV, Touka Gettan, To Heart ~Remember my Memories~

Method Two: Transcription

First off, some may be under the misconception that transcription involves watching hardsubbed videos, pausing every few seconds to type things into a text file, and then retiming everything from scratch. This is not necessary, as you can do everything from within Aegisub, without the need to switch between applications.

Transcription from timed, non-English subs:

Chances are, the show in question probably doesn't have English softsubs available. (And if it does, just use those and edit/[de-]localize them to your preferences.) However, there's a good chance that subbers over in Russia have subbed the show and based their scripts on the most-respected English hardsubs. So head over to Subs.RU, search for the show you want, and grab the RU subtitle archive. Now, it's time to get things set up.

1) If there are any "Readme" or "Comments" text files, feed them into Google Translate. If there are multiple English hardsubbed versions, the text files may say which version the Russian subbers used. Next, open one of the .ass or .srt files. Use Find+Replace to replace \N with a blank space. Then, copy roughly half the lines, and paste them into Google Translate. Copy the resulting translation, and use Paste Over (Shift+Ctrl+V) in Aegisub to paste the auto-translated English text over the Russian. Repeat with the second half of the script. (Trying to do the whole script at once will run over Google-TL's length limits, and cut off numerous lines. \N's will also cause "\N[some RU word] to appear in the translated text.)

-- Also, this can work with scripts in other non-English languages, mainly from European ones that also translate to their native languages from the English fansubs. Subs in other Asian languages like Chinese can work. However, I don't recommend them as they'll be original JP->CN translations, and their timings likely won't mesh well with the English subs.

2) Once you have a script of semi-comprehensible English text, select all lines and do the following:
* In the Shift Times dialogue box (CTRL+I), shift all lines forward 0.30 seconds. This is to ensure that the subtitles in the script always appear slightly after the subtitles in the hardsubbed video.
* Change the vertical margin to ~400, so that the subs appear extremely high on the image, but not at the very top of the screen.
* With all lines selected, right-click and press "Duplicate." Select all the duplicate lines, and change the vertical margin to ~90. Clear all text from the duplicate lines by deleting the text from one line while all duplicate lines are selected.

3) After all that, you should have a script with ~350 auto-translated, broken-English lines appearing near the top of the screen, and an equal number of blank lines with the same start/end times. Load the hardsubbed video into Aegisub. Go to the blank lines, and begin typing in the text from the hardsubs, making whatever changes and edits you deem necessary. Be wary of "right spelling, wrong word" typos such as you're/your, out/our, not/now, is/if, and the like, as these will be harder to spot later. Aegisub's spellcheck can handle most outright errors, so if you see you've made an obvious mistake (red wavy underlined word), just move on to the next line and fix it with spellcheck later.

4) Timing issues: In the ideal scenario, the RU or other non-English subs will line up perfectly with the English subs, in terms of where lines (defined here as cells on the subtitle grid, not the number of actual lines of text appearing onscreen) begin and end -- some timers include s-stuttering, some don't -- and how long lines are broken up. The reason I recommend auto-translating the RU text is to create a "guide" to show approximately what content is covered within the lines as the RU subbers timed them. This avoids accidental skipping, combining, or other mis-entries of content while transcribing.

If the RU lines are *shorter* than the English lines, i.e. two or 3 short RU lines cover the same dialogue as one long ENG line, just enter everything in the ENG line into the first RU line. For the subsequent RU line or lines, enter a placeholder like //. You can then later use ctrl+F to find all of those, and use "Join (keep first)" to sort them out.

If the RU lines are *longer* than the English lines, then you have more of an annoyance to deal with. You will need to play the video to view the next English line or lines, and then enter them into the RU line that spans their timecodes. I recommend going into Aegisub's Hotkeys options and setting "Global Video Play" to a single key like F9. By default, it's the more cumbersome Ctrl+P.

Of course, you can feel free to join or split lines later on, if you feel the original English or RU times are too fragmented (lots of short lines with almost no time on screen to read them) or too drawn-out (lengthy lines that stay up too long and give away too much information too soon). I try to strike a middle ground and keep lines between 2-5 seconds in length, where possible.

5) Styling: I usually use "Default" as something simple to make the RU subs easily readable (often a "DVD-yellow" bold Arial), and set up another style like "Main" for the actual English dialogue. Chances are, you'll want to use a few more styles, such as an "Alternate" for overlapping dialogue, an Italic style for thoughts, and/or a top-aligned style for background-type lines. If you're more masochistic, you may also want to do different colors for different characters, or to differentiate between onscreen/offscreen dialogue. You can use Aegisub's features to combine the styling process with transcription. I do this by adding special symbols to lines where I want a different style, like @ for Alternate/Overlap, # for Thought, $ for Top, and so forth. Make sure they're symbols that are unlikely to appear in the actual dialogue. Once the transcription is complete, use Aegisub's Subtitles > Select Lines... option to select all lines with a given symbol. Set the styles for it, then use Find+Replace to change all those symbols to nothing. Repeat for all the symbols you've used.

For more complex styling, like color-coding by character, or the numerous styles I used with Tonagura!, you'll likely have to go through line-by-line to set the styles. Setting up hotkeys like F11 and F12 for "Global Prev Line" and "Global Next Line" helps, as does Aegisub's Styling Assistant. To do onscreen/offscreen styling correctly, you will need to view the beginning and end of each line in Aegisub or a real-time watch, and shift between them if the character moves on- or off-"camera" during the line. This can be done with \t tags for gradual shifts, or by duplicating the line and adjusting start and end times for gradual shifts. Refer to my Tonagura! scripts for examples of this.

The elements that constitute *good* styling could easily fill a guide by themselves, and fortunately you can read such a guide right here.

6) Final check: Whether you do a full retiming/editing/TLC on the transcribed subs or not, it's still good to look over them and make sure you haven't introduced any errors. This is partly why I say to set the "new" subs to a vertical margin of 90 -- they'll appear above the originals, so you can scroll through line by line and easily spot any unintentional missing/extra words, punctuation errors, correctly-spelled wrong words, and so forth. This can be combined with a line-by-line restyling, as described in (5). Finally, delete the auto-translated lines, set the margin override back to 0, and reverse the time-shift you did earlier. (This would also be a time to shift times for any TV-DVD differences, like sponsor screens.) At this point, you'll now have a script that's the same or better than the script used in the TV-rip. Depending on what the RU subbers did, you may also need to manually include some things like signs, songs, and previews.

Projects done via this method: Tonagura!, Soul Link, Rizelmine (05 onwards), Lime-Colored War Tales, Full Moon wo Sagashite, My Wife is a High School Girl, G-On Riders, and probably any other future hardsub -> softsub conversion.

*** Alternate method: Brute-Force Transcription ***

This method is a last resort, to be used when you cannot find timed scripts for your project in any language. Most steps are the same, except that you'll have to create a timed script yourself. Open the hardsubbed video, and then open the audio from that video. Knowledge some basic Japanese and familiarity with the show in question will be useful. Go through the wave form and time anything that resembles dialogue, without paying attention to the video -- use "Audio+Subs View." It's best to err on the side of caution, and favor shorter line times. It's easier to enter in a longer line and join subsequent lines into it, than it is to start and stop the video while transcribing to enter several short lines from the hardsubs into one longer-timed line. Of course, if you have a feel for how the original fansub group did their timing, you can adjust your "pre-timing" to fit. After all lines are pre-timed, shift them forward ~.30 seconds (or longer, if you timed with significant amounts of lead-in), and transcribe away. Set the vertical margins to ~90 if you wish to compare your transcriptions with the original hardsubs for error-checking and styling. Signs will need to be added manually by scanning the video; add karaoke from a separately-timed file.

Projcts done via this method: No full series thankfully, though I have done it for a few random episodes of Saint October and Yes! Pretty Cure 5.

Advantages:
* Can work on any hardsubs, regardless of their nature, or whether non-English scripts are available or not.
* Once setup process is learned, is as easy as typing on a word processor.
* Fast with proper setup, only limited by one's typing speed.
* With proper care, can avoid introducing errors and even fix errors in original subs.
* Aegisub features can make styling nearly automatic.

Disadvantages:
* Labor-intensive, can be tedious.
* Cannot multitask, aside from maybe listening to music.
* Can introduce typos and other errors that can be hard to detect.
* Not everything may be covered in non-English scripts -- some manual addition and retiming may still be necessary.
* Timing in non-English scripts might not line up with English fansub timing, thus adding extra work.

Comparisons and Conclusions:

Obviously, getting TV-rip scripts directly from original fansub staffers or softsubbed .mkvs is the easiest route to go, doubly so if signs and/or karaoke were softsubbed. Between OCR and transcription, I have to choose transcription. While the semi-automatic nature of OCR is nice (and I burned through many backlogged rewatches), the errors it introduces are aggravating. Transcribing an episode takes more keystrokes, but no more time than an average OCR job -- no more than 40-50 minutes per ep, while OCR can run longer due to color issues and bad luck. Transcribing allows me to fix errors from the original subs in the process, or even rewrite lines on the fly. Luckily, I'm conscientious enough to avoid introducing new mistakes in transcription, or at least aware enough to catch them during retiming/editing/QC. Transcription also offers a greater (that is, non-zero) chance than OCR of catching signs and dialogue text outside the normal subtitle area.