As I pitched in an earlier post, making non-flyers feel like they’re up there with you, experiencing the engaging parts of the flying experience - combat or civil - is the low-expense, high-yield way to help drive more community engagement into the world of flight sims.
The key medium for doing this is video. All bells and whistles - from chat to “choose your own adventure” Twitch-Plays-Pokemon style remote flight stuff - ultimately have video at the base.
And, flying is 90% going in a straight line looking at instruments. Look, I’m one of those weirdos that sees it as a feature, not a bug - I get an odd mix of thrill and calm out of babysitting my engines (and I have some ideas about making that 90% more instructive and dramatic, we’ll talk later) - but in general that 90% doesn’t make great TV. The interesting 10% can.
Editing for interest, a 90-minute flight session can end up between 5 and 20 minutes of interesting bits strung together with some kind of narrative. Creating that narrative - and seeking, assembling, and producing the interesting string of clips - takes forever.
I’m not exaggerating - let’s look at some math.
Let’s assume a 4 minute video is made of fifteen 16-second clips that contain spoken tracks with some padding around the start and end. It’s not likely spoken phrases are longer than 8 seconds (try talking for 8 seconds), so if you add 4 seconds to the start and the end, you get 16 seconds.
Let’s say we need to “seek” fifteen clips. How long will that take?
This doesn’t account for mistakes. If every seek has two minutes of messing around, that’s another 30 minutes total.
The equation is SOURCE TIME + (2 MIN * NUMBER OF CLIPS) Seeking 15 clips from a 60 min video = (60 + 30) = 90min Seeking 15 clips from a 90 min video = (90 + 30) = 120min Seeking 20 clips from a 90 min video = (90 + 40) = 130min
This isn’t just hypothetical. Here - take a look at this supercut of a 100-minute flight session I did with my wingman and AWACS:
That took 2 hours and 30 minutes – a total of 150 minutes – to cut that down, with no other editing or post-production. It was about 30 clips worth. Pretty darn close to my napkin math up there.
The majority of time was finding the interesting parts, and lining up to the start and end of relevant dialogue.
I already spent 100 minutes flying. Am I going to spend that plus half again on just cutting up my video?
There’s gotta be a better way.
Enter ML Transcription
Here’s the thing. We already know there’s a ton of non-interesting stuff going on while flying. Where’s the interesting stuff usually at? Takeoff, landing, we know that one.
But it’s also when we’re speaking.
Instructing, talking on the radio, talking to the audience, yelling in the heat of combat - talking is where the story gets told. If we look for the words and cut at those beats, we’ll get the narrative.
What if we can get at the words quicker?
Enter machine-learning backed transcription. For this test I used Amazon Transcribe, a feature of Amazon Web Services, which I’m already using to host this site. You can read all about it, the short version is you hand it a video or audio track, you get back a transcript with a heap of metadata that you can ingest into whatever it is you need it for.
For my purposes, what I want is a transcript lined up with video timecodes. If I can tie a timecode to the start and end of relevant pieces of dialogue, I can scrub to those timecodes very quickly – maybe even automatically using scripting – and set my In/Out markers for clip assembly.
I gave it a try, and timed it out as a comparison.
Here’s a different 100-minute video that I fed into the transcription engine, using the timecodes it spit back as a way to set In/Out clip markers.
This took only about 70 minutes - close to just half as much time as without transcription.
A lot of that time can still be made more efficient, because as a first try, I fumbled a lot. Here’s what I learned; these tips might help you go faster: