It’s been a couple of weeks since I made the original pitch for machine learning transcription in flight sim video editing. Since then, I had a couple of work sessions to test out the hypothesis.
Here are two versions of a video that’s been auto-cut from a 40-minute original, by my automation workflow.
The first takes every 2-second or longer speech clip, stringing them together with no buffer.
Here’s the second - still 2-second minimum, but a one-second buffer at the end of each segment.
Clearly - we’ve got some issues.
Speech is interesting - but it’s not the only interesting thing. The algorithm fails to capture carrier launch and landing, missile shots, dogfights.
Buffering the ends is problematic. Without it, cuts are jarring and some end early. With it, you’ll sometimes get duplications of speech.
How it Works:
The ML Transcription Workflow - Bash, Python, S3, Amazon Transcribe, Lightsail and Lambda
The overall workflow is several small scripts that each take a single step in the flow, each operating on a collection of files in different S3 buckets or folders. The processed files go to the next bucket where the following operation will find them.
Most of the compute is done on a Lightsail instance (EC2), loaded with Python3.7, the AWS CLI, and a command-line audio and video processor called ffmpeg.
There’s one Lambda function (serverless compute) that watches one of the S3 buckets and hands off MP3 files to AWS Transcribe.
It probably makes sense to convert most of the compute functions into Lambdas as well - the Lightsail instance also runs my website and doesn’t need to be overloaded by video processing and AWS I/O.