Voice Controlled Helldivers

Imagine yourself running away from a horde of bugs, knowing you have a beautiful Eagle cluster bomb in the pocket, but you’re fumbling. Your fingers just can’t pull it off, and with every second that passes your mental starts to slip more and more.

If you’re a newbie helldiver like me, look no further. Now without missing a beat you can keep running and simply command your destroyer with voice-activated commands!

Here’s a demo:

Intro

Helldivers 2 is a third-person co-op shooter. You have your standard primaries, secondaries, grenades, and your Super Destroyer perfectly equipped to further spread Managed Democracy across the galaxy. Each game, you can choose certain stratagems that can be called down by inputting a certain sequence of arrows (left, right, up, down). I thought it would be fun to make it so that I could ask for these stratagems with my voice.

Voice Detection

I primarily used this repo: https://github.com/KoljaB/RealtimeSTT It basically has everything you want for the voice transcriptions.

Initially, I saw in the repo that there was support for certain wake words, like ‘hey siri’, ‘alexa’, or even ‘terminator’. I of course chose terminator. This worked well enough. I could log when the wake word was detected and all the subsequent transcribed audio. The main problem was that the latency on this was slightly too high to be comfortable. I had to give a small pause before talking so it wouldn’t miss the start, and there was just enough of a delay once finished speaking that made me wonder if there was an issue before it actually registered.

Additionally, it felt constraining to use a preset list of mostly corporate wake words. I wanted to have this be more on theme. Another small issue is that since I was running this purely on cpu with the tiny.en model, it feels like the model wasn’t amazingly precise, especially at the beginning of sentences when there was less context. This means that sometimes terminator would be interpreted as:

Time and later
It’s time for me later
Time and anger
Time vinegar
etc. if I was sloppy with the enunciation.

The desire to have my custom wake words plus some method to make the detection a little more robust led me to the second method: constant real time transcription + custom callback function.

With crtt+ccf (rolls off the tongue for sure), I set my program to constantly transcribe what I was saying. The library provides a “on_update” argument that accepts a function whenever the transcription is updated/stabilized. I wrote a function that 1. checked if the transcription had any of my list of wake words in it and if it did, 2. mapped the subsequent command to game inputs.

This method has empirically lower latency already (I’m guessing because it doesn’t have to run the wake word detection and then start up the real-time transcription), and it doesn’t have to wait for silence to register a command, which just feels more natural and shaves off a bit more latency.

I could also simply add a list of words that are my designated wake words. I first chose Foundation and Foundations, based on the reasoning that they are crisp-enough words to pronounce which the model wouldn’t confuse for other words. Later, I changed it to Destroyer for the flavor and immersion.

Small optimization hacks

A small optimization I did was to forcibly shorten the amount of audio stored so the transcriptions would be faster. By default, the library seems to keep the chunk of audio from when you started speaking to when it detects a long enough period of silence. I’ve heard that whisper (the underlying transcription model) will use the last 30 seconds of audio and while this is great for real transcription work with context, I just needed ~2 seconds of audio for it to recognize a command.

The AudioToTextRecorder class I was using didn’t seem to natively support this, but a quick perusal of the code suggested that self.frames was a list that held the raw audio data. So it seemed reasonable enough that I could trim this list during the callback and reduce the computational load of the transcription.

What follows here is an absolute novice’s understanding of “audio processing” that only approximately worked, so take it with a grain of salt.

    def snip_frames(self):
        total_frames = self.sample_rate * self.frame_buffer_seconds
        total_buffers = total_frames / self.buffer_size
 
        frame_limit = round(total_buffers)
        self.recorder.frames = self.recorder.frames[-frame_limit:]

My understanding is that if I want 2 seconds of audio, I need to find how many frames that is. Normally that is just $Duration \times sample rate$ . However, recorder.frames here internally seems to actually hold audio buffers, each of which contain a certain number of frames. The class defaults buffer_size to 512, and the docstring suggests changing it is a bad idea.

So we have the simple enough $\frac{16000 frames}{1 sec} \times 2 sec \times \frac{1 buffer}{512 frames} = 62.5 buffers$ , which gets rounded down to 62.

Below is a small snippet of debugging output, the top number is the number of audio buffers going into the callback function, and the bottom number is the amount after snipping the audio.

77
Raw  : Okay, just wanted to test it.
Clean: okay just wanted to test it
62

We can use this to approximate how quick a transcription cycle is: $77 - 62 = 15 buffers \Rightarrow 15 buffers \times \frac{512 frames}{1 buffer} = 7680 frames \Rightarrow 7680 frames \times \frac{1 sec}{16000 frames} = .48 seconds$

To be fair, even when recording longer amount of audio, the median transcription speed usually stays the same. However, I do observe longer tails when I store more audio. For example, every once in a while, there will be a particularly long transcription period - up to multiple seconds - that only seems to occur when storing more audio. I can usually trigger this by counting 1-10 at a moderate pace, but it will trigger during normal speech as well. Similarly, sometimes there’s a period where the transcription speed is only slightly lower, maybe 25 buffers-worth, which I think is the result of a particularly onerous bit of audio, but I’m not sure.

Finally, the last little improvement was to give the transcription model a prompt. I just gave it the wake words, and this gave it much more context. Before it would often transcribe “destroyer” to “this right here” and after giving it context the model transcribed the right thing every time. I initially prompted it with instructions like “Listen for the word destroyer” but the extra context seemed to add some latency, and just giving it “destroyer” worked beautifully.

Voice to inputs

The AudioToTextRecorder class gives easy access to the transcription. A quick pass is performed to remove annoying punctuation and capitalization. It’s easy to check if any of the wake words are in the string - the only caveat is to check for F”{wake word} ” with that space at the end, so that “alexa” doesn’t match alexander. Then we can grab rest of the sequence after the wake word and check if it matches any of our commands.

I try to include a reasonable amount of redundancy for approximate matches, once again because the transcription is not 100% accurate. The python standard library includes a package named difflib, which includes a class called SequenceMatcher. You can pass two strings to this class and call .ratio(), which will return a measure of the strings’ similarity. To avoid everything matching, I use a threshold of 0.75 to ensure that any approximate matches are at least kind of close.

Below is example output of the program picking up my wake word (foundation), grabbing the rest of the sequence (orbital strike), and displaying the matches. The arguments in the tuple are (the ratio, the string matched against, the command that it is mapped to). Matches are sorted by ratio and the largest is chosen - ties are arbitrary.

Raw  : Foundation orbital strike.
Clean: foundation orbital strike
orbital strike
Raw  : orbital strike
Clean: orbital strike
[
	(0.7567567567567568, 'orbital airburst strike', 'orbital_airburst_strike'), 
	(0.8235294117647058, 'orbital laser strike', 'orbital_laser_strike'), 
	(0.875, 'orbital gas strike', 'orbital_gas_strike'), 
	(0.875, 'orbital ems strike', 'orbital_ems_strike'), 
	(0.8235294117647058, 'orbital smoke strike', 'orbital_smoke_strike')
]
['right', 'right', 'down', 'right']

Here is another example. Airburst is misinterpreted as airbrush but because of the approximate matching it registers the command anyways.

Raw  : Foundation Airbrush.
Clean: foundation airbrush
airbrush
Raw  : airbrush
Clean: airbrush
[(0.75, 'airburst', 'orbital_airburst_strike')]
['right', 'right', 'right']

Finally once a command is registered I have a dictionary that contains the arrow keys to press. I use pyautogui to register key presses, so when in-game this translates to doing the commands. I had to play around with the timing; the default was too fast, the game would barely register the key presses. After some experimenting, a .05sec delay proved to work well and be quite quick!

Lastly I compiled a list of shortcuts as well, so instead of saying “Orbital 380mm he barrage” (which I’m not sure could be transcribed that well), I have a shortcut that maps “380” to that command. Additionally, I had to map “nuke” to “hellbomb” because hellbomb is not a very common phrase, and “cf” or “artillery” to “seaf artillery” because seaf is a fictional organization.

Conclusion

This was definitely a quick and dirty project. I didn’t take that much time to understand and optimize the audio processing - I think there is a lot of room to improve the latency even further. Also, in an ideal world, I would somehow be able to send commands to the game directly (although that might trigger the anti-cheat!). Right now, you have to make a mental notice to not interrupt your character while the commands are being inputted. Proning, shooting your gun, pulling up the minimap/radar, etc. will interrupt the process but you can still run around (which is great when you’re retreating; the default pc keybinds force you to use the same keys for moving and calling down stratagems).

There is also a moment of lag in-game (noticable in the demo) when you are running not-straight while the command is being input. I don’t know why this occurs, especially because running around and mashing the arrow keys manually does not trigger the lag.

Strangely, when you haven’t even said anything to the mic, the transcription somehow defaults to nonsense like “Thanks for watching.” and “Thank you for watching”. This behavior has occurred from the beginning and I don’t think I did any strange to cause this. My only guess is this was trained on a lot of grateful youtube videos…

Lastly, it’s fun to play with when you’re not on comms, but it gets old pretty quick when you announce every single stratagem to the team…

You can find the code at https://github.com/Chewybanana/HD2Voice

🐧 Penguin Roost

Explorer