Construindo um reconhecimento de fala e uma ferramenta de conversão de fala em texto

Building a speech recognition and speech-to-text tool

May 18, 2024 Roberto Magalhães

In 2010, I was sitting in my room watching TV when I saw an Xbox commercial. Kinect, a line of motion-sensing devices from Microsoft, has been launched and with it, a new way to play. I was fascinated by the characteristics of this new technology and an idea came to mind : What if we take advantage of Kinect's features outside of the console?

There were so many options and so much to create. Kinect has connected to robots to perform medical surgeries or deliver packages on the battlefield. Kinect could be integrated into your home to become what we now call “smart”. It could also be integrated with hardware to use voice commands, so that people with disabilities could work with computers and get a job that requires the use of specific manipulation of machines, or even dispense with the use of a keyboard. So many photos came to mind as if I was watching a futuristic movie. These realities would become reality later, and probably the people reading this article already have an Alexa at home or even use Augmented Reality on their cell phones. You may have also used image recognition to simplify your daily tasks. Well, I managed to materialize these concepts and ideas with Kinect.

The process

I have a methodical design thinking process that is personal and works for me, which I have divided into four steps. I like working like this because it allows me to imagine and consider all scenarios before starting, so I can set goals without any restrictions. Right now my imagination is the limit.

I then take the time to read and investigate as much as possible so that the implications of my ideas are clear in terms of cost, scope, time and effort. In the end, just like we do when we use the Scrum Framework to determine an MVP, I can define my Minimum Viable Product with this process. After that, I can start manufacturing the MVP; I won't work on vague ideas until then.

Once the MVP is finished, I can add more features to it. I am always mindful of setting clear and achievable short-term goals.

1. Dream Internship

When something catches my attention, it doesn't leave my mind. I decided to make my ideas come true. But what did I need to materialize an idea with such a huge challenge? In 2010, the technology was very new; therefore, the company could not deliver an SDK (software development kit) to developers anytime soon. At that moment, how a new technology works is unknown except to the engineers who created it; neither manuals nor other sources of information are available.

This led me to create things from scratch and trust my personal process that I follow regularly before starting something like this. It usually takes me a few days or a week to organize my thoughts, but I don't take notes. I just wander from one idea to another, trying to imagine as many scenarios as possible to determine what I need and what could go wrong.

2. Research stage

Microsoft didn't release any SDKs to handle Kinect at that time, so I dove deep into the web to find some people who had already disassembled the hardware to get the DLLs (dynamic link libraries) that make the magic happen. I finally found them on a Russian forum. The next steps were relatively easy from then on. Once you have the libraries to work with, it's just a matter of reading their content. While I was at it, I purchased three Kinect sensors to disassemble and understand their hardware capabilities.

3. Creation Stage

At this point, I had everything I needed to get started. This is my favorite part because you just dive deep into it. You will start a long-term relationship with him. This is the moment when you become a creator. I was coding when I realized that the DLLs just interacted with the hardware, but the code was missing something else that made the Kinect listen and understand the user. At that time, I discovered that I could probably use the dictionary that comes with Windows to translate spoken words into text, and thus my project began to understand me as I speak.

This step was necessary because the Kinect DLL only contained the functions to perceive audio. It was impossible to determine whether the speaker was speaking in English or another language, or to identify the words uttered. By adding a Windows dictionary, just like we do with our computer, you can instruct the system to define the language to work with. Most importantly, you also provide a set of words to compare with the received audio. Thus, my project began to “understand” me as I speak.

I integrated several third-party software and hardware using Kinect sensors and their libraries. For example, I made it possible to navigate any non-Windows program or write inside text boxes to fill out a form without using the mouse or keyboard. In the case of Microsoft Word, I could navigate and control the cursor pointer by waving my hands without touching the mouse and writing on the sheet by dictating orally without using any keyboard. I could make a Lego electric car and move it without physical interaction, just by moving my hands in front of the camera sensors to instruct which direction it should go. Then the dream was finally over.

4. Perfection Stage

Finally, it's time to improve my project by adding some features. When analyzing the Kinect hardware, I discovered that there was a branch of engineering that I didn't know about. It worked with images and was called digital image analysis.

I discovered that we could use two types of Kinect cameras to detect the depth of the body and even the hand. It allows you to detect sensor proximity so you can play with more variables than x, y and z axis, and it will also detect facial gestures and hand positions to interact and integrate them into multiple systems in different ways. .

Soon after, I was able to perform basic sentiment analysis without AI training, focusing on facial gestures. Of course, it seems quite simple if we compare my analysis of sentiments at that time with the current context. Today we have a specialization in Artificial Intelligence dedicated exclusively to improving and updating sentiment analysis algorithms. Regarding other features, I was able to successfully control the mouse, open and close applications, dictation and automatic writing with Microsoft Word.

Conclusion

Today we have smaller sensors to work with that allow us to perform the same integrations I did almost a decade ago. Something that surprises me every time I remember this moment in my life is that, despite so many years having passed, technology still works the same way. Sensors have become smaller and hardware upgrades have improved the quality of environmental stimulus detection, but the backend logic and algorithms remain the same.

And for people willing to do something that seems unattainable right now, I recommend following my path. Let your imagination run wild and you will find at least one viable idea. Start your journey and once you get your MVP, take another look at the seemingly unfeasible ideas. You are probably able to materialize them now.

More blog posts from our BDevers.