Os desafios do processamento de linguagem natural e das mídias sociais

The challenges of natural language processing and social media

Natural Language Processing is a powerful tool for exploring opinions on social media, but the process has its own problems.

Imagem em destaque

Natural Language Processing is a field of computer science, more specifically a field of Artificial Intelligence, which is concerned with developing computers with the ability to perceive, understand and produce human language.

Language analysis has largely been a qualitative field that relies on human interpreters to find meaning in speech. As powerful as it is, it has some limitations, the first of which is the fact that humans have unconscious biases that distort their understanding of information.

The other issue, and the more relevant one for us, is the limited capacity of humans to consume data, since most adult humans can only read about 200 to 250 words per minute – college graduates average about 300 words.

To put these numbers into perspective, the average book is between 90,000 and 100,000 words. This means that it will take a normal human being about 70 hours to finish a normal-sized book. 100,000 words may seem like a lot, but it's actually a very small fraction of the amount of language that is produced every day on social media.

Twitter, a social media built on 280-character messages, averages 500 million tweets per day . Assuming about 20 words per tweet, we are analyzing about 100,000 books of information. And that's just one social media platform.

Collecting Big Data

Any researcher focusing on social networks has to deal with large amounts of data. Manually collecting and analyzing data is at best inefficient and at worst a complete waste of time. So what is the solution?

Collecting data programmatically. Most social media platforms have APIs that allow researchers to access their feeds and sample data. And even without an API, web scraping is a practice as old as the internet itself, right?

Web scraping refers to the practice of searching for and extracting information from web pages, either manually or through automated processes (the first is much more common than the second).

Unfortunately, web scraping falls into a legal gray area. Facebook v Power Ventures Inc is one of the best-known examples of big technology companies trying to combat this practice. In this case, Power Ventures created an aggregate website that allowed users to aggregate data about themselves from different services, including LinkedIn, Twitter, Myspace, and AOL.

One of the biggest challenges when working with social media is having to manage several APIs at the same time, in addition to understanding the legal limitations of each country. For example, Australia is pretty lax about web scraping, as long as it's not used to collect email addresses.

Another challenge is understanding and navigating developer account levels and APIs. Most services offer free tiers with some pretty major limitations, like the size of a query or the amount of information you can collect each month.

For example, in the case of Twitter, the Search API sandbox allows up to 25,000 tweets per month, while a premium account offers up to 5 million. The former is more suitable for small-scale or proof-of-concept projects, the latter for larger projects.

In other words, anyone interested in collecting information on Social Media must:

  1. Understand the law regarding data collection
  2. Understand how software developer accounts and the API work for each platform
  3. Discover potential investment based on project scope.

Understanding your audience

Human nature pushes like-minded individuals towards each other. We prefer to share with people who have the same interests as us. Social networking sites appeal to different demographic groups, and interactions in these virtual spaces are shaped both by their behaviors and emerging culture.

Natural Language Processing excels at understanding syntax, but semiotics and pragmatism are still challenging, to say the least. In other words, a computer can understand a sentence and even create sentences that make sense. But they have difficulty understanding the meaning of words or how language changes depending on context.

This is why computers have such a hard time detecting sarcasm and irony. For the most part, this is not a problem. On the one hand, the amount of data containing sarcasm is minuscule, and on the other hand, some very interesting tools can help.

When training machine learning models to interpret the language of social media platforms, it is very important to understand these cultural differences. Twitter, for example, has a pretty toxic reputation and for good reason, it is right up there with Facebook as one of the most toxic places in the perception of its users.

It should come as no surprise, then, that you're more likely to encounter differences of opinion depending on the platform you work with. And, in fact, these differences are very important data.

As a quick example, market researchers need to understand which social media platform attracts their target audience. It doesn't make much sense to invest time and resources in tracking trends on networks that will produce little or no valuable information.

More than words

The exponential growth of platforms like Instagram and TikTok represents a new challenge for Natural Language Processing. Videos and images as user-generated content are quickly becoming popular, which in turn means our technology needs to adapt.

Facial and voice recognition will soon change the game as more and more content creators share their opinions through videos. Although challenging, this is also a great opportunity for emotional analysis, since traditional approaches rely on written language, it has always been difficult to assess the emotion behind the words.

While it is still too early to make an educated guess, if the big tech industries continue to push for a “metaverse,” social networks will likely change and adapt to become something akin to an MMORPG or a game like Club Penguin or Second Life . A social space where people freely exchange information through their virtual reality microphones and headsets.

Will Meta allow researchers to access these interactions? If the past is any indication, the answer is no, but once again, it's still too early to tell and the Metaverse is a long way off.

NLP and data science

Faster, more powerful computers have led to a revolution in natural language processing algorithms, but NLP is just one tool in a bigger box. Data scientists need to rely on data collection, sociological understanding, and just a little intuition to make the best of this technology.

It's an exciting time for Natural Language Processing, and you can bet that in the coming years the field will continue to grow, providing better, more refined tools for understanding how humans communicate.

Source: BairesDev

Back to blog

Leave a comment

Please note, comments need to be approved before they are published.