What is Hadoop?

May 29, 2024 Roberto Magalhães

Dive into the Big Data giant! Explore Hadoop, the open source framework that processes vast data sets across computer clusters, reshaping modern data management.

Does your company operate in the number crunching business? Do you need to perform calculations on large amounts of data? In this case, you likely need distributed computing for Big Data type transactions. How do you achieve such a feat? With the help of Hadoop.

Apache Hadoop is a collection of open source tools that make it possible to bring together multiple computers to solve big problems. Hadoop is used for very large storage of virtually any type of data and provides enormous processing power for nearly unlimited concurrent jobs.

Hadoop was started by Doug Cutting and Mike Cafarella in 2002 while working on the Apache Nutch project. Shortly after the project began, Cutting and Cafarella concluded that it would cost nearly half a million dollars in hardware alone, with a monthly operating cost of about $30,000.

To reduce cost, the team turned to Google File System and MapReduce, which led Cutting to realize that Nutch was limited to clusters of 20 to 40 nodes. This meant that they would not be able to achieve their goal with just two people working on the project. It was shortly after this realization (and Cutting's entry into Yahoo!) that a new project was formed, called Hadoop, which was created to expand Nutch's ability to scale thousands of nodes.

In 2007, Yahoo! Successfully deployed Hadoop on a 1,000 node cluster. Hadoop was then released as an open source project to the Apache Software Foundation. That same year, Apache successfully tested Hadoop on a 4,000-node cluster.

Cutting and Cafarella's goal was achieved and Hadoop could scale enough to handle massive data computation. And in December 2001, Apache released version 1.0 of Hadoop.

Empresa de Desenvolvimento Hadoop 1

What are the pieces that make up Hadoop?

Hadoop is a component framework made up of 3 parts:

Hadoop HDFS – Hadoop Distributed File System (HDFS) serves as a storage unit, which not only provides distributed storage but also data security and fault tolerance.
Hadoop MapReduce – Hadoop MapReduce serves as the processing unit, which is handled in the cluster nodes, the results of which are sent to the cluster master.
Hadoop YARN – Hadoop YARN (Yet Another Resource Negotiator) serves as a resource management unit and performs task scheduling.

Each of these components comes together to make distributed storage much more efficient. To help you understand this, let's use an analogy.

A small business sells coffee beans. At first, they only sell one type of coffee beans. They store their coffee beans in a single warehouse connected to the building and everything runs smoothly. Eventually, though, customers start asking for different types of coffee beans, so the company decides it's in its best interest to expand.

To save money, the company stores all the grains in the same room, but hires more employees to meet the new demand. As demand continues to grow, supply must match, so storage space becomes problematic. To compensate for this, the company hires even more workers, but soon realizes that the problem is the bottleneck in the warehouse.

Finally, the company realizes that it needs separate storage rooms for each type of beans and then separates employees to manage the different rooms. With this new delivery pipeline in place, the business not only runs better, but can also handle continued growth in demand.

This distributed storage is similar to how distributed data storage works, but instead of separate storage rooms, we have multiple cluster nodes to store data.

This is how Hadoop helps Big Data overcome the ever-increasing needs of:

Volume – the amount of data
Speed – speed data is generated
Variety – the different types of data
Veracity – the ability to trust data

Why Hadoop instead of a traditional database?

You’re probably asking yourself, “Why use Hadoop when a traditional database has served my business perfectly?” This is a good question. The answer comes down to how far you want your business to grow.

You can have the best Java, JavaScript, .NET, Python, PHP and cloud developers money can buy. But if your database solution is not capable of handling large amounts of data, there is no way for these developers (no matter how talented they are) to get around the limitations of standard databases, such as:

Data Storage Size Limitations – Hadoop makes it possible to store large amounts of any type of data.
Computing power – Hadoop's distributed computing model means your company can manage increasingly larger amounts of data by simply adding nodes to the cluster.
Fault tolerance – Hadoop protects your cluster against hardware failures. If a node goes down, jobs are automatically redirected to other nodes.
Flexibility – Hadoop makes it possible to pre-process data before it is stored.
Cost-effective – Hadoop is not only free, but it enables you to expand your business far beyond traditional databases without the added cost of expensive hardware.

The caveat of using Hadoop

The biggest problem your company will face is having the in-house skills to deploy and manage a Hadoop system. Although Hadoop uses Java as its main language (especially with MapReduce), the skills required go far beyond the basics. Fortunately, there are many nearshore and offshore development hiring companies that offer the talent you need to implement Hadoop in your company.

Conclusion

So, if you want to compete with the biggest companies in the market, seriously consider a platform like Hadoop to help you meet and exceed the data computing needs dedicated to current and future growth.