Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

BIg Data Computation Question Bank, Exams of Computational and Statistical Data Analysis

An overview of Hadoop Distributed File System (HDFS) and MapReduce, two major components of Hadoop. It explains the architecture of HDFS, including the roles of Namenode, Datanode, and Blocks. It also describes the anatomy of file read and write operations in HDFS. Additionally, the document discusses the working of MapReduce, a programming model used for processing large amounts of data. It explains the phases of Map and Reduce and the role of Jobtracker and Task Trackers in executing MapReduce tasks. Finally, the document highlights the advantages of Hadoop, including its ability to handle varied data sources and its scalability for processing big data.

Typology: Exams

2022/2023

Available from 03/29/2023

CosmicAlgo
CosmicAlgo 🇮🇳

4 documents

1 / 20

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
BIG DATA COMPUTING
QUESTION BANK
QUESTION 1: Explain the hadoop distributed file system (HDFS)
with a suitable diagram.
Answer: Hadoop File System was developed using distributed le
system design. It runs on commodity hardware. Unlike other
distributed systems, HDFS is highly fault tolerant and designed
using low-cost hardware. HDFS holds a very large amount of data
and provides easier access. To store such huge data, the les are
stored across multiple machines. These les are stored in
redundant fashion to rescue the system from possible data losses
in case of failure. HDFS also makes applications available for
parallel processing.
HDFS Architecture
Given below is the architecture of a Hadoop File System.
1
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14

Partial preview of the text

Download BIg Data Computation Question Bank and more Exams Computational and Statistical Data Analysis in PDF only on Docsity!

BIG DATA COMPUTING

QUESTION BANK

QUESTION 1 : Explain the hadoop distributed file system (HDFS)

with a suitable diagram.

Answer: Hadoop File System was developed using distributed le system design. It runs on commodity hardware. Unlike other distributed systems, HDFS is highly fault tolerant and designed using low-cost hardware. HDFS holds a very large amount of data and provides easier access. To store such huge data, the les are stored across multiple machines. These les are stored in redundant fashion to rescue the system from possible data losses in case of failure. HDFS also makes applications available for parallel processing. HDFS Architecture Given below is the architecture of a Hadoop File System.

HDFS follows the master-slave architecture and it has the following elements. Namenode The namenode is the commodity hardware that contains the GNU/Linux operating system and the namenode software. It is a software that can be run on commodity hardware. The system having the namenode acts as the master server and it does the following tasks − ● Manages the le system namespace. ● Regulates client’s access to les. ● It also executes le system operations such as renaming, closing, and opening les and directories. Datanode The datanode is a commodity hardware having the GNU/Linux operating system and datanode software. For every node (Commodity hardware/System) in a cluster, there will be a datanode. These nodes manage the data storage of their system. ● Datanodes perform read-write operations on the le systems, as per client request. ● They also perform operations such as block creation, deletion, and replication according to the instructions of the namenode. Block Generally the user data is stored in the les of HDFS. The le in a le system will be divided into one or more segments and/or stored in individual data nodes. These le segments are called

● Highly Scalable - HDFS is highly scalable as it can scale hundreds of nodes in a single cluster. ● Replication - Due to some unfavorable conditions, the node containing the data may be lost. So, to overcome such problems, HDFS always maintains the copy of data on a different machine. ● Fault tolerance - In HDFS, the fault tolerance signies the robustness of the system in the event of failure. The HDFS is highly fault-tolerant that if any machine fails, the other machine containing the copy of that data automatically becomes active. ● Distributed data storage - This is one of the most important features of HDFS that makes Hadoop very powerful. Here, data is divided into multiple blocks and stored into nodes. ● Portable - HDFS is designed in such a way that it can easily be portable from platform to another.

QUESTION 3: Describe HDFS CLI (COMMAND LINE INTERFACE).

Answer: A command line Interface (CLI) is a text based user interface used to run programs, manage computer les and interact with the computer. Command line interfaces are also called command line user interfaces, console user interfaces and character user interfaces. CLI’s accept as input commands that are entered by keyboard; the commands invoked at the command prompt are then run by the computer.

CLI is an interactive command line shell that makes interacting with HDFS simpler and more intuitive than the standard command line tools that come with hadoop. To use the HDFS commands, we just need to start the hadoop services using the following commands: Sbin / start -all.sh To check the hadoop services are up and running we use the following command Jps Commands: ● ls: this command is used to list all the les. bin / hdfs -ls ● mkdir: to create a directory bin / hdfs dfs -mkdir Creating home Directory: hdfs/bin -mkdir / user hdfs/bin -mkdir / user / username -> write username of your computer. ● touchz: it creates an empty le. bin/hdfs dfs -touchz <le path> ● copyFromLocal (or) get; ● cat;

Step 1: The client opens the le it wishes to read by calling open() on the File System Object(which for HDFS is an instance of Distributed File System). Step 2: Distributed File System( DFS) calls the name node, using remote procedure calls (RPCs), to determine the locations of the rst few blocks in the le. For each block, the name node returns the addresses of the data nodes that have a copy of that block. The DFS returns an FSDataInputStream to the client for it to read data from. FSDataInputStream in turn wraps a DFSInputStream, which manages the data node and name node I/O. Step 3: The client then calls read() on the stream. DFSInputStream, which has stored the info node addresses for the primary few blocks within the le, then connects to the primary (closest) data node for the primary block in the le. Step 4: Data is streamed from the data node back to the client, which calls read() repeatedly on the stream. Step 5: When the end of the block is reached, DFSInputStream will close the connection to the data node, then nd the best data node for the next block. This happens transparently to the client, which from its point of view is simply reading an endless stream. Blocks are read as, with the DFSInputStream opening new connections to data nodes because the client reads through the stream. It will also call the name node to retrieve the data node locations for the next batch of blocks as needed. Step 6: When the client has nished reading the le, a function is called, close() on the FSDataInputStream.

Anatomy of File Write in HDFS: we’ll check out how les are written to HDFS. Consider the gure below to get a better understanding of the concept. Note: HDFS follows the Write once Read many times model. In HDFS we cannot edit the les which are already stored in HDFS, but we can append data by reopening the les. Step 1: The client creates the le by calling create() on DistributedFileSystem(DFS). Step 2: DFS makes an RPC call to the name node to create a new le in the le system’s namespace, with no blocks associated with it. The name node performs various checks to make sure the le doesn’t already exist and that the client has the right permissions to create the le. If these checks pass, the name node prepares a record of the new le; otherwise, the le can’t be created and therefore the client is thrown an error i.e.

QUESTION 5: Describe MapReduce working with a suitable

diagram.

Answer: MapReduce is a software framework and programming model used for processing huge amounts of data. The MapReduce program works in two phases, namely, Map and Reduce. Map tasks deal with splitting and mapping of data while Reduce tasks shuffle and reduce the data. Hadoop is capable of running MapReduce programs written in various languages: Java, Ruby, Python, and C++. The programs of Map Reduce in cloud computing are parallel in nature, thus are very useful for performing large-scale data analysis using multiple machines in the cluster. The input to each phase is key-value pairs. In addition, every programmer needs to specify two functions: map function and reduce function. MapReduce Architecture

Hadoop divides the job into tasks. There are two types of tasks:

  1. Map tasks (Splits & Mapping)
  2. Reduce tasks (Shuffling, Reducing) The complete execution process (execution of Map and Reduce tasks, both) is controlled by two types of entities called a
  3. Jobtracker: Acts like a master (responsible for complete execution of submitted job)
  4. Multiple Task Trackers: Acts like slaves, each of them performing the job For every job submitted for execution in the system, there is one Jobtracker that resides on Namenode and there are multiple tasktrackers which reside on Datanode. How Hadoop MapReduce Works ● A job is divided into multiple tasks which are then run onto multiple data nodes in a cluster.

These blocks are distributed across the nodes on various machines in the cluster. However, RDBMS is a structured database approach, in which data gets stored in tables in the forms of rows and columns. RDBMS uses SQL or Structured Query Language, which can help update and access the data present in different tables. As in the case of Hadoop, traditional RDBMS is not competent to be used in storage of a larger amount of data or simply big data. Further, let’s go through some of the major real-time working differences between the Hadoop database architecture and the traditional relational database management practices. ● In Terms of Data Volume Volume means the quantity of data which could be comfortably stored and effectively processed. Relational databases surely work better when the load is low, probably gigabytes of data. This was the case for so long in information technology applications, but when the data size has grown to Terabytes or Petabytes, RDBMS isn’t competent to ensure the desired results. On the other hand, considering Hadoop is the right approach when the need is to handle a bigger data size. Hadoop can be used to process a huge volume of data effectively compared to the traditional relational database management systems. ● Database Architecture Considering the database architecture, as we have seen above Hadoop works on the components as:

● HDFS, which is the distributed le system of the Hadoop ecosystem. ● MapReduce, which is a programming model that helps process huge data sets. ● Hadoop YARN, which helps in managing the computing resources in multiple clusters. However, the traditional RDBMS will possess data based on the ACID properties, i.e., Atomicity, Consistency, Isolation, and Durability, which are used to maintain integrity and accuracy in data transactions. Such transactions would be of any sectors like banking systems, telecommunication, e-commerce, manufacturing, or education, etc. ● Throughput It is the total data volume process over a specic time period so that the output could be optimized. Relational database management systems are found to be a failure in terms of achieving a higher throughput if the data volume is high, whereas Apache Hadoop Framework does an appreciable job in this regard. This is one major reason why there is an increasing usage of Hadoop in the modern-day data applications than RDBMS. ● Data Diversity The diversity of data refers to various types of data processed. There are structures, unstructured, and semi-structured data available now. Hadoop possesses a signicant ability to store and process data of all the above-mentioned types and prepare it for

  1. Performance Hadoop with its distributed processing and distributed storage architecture processes huge amounts of data with high speed. Hadoop even defeated supercomputers as the fastest machine in
  2. It divides the input data le into a number of blocks and stores data in these blocks over several nodes. It also divides the task that the user submits into various sub-tasks which assign to these worker nodes containing required data and these sub-task run in parallel thereby improving the performance.
  3. Fault-Tolerant In Hadoop 3.0 fault tolerance is provided by erasure coding. For example, 6 data blocks produce 3 parity blocks by using erasure coding technique, so HDFS stores a total of these 9 blocks. In event of failure of any node the data block affected can be recovered by using these parity blocks and the remaining data blocks.
  4. Highly Available In Hadoop 2.x, HDFS architecture has a single active NameNode and a single Standby NameNode, so if a NameNode goes down then we have a standby NameNode to count on. But Hadoop 3. supports multiple standby NameNode making the system even more highly available as it can continue functioning in case if two or more NameNodes crashes.
  5. Low Network Traffic In Hadoop, each job submitted by the user is split into a number of independent sub-tasks and these sub-tasks are assigned to the

data nodes thereby moving a small amount of code to data rather than moving huge data to code which leads to low network traffic.

  1. High Throughput Throughput means job done per unit time. Hadoop stores data in a distributed fashion which allows using distributed processing with ease. A given job gets divided into small jobs which work on chunks of data in parallel thereby giving high throughput.
  2. Open Source Hadoop is an open source technology i.e. its source code is freely available. We can modify the source code to suit a specic requirement.
  3. Scalable Hadoop works on the principle of horizontal scalability i.e. we need to add the entire machine to the cluster of nodes and not change the conguration of a machine like adding RAM, disk and so on which is known as vertical scalability. Nodes can be added to Hadoop clusters on the y making it a scalable framework.
  4. Ease of use The Hadoop framework takes care of parallel processing, MapReduce programmers do not need to care for achieving distributed processing, it is done at the backend automatically.
  5. Compatibility Most of the emerging technology of Big Data is compatible with Hadoop like Spark, Flink etc. They have got processing engines

providing the same level of fault tolerance and data durability as traditional replication-based HDFS deployment. How has HDFS Fault Tolerance achieved? Prior to Hadoop 3, the Hadoop Distributed File system achieves Fault Tolerance through the replication mechanism. Hadoop 3 came up with Erasure Coding to achieve Fault tolerance with less storage overhead. Let us see both ways for achieving Fault-Tolerance in Hadoop HDFS.

  1. Replication Mechanism: Before Hadoop 3, fault tolerance in Hadoop HDFS was achieved by creating replicas. HDFS creates a replica of the data block and stores them on multiple machines (DataNode). The number of replicas created depends on the replication factor (by default 3).

If any of the machines fails, the data block is accessible from the other machine containing the same copy of data. Hence there is no data loss due to replicas stored on different machines.

  1. Erasure Coding: Erasure coding is a method used for fault tolerance that durably stores data with signicant space savings compared to replication. RAID (Redundant Array of Independent Disks) uses Erasure Coding. Erasure coding works by striping the le into small units and storing them on various disks. For each strip of the original dataset, a certain number of parity cells are calculated and stored. If any of the machines fails, the block can be recovered from the parity cell. Erasure coding reduces the storage overhead to 50%. Example of HDFS Fault Tolerance Suppose the user stores a le XYZ. HDFS breaks this le into blocks, say A, B, and C. Let’s assume there are four DataNodes, say D1, D2, D3, and D4. HDFS creates replicas of each block and stores them on different nodes to achieve fault tolerance. For each original block, there will be two replicas stored on different nodes (replication factor 3). Let block A be stored on DataNodes D1, D2, and D4, block B stored on DataNodes D2, D3, and D4, and block C stored on DataNodes D1, D2, and D3. If DataNode D1 fails, the blocks A and C present in D1 are still available to the user from DataNodes (D2, D4 for A) and (D2, D3 for C). Hence even in unfavorable conditions, there is no data loss.