Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Hadoop & Azure Synapse: Big Data Management & Analytics, Cheat Sheet of Law

A comprehensive overview of hadoop and azure synapse analytics, two powerful tools for managing and analyzing large datasets. It explores the key features, functionalities, and differences between these technologies, highlighting their strengths and limitations. The document also delves into data ingestion, processing, and validation techniques, offering practical insights into real-world applications. It covers topics such as data locality, data masking, and the architecture of azure synapse analytics, providing a solid foundation for understanding big data concepts and their implementation.

Typology: Cheat Sheet

2024/2025

Uploaded on 12/22/2024

arulmozhi-varman-murugesan
arulmozhi-varman-murugesan 🇺🇸

1 document

1 / 126

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
500+
Data Engineering
Interview questions
&
Answers
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19
pf1a
pf1b
pf1c
pf1d
pf1e
pf1f
pf20
pf21
pf22
pf23
pf24
pf25
pf26
pf27
pf28
pf29
pf2a
pf2b
pf2c
pf2d
pf2e
pf2f
pf30
pf31
pf32
pf33
pf34
pf35
pf36
pf37
pf38
pf39
pf3a
pf3b
pf3c
pf3d
pf3e
pf3f
pf40
pf41
pf42
pf43
pf44
pf45
pf46
pf47
pf48
pf49
pf4a
pf4b
pf4c
pf4d
pf4e
pf4f
pf50
pf51
pf52
pf53
pf54
pf55
pf56
pf57
pf58
pf59
pf5a
pf5b
pf5c
pf5d
pf5e
pf5f
pf60
pf61
pf62
pf63
pf64

Partial preview of the text

Download Hadoop & Azure Synapse: Big Data Management & Analytics and more Cheat Sheet Law in PDF only on Docsity!

Data Engineering

Interview questions

Answers

Interview Questions

1. What is Hadoop MapReduce? A.) For processing large datasets in parallel across hadoop cluster, hadoop mapReduce framework is used. 2. What are the difference between relational database and HDFS? A.) There are 6 major categories we can define RDMBS and HDFS. They are a. Data Types b. processing c. Schema on read Vs Write d. Read/write speed e. cost Best fit use case RDBMS HDFS

  1. In RDBMS it relies on structured data any kind of data can be stored into Hadoop. i.e structured, unstructured, semi-structured. and schema is always known.
  2. Rdbms provides limited or no processing hadoop allows us to process the data which is distributed across the cluster in a parallel fashion.

A.) HDFS (Hadoop Distributed File System) is the storage unit of Hadoop. It is responsible for storing different kinds of data as blocks in a distributed environment. It follows master and slave topology. NameNode: NameNode is the master node in the distributed environment and it maintains the metadata information for the blocks of data stored in HDFS like block location, replication factors etc. DataNode: DataNodes are the slave nodes, which are responsible for storing data in the HDFS. NameNode manages all the DataNodes. YARN (Yet Another Resource Negotiator) is the processing framework in Hadoop, which manages resources and provides an execution environment to the processes. ResourceManager: It receives the processing requests, and then passes the parts of requests to corresponding NodeManagers accordingly, where the actual processing takes place. It allocates resources to applications based on the needs. NodeManager: NodeManager is installed on every DataNode and it is responsible for the execution of the task on every single DataNode.

  1. Tell me about various Hadoop Daemons and their roles in hadoop cluster? A.) Generally approach this question by first explaining the HDFS daemons i.e. NameNode, DataNode and Secondary NameNode, and then moving on to the YARN daemons i.e. ResorceManager and NodeManager, and lastly explaining the JobHistoryServer. JobHistoryServer: It maintains information about MapReduce jobs after the Application Master terminates.
  2. Compare HDFS with Network attached servive(NAS)? A.) Network-attached storage (NAS) is a file-level computer data storage server connected to a computer network providing data access to a heterogeneous group of clients. NAS can either be a hardware or software which provides services for storing and accessing files. Whereas Hadoop Distributed File System (HDFS) is a distributed filesystem to store data using commodity hardware. In HDFS Data Blocks are distributed across all the machines in a cluster. Whereas in NAS data is stored on a dedicated hardware.
  3. List the difference between Hadoop 1.0 vs Hadoop 2.0?

A.) To answer this we need to highlight 2 important features that are a. PassiveNode b. Processing. In Hadoop 1.x, “NameNode” is the single point of failure. In Hadoop 2.x, we have Active and Passive “NameNodes”. If the active “NameNode” fails, the passive “NameNode” takes charge. Because of this, high availability can be achieved in Hadoop 2.x. Also, in Hadoop 2.x, YARN provides a central resource manager. With YARN, you can now run multiple applications in Hadoop, all sharing a common resource. MRV2 is a particular type of distributed application that runs the MapReduce framework on top of YARN. Other tools can also perform data processing via YARN, which was a problem in Hadoop 1.x.

  1. What are active and Passive Namenodes? A.) In a High availability architecture, there are 2 namenodes. i.e. a. Active “NameNode” is the “NameNode” which works and runs in the cluster. b. Passive “NameNode” is a standby “NameNode”, which has similar data as active “NameNode”. When the active “NameNode” fails, the passive “NameNode” replaces the active “NameNode” in the cluster.
  2. Why does one remove or add datanodes freaquently? A.) most attractive features of the Hadoop framework is its utilization of commodity hardware. However, this leads to frequent “DataNode” crashes in a Hadoop cluster. Another striking feature of Hadoop Framework is the ease of scale in accordance with the rapid growth in data volume. this is why hadoop admins work is to commission or decommission nodes in hadoop cluster. 11.) what happens when two clients tries to access same file in Hdfs? A.) When first client request for file or data hdfs provides access to write, but when second client request it rejects by saying already another client accessing it.
  3. How does nameNOde tackles data node failures? A.) NameNode periodically receives a Heartbeat (signal) from each of the DataNode in the cluster, which implies DataNode is functioning properly.
  4. What will you do when NameNode is down? A.) Use the file system metadata replica (FsImage) to start a new NameNode.

A.) Rack Awareness is the algorithm in which the “NameNode” decides how blocks and their replicas are placed, based on rack definitions to minimize network traffic between “DataNodes” within the same rack.

  1. What is the difference between hdfs block, and input split? A.) The “HDFS Block” is the physical division of the data while “Input Split” is the logical division of the data. Hdfs block divides data into blocks to store the blocks together processing, where Input split Divides the data into the input split and assign it to the mapper function for processing.
  2. Name of three modes which hadoop can run? A.) Stanalone mode pseudo- distribution mode fully distributed mode
  3. What do you know about SequenceFileFormat? A.) “SequenceFileInputFormat” is an input format for reading within sequence files. It is a specific compressed binary file format which is optimized for passing the data between the outputs of one “MapReduce” job to the input of some other “MapReduce” job.
  4. What is Hive? A.) Apache Hive is a data warehouse system built on top of Hadoop and is used for analyzing structured and semi-structured data developed by Facebook. Hive abstracts the complexity of Hadoop MapReduce.
  5. what is Serde in Hive? A.) The “SerDe” interface allows you to instruct “Hive” about how a record should be processed. A “SerDe” is a combination of a “Serializer” and a “Deserializer”. “Hive” uses “SerDe” (and “FileFormat”) to read and write the table’s row.
  6. can the default hive metastore used by multiple users at the same time? A.) “Derby database” is the default “Hive Metastore”. Multiple users (processes) cannot access it at the same time.

It is mainly used to perform unit tests.

  1. what is the default location for hive to store in table data? A.) The default location where Hive stores table data is inside HDFS in /user/hive/warehouse.
  2. What is Apache Hbase? A.) HBase is an open source, multidimensional, distributed, scalable and a NoSQL database written in Java. HBase runs on top of HDFS (Hadoop Distributed File System) and provides BigTable (Google) like capabilities to Hadoop. It is designed to provide a fault-tolerant way of storing the large collection of sparse data sets. HBase achieves high throughput and low latency by providing faster Read/Write Access on huge datasets.
  3. What are the components of apache Hbase? A.) HBase has three major components, i.e. HMaster Server, HBase RegionServer and Zookeeper. Region Server: A table can be divided into several regions. A group of regions is served to the clients by a Region Server. HMaster: It coordinates and manages the Region Server (similar as NameNode manages DataNode in HDFS). ZooKeeper: Zookeeper acts like as a coordinator inside HBase distributed environment. It helps in maintaining server state inside the cluster by communicating through sessions.
  4. what are the components of Region server? A.) WAL: Write Ahead Log (WAL) is a file attached to every Region Server inside the distributed environment. The WAL stores the new data that hasn’t been persisted or committed to the permanent storage. Block Cache: Block Cache resides in the top of Region Server. It stores the frequently read data in the memory. MemStore: It is the write cache. It stores all the incoming data before committing it to the disk or permanent memory. There is one MemStore for each column family in a region. HFile: HFile is stored in HDFS. It stores the actual cells on the disk.
  1. why is Hadoop used in bigdata analytics? A.) Hadoop allows running many exploratory data analysis tasks on full datasets, without sampling. Features that make Hadoop an essential requirement for Big Data are – Data collection Storage Processing Runs independently.
  2. Name of some of the important tools used for data analytics? A.) The important Big Data analytics tools are – NodeXL KNIME Tableau Solver OpenRefine Rattle GUI Qlikview.
  3. what is FSCK? A.) FSCK or File System Check is a command used by HDFS. It checks if any file is corrupt,or if there are some missing blocks for a file. FSCK generates a summary report, which lists the overall health of the file system.
  4. what are the different core methods of Reducer? A.) There are three core methods of a reducer- setup() – It helps to configure parameters like heap size, distributed cache, and input data size. reduce() – Also known as once per key with the concerned reduce task. It is the heart of the reducer.

cleanup() – It is a process to clean up all the temporary files at the end of a reducer task.

  1. what are the most common Input fileformats in Hadoop? A.) The most common input formats in Hadoop are – Key-value input format Sequence file input format Text input format.
  2. what are the different fileformats that can be used in Hadoop? A.) File formats used with Hadoop, include – CSV JSON Columnar Sequence files AVRO Parquet file.
  3. what is commodity hardware? A.) Commodity hardware is the basic hardware resource required to run the Apache Hadoop framework. It is a common term used for affordable devices, usually compatible with other such devices.
  4. what do you mean by logistic regression? A.) Also known as the logit model, logistic regression is a technique to predict the binary result from a linear amalgamation of predictor variables.
  5. Name the port number for namenode, task tracker, job tracker? A.) NameNode – Port 50070

A.) The amount of data required depends on the methods you use to have an excellent chance of obtaining vital results.

  1. Is Hadoop is different from other parallel computing systems? How? A.) Yes, it is. Hadoop is a distributed file system. It allows us to store and manage large amounts of data in a cloud of machines, managing data redundancy. The main benefit of this is that since the data is stored in multiple nodes, it is better to process it in a distributed way. Each node is able to process the data stored on it instead of wasting time moving the data across the network. In contrast, in a relational database computing system, we can query data in real-time, but it is not efficient to store data in tables, records, and columns when the data is huge. Hadoop also provides a schema for building a column database with Hadoop HBase for run-time queries on rows.
  2. What is BackUp Node? A.) Backup Node is an extended checkpoint node for performing checkpointing and supporting the online streaming of file system edits. Its functionality is similar to Checkpoint, and it forces synchronization with NameNode.
  3. what are the common data challenges? A.) The most common data challenges are – Ensuring data integrity Achieving a 360-degree view Safeguarding user privacy Taking the right business action with real-time resonance.
  4. How do you overcome above mentioned data challenges? A.) Data challenges can be overcome by – Adopting data management tools that provide a clear view of data assessment Using tools to remove any low-quality data Auditing data from time to time to ensure user privacy is safeguarded

Using AI-powered tools, or software as a service (SaaS) products to combine datasets and make them usable.

  1. What is the hierarachical Clustering algorithm? A.) The hierarchical grouping algorithm is the one that combines and divides the groups that already exist.
  2. what is K- Mean clustering? A.) K mean clustering is a method of vector quantization.
  3. can you mention the crieteria for good data model? A.) A good data model – It should be easily consumed Large data changes should be scalable Should offer predictable performances Should adapt to changes in requirements.
  4. Name the different commands for starting up and shutting down the hadoop daemons? A.) To start all the daemons: ./sbin/start-all.sh To shut down all the daemons: ./sbin/stop-all.sh
  5. Talk about the different tombstone markers used for deletion purpose in Hbase? A.) There are three main tombstone markers used for deletion in HBase. They are- Family Delete Marker – For marking all the columns of a column family. Version Delete Marker – For marking a single version of a single column. Column Delete Marker – For marking all the versions of a single column.
  1. what is feature selection? A.) Feature selection refers to the process of extracting only the required features from a specific dataset. When data is extracted from disparate sources. Feature selection can be done via three techniques. a. filters method b. wrappers method c. Embedded method.
  2. Define OUtliers? A.) outliers are the values that are far removed from the group, they do not belong to any specific cluster or group in the dataset. The presence of outliers usually affects the behavior of the model. Here are six outlier detection methods:
  3. Extreme value analysis
  4. probabilistic analysis
  5. linear models
  6. information-theoretic models
  7. High-dimensional outlier detection.
  8. How can you handle missing values in Hadoop? A.) there are different ways to estimate the missing values. These include regression, multiple data imputation, listwise/pairwise deletion, maximum likelihood estimation, and approximate Bayesian bootstrap. MapReduce Interview Questions:
  9. Compare MapReduce and SPark? A.) there are 4 crieteria to be followed to compare MR with spark. they are.
  10. processing speeds
  11. standalone mode
  12. Ease of use
  13. versatility

MapReduce Spark

  1. processing speed is good It is execeptional
  2. standalone mode needs hadoop it can work independently
  3. it needs extensive java program API for python & scala& java
  4. It is optimized real time machine-learning not optimized real time & mL applications. applications.
  5. what is MapReduce? A.) It is a framework/a programming model that is used for processing large data sets over a cluster of computers using parallel programming.
  6. State the reason why we can't perform aggregation in mapper? why do we need reducer for this? A.) We cannot perform “aggregation” (addition) in mapper because sorting does not occur in the “mapper” function, as sorting occurs only on reducer. During “aggregation”, we need the output of all the mapper functions which may not be possible to collect in the map phase as mappers may be running on the different machine.
  7. What is the recordReader in Hadoop? A.) The “RecordReader” class loads the data from its source and converts it into (key, value) pairs suitable for reading by the “Mapper” task.
  8. Explain Distributed cache in MapReduce Framework? A.) Distributed Cache is a dedicated service of the Hadoop MapReduce framework, which is used to cache the files whenever required by the applications. This can cache read-only text files, archives, jar files, among others, which can be accessed and read later on each data node where map/reduce tasks are running.
  9. How do reducers communicate with each other? A.) The “MapReduce” programming model does not allow “reducers” to communicate with each other.

It only writes the input data into output and do not perform and computations and calculations on the input data. Chain mapper: Chain Mapper is the implementation of simple Mapper class through chain operations across a set of Mapper classes, within a single map task. In this, the output from the first mapper becomes the input for second mapper

  1. What main configuration parameters are specified in Mapreduce? A.) following configuration parameters to perform the map and reduce jobs: The input location of the job in HDFs. The output location of the job in HDFS. The input’s and output’s format. The classes containing map and reduce functions, respectively. The .jar file for mapper, reducer and driver classes.
  2. Name Job control options specified by mapreduce? A.) Since this framework supports chained operations wherein an input of one map job serves as the output for other. The various job control options are: Job.submit() : to submit the job to the cluster and immediately return Job.waitforCompletion(boolean) : to submit the job to the cluster and wait for its completion.
  3. What is inputFormat in hadoop? A.) inputformat defines the input specifications for a job.it peforms following instructions.
  4. validates input-specifications of job.
  5. Split the input file(s) into logical instances called InputSplit.
  6. Provides implementation of RecordReader to extract input records from the above instances for further Mapper processing.
  1. What is the difference between Hdfs block and inputsplit? A.) An HDFS block splits data into physical divisions while InputSplit in MapReduce splits input files logically.
  2. what is the text inputformat? A.) TextInputFormat, files are broken into lines, wherein key is position in the file and value refers to the line of text. Programmers can write their own InputFormat.
  3. what is role of job Tracker? A.) The primary function of the JobTracker is resource management, which essentially means managing the TaskTrackers. Apart from this, JobTracker also tracks resource availability and handles task life cycle management.
  4. Explian jobconf in mapreduce? A.) It is a primary interface to define a map-reduce job in the Hadoop for job execution. JobConf specifies mapper, Combiner, partitioner, Reducer,InputFormat , OutputFormat implementations
  5. what is output committer? A.) OutPutCommitter describes the commit of MapReduce task. FileOutputCommitter is the default available class available for OutputCommitter in MapReduce.
  6. what is map in Hadoop? A.) In Hadoop, a map is a phase in HDFS query solving. A map reads data from an input location, and outputs a key value pair according to the input type.
  7. what is reducer in hadoop? A.) In Hadoop, a reducer collects the output generated by the mapper, processes it, and creates a final output of its own.
  8. what are the parameters of mappers and reducers?