









Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
Chapter 14 Big Data Analytics and NoSQL Comprehensive Exam Study Guide Latest Updated 2024/2025
Typology: Exams
1 / 17
This page cannot be seen from the preview
Don't miss anything!
Start by explaining that Big Data is a nebulous term. Its definition and the composition of the techniques and technologies that are covered under this umbrella term are constantly changing and being redefined. There is no standardizing body for Big Data or NoSQL so there is no one in charge to make a definitive statement about exactly what qualifies as Big Data. This is made worse by the fact that most technologies for big data problems and the NoSQL movement are open- source so even the developers working in the arena are often a loose community without hierarchy or structure.
As a generic definition, Big Data is data of such volume, velocity, and/or variety that it is difficult for traditional relational database technologies to store and process it. Students need to understand that the definition of Big Data is relative, not absolute. We cannot look at a collection of data and state categorically that it is Big Data now and for all time. We may categorize a set of data or a data storage and processing requirement as a Big Data problem today. In three years, or even in one year, relational database technologies may have advanced to the point where that same problem is no longer a Big Data problem.
NoSQL has the same problem in terms of its definition. Since Big Data and NoSQL are both defined in terms of a negative statement that says what they are not instead of a positive statement that says what they are, they both suffer from being ill-defined and overly broad.
Discuss the many V’s of Big Data. The basic V’s, volume, velocity, and variety are key to Big Data. Again, because of the lack of an authority to define what Big Data is, other V’s are added by writers and thinkers who like to jump on the alliteration of the 3 V’s. Beyond the 3 V’s, the other V’s that are proposed by various sources are often not really unique to Big Data. For example, all data have Volume. Big Data problems require Volume that is too large for relational database technologies to support. Veracity is the trustworthiness of the data. All data needs to be trustworthy. Big Data problems do not require support for a higher level of trustworthiness than relational database technologies can support. Therefore, the argument can be made that veracity is a characteristic of all data, not just Big Data. Students should understand that critical thinking about Big Data is necessary when assessing claims and technologies in this fast-changing arena.
Discuss that Hadoop has been the beneficiary of great marketing and widespread buy-in from pundits. Hadoop has become synonymous with Big Data in the minds of many people that are passingly familiar with data management. However, Hadoop is a very specialized technology that is aimed at very specific tasks associated with storing and processing very large data sets in non- integrative ways. This makes the Hadoop ecosystem very important because the ecosystem can expand the basic HDFS and MapReduce capabilities to support a wider range of needs and allow greater integration of the data.
Stress to students that the NoSQL landscape is constantly changing. There are about 100 products
categories of NoSQL databases that appear in the literature, as shown below, but many products do not fit neatly into only one category: Key-value Document Column family Graph Each category attempts to deal with non-relational data in different ways.
Data analysis focuses on attempting to generate knowledge to expand and inform the organization’s decision making processes. These topics were covered to a great extent in Chapter 13 when analyzing data from transactional databases integrated into data warehouses. In this chapter, the use of exploratory and predictive analytics are applied to non-relational databases.
1. What is Big Data? Give a brief definition. Big Data is data of such volume, velocity, and/or variety that it is difficult for traditional relational database technologies to store and process it. 2. What are the traditional 3 Vs of Big Data? Briefly, define each. Volume, velocity, and variety are the traditional 3 Vs of Big Data. Volume refers to the quantity of the data that must be stored. Velocity refers to the speed with which new data is being generated and entering the system. Variety refers to the variations in the structure, or the lack of structure, in the data being captured. 3. Explain why companies like Google and Amazon were among the first to address the Big Data problem. In the 1990s, the use of the Internet exploded and commercial websites helped attract millions of new consumers to online transactions. When the dot-com bubble burst at the end of the 1990s, the millions of new consumers remained but the number of companies providing them services reduced dramatically. As a result, the surviving companies, like Google and Amazon experienced exponential growth in a very short time. This lead to these companies being among the first to experience the volume, velocity, and variety of data that is associated with Big Data. 4. Explain the difference between scaling up and scaling out. Scaling up involves improving storage and processing capabilities through the use of improved hardware, software, and techniques without changing the quantity of servers. Scaling out involves improving storage and processing capabilities through the use of more servers.
streamed in a sequential fashion. HDFS does not work well when only small parts of a file are needed. Finally, HDFS assumes that failures in the servers will be frequent. As the number of
464
servers increases, the probability of a failure increases significantly. HDFS assumes that servers will fail so the data must be redundant to avoid loss of data when servers fail.
10. What is the difference between a name node and a data node in HDFS? The name node stores the metadata that tracks where all of the actual data blocks reside in the system. The name node is responsible for coordinating tasks across multiple data nodes to ensure sufficient redundancy of the data. The name node does not store any of the actual user data. The data nodes store the actual user data. A data node does not store metadata about the contents of any data node other than itself. 11. Explain the basic steps of MapReduce processing. A client node submits a job to the Job Tracker. Job Tracker determines where the data to be processed resides. Job Tracker contacts the Task Tracker on the nodes as close as possible to the data. Each Task Tracker creates mappers and reducers as needed to complete the processing of each block of data and consolidate that data into a result. Task Trackers report results back to the Job Tracker when the mappers and reducers are finished. The Job Tracker updates the status of the job to indicate when it is complete. 12. Briefly explain how HDFS and MapReduce are complementary to each other. Both HDFS and MapReduce rely on the concept of massive, relatively independent, distributions. HDFS decomposes data into large, independent chunks of data that are then distributed across a number of independent servers. MapReduce decomposes processing into independent tasks that are distributed across a number of independent servers. The distribution of data in HDFS is coordinated by a name node server that collects data from each server about the state of the data that it holds. The distribution of processing in MapReduce is coordinated by a job tracker that collects data from each server about the state of the processing it is performing. 13. What are the four basic categories of NoSQL databases? Key-value database, document databases, column family databases, and graph databases. 14. How are the value components of a key-value database and a document database different? In a key-value database, the value component is nonintelligible for the database. In other
must be accomplished by the application logic. In a document database, the value component is partially interpretable by the DBMS. The DBMS can identify and search for specific tags, or subdivisions, within the value component.
15. Briefly explain the difference between row-centric and column-centric data storage. Row-centric storage treats a row as the smallest data storage unit. All of the column values associated with a particular row of data are stored together in physical storage. This is the optimal storage approach for operations that manipulate and retrieve all columns in a row, but only a small number of rows in a table. Column-centric storage treats a row as a divisible collection of values that are stored separately with the values of a single column across many rows being physically stored together. This is optimal when operations manipulate and retrieve a small number of columns in a row for all rows in the table. 16. What is the difference between a column and a super column in a column family database? Columns in a column family database are relatively independent of each other. A super column is a group of columns that are logically related. This relationship can be based on the nature of the data in the columns, such as a group of columns that comprise an address, or it can be based on application processing requirements. 17. Explain why graph databases tend to struggle with scaling out? Graph databases are designed to address problems with highly related data. The data that appears in a graph database are tightly integrated and queries that traverse a graph focus on the relationships among the data. Scaling out requires moving data to number of different servers. As a general rule, scaling out is recommended when the data on each server is relatively independent of the data on other servers. Due to the dependencies among the data on different servers in a graph database, the inter-server communication overhead is very high with a graph database. This has a significant negative impact on the performance of graph databases in a scaled out environment. 18. What is data analytics? Briefly define explanatory and predictive analytics. Give some examples. Data analytics is a subset of BI functionality that encompasses a wide range of mathematical, statistical, and modeling techniques with the purpose of extracting knowledge from data. Data analytics is used at all levels within the BI framework, including queries and reporting, monitoring and alerting, and data visualization. Hence, data analytics is a “shared” service that is crucial to what BI adds to an organization. Data analytics represents what business managers really want from BI: the ability to extract actionable business insight from current events and foresee future problems or opportunities.
profile customers and predict customer buying patterns in these industries was a critical driving force for the evolution of many modeling methodologies used in BI data analytics today. For
467
example, based on your demographic information and purchasing history, a credit card company can use data-mining models to determine what credit limit to offer, what offers you are more likely to accept, and when to send those offers. Another example, a data mining tool could be used to analyze customer purchase history data. The data mining tool will find many interesting purchasing patterns, and correlations about customer demographics, timing of purchases and the type of items they purchase together. The predictive analytics tool will use those finding to build a model that will predict with high degree of accuracy when a certain type of customer will purchase certain items and what items are likely to be purchased on certain nights and times.
20. How does data mining work? Discuss the different phases in the data mining process. Data mining is subject to four phases: In the data preparation phase, the main data sets to be used by the data mining operation are identified and cleansed from any data impurities. Because the data in the data warehouse are already integrated and filtered, the Data Warehouse usually is the target set for data mining operations. The data analysis and classification phase objective is to study the data to identify common data characteristics or patterns. During this phase the data mining tool applies specific algorithms to find: data groupings, classifications, clusters, or sequences. data dependencies, links, or relationships. data patterns, trends, and deviations. The knowledge acquisition phase uses the results of the data analysis and classification phase. During this phase, the data mining tool (with possible intervention by the end user) selects the appropriate modeling or knowledge acquisition algorithms. The most typical algorithms used in data mining are based on neural networks, decision trees, rules induction, genetic algorithms, classification and regression trees, memory-based reasoning, or nearest neighbor and data visualization. A data mining tool may use many of these algorithms in any combination to generate a computer model that reflects the behavior of the target data set. Although some data mining tools stop at the knowledge acquisition phase, others continue to the prognosis phase. In this phase, the data mining findings are used to predict future behavior and forecast business outcomes. Examples of data mining findings can be:
65% of customers who did not use the credit card in six months are 88% likely to cancel their account
82% of customers who bought a new TV 27" or bigger are 90% likely to buy a entertainment center within the next 4 weeks.
If age < 30 and income <= 25,0000 and credit rating < 3 and credit amount > 25,000, the minimum term is 10 years.
The complete set of findings can be represented in a decision tree, a neural net, a forecasting model or a visual presentation interface which is then used to project future events or results. For example the prognosis phase may project the likely outcome of a new product roll-out or a new marketing promotion.
21. Describe the characteristics of predictive analytics. What is the impact of Big Data in predictive analytics? Predictive analytics employs mathematical and statistical algorithms, neural networks, artificial intelligence, and other advanced modeling tools to create actionable predictive models based on available data. The algorithms used to build the predictive model are specific to certain types of problems and work with certain types of data. Therefore, it is important that the end user, who typically is trained in statistics and understands business, applies the proper algorithms to the problem in hand. However, thanks to constant technology advances, modern BI tools automatically apply multiple algorithms to find the optimum model. Most predictive analytics models are used in areas such as customer relationships, customer service, customer retention, fraud detection, targeted marketing, and optimized pricing. Predictive analytics can add value to an organization in many different ways; for example, it can help optimize existing processes, identify hidden problems, and anticipate future problems or opportunities. However, predictive analytics is not the “secret sauce” to fix all business problems. Managers should carefully monitor and evaluate the value of predictive analytics models to determine their return on investment.
Predictive analytics received a big stimulus with the advent of social media. Companies turned to data mining and predictive analytics as a way to harvest the mountains of data stored on social media sites. Google was one of the first companies that offered targeted ads as a way to increase and personalize search experiences. Similar initiatives were used by all types of organizations to increase customer loyalty and drive up sales. Take the example of the airline and credit card industries and their frequent flyer and affinity card programs. Nowadays, many organizations use predictive analytics to profile customers in an attempt to get and keep the right ones, which in turn will increase loyalty and sales.