Prepare for your exams
Get points
Guidelines and tips

Prepare for your exams

Study with the several resources on Docsity

Earn points to download

Earn points by helping other students or get them with a premium plan

Guidelines and tips

Sell on Docsity

Prepare for your exams

Study with the several resources on Docsity

Find documents

Prepare for your exams with the study notes shared by other students like you on Docsity

Search Store documents

The best documents sold by students who completed their studies

Search through all study resources

Docsity AINEW

Summarize your documents, ask them questions, convert them into quizzes and concept maps

Explore questions

Clear up your doubts by reading the answers to questions asked by your fellow students

Earn points to download

Earn points by helping other students or get them with a premium plan

Share documents

20 Points

For each uploaded document

Answer questions

5 Points

For each given answer (max 1 per day)

All the ways to get free points

Get points immediately

Choose a premium plan with all the points you need

Study Opportunities

Choose your next study program

Get in touch with the best universities in the world. Search through thousands of universities and official partners

Community

Ask the community

Ask the community for help and clear up your study doubts

University Rankings

Discover the best universities in your country according to Docsity users

Free resources

Our save-the-student-ebooks!

Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors

From our blog

Exams and Study

Go to the blog

Introduction to Big Data and Analytics, Lecture notes of Data Mining

Shivaji University Data Mining

An introduction to the concept of big data and analytics. It covers the classification of digital data, structured and unstructured data, the characteristics and evolution of big data, the challenges associated with big data, the differences between traditional business intelligence and big data, and the importance of big data analytics. The document also discusses data science, data scientists, and the terminologies used in big data environments, such as acid (basically available, soft state, eventual consistency) and top analytics tools. Additionally, it introduces the technology landscape, including nosql, the comparison of sql and nosql, and hadoop. This comprehensive overview lays the foundation for understanding the key aspects of big data and analytics, which are crucial for businesses and organizations in the modern data-driven landscape.

Typology: Lecture notes

2022/2023

Uploaded on 03/10/2024

ajinkya-jagtap 🇮🇳

7 documents

1 / 119

This page cannot be seen from the preview

Don't miss anything!

BIG DATA ANALYTICS

[R17A0528]

LECTURE NOTES

B.TECH IV YEAR – I SEM (R17)

(2020-2021)

MALLA REDDY

COLLEGE OF ENGINEERING & TECHNOLOGY

(Autonomous Institution – UGC, Govt. of India)

Recognized under 2(f) and 12 (B) of UGC ACT 1956

(Affiliated to JNTUH, Hyderabad, Approved by AICTE - Accredited by NBA & NAAC – ‘A’ Grade - ISO 9001:2015 Certified)

Maisammaguda, Dhulapally (Post Via. Hakimpet), Secunderabad – 500100, Telangana State, India

Partial preview of the text

Download Introduction to Big Data and Analytics and more Lecture notes Data Mining in PDF only on Docsity!

BIG DATA ANALYTICS

[R17A0528]

LECTURE NOTES

B.TECH IV YEAR – I SEM (R17)

MALLA REDDY

COLLEGE OF ENGINEERING & TECHNOLOGY

(Autonomous Institution – UGC, Govt. of India)

Recognized under 2(f) and 12 (B) of UGC ACT 1956 (Affiliated to JNTUH, Hyderabad, Approved by AICTE - Accredited by NBA & NAAC – ‘A’ Grade - ISO 9001:2015 Certified) Maisammaguda, Dhulapally (Post Via. Hakimpet), Secunderabad – 500100, Telangana State, India

(R1 7 A05 28 ) BIG DATA ANALYTICS

UNIT I

INTRODUCTION TO BIG DATA AND ANALYTICS

Classification of Digital Data, Structured and Unstructured Data – Introduction to Big Data: Characteristics – Evolution – Definition - Challenges with Big Data

Other Characteristics of Data - Why Big Data - Traditional Business Intelligence versus Big Data - Data Warehouse and Hadoop Environment Big Data Analytics: Classification of Analytics – Challenges - Big Data Analytics important - Data Science - Data Scientist - Terminologies used in Big Data Environments - Basically Available Soft State Eventual Consistency - Top Analytics Tools

UNIT II

INTRODUCTION TO TECHNOLOGY LANDSCAPE

NoSQL, Comparison of SQL and NoSQL, Hadoop - RDBMS Versus Hadoop - Distributed Computing Challenges – Hadoop Overview - Hadoop Distributed File System - Processing Data with Hadoop - Managing Resources and Applications with Hadoop YARN - Interacting with Hadoop Ecosystem

UNIT III

INTRODUCTION TO MONGODB AND MAPREDUCE PROGRAMMING

MongoDB: Why Mongo DB - Terms used in RDBMS and Mongo DB - Data Types - MongoDB Query Language MapReduce: Mapper – Reducer – Combiner – Partitioner – Searching – Sorting – Compression

UNIT IV

INTRODUCTION TO HIVE AND PIG

Hive: Introduction – Architecture - Data Types - File Formats - Hive Query Language Statements – Partitions – Bucketing – Views - Sub- Query – Joins – Aggregations - Group by and Having - RCFile Implementation - Hive User Defined Function - Serialization and Deserialization. Pig: Introduction - Anatomy – Features – Philosophy - Use Case for Pig - Pig Latin Overview - Pig Primitive Data Types - Running Pig - Execution Modes of Pig - HDFS Commands - Relational Operators - Eval Function - Complex Data Types - Piggy Bank - User-Defined Functions - Parameter Substitution - Diagnostic Operator - Word Count Example using Pig - Pig at Yahoo! - Pig Versus Hive

UNIT V

INTRODUCTION TO DATA ANALYTICS WITH R

Machine Learning: Introduction, Supervised Learning, Unsupervised Learning, Machine Learning Algorithms: Regression Model, Clustering, Collaborative Filtering, Associate Rule Making, Decision Tree, Big Data Analytics with BigR.

INDEX

S. No Unit Topic Pg.No

1 I

INTRODUCTION TO BIG DATA AND ANALYTICS

Classification of Digital Data, Structured and Unstructured Data - Introduction to Big Data

2 I

Why Big Data Traditional Business Intelligence versus Big Data - Data Warehouse and Hadoop 4

(^3) I

Environment Big Data Analytics: Classification of Analytics – Challenges

Big Data Analytics importance (^5)

4 I Data Science^ -^ Data Scientist^ -^ Terminologies used in Big^ Data Environments^10 5 I Basically, Available Soft State Eventual Consistency^ - Top Analytics Tools^12

7 II

INTRODUCTION TO TECHNOLOGY LANDSCAPE

NoSQL, Comparison of SQL and NoSQL, Hadoop - RDBMS Versus Hadoop - Distributed Computing

8 II

Challenges – Hadoop Overview - Hadoop Distributed File System - Processing Data with Hadoop - 20

9 II

Managing Resources and Applications with Hadoop YARN - Interacting with Hadoop Ecosystem 22

11110 III

INTRODUCTION TO MONGODB AND MAPREDUCE

PROGRAMMING

MongoDB: Why Mongo DB - Terms used in RDBMS and Mongo DB - Data Types - MongoDB Query Language

111 III

MapReduce: Mapper – Reducer – Combiner – Partitioner – Searching – Sorting – Compression 36

12 IV

INTRODUCTION TO HIVE AND PIG

Hive: Introduction – Architecture - Data Types - File Formats - Hive Query Language Statements

13 IV

Partitions – Bucketing – Views - Sub- Query – Joins – Aggregations - Group by and Having - RCFile 70

(^14) IV

Implementation – Hive User Defined Function - Serialization and Deserialization. Pig: Introduction 75

15 IV

Anatomy – Features – Philosophy - Use Case for Pig - Pig Latin Overview - Pig Primitive Data Types 76

16 IV

Running Pig - Execution Modes of Pig - HDFS Commands - Relational Operators - Eval Function 79

17 IV

Complex Data Types - Piggy Bank - User-Defined Functions - Parameter Substitution – Diagnostic 82

(^18) IV

Operator - Word Count Example using Pig - Pig at Yahoo! - Pig Versus Hive 93

19 V

INTRODUCTION TO DATA ANALYTICS WITH R

Machine Learning: Introduction, Supervised Learning, Unsupervised Learning, Machine Learning

20 V

Algorithms: Regression Model, Clustering, CollaborativeFiltering, Associate Rule Making, Decision Tree, Big Data Analytics with BigR. 97

UNIT – I

What is Big Data? According to Gartner, the definition of Big Data – “Big data” is high-volume, velocity, and variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.”

This definition clearly answers the “What is Big Data?” question – Big Data refers to complex and large data sets that have to be processed and analyzed to uncover valuable information that can benefit businesses and organizations.

However, there are certain basic tenets of Big Data that will make it even simpler to answer what is Big Data:  It refers to a massive amount of data that keeps on growing exponentially with time.  It is so voluminous that it cannot be processed or analyzed using conventional data processing techniques.  It includes data mining, data storage, data analysis, data sharing, and data visualization.  The term is an all-comprehensive one including data, data frameworks, along with the tools and techniques used to process and analyze the data.

The History of Big Data

Although the concept of big data itself is relatively new, the origins of large data sets go back to the 1960s and '70s when the world of data was just getting started with the first data centers and the development of the relational database.

Around 2005, people began to realize just how much data users generated through Facebook, YouTube, and other online services. Hadoop (an open-source framework created specifically to store and analyze big data sets) was developed that same year. NoSQL also began to gain popularity during this time.

The development of open-source frameworks, such as Hadoop (and more recently, Spark) was essential for the growth of big data because they make big data easier to work with and cheaper to store. In the years since then, the volume of big data has skyrocketed. Users are still generating huge amounts of data—but it’s not just humans who are doing it.

With the advent of the Internet of Things (IoT), more objects and devices are connected to the internet, gathering data on customer usage patterns and product performance. The emergence of machine learning has produced still more data.

While big data has come far, its usefulness is only just beginning. Cloud computing has expanded big data possibilities even further. The cloud offers truly elastic scalability, where developers can simply spin up ad hoc clusters to test a subset of data.

b) Velocity Velocity essentially refers to the speed at which data is being created in real-time. In a broader prospect, it comprises the rate of change, linking of incoming data sets at varying speeds, and activity bursts. c) Volume Volume is one of the characteristics of big data. We already know that Big Data indicates huge ‘volumes’ of data that is being generated on a daily basis from various sources like social media platforms, business processes, machines, networks, human interactions, etc. Such a large amount of data is stored in data warehouses. Thus comes to the end of characteristics of big data.

Why is Big Data Important?

The importance of big data does not revolve around how much data a company has but how a company utilizes the collected data. Every company uses data in its own way; the more efficiently a company uses its data, the more potential it has to grow. The company can take data from any source and analyze it to find answers which will enable:

Cost Savings : Some tools of Big Data like Hadoop and Cloud-Based Analytics can bring cost advantages to business when large amounts of data are to be stored and these tools also help in identifying more efficient ways of doing business.
Time Reductions: The high speed of tools like Hadoop and in-memory analytics can easily identify new sources of data which helps businesses analyzing data immediately and make quick decisions based on the learning.
Understand the market conditions : By analyzing big data you can get a better understanding of current market conditions. For example, by analyzing customers’ purchasing behaviors, a company can find out the products that are sold the most and produce products according to this trend. By this, it can get ahead of its competitors.
Control online reputation: Big data tools can do sentiment analysis. Therefore, you can get feedback about who is saying what about your company. If you want to monitor and improve the online presence of your business, then, big data tools can help in all this.
Using Big Data Analytics to Boost Customer Acquisition and Retention The customer is the most important asset any business depends on. There is no single business that can claim success without first having to establish a solid customer base. However, even with a customer base, a business cannot afford to disregard the high competition it faces. If a business is slow to learn what customers are looking for, then it is very easy to begin offering poor quality products. In the end, loss of clientele will result, and this creates an adverse overall effect on business success. The use of big data allows businesses to observe various customer related patterns and trends. Observing customer behavior is important to trigger loyalty.
Using Big Data Analytics to Solve Advertisers Problem and Offer Marketing Insights

Big data analytics can help change all business operations. This includes the ability to match customer expectation, changing company’s product line and of course ensuring that the marketing campaigns are powerful.

Big Data Analytics As a Driver of Innovations and Product Development Another huge advantage of big data is the ability to help companies innovate and redevelop their products.

Business Intelligence vs Big Data

Although Big Data and Business Intelligence are two technologies used to analyze data to help companies in the decision-making process, there are differences between both of them. They differ in the way they work as much as in the type of data they analyze.

Traditional BI methodology is based on the principle of grouping all business data into a central server. Typically, this data is analyzed in offline mode, after storing the information in an environment called Data Warehouse. The data is structured in a conventional relational database with an additional set of indexes and forms of access to the tables (multidimensional cubes).

A Big Data solution differs in many aspects to BI to use. These are the main differences between Big Data and Business Intelligence:

In a Big Data environment, information is stored on a distributed file system, rather than on a central server. It is a much safer and more flexible space.
Big Data solutions carry the processing functions to the data, rather than the data to the functions. As the analysis is centered on the information, it´s easier to handle larger amounts of information in a more agile way.
Big Data can analyze data in different formats, both structured and unstructured. The volume of unstructured data (those not stored in a traditional database) is growing at levels much higher than the structured data. Nevertheless, its analysis carries different challenges. Big Data solutions solve them by allowing a global analysis of various sources of information.
Data processed by Big Data solutions can be historical or come from real-time sources. Thus, companies can make decisions that affect their business in an agile and efficient way.
Big Data technology uses parallel mass processing (MPP) concepts, which improves the speed of analysis. With MPP many instructions are executed simultaneously, and since the various jobs are divided into several parallel execution parts, at the end the overall results are reunited and presented. This allows you to analyze large volumes of information quickly.

 Store – Big data need to be collected in a seamless repository, and it is not necessary to store in a single physical database.  Process – The process becomes more tedious than traditional one in terms of cleansing, enriching, calculating, transforming, and running algorithms.  Access – There is no business sense of it at all when the data cannot be searched, retrieved easily, and can be virtually showcased along the business lines.

Classification of analytics

Descriptive analytics Descriptive analytics is a statistical method that is used to search and summarize historical data in order to identify patterns or meaning.

Data aggregation and data mining are two techniques used in descriptive analytics to discover historical data. Data is first gathered and sorted by data aggregation in order to make the datasets more manageable by analysts.

Data mining describes the next step of the analysis and involves a search of the data to identify patterns and meaning. Identified patterns are analyzed to discover the specific ways that learners interacted with the learning content and within the learning environment.

Advantages:

 Quickly and easily report on the Return on Investment (ROI) by showing how performance achieved business or target goals.  Identify gaps and performance issues early - before they become problems.

 Identify specific learners who require additional support, regardless of how many students or employees there are.

 Identify successful learners in order to offer positive feedback or additional resources.

 Analyze the value and impact of course design and learning resources.

Predictive analytics

Predictive Analytics is a statistical method that utilizes algorithms and machine learning to identify trends in data and predict future behaviors

The software for predictive analytics has moved beyond the realm of statisticians and is becoming more affordable and accessible for different markets and industries, including the field of learning & development.

For online learning specifically, predictive analytics is often found incorporated in the Learning Management System (LMS), but can also be purchased separately as specialized software.

For the learner, predictive forecasting could be as simple as a dashboard located on the main screen after logging in to access a course. Analyzing data from past and current progress, visual indicators in the dashboard could be provided to signal whether the employee was on track with training requirements.

Advantages:

 Personalize the training needs of employees by identifying their gaps, strengths, and weaknesses; specific learning resources and training can be offered to support individual needs.

 Retain Talent by tracking and understanding employee career progression and forecasting what skills and learning resources would best benefit their career paths. Knowing what skills employees need also benefits the design of future training.

 Support employees who may be falling behind or not reaching their potential by offering intervention support before their performance puts them at risk.

 Simplified reporting and visuals that keep everyone updated when predictive forecasting is required.

Prescriptive analytics Prescriptive analytics is a statistical method used to generate recommendations and make decisions based on the computational findings of algorithmic models.

Generating automated decisions or recommendations requires specific and unique algorithmic models and clear direction from those utilizing the analytical technique. A recommendation cannot be generated without knowing what to look for or what problem is desired to be solved. In this way, prescriptive analytics begins with a problem.

Example A Training Manager uses predictive analysis to discover that most learners without a particular skill will not complete the newly launched course. What could be done? Now prescriptive analytics can be of assistance on the matter and help determine options for action. Perhaps an algorithm can detect the learners who require that new course, but lack that particular skill, and send an automated recommendation that they take an additional training resource to acquire the missing skill.

4. Getting Voluminous Data Into The Big Data Platform

It is hardly surprising that data is growing with every passing day. This simply indicates that business organizations need to handle a large amount of data on daily basis. The amount and variety of data available these days can overwhelm any data engineer and that is why it is considered vital to make data accessibility easy and convenient for brand owners and managers.

5. Uncertainty Of Data Management Landscape

With the rise of Big Data, new technologies and companies are being developed every day. However, a big challenge faced by the companies in the Big Data analytics is to find out which technology will be best suited to them without the introduction of new problems and potential risks.

6. Data Storage And Quality

Business organizations are growing at a rapid pace. With the tremendous growth of the companies and large business organizations, increases the amount of data produced. The storage of this massive amount of data is becoming a real challenge for everyone. Popular data storage options like data lakes/ warehouses are commonly used to gather and store large quantities of unstructured and structured data in its native format. The real problem arises when a data lakes/ warehouse try to combine unstructured and inconsistent data from diverse sources, it encounters errors. Missing data, inconsistent data, logic conflicts, and duplicates data all result in data quality challenges.

7. Security And Privacy Of Data

Once business enterprises discover how to use Big Data, it brings them a wide range of possibilities and opportunities. However, it also involves the potential risks associated with big data when it comes to the privacy and the security of the data. The Big Data tools used for analysis and storage utilizes the data disparate sources. This eventually leads to a high risk of exposure of the data, making it vulnerable. Thus, the rise of voluminous amount of data increases privacy and security concerns.

Terminologies Used In Big Data Environments

 As-a-service infrastructure

Data-as-a-service, software-as-a-service, platform-as-a-service – all refer to the idea that rather than selling data, licences to use data, or platforms for running Big Data technology, it can be provided “as a service”, rather than as a product. This reduces the upfront capital investment

necessary for customers to begin putting their data, or platforms, to work for them, as the provider bears all of the costs of setting up and hosting the infrastructure. As a customer, as-a-service infrastructure can greatly reduce the initial cost and setup time of getting Big Data initiatives up and running.

 Data science

Data science is the professional field that deals with turning data into value such as new insights or predictive models. It brings together expertise from fields including statistics, mathematics, computer science, communication as well as domain expertise such as business knowledge. Data scientist has recently been voted the No 1 job in the U.S., based on current demand and salary and career opportunities.

 Data mining

Data mining is the process of discovering insights from data. In terms of Big Data, because it is so large, this is generally done by computational methods in an automated way using methods such as decision trees, clustering analysis and, most recently, machine learning. This can be thought of as using the brute mathematical power of computers to spot patterns in data which would not be visible to the human eye due to the complexity of the dataset.

 Hadoop

Hadoop is a framework for Big Data computing which has been released into the public domain as open source software, and so can freely be used by anyone. It consists of a number of modules all tailored for a different vital step of the Big Data process – from file storage (Hadoop File System

HDFS) to database (HBase) to carrying out data operations (Hadoop MapReduce – see below). It has become so popular due to its power and flexibility that it has developed its own industry of retailers (selling tailored versions), support service providers and consultants.

 Predictive modelling

At its simplest, this is predicting what will happen next based on data about what has happened previously. In the Big Data age, because there is more data around than ever before, predictions are becoming more and more accurate. Predictive modelling is a core component of most Big Data initiatives, which are formulated to help us choose the course of action which will lead to the most desirable outcome. The speed of modern computers and the volume of data available means that predictions can be made based on a huge number of variables, allowing an ever-increasing number of variables to be assessed for the probability that it will lead to success.

 MapReduce

MapReduce is a computing procedure for working with large datasets, which was devised due to difficulty of reading and analysing really Big Data using conventional computing methodologies. As its name suggest, it consists of two procedures – mapping (sorting information into the format needed for analysis – i.e. sorting a list of people according to their age) and reducing (performing an operation, such checking the age of everyone in the dataset to see who is over 21).

Spark is another open source framework like Hadoop but more recently developed and more suited to handling cutting-edge Big Data tasks involving real time analytics and machine learning. Unlike Hadoop it does not include its own filesystem, though it is designed to work with Hadoop’s HDFS or a number of other options. However, for certain data related processes it is able to calculate at over 100 times the speed of Hadoop, thanks to its in-memory processing capability. This means it is becoming an increasingly popular choice for projects involving deep learning, neural networks and other compute-intensive tasks.

 Structured Data

Structured data is simply data that can be arranged neatly into charts and tables consisting of rows, columns or multi-dimensioned matrixes. This is traditionally the way that computers have stored data, and information in this format can easily and simply be processed and mined for insights. Data gathered from machines is often a good example of structured data, where various data points

speed, temperature, rate of failure, RPM etc. – can be neatly recorded and tabulated for analysis.

 Unstructured Data

Unstructured data is any data which cannot easily be put into conventional charts and tables. This can include video data, pictures, recorded sounds, text written in human languages and a great deal more. This data has traditionally been far harder to draw insight from using computers which were generally designed to read and analyze structured information. However, since it has become apparent that a huge amount of value can be locked away in this unstructured data, great efforts have been made to create applications which are capable of understanding unstructured data – for example visual recognition and natural language processing.

 Visualization

Humans find it very hard to understand and draw insights from large amounts of text or numerical data – we can do it, but it takes time, and our concentration and attention is limited. For this reason effort has been made to develop computer applications capable of rendering information in a visual form – charts and graphics which highlight the most important insights which have resulted from our Big Data projects. A subfield of reporting (see above), visualizing is now often an automated process, with visualizations customized by algorithm to be understandable to the people who need to act or take decisions based on them.

Basic availability, Soft state and Eventual consistency

Basic availability implies continuous system availability despite network failures and tolerance to temporary in consistency.

Soft state refers to state change without input which is required for eventual consistency.

Eventual consistency means that if no further updates are made to a given updated data base item for long enough period of time , all users will see the same value for the updated item.

Top Analytics Tools

*** R** is a language for statistical computing and graphics. It also used for big data analysis. It provides a wide variety of statistical tests.

Features:

 Effective data handling and storage facility,  It provides a suite of operators for calculations on arrays, in particular, matrices,  It provides coherent, integrated collection of big data tools for data analysis  It provides graphical facilities for data analysis which display either on-screen or on hardcopy

*** Apache Spark** is a powerful open source big data analytics tool. It offers over 80 high-level operators that make it easy to build parallel apps. It is used at a wide range of organizations to process large datasets.

Features:

 It helps to run an application in Hadoop cluster, up to 100 times faster in memory, and ten times faster on disk  It offers lighting Fast Processing  Support for Sophisticated Analytics  Ability to Integrate with Hadoop and Existing Hadoop Data

*** Plotly** is an analytics tool that lets users create charts and dashboards to share online.

Features:

 Easily turn any data into eye-catching and informative graphics  It provides audited industries with fine-grained information on data provenance  Plotly offers unlimited public file hosting through its free community plan

*** Lumify** is a big data fusion, analysis, and visualization platform. It helps users to discover connections and explore relationships in their data via a suite of analytic options.

Features:

 It provides both 2D and 3D graph visualizations with a variety of automatic layouts

UNIT II

NoSQL

NoSQL is a non-relational DMS, that does not require a fixed schema, avoids joins, and is easy to scale. NoSQL database is used for distributed data stores with humongous data storage needs. NoSQL is used for Big data and real-time web apps. For example companies like Twitter, Facebook, Google that collect terabytes of user data every single day.

SQL

Structured Query language (SQL) pronounced as "S-Q-L" or sometimes as "See-Quel " is the standard language for dealing with Relational Databases. A relational database defines relationships in the form of tables.

SQL programming can be effectively used to insert, search, update, delete database records.

Comparison of SQL and NoSQL

Parameter SQL NOSQL Definition SQL databases are primarily called RDBMS or Relational Databases

NoSQL databases are primarily called as Non- relational or distributed database Design for Traditional RDBMS uses SQL syntax and queries to analyze and get the data for further insights. They are used for OLAP systems.

NoSQL database system consists of various kind of database technologies. These databases were developed in response to the demands presented for the development of the modern application. Query Language

Structured query language (SQL) No declarative query language

Type SQL databases are table based databases

NoSQL databases can be document based, key- value pairs, graph databases Schema SQL databases have a predefined schema

NoSQL databases use dynamic schema for unstructured data. Ability to scale SQL databases are vertically scalable

NoSQL databases are horizontally scalable

Examples Oracle, Postgres, and MS-SQL. MongoDB, Redis, , Neo4j, Cassandra, Hbase. Best suited for An ideal choice for the complex query intensive environment.

It is not good fit complex queries.

Hierarchical data storage

SQL databases are not suitable for hierarchical data storage.

More suitable for the hierarchical data store as it supports key-value pair method. Variations One type with minor variations. Many different types which include key-value stores, document databases, and graph databases.

Development Year

It was developed in the 1970s to deal with issues with flat file storage

Developed in the late 2000s to overcome issues and limitations of SQL databases.

Open-source A mix of open-source like Postgres & MySQL, and commercial like Oracle Database.

Open-source

Consistency It should be configured for strong consistency.

It depends on DBMS as some offers strong consistency like MongoDB, whereas others offer only offers eventual consistency, like Cassandra. Best Used for RDBMS database is the right option for solving ACID problems.

NoSQL is a best used for solving data availability problems Importance It should be used when data validity is super important

Use when it's more important to have fast data than correct data Best option When you need to support dynamic queries

Use when you need to scale based on changing requirements Hardware Specialized DB hardware (Oracle Exadata, etc.)

Commodity hardware

Network Highly available network (Infiniband, Fabric Path, etc.)

Commodity network (Ethernet, etc.)

Storage Type Highly Available Storage (SAN, RAID, etc.)

Commodity drives storage (standard HDDs, JBOD) Best features Cross-platform support, Secure and free

Easy to use, High performance, and Flexible tool. Top Companies Using

Hootsuite, CircleCI, Gauges Airbnb, Uber, Kickstarter

Average salary The average salary for any professional SQL Developer is $84,328 per year in the U.S.A.

The average salary for "NoSQL developer" ranges from approximately $72,174 per year

ACID vs. BASE Model

ACID( Atomicity, Consistency, Isolation, and Durability) is a standard for RDBMS

Base ( Basically Available, Soft state, Eventually Consistent) is a model of many NoSQL systems

Introduction to Big Data and Analytics, Lecture notes of Data Mining

Related documents

Partial preview of the text

Download Introduction to Big Data and Analytics and more Lecture notes Data Mining in PDF only on Docsity!

BIG DATA ANALYTICS

[R17A0528]

LECTURE NOTES

B.TECH IV YEAR – I SEM (R17)

MALLA REDDY

COLLEGE OF ENGINEERING & TECHNOLOGY

(Autonomous Institution – UGC, Govt. of India)

(R1 7 A05 28 ) BIG DATA ANALYTICS

UNIT I

INTRODUCTION TO BIG DATA AND ANALYTICS

UNIT II

INTRODUCTION TO TECHNOLOGY LANDSCAPE

UNIT III

INTRODUCTION TO MONGODB AND MAPREDUCE PROGRAMMING

UNIT IV

INTRODUCTION TO HIVE AND PIG

UNIT V

INTRODUCTION TO DATA ANALYTICS WITH R

INDEX

1 I

INTRODUCTION TO BIG DATA AND ANALYTICS

2 I

7 II

INTRODUCTION TO TECHNOLOGY LANDSCAPE

8 II

9 II

11110 III

INTRODUCTION TO MONGODB AND MAPREDUCE

PROGRAMMING

111 III

12 IV

INTRODUCTION TO HIVE AND PIG

13 IV

19 V

INTRODUCTION TO DATA ANALYTICS WITH R

20 V