Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

Data mining and data warehouse, Assignments of Data Mining

This is a complete assignment related to data mining and data warehouse

Typology: Assignments

2022/2023

Uploaded on 12/16/2023

sangita-choudhari
sangita-choudhari 🇮🇳

1 document

1 / 21

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
Q. Define data warehouse. Explain the architecture of data warehouse
with diagram. (10 marks)
According to the definition of Bill Inmon, ”Data warehouse is a subject-
oriented, Integrated, Non-volatile and Time variant collection of data in
support of management’s decision.”
i. Subject-Oriented Data: In data warehouse, data is stored by subjects,
not by application.
ii. Integrated data: Data in data warehouse comes from several
operational system like remove inconsistencies, transformation,
integration of source data.
iii. Time-Variant: It means data warehouse has to contain historical data,
not just current value.
iv. Non-volatile Data: Data is not updated/deleted from data warehouse
in real time.
Components of data warehouse:
1. Operational source:
An operational source is a data source consists of operational data and
external data.
Data can come from Relational DBMS like Oracle.
2. Load manager:
The load manager performs all operations associated with the
extraction of loading data in data warehouse.
These tasks include the simple transformation of data to prepare data
for entry into the warehouse.
3. Query manager:
It performs all the tasks associated with the management of user
queries.
The complexity of query manager is determined by the end-user
access operations tools and the features provided by the database.
4. OLAP server (Online Analytical Processing):
Tools that allow for advanced, multi-dimensional analysis of the data.
5. Metadata:
Information about data in the warehouse, like where it came from,
when it was last updated and how it is structured.
It is like catalog for the data.
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15

Partial preview of the text

Download Data mining and data warehouse and more Assignments Data Mining in PDF only on Docsity!

Q. Define data warehouse. Explain the architecture of data warehouse

with diagram. (10 marks)

According to the definition of Bill Inmon, ”Data warehouse is a subject- oriented, Integrated, Non-volatile and Time variant collection of data in support of management’s decision.” i. Subject-Oriented Data: In data warehouse, data is stored by subjects, not by application. ii. Integrated data: Data in data warehouse comes from several operational system like remove inconsistencies, transformation, integration of source data. iii. Time-Variant: It means data warehouse has to contain historical data, not just current value. iv. Non-volatile Data: Data is not updated/deleted from data warehouse in real time. Components of data warehouse:

  1. Operational source:
    • An operational source is a data source consists of operational data and external data.
    • Data can come from Relational DBMS like Oracle.
  2. Load manager:
    • The load manager performs all operations associated with the extraction of loading data in data warehouse.
    • These tasks include the simple transformation of data to prepare data for entry into the warehouse.
  3. Query manager:
    • It performs all the tasks associated with the management of user queries.
    • The complexity of query manager is determined by the end-user access operations tools and the features provided by the database.
  4. OLAP server (Online Analytical Processing):
    • Tools that allow for advanced, multi-dimensional analysis of the data.
  5. Metadata:
    • Information about data in the warehouse, like where it came from, when it was last updated and how it is structured.
    • It is like catalog for the data.
  1. Data models:
    • Star schema: Uses a central table(fact table) surrounded by dimension table.
    • Snowflake schema: Similar to star schema but with normalized dimension tables.
  2. End-User tools:
    • End-User tools consists of analysis, reporting and mining.
    • By using end-user tools, users can link with the warehouse. Architecture of data warehouse:
  3. Bottom Tier − The bottom tier of the architecture is the data warehouse database server. It is the relational database system. We use the back-end tools and utilities to feed data into the bottom tier. These back-end tools and utilities perform the Extract, Clean, Load, and refresh functions.
  4. Middle Tier − In the middle tier, we have the OLAP Server that can be implemented in either of the following ways.
  • By Relational OLAP (ROLAP), which is an extended relational database management system. The ROLAP maps the operations on multidimensional data to standard relational operations.
  • By Multidimensional OLAP (MOLAP) model, which directly implements the multidimensional data and operations.
  1. Top-Tier − This tier is the front-end client layer. This layer holds the query tools and reporting tools, analysis tools and data mining tools.

Q. Define multi-dimensional data model. Explain different types of schemas used in data warehouse with example. (10 marks) Dimensional Modelling is a design technique used in data warehousing where data is organized into fact and dimension tables to optimize query performance, ease of use, and to support business processes. It often employs star or snowflake schema designs in database architecture. A multi-dimensional data model is a representation of data that allows for viewing and analysing information from multiple perspectives or dimensions simultaneously. Different types of schemas used in data warehouse: Schema is a logical description of the entire database. It includes the name and description of records of all record types including all associated data-items and aggregates.

  1. Star schema
    • Star schema is the type of multidimensional model which is used for data warehouse.
    • In star schema, the fact tables and the dimension tables are contained.
    • In this schema fewer foreign-key join is used.
    • This schema forms a star with fact table and dimension tables.
    • Diagram Example:
  1. Snowflake schema
    • Snowflake Schema is also the type of multidimensional model which is used for data warehouse.
    • In snowflake schema, the fact tables, dimension tables as well as sub dimension tables are contained.
    • This schema forms a snowflake with fact tables, dimension tables as well as sub-dimension tables.
    • Diagram
      • Example

Q. Define data mining. Explain data mining issues and applications. (10 marks) Data mining is the process of analyzing large datasets to discover patterns, trends, and relationships that can provide valuable insights or predictions. Issues in data mining:

  1. Data Quality: Poor data quality, including missing values, inconsistencies, and errors, can lead to misleading results. Ensuring clean and accurate data is essential for effective data mining.
  2. Privacy Concerns: As data mining can extract hidden information and patterns, it can potentially invade the privacy of individuals, especially when handling personal or sensitive data.
  3. Data Security: Safeguarding the data being mined is critical. Unauthorized access or breaches can lead to misuse of sensitive information.
  4. Overfitting: This occurs when a model is too closely tailored to the training dataset, making it perform poorly on new, unseen data. Overfit models capture noise rather than the underlying pattern.
  5. Complexity of Algorithms: Some data mining algorithms can be inherently complex and hard to understand. This black-box nature can lead to challenges in interpretability and trustworthiness of the results. Applications of data mining:
  6. Healthcare: Predicting disease outbreaks based on health data trends, or identifying risk factors for diseases from large patient datasets.
  7. Sports: Analyzing player performance data to make decisions about team strategy, player health, and recruitment.
  8. E-Learning and Education: Analyzing student performance to identify learning patterns, predict student outcomes, or provide personalized learning paths.
  9. Agriculture: Identifying patterns related to crop yields based on factors like weather conditions, soil quality, and farming practices.
  10. Social Media: Analysing user behaviours to deliver targeted advertisements, or understanding trending topics and sentiment analysis.

Q. What is data mining. Explain data mining tasks/functionalities.(10 marks)

  1. Classification: Aim: Categorize data into predefined classes or groups. Example: Diagnosing diseases based on patient symptoms. If a patient has a set of symptoms, classify the patient as having one of several possible diseases.
  2. Regression: Aim: Predict a continuous or ordered value based on input data. Example: Predicting a student's final grade based on their attendance, homework scores, and midterm exam scores.
  3. Clustering: Aim: Group data points that are similar to each other, without prior knowledge of classes. Example: Segmenting customers for a retail business based on their buying behaviors, even if we don’t know the segments in advance.
  4. Association Rule Learning (or Market Basket Analysis): Aim: Discover interesting relationships or associations between variables in large datasets. Example: Finding out that people who buy diapers often also buy baby wipes from a supermarket transaction database.
  5. Anomaly Detection (or Outlier Detection): Aim: Identify data points that deviate significantly from the norm or expected pattern. Such data points are termed anomalies or outliers. Example: Detecting potentially fraudulent activities in a set of credit card transactions.
  • Data Cleaning: Removing noise and irrelevant data.
  • Data Integration: Combining data from different sources.
  • Data Selection: Choosing the relevant data for the analysis.
  • Data Transformation: Converting data into a format suitable for mining, which may involve normalization, aggregation, or other operations.
  • Data Mining: Applying statistical and machine learning algorithms to extract patterns or knowledge from the prepared data.
  • Pattern Evaluation: Identifying the truly interesting patterns representing knowledge based on some measures.
  • Knowledge Presentation: Representing this knowledge in a comprehensible manner using visualization techniques, reports, etc. Parameter KDD Data Mining Scope KDD is a broader process that encompasses all the steps from raw data collection to knowledge presentation. Data mining is a particular step within the KDD process where specific algorithms are applied to extract patterns. Objective The primary objective of KDD is knowledge discovery. The primary objective of data mining is pattern discovery. Components KDD involves data cleaning, integration, selection, transformation, mining, pattern evaluation, and knowledge presentation. Data mining focuses mainly on clustering, classification, association, and prediction. Output KDD's end result is actionable knowledge. Data mining's output is patterns or models. Example Data analysis to find patterns and links. Clustering groups of data elements based on how similar they are.

Q. Explain multidimensional data model with example. The multi-Dimensional Data Model is a method which is used for ordering data in the database along with good arrangement and assembling of the contents in the database. Here are main concepts of multidimensional data model:

  • Cube Visualization: Data is represented like a 3D cube.
  • Dimensions: Categories like Time, Products, or Regions.
  • Facts: Numeric values such as Sales or Profit.
  • Quick Queries: Allows fast and efficient data analysis.
  • Decision Support: Used for business insights and trend spotting. Dimensions: Fundamental categories or perspectives through which data can be analyzed (e.g., Time, Product, Customer, Location). Facts: Measurable data points or metrics that represent a business's performance or activities (e.g., Sales, Profit, Quantity). Cubes: A data structure that organizes facts and dimensions in a way that allows for multidimensional analysis. It's like a 3D representation of data. Star Schema: Star schema centers around a fact table linked to dimension tables. Snowflake Schema: Snowflake schema further normalizes the dimension tables.

Q. What is data mining. Explain various techniques of data mining.( 10 marks) Data Mining is the process of discovering patterns, correlations, trends, and useful information from large sets of data using techniques from fields such as statistics, machine learning, and database systems. Various techniques used for data mining:

  1. Decision Trees: Graphical representations that use a tree-like model to make decisions. They split the data based on the values of input variables and are relatively easy to interpret.
  2. Clustering (e.g., K-means): Grouping data points based on similarity. K-means is a popular and simple clustering technique where 'K' clusters are formed based on data attributes.
  3. Association Rule Learning (e.g., Apriori algorithm): Often used in market basket analysis, it finds items that tend to be bought together.
  4. Naive Bayes Classifier: A simple probabilistic classifier based on applying Bayes' theorem. Especially popular for text classification.
  5. Basic Time Series Analysis: Analyzing time-ordered data points. Techniques might involve moving averages or identifying seasonality.

Q. Explain Apriori algorithm with example.(10 marks)

1. Purpose: The Apriori algorithm is used to find frequent itemsets in a dataset and then derive association rules from them. 2. Principle: The algorithm is based on the idea that a subset of a frequent itemset must also be frequent. 3. Initialization: Start by counting the occurrence (support) of each item in the dataset and collecting all items that meet a minimum support threshold. 4. Iteration: For each subsequent step, generate new candidate itemsets by combining the previous frequent itemsets. 5. Pruning(Trimning): Before determining the frequency of these new candidate itemsets, prune those with any subset that is not frequent, using the Apriori property. 6. Counting: Count the support of the remaining candidate itemsets. 7. Threshold Check: Keep those itemsets that meet the minimum support threshold. 8. Repetition: Repeat steps 4-7 until no more frequent itemsets can be generated. 9. Association Rules: From the list of frequent itemsets, derive association rules that have a confidence level above a given threshold. 10. Example: Suppose in a supermarket dataset, the bread (B) is bought 100 times, and bread and butter (B&B) are bought together 50 times. The support for B is 100/total transactions and for B&B is 50/total transactions. If our threshold is 0.01 (1% of total transactions) and both support values are above this, they are considered frequent. The confidence of the rule B → B&B is 50/100 = 0.5 or 50%. If this is above our confidence threshold, the association rule is accepted.

Q. Classification by decision tree. Classification by Decision Tree :

  • Tree Structure: Uses a tree-like model of decisions and their possible consequences.
  • Feature Splits: At each internal node, the dataset is split based on a feature value.
  • Leaf Nodes: Represent final class labels or outcomes.
  • Traversal: New data is classified by traversing the tree from the root to a leaf based on feature tests.
  • Pruning: Reduces tree complexity to avoid overfitting and improve generalization.

Q. Explain Bayesian classification with example. Example : Situation: You receive an email in your inbox. Goal: Determine whether the email is spam or not using Bayesian classification. Prior Knowledge:

  • 80% of the emails you receive are genuine, and 20% are spam.
  • 90% of spam emails contain the word "lottery", but only 5% of genuine emails mention "lottery". New Email: "Congratulations! You've won the lottery." Using Bayesian Classification:
  • Probability it's spam given the word "lottery": Using Bayes' theorem, you calculate the likelihood it's spam. The result might be high, say 85%.
  • Probability it's genuine given the word "lottery": Similarly, the likelihood it's genuine might be 15% using the same approach.

Q. Clustering methods.

  1. K-Means:
  • How it works: It partitions data into K distinct non-overlapping clusters based on distance to the centroid of the clusters.
  • Usage: Suitable for data where clusters are roughly spherical and equally sized. 2. Hierarchical Clustering:
  • How it works: It builds a tree (dendrogram) of clusters by successively merging or splitting groups of data points.
  • Usage: Great for understanding hierarchical relationships; can be used with small datasets or when a tree structure is desired. 3. DBSCAN (Density-Based Spatial Clustering of Applications with Noise):
  • How it works: It groups together data points that are closely packed together based on a distance measure and a minimum number of points, marking more isolated points as outliers.
  • Usage: Suitable for data with clusters of similar density and when there may be noise/outliers. 4. Gaussian Mixture Models (GMM):
  • How it works: Assumes that the data is generated from a mixture of several Gaussian distributions. It uses the Expectation-Maximization technique to optimize the likelihood of data fitting these distributions.
  • Usage: Useful when the clusters have different sizes and correlation within them. 5. Agglomerative Clustering:
  • How it works: A type of hierarchical clustering where each data point starts as an individual cluster. It merges the closest pairs of clusters in each successive step until only one cluster remains.
  • Usage: Effective for smaller datasets and provides a tree overview of the hierarchical clustering structure.

Q. Hierarchical clustering.

  • Type of Clustering : Creates a tree-like group of clusters.
  • Two Approaches:
    1. Agglomerative: Starts with each data point as its own cluster and merges them step by step.
    2. Divisive: Starts with one big cluster and splits it step by step.
  • Dendrogram: A tree diagram that shows the sequence of merges or splits.
  • No Need for Cluster Number: Unlike K-means, you don't need to specify the number of clusters beforehand.
  • Best for Smaller Datasets: More computationally intensive than other methods, making it ideal for smaller sets of data.