Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

MSDA 665 EXAM QUESTIONS AND CORRECT ANSWERS ., Exams of Data Mining

MSDA 665 EXAM QUESTIONS AND CORRECT ANSWERS .

Typology: Exams

2024/2025

Available from 06/21/2025

joyce-williams
joyce-williams 🇺🇸

2.3K documents

1 / 25

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
MSDA 665 EXAM QUESTIONS AND CORRECT
ANSWERS
If there are 6,000 tweets per second, and 500 million tweets per day, this explicitly
illustrates the ____ aspects of Big Data. - answer velocity / volume
Which of the following is NOT a key characteristic of Big Data? - answer Relational
Which of the following is NOT a challenge presented for Data Scientists given the current
analytic architecture? - answer Analytic project data is centrally-managed
If you're the Analyst, which one person (Role) is most important for you to proceed? -
answer Project Sponsor
You learn in the Discovery Phase that your project will likely fail. You proceed. Your costs to
that point are $500. When you communicate results, senior managers trash and cancel
your project. At that point, from a 'rule of thumb' standpoint, all told you could have
wasted... - answer 5,000,000
In Phase 2, Data Preparation, what is one common and effective way to conduct data
exploration? - answer Visualization
Which phase would R most likely be used in? - answer Phase 4: Model Execution
Which phase would Hadoop most likely be used in? - answer Phase 2: Data Preparation
Who makes and provides R? - answer Open Source Community
How is the mean different than the median? - answer Mean is average, median is the
number separating the higher half of a sample from the lower half.
What will the 1st Quartile be for summary(x)? - answer 1.5
Which method listed is NOT the BEST way to get help in R? - answer Consult Dr. Downing
or the TA
Using the NOIR classification, what type of data is a peoples' Height? - answer Ratio
What does correlation tell us? - answer Correlation is the degree to which two or more
measurements show a tendency to vary together.
pf3
pf4
pf5
pf8
pf9
pfa
pfd
pfe
pff
pf12
pf13
pf14
pf15
pf16
pf17
pf18
pf19

Partial preview of the text

Download MSDA 665 EXAM QUESTIONS AND CORRECT ANSWERS . and more Exams Data Mining in PDF only on Docsity!

MSDA 665 EXAM QUESTIONS AND CORRECT

ANSWERS

If there are 6,000 tweets per second, and 500 million tweets per day, this explicitly illustrates the ____ aspects of Big Data. - answer velocity / volume Which of the following is NOT a key characteristic of Big Data? - answer Relational Which of the following is NOT a challenge presented for Data Scientists given the current analytic architecture? - answer Analytic project data is centrally-managed If you're the Analyst, which one person (Role) is most important for you to proceed? - answer Project Sponsor You learn in the Discovery Phase that your project will likely fail. You proceed. Your costs to that point are $500. When you communicate results, senior managers trash and cancel your project. At that point, from a 'rule of thumb' standpoint, all told you could have wasted... - answer 5,000, In Phase 2, Data Preparation, what is one common and effective way to conduct data exploration? - answer Visualization Which phase would R most likely be used in? - answer Phase 4: Model Execution Which phase would Hadoop most likely be used in? - answer Phase 2: Data Preparation Who makes and provides R? - answer Open Source Community How is the mean different than the median? - answer Mean is average, median is the number separating the higher half of a sample from the lower half. What will the 1st Quartile be for summary(x)? - answer 1. Which method listed is NOT the BEST way to get help in R? - answer Consult Dr. Downing or the TA Using the NOIR classification, what type of data is a peoples' Height? - answer Ratio What does correlation tell us? - answer Correlation is the degree to which two or more measurements show a tendency to vary together.

About what do you expect the correlation of Sales and Month to be? - answer 1 What does it tell us if the mean is BIGGER than the median? - answer The data distribution is skewed to the right. How many rows will cbind(x,y) produce? - answer 3 For Lab 2, our most recent lab, we.... - answer used R to open and examine data, including some summary statistics and correlation. You are provided four different datasets. Initial analysis on these datasets show that they have identical mean, variance, and correlation values. What is the next step in the analysis? - answer Visualize the data to further explore the characteristics of each dataset What would be your recommendation to enhance the graph to help you detect structure that you might otherwise miss? - answer Plot log (Maturity Balance) So if the scatter plot just looks like a dark mess, what should I do? - answer Try a hexbin plot How would you characterize the relationship between petal length and petal width? - answer Strong linear relationship between petal length and width What do you expect the p value to be for t.test(AbsentOffer,StupidOffer)? - answer Larger, greater than. If the Null hypothesis(H0) is true and you failed to accept the null hypothesis, you have - answer Type 1 error If the NIU spam filter uses analytics on words in an email, and decides an email from the Dean is spam, and I therefore miss an important email from the Dean, what has happened? - answer Type 1 error Suppose everyone who visits our retail website either gets one of two promotional offers, or no promotion at all. We want to see if making the promotional offers makes a difference. What statistical method would you recommend for this analysis? - answer ANOVA Which offers will have "no difference". Why? - answer Medium and Low; Because the p- value is 0. If I run Tukey's test on the ANOVA on "AbsentOffer", "LowOffer", "MediumOffer", "TremendousOffer", what results do I expect? - answer All p values less than .05 except

In Step 6 of Lab 5, why is the Support for the rule .57? - answer Because "R-basics" and "Stat-intro" appear together in 4 out of 7 transactions. What is the Lift of "PSQL-basics ==> R-basics"? - answer 0. Consider a database with the following four transactions: Transaction 1: {cheese, bread, milk}, Transaction 2: {soda, bread, milk}, Transaction 3: {cheese, bread}, Transaction 4: {cheese, soda, juice}. Assuming a minimum support of 50%, which itemsets would be generated in the second step of the Apriori algorithm? - answer {cheese, bread}, {bread, milk} A data scientist wants to add a new categorical variable, X2, into a Linear Regression model Y=b0+b1*X1. How many terms should be added to the right-hand side of the equation if X2 has four possible values? - answer 3 Does this p-value stuff work just like in Module 3? - answer You bet! Which word or phrase completes the statement; "Excessive emphasis color is to Bar chart as __________________."? - answer Multicollinearity is to OLS Based on this regression model, what would you recommend to someone who wanted a good grade in OMIS 665? - answer Study many hours per week How is Logistic regression used as a classifier? - answer Assigning class labels based on a pre-established threshold for the class probability retuned

In the following logistic regression model what does the coefficient 'b' in bcreditscore score mean?: default = f(creditscore,income,loanAmount,Existingdebt), Given that bcreditscore=-0.69, exp(bcreditScore) = ½ and log(bcreditScore) = -0.161? - answer For the same income, loan, and existing debt, the odds-ratio of default is halved for every point increase in credit score You are using a Logistic Regression model to determine if an applicant's gender is a factor in determining whether or not they receive a bank loan. When you plot the results, you notice that the regression coefficient for gender is zero. What can be determined? - answer Applicant's gender does not influence the loan decision In a Logistic Regression, the coefficient for "age" equals -3. What is the correct interpretation of the Logistic Regression coefficient, holding all other variables constant? - answer When age increases by 1 unit, the odds of response are multiplied by e(-3) or 0. How is False Positive Rate defined? - answer The fraction of negative instances that were misclassified You fit a Logistic Regression model to your training data and notice that the variable X has an infinite magnitude coefficient. What does this indicate? - answer X is strongly correlated with the outcome for a subset of the data What is a distinct property of Logistic Regression compared with Linear Regression? - answer Logistic Regression returns probability estimates of an event Which is not considered a defining characteristic of Big Data? - answer Data Quality What is a motivation for using a data analytics lifecycle? - answer Creates a repeatable process

Which of the following is not one of the project sponsor's needs that must be addressed in the final presentation? - answer How other related reports and systems will change To succeed in MSDA 665, you must obtain which of the following from the Get Stuff link in Module 0 within Blackboard? - answer The Dell/EMC Participant Guide and Lab Guide. Which of the following is NOT a challenge presented for Data Scientists given most current analytic architectures? - answer Analytic project data is centrally-managed Which phase of the Analytic Lifecycle would Hadoop most likely be used in? - answer Phase 2: Data Preparation While completing Participation Quizzes for MSDA 665, you are allowed to use: a) The parentheticals to guide you to the approximate slide within the Participant Guide that the question content is coming from. b) The videos. c) Your notes and/or Participant Guide. d) All of these. - answer All of these. You have just completed the Discovery phase of a project and finished interviewing the main stakeholders. You have identified the necessary data feeds and are now beginning to set up the analytic sandbox. What is the next step? - answer Perform ETLT You have been assigned to perform a study of the daily revenue effect of a pricing model of online transactions. All the data currently available has been loaded into an analytics database. This data includes revenue data, pricing data, and online transaction data. You have completed a thorough univariate analysis of all data and have decided that there are three different models you want to test. Preliminary results show that all models have

equally effective results. What is the next step? - answer Prioritize models by complexity and feasibility, and proceed with the most feasible What is a characteristic of an analytic sandbox? - answer Leverages in-database processing to enable high-performance analytics What is an appropriate assignment for a data scientist? - answer Develop predictive models Which activity is performed in the Operationalize phase of the data analytics lifecycle? - answer Assess the benefits In addition to the business question and descriptions of available data sets, what else would an analytic plan include? - answer Initial hypotheses In which lifecycle stage are test and training data sets created? - answer Model building Big Data can be - answer Structured, Semi-Structured, Unstructured Which phase of the Analytic Lifecycle would R or RStudio most likely be used in? - answer Phase 4: Model Execution / Building The summary() function provides ___________ statistics. - answer Mean, Median, 1st quartile, 3rd quartile, Min and Max values Attributes can be categorized into 4 types and they are called NOIR in short. What are they? - answer Nominal, Ordinal, Interval and Ratio

What are the levels for object fdata? - answer 3 Review the following R code:

v1 <- 1: v2 <- 5: v [1] 5 4 3 2 1 cbind(v1,v2) What is the expected output generated by the cbind function? - answer v1 v

[1,] 1 5 [2,] 2 4 [3,] 3 3 [4,] 4 2 [5,] 5 1 Which is NOT a R data structure? - answer Ratio What is the utility of a rug plot? - answer They are used to emphasize distribution Consider the following example: Suppose everyone who visits our retail website either gets one of two promotional offers, or no promotion at all. We want to see if making the promotional offers makes a difference. What statistical method would you recommend for this analysis? - answer ANOVA What is the assumption in a t-test? - answer The populations are normally distributed

You are analyzing two normally distributed populations and your null hypothesis is that the mean m1 of the first population is equal to the mean m2 of the second. You observed a "p" value of 4.33e-05. What will be your decision with the null hypothesis? - answer p- value is small and reject the null hypothesis and accept the alternate hypothesis In Lab 2, to see the first ten lines of the file "lab1_01.txt", you used the R command: - answer head What does it mean if the mean is less than the median? - answer The distribution is skewed to the left (that is, bunched up toward the right and with a "tail" stretching on the left). Consider a scale that has five values that range from "not important" to "very important". Which data classification best describes this data? - answer Ordinal If R factors are categorical variables, which data classification level are they most closely related? - answer Nominal Your risk analysis team has access to new customer financial data. You want to use this data to improve your prediction of credit default. Previously, the team was using only credit bureau scores, loan size, and customer income to assess risk of default. What is the null hypothesis that should be used to evaluate the model? - answer Model using the new financial data predicts the outcome just as well as the previous model In Lab 3, when using the Tukey test to determine if offer1 was different from offer2, what was our conclusion? - answer There was no statistical difference between the offers due to the high p-value.

A time series of monthly sales indicates a periodic pattern, and the value reaches the peak every December. Which statistic needs to be subtracted from the actual value to adjust for this seasonal effect? - answer Average value for each month over a few years Which word or phrase completes the statement; "Discovering relationships is to Association Rules as generating forecasts is to __________."? - answer Time Series Analysis Lab 4 was our K-Means Clustering lab. In that lab, the type of data you worked with was... - answer Income You want to group items in your dataset by similarity and assign labels (unknown) to each group. What is the preferred analytic method to use for this task? - answer K-means clustering Refer to the Graphics that plots the Value of Within Sum of Squares for the number of clusters (k) chosen for a particular data set in a k-means clustering solution: (big L Shape) - answer 4 In Lab 5, using the Grocery data set, what order of magnitude of total Grocery Purchase Rules were there (in both Steps 8 and 9)? - answer More than two digits (over 100) How is Apriori property defined? - answer Any subset of a frequent itemset is also frequent Consider the following transaction list: T1 : {item A, item B, item C} T2 : {item A, item B}

T3 : {item B, Item C} T4 : {item A, item C} T5 : {item A, Item C} Which of the following itemsets have a support of 50%? - answer ({item A}, {item B}, {item C}, {item A, item C}) What is the measure used with frequent itemsets to find rules X->Y that are interesting and useful in a Market Basket Analysis? - answer Confidence - a measure of the % of transactions that contain X, which also contain Y Consider the following itemsets: (hat, scarf, coat) (hat, scarf, coat, gloves) (hat, scarf, gloves) (hat, gloves) (scarf, coat, gloves) If the minimum support is 50%, what represents the complete Apriori list of frequent 2- itemsets? - answer (hat, scarf), (hat, gloves), (scarf, gloves), (scarf, coat) You have run a Linear Regression model on the data shown in the graphic in Fig. REG. Which value is a reasonable guess for R-squared? - answer. You have created a Linear Regression model to predict total sales based on variables M, N, P and Q as shown in Fig. REG2. You originally expected all variables to have positive

ratio of default is halved for every point increase in credit score You are tasked with predicting if a customer will purchase a product when he visits the website and the probability of a purchase decision. You are provided with other relevant drivers associated with the problem. Which analytical method would you recommend? - answer Logistic Regression Consider the following confusion Matrix: Bad Good Bad 262 38 Good 29 671 What is the False Negative Rate? - answer 0. An ROC curve is ________ and a good value for AUC, the area under the curve is _________. - answer the plot of TPR and FPR; >= 0. ________________ is the minimum number of steps required to reach the node from root in a Decision Tree. - answer Depth of node Decision stump is a kind of decision tree, where the root is immediately connected to the ______________. - answer Leaf node Which two analytical methods can be used for dealing with categorical variables with a large number of levels? - answer Naïve Bayesian and Decision Tree

The graphic shows the values for the input Boolean attributes "A", "B", and "C". In addition, the graphic shows the values for the output attribute "class". Which Decision Tree is valid for the data? - answer Tree B n an entropy function for a binary classification: H=-Ep(c)log2p(c) For what value of probability is the value of H maximum? - answer 0. What is one of the cautions to choose Time Series Analysis for prediction? - answer Suitable for short term prediction only A time series of monthly sales indicates a periodic pattern, and the value reaches the peak every December. Which statistic needs to be subtracted from the actual value to adjust for this seasonal effect? - answer Average value for each month over a few years Which is NOT a problem solving task in a Text Analysis problem? - answer Cluster Setup In text analysis, what is a "corpus"? - answer A large collection of text documents What is the major challenge in the representation of a Corpus in Text Analysis? - answer Corpus metrics are dynamic While we can use Regular Expression to search for a particular text or combination of words, what happens when "Stock Market$" is used. - answer Searches for text that ends

If you are looking for a "story" in a large set of data, and you plug the full data set into various R models and do not get understandable and/or actionable results, what might you consider doing? - answer Segmenting the data into smaller sets of variables, and using ensemble techniques. How is HDFS defined? - answer Reliable, redundant distributed file system Which are the functions of the Mapping Phase and Reduce Phase in MapReduce? - answer In the mapping phase, raw data is transformed into a set of <key, value> pairs, whereas the reduce phase merges the values from the previous phase Which "daemon" (think "foreman") is NOT required to run the Hadoop Cluster?a. NameNode b. DataNode c. JobTracker d. TaskTracker e. All of these daemons are required - answer All of these daemons are required Which is NOT one of the allowed operational modes in Hadoop? - answer BASH mode Match the Hadoop eco system product with the functionality: Hbase, Pig, Hive. 1 - Data flow language and execution environment, 2 - Query language based on SQL for building MapReduce jobs, 3 - Column oriented database built on HDFS supporting MapReduce and point queries. - answer Hbase - 3, Pig - 1, Hive- Which of the following statement is false about Pig? - answer Supports random reads and queries

Which two tools support query languages that can be used with Hadoop? - answer Hive and Pig What is the primary function of the NameNode in Hadoop - answer Acts as a regulator/resolver among clients and DataNodes I was given a paragraph and asked to generate Key, Value pairs. Who am I? - answer Mappers In HDFS the Replication factor by default is 3. The replication factor helps to ________________. - answer Maintain 3 copies of same data on different data nodes For our labs this semester (Fall 2020), what software did you mostly use? - answer RStudio While generating (key, Value) pairs on distributed data in MapReduce _______________ happens. - answer Mappers run parallel on distributed data ________________ has all the information regarding which location has particular data. - answer Name node From a high level, What is Hadoop? - answer Big Data Storage With reference to Hive, which of the following is/are NOT allowed?: - answer Data Updates and transactions