

Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
An assignment for the data mining course (csc 573) at the university level, focusing on attribute relevance analysis using the weka machine learning tool. Students are required to use three different datasets ('contact-lenses.arff', 'iris.arff', and 'soybean.arff') and perform attribute ranking using infogainattributeeval and gainratioattributeeval methods. The document also covers the installation of weka and the discretization of non-class attributes. The assignment includes evaluating the results and submitting the output and observations files.
Typology: Assignments
1 / 2
This page cannot be seen from the preview
Don't miss anything!
CSC 573: Data Mining Weka Assignment #1 : Attribute Relevance Analysis in WEKA Instructor: Ratko Orlandic
For this assignment, your task is to familiarize yourself with the WEKA machine learning tool and the attribute ranking facilities in WEKA (“Select attributes” feature in WEKA Explorer). For this assignment, you will use “contact-lenses”, “iris”, and “soybean” data sets, all of which are available in the required .arff format in the WEKA package. The “contact-lenses” data set has 24 instances with 5 nominal attributes, the last of which (“contact-lenses”) is the class dimension. The “iris” set has 150 instances with 4 continuous attributes and the nominal class, which is the last (5 th^ ) dimension. The “soybean” set has 683 instances with 36 nominal attributes, the last of which is the class dimension. Unlike the other two sets, “soybean” has missing values.
WEKA machine learning tool is installed on the computers in the computer lab UHB 2030. You can also download to your computer a free copy of the software as follows:
WEKA comes with certain data files and some documentation. Once you install the software, you can find these on your computer in the directory C:\Program Files\Weka-3-4. Whether you work in the lab or on your computer, you should spend some time familiarizing yourself with WEKA. For this assignment, you will be working with WEKA Explorer.
For each step, open the indicated file in the “Preprocess” window. Then, go to the “Attribute Selection” window and set the “Attribute selection mode to “Use full training set”. For each case A-E below, perform attribute ranking using the following attribute selection methods with default parameters: a) InfoGainAttributeEval; and b) GainRatioAttributeEval; These attribute selection methods should consider only non-class dimensions (for each set, the class attribute is indicated above the “Start” button). Record the output of each run in a text file called “output.txt”. For that, copy the output of the run from the “Attribute selection output” window in the Explorer and paste it at the end of the “output.txt” file.
A. Perform attribute ranking on the “contact-lenses.arff” data set using the two attribute ranking methods with default parameters.
B. Load the “iris.arff” data set. Perform attribute ranking on the “iris.arff” data set using the two attribute ranking methods with default parameters.
C. Go back to “Preprocess” and load the “iris.arff” data set. Perform discretization of all non-class attributes into 10 equal-width bins as follows: under “Filter” in the “Preprocess” window of the Explorer, select ‘filters’->’unsupervised’->’attribute’->’Discretize’ (use default parameters of the ‘Discretize’ filter) and hit `Apply’. Verify that all attributes are nominal by clicking on individual attributes in the “Attributes” window in “Preprocess”. Then perform attribute ranking on the discretized set using the two attribute-ranking methods with default parameters.
D. Go back to “Preprocess” and load the original “iris.arff” data set again. Perform discretization of all non-class attributes into 5 close-to-equal-height bins by selecting the ’Discretize’ filter. Then, select appropriate parameters by clicking on the ’Discretize’ filter in the “Filter” window, and setting `bins’ to 5 and ‘useEqualFrequency’ to true. After you verify that all attributes are nominal, perform attribute ranking on the new set using the two attribute-ranking methods with default parameters.
E. Load the “soybean.arff” data set. Then perform attribute ranking on the “soybean.arff” data set using the two attribute ranking methods with default parameters.
Once you have performed the experiments, you should spend some time evaluating your results. In particular, try to answer at least the following questions: Why would one need attribute relevance ranking? Do these attribute-ranking methods often agree or disagree? On which data set(s), if any, these methods disagree? Does discretization and its method affect the results of attribute ranking? Do missing values affect the results of attribute ranking? Record these and any other observations in a Word file called “Observations.doc”.
On or before the due date, please submit in a single zipped file through the Blackboard system the “output.txt” file with the results of your runs and the “Observations.doc” file. Please adhere to the following submission procedure:
Grading will be done based on the correctness of the results in your output file as well as the extensiveness, clarity, and correctness of your observations.
Good luck!