|
Instructor:
Dr. Aboul Ella Hassanien
Email:
Abo@cba.edu.kw
Office Hours:
Course web page:
http://www.cba.edu.kw/abo/Datamining
Required Textbook:
Jiawei
Han and Micheline Kamber , data Mining: Concepts and
Techniques, 2ed. The Morgan Kaufmann Series in Data Management
Systems, Jim Gray, Series Editor. Morgan Kaufmann Publishers,
Feb. 2006. ISBN 1-55860-901-6 Web site: http://www-faculty.cs.uiuc.edu/~hanj/bk2/index.html
Course Overview & Objectives:
Data Mining
studies algorithms and computational paradigms that allow
computers to find patterns and regularities in databases,
perform prediction and forecasting, and generally improve
their performance through interaction with data. It is
currently regarded as the key element of a more general
process called Knowledge Discovery that deals with
extracting useful knowledge from raw data. The knowledge
discovery process includes data selection, cleaning, coding,
using different statistical, pattern recognition and machine
learning techniques, and reporting and visualization of the
generated structures. The course will cover all these issues
and will illustrate the whole process by examples of practical
applications. The students will use recent Data Mining
software
Course Objectives
-
to introduce students to the basic concepts and techniques
of Data Mining.
-
to develop skills of using recent data mining software for
solving practical problems.
-
to gain experience of doing independent study and research.
Required Software
Weka
is a set of software for machine learning and data mining
developed. Weka is open source software issued under the GNU
General Public License.
Download the software from:
http://www.cs.waikato.ac.nz/ml/weka/
Collections of datasets
-
A jarfile containing 37 classification
problems, originally obtained from the
UCI repository
(datasets-UCI.jar, 1,190,961 Bytes).
-
A jarfile containing 37 regression
problems, obtained from various sources
(datasets-numeric.jar, 169,344 Bytes).
-
A jarfile containing 6 agricultural
datasets obtained from agricultural researchers in New
Zealand
(agridatasets.jar, 31,200 Bytes).
-
A jarfile containing 30 regression
datasets collected by Luis Torgo
(regression-datasets.jar, 10,090,266 Bytes).
-
A gzip'ed tar containing
UCI and
UCI KDD datasets (uci-20050214.tar.gz,
15,308,385 Bytes)
-
A gzip'ed tar containing
StatLib datasets (statlib-20050214.tar.gz,
12,785,582 Bytes)
-
A gzip'ed tar containing ordinal,
real-world datasets donated by
Dr. Arie Ben David (Holon
Inst. of Technology/Israel) (datasets-arie_ben_david.tar.gz,
11,348 Bytes)
-
A zip file containing 19 multi-class
(1-of-n) text datasets donated by
George Forman/Hewlett-Packard
Labs (19MclassTextWc.zip,
14,084,828 Bytes)
Assessment
Course assessment will be based on the combination of
the following:
|
Assignment |
Final
Mark |
Due Date |
|
Term Test 1 |
25 |
TBA |
|
Final exam |
40 |
TBA |
|
Quiz, assignments, reports, project, case study, etc. |
35 |
|
|
TOTAL |
100 |
|
Tentative
Schedule
|
Wk |
topics |
|
|
Chapter-1: Introduction [ppt
file]
|
pages |
|
1,2 |
1.1 What Motivated Data
Mining? Why Is It Important
.
1.2 What Is Data Mining?
...
1.3 Data MiningOn What
Kind of Data?
..
1.4 Data Mining
FunctionalitiesWhat Kinds of Patterns Can Be Mined?
.
1.4.1 Concept/Class
Description: Characterization and Discrimination
..
1.4.2 Mining Frequent
Patterns, Associations, and Correlations
..
1.4.3 Classification and
Prediction
.
1.4.4 Cluster Analysis
1.4.5 Outlier Analysis
..
1.6 Classification of
Data Mining Systems
1.7 Data Mining Task
Primitives
1.8 Integration of a
Data Mining System with a Database or Warehouse System
..
1.9 Major Issues in Data
Mining
|
1
5
9
21
23
24
25
26
29
31
34
36
|
|
Chapter-2: Data Preprocessing
|
|
3,4 |
2.1
Why Preprocess the Data?
2.2
Descriptive Data Summarization
2.2.1 Measuring the Central Tendency
2.2.2 Measuring the Dispersion of Data
2.2.3 Graphic Displays of Basic Descriptive Data
Summaries
2.3
Data Cleaning
2.3.1 Missing Values
2.3.2 Noisy Data
2.3.3 Data Cleaning as a Process
2.4
Data Integration and Transformation
2.4.1 Data Integration
2.4.2 Data Transformation
2.5
Data Reduction
2.5.1 Data Cube Aggregation
2.5.2 Attribute Subset Selection
2.5.3 Dimensionality Reduction
|
48
51
51
53
56
61
61
62
65
67
67
70
72
73
75
77 |
|
Introduction to WEKA
Software (lab work) |
|
Chapter-5: Mining Frequent Patterns,
Associations, and Correlations
|
|
5,6 |
5.1
Basic Concepts and a Road Map
5.1.1 Market Basket Analysis: A Motivating Example
5.1.2 Frequent Itemsets, Closed Itemsets, and
Association Rules
5.1.3 Frequent Pattern Mining: A Road Map
5.2
Efficient and Scalable Frequent Itemset Mining Methods
5.2.1 The Apriori Algorithm: Finding Frequent Itemsets
Using Candidate Generation
5.2.2 Generating Association Rules from Frequent
Itemsets
5.2.3 Improving the Efficiency of Apriori
5.2.4 Mining Frequent Itemsets without Candidate
Generation
5.2.5 Mining Frequent Itemsets Using Vertical Data
Format
5.2.6 Mining Closed Frequent Itemsets
5.3
Mining Various Kinds of Association Rules
5.3.1 Mining Multilevel Association Rules
5.3.2 Mining Multidimensional Association Rules
from Relational Databases and Data Warehouses
5.4
From Association Mining to Correlation Analysis
5.4.1 Strong Rules Are Not Necessarily Interesting: An
Example
5.4.2 From Association Analysis to Correlation Analysis
5.5
Constraint-Based Association Mining
5.5.1 Metarule-Guided Mining of Association Rules
5.5.2 Constraint Pushing: Mining Guided by Rule
Constraints
|
227
228
230
232
234
234
239
240
242
245
248
250
250
254
259
260
261
265
266
267 |
|
Midterm exam, will
covers chapters 1,2, and 5 |
|
Chapter-6: Classification and Prediction
|
|
7,8,9 |
6.1
What Is Classification? What Is Prediction?
6.2
Issues Regarding Classification and Prediction
6.2.1 Preparing the Data for Classification and
Prediction
6.2.2 Comparing Classification and Prediction Methods
6.3
Classification by Decision Tree Induction
6.3.1 Decision Tree Induction
6.3.2 Attribute Selection Measures
6.3.3 Tree Pruning
6.3.4 Scalability and Decision Tree Induction
6.4
Bayesian Classification
6.4.1 Bayes Theorem
6.4.2 Naοve
Bayesian Classification
6.4.3 Bayesian Belief Networks
6.4.4 Training Bayesian Belief Networks
6.5
Rule-Based Classification
6.5.1 Using IF-THEN Rules for Classification
6.5.2 Rule Extraction from a Decision Tree
6.5.3 Rule Induction Using a Sequential Covering
Algorithm
6.11
Prediction
6.11.1 Linear Regression
6.11.2 Nonlinear Regression
6.11.3 Other Regression-Based Methods
|
285
289
289
290
291
292
296
304
306
310
310
311
315
317
318
319
321
322
354
355
357
358
|
|
Chapter 10
Mining Object, Spatial, Multimedia, Text, andWeb Data
|
|
10,11,12 |
Multimedia Data Mining
10.3.1 Similarity Search in Multimedia Data
10.3.2 Multidimensional Analysis of Multimedia Data
10.3.3 Classification and Prediction Analysis of
Multimedia Data
10.3.4 Mining Associations in Multimedia Data
10.3.5 Audio and Video Data Mining
10.4
Text Mining
10.4.1 Text Data Analysis and Information Retrieval
10.4.2 Dimensionality Reduction for Text
10.4.3 Text Mining Approaches
10.5
Mining theWorld WideWeb
10.5.1 Mining the Web Page Layout Structure
10.5.2 Mining the Webs Link Structures to Identify
Authoritative Web Pages
10.5.3 Mining Multimedia Data on the Web
10.5.4 Automatic Classification of Web Documents
10.5.5 Web Usage Mining
|
607
608
609
611
612
613
614
615
621
624
628
630
631
637
638
|
|
|
Final Exam will
cover chapter-1, 6 and 10 |
|
Course
sayllbus
---------------------------------------------------------
Project
(10 Points)
Association Rules
Your task for this project is to identify and perform an
association rule mining task. This involves
-
Selecting an appropriate data set ( I prefer to use
data bank data)
-
Preparing and preprocessing the data
-
Finding rules, including appropriate parameter setting
-
Determining which of the resulting rules are interesting
-
Figuring out how the interesting rules could be useful
[While
you are on your own to select an appropriate data set, I will
point you to one easy source:
The
UCI Machine Learning Repository.
http://www.ics.uci.edu/~mlearn/MLRepository.html
This contains many data sets, not all of which are
appropriate for association rules, so you'll need to do some
thinking. You are also welcome to identify data from other
sources, especially those that you find personally of
interest. ]
Project Report
The project report should contain the following:
-
Objectives: What is the domain and what are the potential
benefits to be derived from association rule mining. This is
high level - not find patterns, but what would improve
because of the use of the patterns.
-
Data set description: What is in the data, and what
preprocessing was done to make it amenable for association
rule mining. Where choices were made (e.g., parameter
settings for discretization, or decisions to ignore an
attribute), describe your reasoning behind the choices.
-
Rule mining process: Parameter settings, choice of algorithm
(if you choose to implement something other than the WEKA-provided
Apriori, you can earn extra credit, but I don't expect it),
and the time required.
-
Resulting rules: Summary (number of rules, general
description), and a selection of those you would show to a
client.
-
Recommendations: What should the client do because of the
rules discovered.
Also turn in (likely as a separate plain-text file) a
complete listing of the rules found, and instructions
(preferably machine-readable/executable) for recreating your
results. WEKA provides several ways to do this, from
command-line scripts to Explorer - your call.
Useful link:
http://maya.cs.depaul.edu/~classes/ect584/WEKA/associate.html
----------------------------------------------------------------------------------------
|