Kuwait University

College of Business Administration  

Quantitative Methods and Information System Department

QM#  492 Selected Topics:

 Data Mining: Concepts and Techniques, Fall 2007

 

 

Instructor:   

Dr. Aboul Ella Hassanien

Email: Abo@cba.edu.kw

Office Hours:       

Course web page: 

 http://www.cba.edu.kw/abo/Datamining                          

Required Textbook:

   Jiawei Han and Micheline Kamber , data Mining: Concepts and Techniques, 2ed. The Morgan Kaufmann Series in Data Management Systems, Jim Gray, Series Editor. Morgan Kaufmann Publishers, Feb. 2006. ISBN 1-55860-901-6 Web site: http://www-faculty.cs.uiuc.edu/~hanj/bk2/index.html

Course Overview & Objectives:

 Data Mining studies algorithms and computational paradigms that allow computers to find patterns and regularities in databases, perform prediction and forecasting, and generally improve their performance through interaction with data. It is currently regarded as the key element of a more general process called Knowledge Discovery that deals with extracting useful knowledge from raw data. The knowledge discovery process includes data selection, cleaning, coding, using different statistical, pattern recognition and machine learning techniques, and reporting and visualization of the generated structures. The course will cover all these issues and will illustrate the whole process by examples of practical applications.   The students will use recent Data Mining software

Course Objectives  

  • to introduce students to the basic concepts and techniques of Data Mining.
  • to develop skills of using recent data mining software for solving practical problems.
  • to gain experience of doing independent study and research.

Required Software

 Weka is a set of software for machine learning and data mining developed.  Weka is open source software issued under the GNU General Public License. Download the software from: http://www.cs.waikato.ac.nz/ml/weka/

Collections of datasets

 Assessment

Course assessment will be based on the combination of the following: 

Assignment

Final Mark

Due Date

Term Test 1

25

TBA

Final exam

40

 TBA

Quiz, assignments, reports, project, case study, etc.

35

 

TOTAL

100

 

 Tentative Schedule

Wk

topics

 

Chapter-1: Introduction  [ppt file]

 

 pages

1,2

1.1 What Motivated Data Mining? Why Is It Important ……………………………….

1.2 What Is Data Mining?   ……………………………………………………………...

1.3 Data Mining—On What Kind of Data?  …………………………………………..

1.4 Data Mining Functionalities—What Kinds of Patterns Can Be Mined?  ………….

1.4.1 Concept/Class Description: Characterization and Discrimination  ……………..

1.4.2 Mining Frequent Patterns, Associations, and Correlations  ……………………..

1.4.3 Classification and Prediction  …………………………………………………….

1.4.4 Cluster Analysis  …………………………………………………………………

1.4.5 Outlier Analysis  ………………………………………………………………..

1.6 Classification of Data Mining Systems …………………………………………                                                                

1.7 Data Mining Task Primitives  …………………………………………………

1.8 Integration of a Data Mining System with a Database or Warehouse System  …..

1.9 Major Issues in Data Mining………………………………………………………

 

 

1

5

9

21

23

24

25

26

29

31

34

36

 

Chapter-2: Data Preprocessing

 

3,4

2.1 Why Preprocess the Data?

2.2 Descriptive Data Summarization

2.2.1 Measuring the Central Tendency

2.2.2 Measuring the Dispersion of Data

2.2.3 Graphic Displays of Basic Descriptive Data Summaries

2.3 Data Cleaning

2.3.1 Missing Values

2.3.2 Noisy Data

2.3.3 Data Cleaning as a Process

2.4 Data Integration and Transformation

2.4.1 Data Integration

2.4.2 Data Transformation

2.5 Data Reduction

2.5.1 Data Cube Aggregation

2.5.2 Attribute Subset Selection

2.5.3 Dimensionality Reduction

 

48

51

51

53

56

61

61

62

65

67

67

70

72

73

75

77

Introduction to WEKA Software (lab work)

 

Chapter-5: Mining Frequent Patterns, Associations, and Correlations

 

 5,6

5.1 Basic Concepts and a Road Map

5.1.1 Market Basket Analysis: A Motivating Example

5.1.2 Frequent Itemsets, Closed Itemsets, and Association Rules

5.1.3 Frequent Pattern Mining: A Road Map

5.2 Efficient and Scalable Frequent Itemset Mining Methods

5.2.1 The Apriori Algorithm: Finding Frequent Itemsets Using Candidate Generation

5.2.2 Generating Association Rules from Frequent Itemsets

5.2.3 Improving the Efficiency of Apriori

5.2.4 Mining Frequent Itemsets without Candidate Generation

5.2.5 Mining Frequent Itemsets Using Vertical Data Format

5.2.6 Mining Closed Frequent Itemsets

5.3 Mining Various Kinds of Association Rules

5.3.1 Mining Multilevel Association Rules

5.3.2 Mining Multidimensional Association Rules

from Relational Databases and Data Warehouses

5.4 From Association Mining to Correlation Analysis

5.4.1 Strong Rules Are Not Necessarily Interesting: An Example

5.4.2 From Association Analysis to Correlation Analysis

5.5 Constraint-Based Association Mining

5.5.1 Metarule-Guided Mining of Association Rules

5.5.2 Constraint Pushing: Mining Guided by Rule Constraints

 

227

228

230

232

234

234

239

240

242

245

248

250

250

254

259

260

261

265

266

267

Midterm exam, will covers chapters 1,2, and 5

 

Chapter-6: Classification and Prediction

 

 7,8,9

6.1 What Is Classification? What Is Prediction?

6.2 Issues Regarding Classification and Prediction

6.2.1 Preparing the Data for Classification and Prediction

6.2.2 Comparing Classification and Prediction Methods

6.3 Classification by Decision Tree Induction

6.3.1 Decision Tree Induction

6.3.2 Attribute Selection Measures

6.3.3 Tree Pruning

6.3.4 Scalability and Decision Tree Induction

6.4 Bayesian Classification

6.4.1 Bayes’ Theorem

6.4.2 Naοve Bayesian Classification

6.4.3 Bayesian Belief Networks

6.4.4 Training Bayesian Belief Networks

6.5 Rule-Based Classification

6.5.1 Using IF-THEN Rules for Classification

6.5.2 Rule Extraction from a Decision Tree

6.5.3 Rule Induction Using a Sequential Covering Algorithm

6.11 Prediction

6.11.1 Linear Regression

6.11.2 Nonlinear Regression

6.11.3 Other Regression-Based Methods

 

 

285

289

289

290

291

292

296

304

306

310

310

311

315

317

318

319

321

322

354

355

357

358

 

Chapter 10 Mining Object, Spatial, Multimedia, Text, andWeb Data

 

 10,11,12

Multimedia Data Mining

10.3.1 Similarity Search in Multimedia Data

10.3.2 Multidimensional Analysis of Multimedia Data

10.3.3 Classification and Prediction Analysis of Multimedia Data

10.3.4 Mining Associations in Multimedia Data

10.3.5 Audio and Video Data Mining

10.4 Text Mining

10.4.1 Text Data Analysis and Information Retrieval

10.4.2 Dimensionality Reduction for Text

10.4.3 Text Mining Approaches

10.5 Mining theWorld WideWeb

10.5.1 Mining the Web Page Layout Structure

10.5.2 Mining the Web’s Link Structures to Identify

Authoritative Web Pages

10.5.3 Mining Multimedia Data on the Web

10.5.4 Automatic Classification of Web Documents

10.5.5 Web Usage Mining

 

607

608

609

611

612

613

614

615

621

624

628

630

631

637

638

 

 

Final Exam will cover chapter-1, 6 and 10

 

 Course sayllbus

---------------------------------------------------------

Project   (10 Points)

 

  Association Rules

Your task for this project is to identify and perform an association rule mining task. This involves

  1. Selecting an appropriate data set ( I prefer to use data bank data)
  2. Preparing and preprocessing the data
  3. Finding rules, including appropriate parameter setting
  4. Determining which of the resulting rules are interesting
  5. Figuring out how the interesting rules could be useful

[While you are on your own to select an appropriate data set, I will point you to one easy source:

The UCI Machine Learning Repository.

http://www.ics.uci.edu/~mlearn/MLRepository.html

 This contains many data sets, not all of which are appropriate for association rules, so you'll need to do some thinking. You are also welcome to identify data from other sources, especially those that you find personally of interest. ]

Project Report

The project report should contain the following:

  1. Objectives: What is the domain and what are the potential benefits to be derived from association rule mining. This is high level - not find patterns, but what would improve because of the use of the patterns.
  2. Data set description: What is in the data, and what preprocessing was done to make it amenable for association rule mining. Where choices were made (e.g., parameter settings for discretization, or decisions to ignore an attribute), describe your reasoning behind the choices.
  3. Rule mining process: Parameter settings, choice of algorithm (if you choose to implement something other than the WEKA-provided Apriori, you can earn extra credit, but I don't expect it), and the time required.
  4. Resulting rules: Summary (number of rules, general description), and a selection of those you would show to a client.
  5. Recommendations: What should the client do because of the rules discovered.

Also turn in (likely as a separate plain-text file) a complete listing of the rules found, and instructions (preferably machine-readable/executable) for recreating your results. WEKA provides several ways to do this, from command-line scripts to Explorer - your call.

Useful link:

http://maya.cs.depaul.edu/~classes/ect584/WEKA/associate.html

 ----------------------------------------------------------------------------------------