数据挖掘ch1

What is Big Data?
“Big data is high-volume, high-velocity and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.” — Gartner

“Big data refers to datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze.” — Mckinsey & Company

Paste_Image.png

Data mining
People have been analysing and investigating data for centuries.

Statistics
Mean, Variance, Correlation, Distribution …

In modern days, data are often far beyond human comprehension.
Diversity, Volume, Dimensionality

Definition
Data Mining is the process of automatically extracting interesting and useful hidden patterns from usually massive, incomplete and noisy data.

Not a fully automatic process
Human interventions are often inevitable.
Domain Knowledge
Data Collection and Pre-processing

Synonym: Knowledge Discovery

Paste_Image.png

Data Integration & Analysis

Paste_Image.png

Process of Data Mining

Paste_Image.png

DM Techniques - Classification
“Classification is a procedure in which individual items are placed into groups based on quantitative information on one or more characteristics (referred to as variables) and based on a training set of previously labeled items.”

Given a training set: {(x1, y1), …, (xn, yn)}, produce a classifier (function) that maps any unknown object xi to its class label yi.

Algorithms
Decision Trees
K-Nearest Neighbours
Neural Networks
Support Vector Machines

Applications
Churn Prediction
Medical Diagnosis
Classification Boundaries

Paste_Image.png

Overfitting – Classification

Paste_Image.png

Confusion Matrix

Paste_Image.png

TPR=TP/(TP+FN)

TNR=TN/(TN+FP)

Accuracy=(TP+TN)/(P+N)

Receiver Operating Characteristic

Paste_Image.png

DM Techniques - Clustering
“Clustering is the assignment of a set of observations into subsets (called clusters) so that observations in the same cluster are similar in some sense.”

Distance Metrics
Euclidean Distance
Manhattan Distance
Mahalanobis Distance

Algorithms
K-Means
Sequential Leader
Affinity Propagation

Applications
Market Research
Image Segmentation
Social Network Analysis

Paste_Image.png

Hierarchical Clustering

Paste_Image.png

DM Techniques – Association Rule

Paste_Image.png

DM Techniques – Regression

Paste_Image.png

Overfitting – Regression

Paste_Image.png

Data Preprocessing
Real data are often surprisingly dirty.
A Major Challenge for Data Mining

Typical Issues
Missing Attribute Values
Different Coding/Naming Schemes
Infeasible Values
Inconsistent Data
Outliers

Data Quality
Accuracy
Completeness
Consistency
Interpretability
Credibility
Timeliness

Paste_Image.png

Data Cleaning
Fill in missing values.
Correct inconsistent data.
Identify outliers and noisy data.

Data Integration
Combine data from different sources.

Data Transformation
Normalization
Aggregation
Type Conversion

Data Reduction
Feature Selection
Sampling

Privacy Protection
Data: A Double-Edged Sword
People can benefit greatly from data analysis.
The consequence of information leakage can be catastrophic.

People may be reluctant to give sensitive information due to privacy concerns.
Drug, Tax, Sexuality …

How to find out the percentage of people with a certain attribute?
The interviewer should not know the true answer of each respondent.

Randomized Response
Used in structured survey research.
Can maintain the confidentiality of respondents.
Two questions are presented:
Q1: I have the attribute A.
Q2: I do not have the attribute A.

The respondent uses a random device to:
Answer Q1 with probability p.
Answer Q2 with probability 1-p.
The interviewer has no idea about which question is answered.