What is Big Data?
“Big data is high-volume, high-velocity and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.” — Gartner
“Big data refers to datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze.” — Mckinsey & Company
Data mining
People have been analysing and investigating data for centuries.
Statistics
Mean, Variance, Correlation, Distribution …
In modern days, data are often far beyond human comprehension.
Diversity, Volume, Dimensionality
Definition
Data Mining is the process of automatically extracting interesting and useful hidden patterns from usually massive, incomplete and noisy data.
Not a fully automatic process
Human interventions are often inevitable.
Domain Knowledge
Data Collection and Pre-processing
Synonym: Knowledge Discovery
Data Integration & Analysis
Process of Data Mining
DM Techniques - Classification
“Classification is a procedure in which individual items are placed into groups based on quantitative information on one or more characteristics (referred to as variables) and based on a training set of previously labeled items.”
Given a training set: {(x1, y1), …, (xn, yn)}, produce a classifier (function) that maps any unknown object xi to its class label yi.
Algorithms
Decision Trees
K-Nearest Neighbours
Neural Networks
Support Vector Machines
Applications
Churn Prediction
Medical Diagnosis
Classification Boundaries
Overfitting – Classification
Confusion Matrix
TPR=TP/(TP+FN)
TNR=TN/(TN+FP)
Accuracy=(TP+TN)/(P+N)
Receiver Operating Characteristic
DM Techniques - Clustering
“Clustering is the assignment of a set of observations into subsets (called clusters) so that observations in the same cluster are similar in some sense.”
Distance Metrics
Euclidean Distance
Manhattan Distance
Mahalanobis Distance
Algorithms
K-Means
Sequential Leader
Affinity Propagation
Applications
Market Research
Image Segmentation
Social Network Analysis
Hierarchical Clustering
DM Techniques – Association Rule
DM Techniques – Regression
Overfitting – Regression
Data Preprocessing
Real data are often surprisingly dirty.
A Major Challenge for Data Mining
Typical Issues
Missing Attribute Values
Different Coding/Naming Schemes
Infeasible Values
Inconsistent Data
Outliers
Data Quality
Accuracy
Completeness
Consistency
Interpretability
Credibility
Timeliness
Data Cleaning
Fill in missing values.
Correct inconsistent data.
Identify outliers and noisy data.
Data Integration
Combine data from different sources.
Data Transformation
Normalization
Aggregation
Type Conversion
Data Reduction
Feature Selection
Sampling
Privacy Protection
Data: A Double-Edged Sword
People can benefit greatly from data analysis.
The consequence of information leakage can be catastrophic.
People may be reluctant to give sensitive information due to privacy concerns.
Drug, Tax, Sexuality …
How to find out the percentage of people with a certain attribute?
The interviewer should not know the true answer of each respondent.
Randomized Response
Used in structured survey research.
Can maintain the confidentiality of respondents.
Two questions are presented:
Q1: I have the attribute A.
Q2: I do not have the attribute A.
The respondent uses a random device to:
Answer Q1 with probability p.
Answer Q2 with probability 1-p.
The interviewer has no idea about which question is answered.
Cloud Computing
Why bother so many different algorithms?
No algorithm is always superior to others.
No parameter setting is optimal over all problems.
Look for the best match between problem and algorithm.
Experience
Trial and Error
Factors to consider:
Applicability
Computational Complexity
Interpretability
Always start with simple ones.
Grouping