Opening
Processor Alice Wong
Associate Dean of Science, HKU
- WiDS 2017 is a collaboration among Stanford University, SAP, Google, Microsoft and Walmart Labs.
- 50th Anniversary for HKU Department of Statistic and Actuarial Science
- the big data research cluster @HKU
Talk 1: Women in Data Science
Speaker:
Anita Varshney
Global Strategy Transformation Lead, SAP Hong Kong
- WiDS
- held by Stanford every February (March in Asia)
- keynote speakers from various industries that are doing data science now
- having largest attending number actually in middle east
- SAP
- the world's largest provider of enterprise application software
- HQ in Germany; founded in 1972
- career suggestion: look for a good mentor
- present in 26 industries
- Real time processes, Prediction and simulation, great User experience, Agility and TCO
- SAP next-gen
- Providing platform for college students to present their ideas directly to business customers.
- Technologies
- Machine learning
- IoT
*Amazing time management of presentation
Talk 2: Big Data Decision Analysis
"Big data is something that breaks Microsoft Excel" (lol)
Research project - Machine Learning for Chinese Suicide Newspaper Articles Classification
Analysis how the media report suicide incidence, and to figure out how to prevent suicide.
- WiseNews database: over 220K search result for the keyword "suicide", containing 84 million terms
- Big data challenges
- Noisy dataset: e.g. "suicide car booming attack"
- Data classification
- Supervised Machine Learning (use labeled articles to train)
- Web Interface for manually label
- Article features extraction for ML
- Text Segmentation: Sentence -> Words -> N-grams
- Tool: Jieba(结巴) - functionalities like MP & HMM(Hidden Markov Model)
- State Transition Matrix: P(M|B) >> P(E|B)
- Tool: Jieba(结巴) - functionalities like MP & HMM(Hidden Markov Model)
- Document Representation
- Word to Document Matrix (not very efficient)
- Chosen approach - Word Embedding (Word2Vec)
- each word is represented by a vector of fixed number of dimensions (usually 30-500d)
- Neural network: to determine the dimensions of the document vector, CBOW and Skip-Gram Model
- Cosine similarity
- Text Segmentation: Sentence -> Words -> N-grams
- Classification (Training)
- labeled dataset: 70% for training and 30% for testing
- P(Suicide = Yes) 85.9% accuracy, P(Student = No), P(HK = Yes), ...
- Future work
- Identify any pattern of misclassification
- Increase dimensions of the word vectors
- Deep learning approach for other NLP tasks with this dataset
- Predict the method used for suicide
- Predict the reasons used for suicide
Talk 3: Predictive Analytics
Vanessa Ko
Head of Presales SAP Hong Kong
- SAP HK
- Customers: I.T., Cathay Pacific, PizzaHut, etc.
- Biggest competitor: overall no, only in some sub-areas.
- Predictive Analytics
- How to make use of digitalized historical data
- Case: Obama for America 2012
- Data source: Historical voting data, Census, Volunteer collected data, Facebook, etc;
- Segments of voters, Found raising prediction, who's persuadable?
- Data Modeling: VOTING RATE MODEL, SUPPORT RATE MODEL, Persuasive Rating, Overall score;
- Goal: Target Voters, Donators and Volunteers -> especially swing voters (not too supportive or too opposing)