讲解:QBUS2820、Python、Python、Predictive AnalyticsWeb|R

QBUS2820 Predictive AnalyticsSemester 2, 2018Assignment 2Key informationRequired submissions: Written report (word or pdf format, through Turnitin submission)and Jupyter Notebook (through Ed). Group leader needs to submit the Written report andJupyter Notebook.Due date: Saturday 3rd November 2018, 2pm (report and Jupyter notebook submission).The late penalty for the assignment is 10% of the assigned mark per day, starting after 2pmon the due date. The closing date Saturday 10th November 2018, 2pm is the last date onwhich an assessment will be accepted for marking.Weight: 30 out of 100 marks in your final grade.Groups: You can complete the assignment in groups of up to three students. There are noexceptions: if there are more than three you need to split the group.Length: The main text of your report (including Task 1 and Task 2) should have amaximum of 20 pages. Especially for Task 2, you should write a complete report. You mayrefer to Assignment 1-Task 2 as reference for the structure of the report.If you wish to include additional material, you can do so by creating an appendix. There isno page limit for the appendix. Keep in mind that making good use of your audience’s timeis an essential business skill. Every sentence, table and figure has to count. Extraneousand/or wrong material will reduce your mark no matter the quality of the assignment.Anonymous marking: As the anonymous marking policy of the University, please onlyinclude your student ID and group ID in the submitted report, and do NOT include yourname. The file name of your report should follow the following format. Replace 123 withyour group SID. Example: Group123Qbus2820Assignment2S22018.Presentation of the assignment is part of the assignment. Markers might assign up to 10%of the mark for clarity of writing and presentation. Numbers with decimals should bereported to the third decimal point.Key rules: Carefully read the requirements for each part of the assignment. Please follow any further instructions announced on Canvas, particularly for submissions. You must use Python for the assignment. Reproducibility is fundamental in data analysis, so that you will be required to submit aJupyter Notebook that generates your results. Unfortunately, Turnitin does not accept multiple files, so that you will do this through Ed instead. Not submitting your code willlead to a loss of 50% of the assignment marks. Failure to read information and follow instructions may lead to a loss of marks.Furthermore, note that it is your responsibility to be informed of the University of Sydneyand Business School rules and guidelines, and follow them. Referencing: Harvard Referencing System. (You may find the details at:http://libguides.library.usyd.edu.au/c.php?g=508212&p=3476130)Task 1 (35 Marks)Part A: Logistic Regression (15 Marks)Use Logistic Regression to predict diagnosis of breast cancer patients on the Breast CancerWisconsin (Diagnostic) Dataset “wdbc.data”. See Section “About the datasets” as detaileddata description.(a) Write Python code to load the data. For the target feature Diagnosis, change its literal M(malignant) to 1 and B (benign) to 0.Then define and train a logistic regression model with intercept by using scikit-learn’sLogisticRegression model, using default parameter values.Based on the estimated parameters from your model, calculate the probability of sample ID8510426 (20th sample) having a benign diagnosis.(b) Based on slides 26 to 31 of Lecture 9, write your own python code to implement thegradient ascend algorithm for the logistic regression with intercept:You may use the following defined logistic function.def logistic_function(reg_input):return np.exp(reg_input) / (1 + np.exp(reg_input))Using the given data, write python code to use initial values ?? = [0,0, … ,0], to run thegradient ascend algorithm to maximize the the log-likelihood function of logistic regressionwith respect to the parameters. Find the optimal learning rate and resulting estimated ??? . Then re-do task (a): probabilityof sample ID 8510426 (20th sample) having a benign diagnosis. Compare the results andexplain the major reasons w代写QBUS2820留学生作业、代做Python编程作业、代写Python作业、代做Predictive Analytihy you may have different answers with scikit-learn. Now change the initial values to ?? = [1,1, … ,1], and re-do the above tasks and reportyour results and findings.About the dataset:https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic)https://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancerwisconsin/wdbc.namesPart B: Youtube Comment Spam Classification (20 Marks)Some questions in Task 2 need you to do some self-learning, e.g., exploring how to buildfeatures for the text data using bag of words. You should discuss with your group memberson how to deal with the problem and do necessary self-learning which is an important abilityto have for your future study and career.Your goal is to build a Random Forest (RF) classifier that classifies whether a youtubecomment is spam or not.Use the ytube_spam dataset. We have already split the data into train and test sets:ytube_spam_trainset.csv and ytube_spam_testset.csv.General instructions:1. CLASS in the data is the target variable ??.2. 3-fold cross validation if needed.3. Make sure set your random number generator seed to 0 for this question:np.random.seed(0).(a) Self-study and use the following Python package:from sklearn.feature_extraction.text import TfidfVectorizerBuild a bag of words representation of the data with: Max 1000 features Remove the top 1% of frequently occurring words A word must occur at least twice to be included as a feature Remove common English wordsb) Build a random forest classifier and use cross validation to optimise the parameters of therandom forest. You need to at least optimise the number of trees in the random forest and canexplore and optimise other parameters as well.Use the following Python packages:from sklearn import ensemblefrom sklearn.model_selection import GridSearchCVWith your CV selected optimal parameters values, re-train the RF on the full training set andproduce your best performing model.Test your best performing model on the test set, and you must achieve an average score (avg/ total) of at least 0.96 for precision, recall and f1-score of sklearn classification_report.Report sklearn classification_report output.(c) Based on your cross validation results from GridSearchCV, plot the mean_test_scoreand mean_train_score vs number of trees on the same Figure.If you optimised other parameters, then fix these parameters to their optimal values.(d) Report your random forest settings that achieve the best classification.(e) Produce a histogram of the depths of the trees of your best performing model.(f) Report the top 10 most important text features of your best performing model.Task 2 (25 Marks)1. Problem descriptionRossmann is a German drug store chain that operates over 3000 stores in 7 Europeancountries. In this assignment, you will use “Rossman_Sales.csv” data to forecast six weeksof daily sales following the last period in the dataset.Your objective inthis assignment isto developunivariate forecastingmodels, e.g. onlyusing the historical sales, to address this problem.We focus on the sales forecasting of store 1. You can download the dataset“Rossman_Sales.csv” from Canvas.2. Report andrequirementsa. The purpose of the report is to discuss the business context, exploratory dataanalysis, methodology, model diagnostics, model validation and presentforecasts and conclusions for six weeks of daily sales following the lastperiod in thedataset.b. Your group must identify at least 1 simple benchmark model and at least 2 differentforecastingmethodsormodelsthat can be used to forecastsales.c. The report should also include an analysis of a monthly sales (with the limitationthat the sample size is small at this frequency).3. Further analysis for bonus marksThe group can earn up to 2 bonus marks (in the final mark for the unit) by developing asystem to automatically generate forecasts for all stores. In order to obtain the bonusmarks, you should present interesting results based on thistool (use the appendix and referto it the main text of the report). The ability to summarise information and be concise isessential here.转自:http://ass.3daixie.com/2018110368662431.html

©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 203,098评论 5 476
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 85,213评论 2 380
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 149,960评论 0 336
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 54,519评论 1 273
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 63,512评论 5 364
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 48,533评论 1 281
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 37,914评论 3 395
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 36,574评论 0 256
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 40,804评论 1 296
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 35,563评论 2 319
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 37,644评论 1 329
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 33,350评论 4 318
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 38,933评论 3 307
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 29,908评论 0 19
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 31,146评论 1 259
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 42,847评论 2 349
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 42,361评论 2 342

推荐阅读更多精彩内容