讲解:INF 552、Java、dataset、JavaWeb|R

Homework 4 INF 552, Instructor: Mohammad Reza RajatiDisclaimer: This set of homework applies SMOTE to a seriously imbalanced datasetwith a large number of features and data points. SMOTE is essentially a time consumingmethod. You need to start doing this homework early, so that you have enough time to runSMOTE on the full dataset.1. The LASSO and Boosting for Regression(a) Download the Communities and Crime data1from https://archive.ics.uci.edu/ml/datasets/Communities+and+Crime. Use the first 1495 rows of data asthe training set and the rest as the test set.(b) The data set has missing values. Use a data imputation technique to deal withthe missing values in the data set. The data description mentions some featuresare nonpredictive. Ignore those features.(c) Plot a correlation matrix for the features in the data set.(d) Calculate the Coefficient of Variation CV for each feature, where CV =sm, inwhich s is sample variance and m is sample mean..(e) Pick b√128c features with highest CV , and make scatter plots and box plots forthem. Can you draw conclusions about significance of those features, just by thescatter plots?(f) Fit a linear model using least squares to the training set and report the test error.(g) Fit a ridge regression model on the training set, with λ chosen by cross-validation.Report the test error obtained.(h) Fit a LASSO model on the training set, with λ chosen by cross-validation. Reportthe test error obtained, along with a list of the variables selected by the model.Repeat with standardized2features. Report the test error for both cases andcompare them.(i) Fit a PCR model on the training set, with M (the number of principal components)chosen by cross-validation. Report the test error obtained.(j) In this section, we would like to fit a boosting tree to the data. As in classificationtrees, one can use any type of regression at each node to build a multivariateregression tree. Because the number of variables is large in this problem, onecan use L1-penalized regression at each node. Such a tree is called L1 penalizedgradient boosting tree. You can use XGBoost3to fit the model tree. Determineα (the regularization term) using cross-validation.1Question you may encounter: I tried opening the dataset and download it but the file is not readable.How to download the file? Just change .data to .csv. .2In this data set, features are already normalized.3Some hints on installing XGBoost on Windows: http://www.picnet.com.au/blogs/guido/2016/09/22/xgboost-windows-x64-binaries-for-download/.1Homework 4 INF 552, Instructor: Mohammad Reza Rajati2. Tree-Based Methods(a) Download the APS Failure data from: https://archive.ics.uci.edu/ml/datasets/APS+Failure+at+Scania+Trucks . The dataset contains a training set and a testset. The training set contains 60,000 rows, of which 1,000 belong to the positiveclass and 171 columns, of which one is the class column. All attributes are numeric.(b) Data PreparationThis data set has missing values. When the number of data with missing valuesis significant, discarding them is not a good idea. 4i. Research what types of techniques are usually used for dealing with data withmissing values.5 Pick at least one of them and apply it to this data in thenext steps.6ii. For each of the 170 features, calculate the coefficient of variation CV=sm,where s is sample variance and m is sample mean.iii. Plot a correlation matrix for your features using pandas or any other tool.iv. Pick b√170c features with highest CV , and make scatter plots and box plotsfor them, similar to those on p. 129 of ISLR. Can you draw conclusions aboutsignificance of those features, just by the scatter plots? This does not meanthat代写INF 552作业、代做Java实验作业、代写dataset留学生作业、代做Java编程设计作业 代写Web开发|代 you will only use those features in the following questions. We pickedthem only for visualization.v. Determine the number of positive and negative data. Is this data set imbalanced?(c) Train a random forest to classify the data set. Do NOT compensate for classimbalance in the data set. Calculate the confusion matrix, ROC, AUC, andmisclassification for training and test sets and report them (You may use pROCpackage). Calculate Out of Bag error estimate for your random forset and compareit to the test error.(d) Research how class imbalance is addressed in random forests. Compensate forclass imbalance in your random forest and repeat 2c. Compare the results withthose of 2c.(e) Model TreesIn the case of a univariate tree, only one input dimension is used at a tree split.In a multivariate tree, or model tree, at a decision node all input dimensions canbe used and thus it is more general. In univariate classification trees, majoritypolling is used at each node to determine the split of that node as the decisionrule. In model trees, a (linear) model that relies on all of the variables is used4In reality, wehn we have a model and we want to fill in missing values, we do not have access to trainingdata, so we only use the statistics of test data to fill in the missing values. For simplicity, in this exercise,you first fill in the missing values and then split your data to training and test sets.5They are called data imputation techniques.6You are welcome to test more than one method.2Homework 4 INF 552, Instructor: Mohammad Reza Rajatito determine the split of that node (i.e. instead of using Xj > s as the decisionrule, one has PjβjXj > s. as the decision rule). Alternatively, in a regressiontree, instead of using average in the region associated with each node, a linearregression model is used to determine the value associated with that node.One of the methods that can be used at each node is Logistic Regression. Onecan use scikit learn to call Weka7to train Logistic Model Trees for classification.Train Logistic Model Trees for the APS data set without compensation for classimbalance. Use one of 5 fold, 10 fold, and leave-one-out cross validation methodsto estimate the error of your trained model and compare it with the test error.Report the Confusion Matrix, ROC, and AUC for training and test sets.(f) Use SMOTE (Synthetic Minority Over-sampling Technique) to pre-process yourdata to compensate for class imbalance.8 Train a Logistic Model Tree using thepre-processed data and repeat 2e. Do not forget that there is a right and a wrongway of cross validation here. Compare the uncompensated case with SMOTE.3. ISLR 6.8.34. ISLR, 6.8.55. ISLR 8.4.56. ISLR 9.7.37. Extra Practice: ISLR 5.4.2, 6.8.4, 8.4.4, 9.7.2AppendixWeka for Mac users:1. Download JDK 9 from http://www.oracle.com/technetwork/java/javase/downloads/index.html2. Add environment variables in Terminal using : vi~/.bash_profile(a) export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk-9.0.4.jdk/Contents/Home(b) export PATH=$JAVA_HOME/bin:$PATH3. Restart Terminal4. Get brew (package installer for Mac, if you don’t have it) and install python (notnecessary)7http://fracpete.github.io/python-weka-wrapper/install.html. may help.8If you did not start doing this homework on time, downsample the common class to 6,000 so that youhave 12,000 data points after applying SMOTE. Remember that the purpose of this homework is to applySMOTE to the whole dataset.3Homework 4 INF 552, Instructor: Mohammad Reza Rajati5. brew install pkg-config6. brew install graphviz7. pip install javabridge8. pip install python-weka-wrapperAnd you should be able to use WEKA in your Jupyter Notebooks.转自:http://ass.3daixie.com/2019030458441000.html

©著作权归作者所有,转载或内容合作请联系作者
  • 序言:七十年代末,一起剥皮案震惊了整个滨河市,随后出现的几起案子,更是在滨河造成了极大的恐慌,老刑警刘岩,带你破解...
    沈念sama阅读 202,009评论 5 474
  • 序言:滨河连续发生了三起死亡事件,死亡现场离奇诡异,居然都是意外死亡,警方通过查阅死者的电脑和手机,发现死者居然都...
    沈念sama阅读 84,808评论 2 378
  • 文/潘晓璐 我一进店门,熙熙楼的掌柜王于贵愁眉苦脸地迎上来,“玉大人,你说我怎么就摊上这事。” “怎么了?”我有些...
    开封第一讲书人阅读 148,891评论 0 335
  • 文/不坏的土叔 我叫张陵,是天一观的道长。 经常有香客问我,道长,这世上最难降的妖魔是什么? 我笑而不...
    开封第一讲书人阅读 54,283评论 1 272
  • 正文 为了忘掉前任,我火速办了婚礼,结果婚礼上,老公的妹妹穿的比我还像新娘。我一直安慰自己,他们只是感情好,可当我...
    茶点故事阅读 63,285评论 5 363
  • 文/花漫 我一把揭开白布。 她就那样静静地躺着,像睡着了一般。 火红的嫁衣衬着肌肤如雪。 梳的纹丝不乱的头发上,一...
    开封第一讲书人阅读 48,409评论 1 281
  • 那天,我揣着相机与录音,去河边找鬼。 笑死,一个胖子当着我的面吹牛,可吹牛的内容都是我干的。 我是一名探鬼主播,决...
    沈念sama阅读 37,809评论 3 393
  • 文/苍兰香墨 我猛地睁开眼,长吁一口气:“原来是场噩梦啊……” “哼!你这毒妇竟也来了?” 一声冷哼从身侧响起,我...
    开封第一讲书人阅读 36,487评论 0 256
  • 序言:老挝万荣一对情侣失踪,失踪者是张志新(化名)和其女友刘颖,没想到半个月后,有当地人在树林里发现了一具尸体,经...
    沈念sama阅读 40,680评论 1 295
  • 正文 独居荒郊野岭守林人离奇死亡,尸身上长有42处带血的脓包…… 初始之章·张勋 以下内容为张勋视角 年9月15日...
    茶点故事阅读 35,499评论 2 318
  • 正文 我和宋清朗相恋三年,在试婚纱的时候发现自己被绿了。 大学时的朋友给我发了我未婚夫和他白月光在一起吃饭的照片。...
    茶点故事阅读 37,548评论 1 329
  • 序言:一个原本活蹦乱跳的男人离奇死亡,死状恐怖,灵堂内的尸体忽然破棺而出,到底是诈尸还是另有隐情,我是刑警宁泽,带...
    沈念sama阅读 33,268评论 4 318
  • 正文 年R本政府宣布,位于F岛的核电站,受9级特大地震影响,放射性物质发生泄漏。R本人自食恶果不足惜,却给世界环境...
    茶点故事阅读 38,815评论 3 304
  • 文/蒙蒙 一、第九天 我趴在偏房一处隐蔽的房顶上张望。 院中可真热闹,春花似锦、人声如沸。这庄子的主人今日做“春日...
    开封第一讲书人阅读 29,872评论 0 19
  • 文/苍兰香墨 我抬头看了看天上的太阳。三九已至,却和暖如春,着一层夹袄步出监牢的瞬间,已是汗流浃背。 一阵脚步声响...
    开封第一讲书人阅读 31,102评论 1 258
  • 我被黑心中介骗来泰国打工, 没想到刚下飞机就差点儿被人妖公主榨干…… 1. 我叫王不留,地道东北人。 一个月前我还...
    沈念sama阅读 42,683评论 2 348
  • 正文 我出身青楼,却偏偏与公主长得像,于是被迫代替她去往敌国和亲。 传闻我的和亲对象是个残疾皇子,可洞房花烛夜当晚...
    茶点故事阅读 42,253评论 2 341

推荐阅读更多精彩内容

  • pyspark.sql模块 模块上下文 Spark SQL和DataFrames的重要类: pyspark.sql...
    mpro阅读 9,439评论 0 13
  • by Lene Nielsen The persona method has developed from bei...
    鲜核桃阅读 1,033评论 0 0
  • 佛说 前世我欠了你很多很多 今生和来世 我要慢慢偿还 我认识你 也许一千年了 你的眼里有我封存的记忆 为什么那么澄...
    槿柔伊阅读 361评论 0 2
  • 秋叶大叔的第二课讲的是让你的写作成为复利式投资,复利式思维就是能否让写作同时带来多元的回报成为优质资产,并且让这份...
    雏菊小珍阅读 352评论 0 0
  • 对于任何一个有文学梦想的人来说,用笔征服世界就是最高的奖赏,而最可怕的就是坐在桌前却发现一个字都写不出来。难道就这...
    麦肖肖阅读 1,221评论 9 22