Data Mining and Predictive Analytics.

By: Larose, Daniel TContributor(s): Larose, Chantal DMaterial type: TextTextSeries: Wiley Series on Methods and Applications in Data Mining SerPublisher: New York : John Wiley & Sons, Incorporated, 2015Copyright date: ©2015Edition: 2nd edDescription: 1 online resource (827 pages)Content type: text Media type: computer Carrier type: online resourceISBN: 9781118868676Subject(s): Data mining | Prediction theoryGenre/Form: Electronic books.Additional physical formats: Print version:: Data Mining and Predictive AnalyticsDDC classification: 006.3/12 LOC classification: QA76.9.D343 -- .L3776 2015ebOnline resources: Click to View
Contents:
Cover -- Contents -- Preface -- Acknowledgments -- Part I Data Preparation -- Chapter 1 An Introduction to Data Mining and Predictive Analytics -- 1.1 What is Data Mining? What is Predictive Analytics? -- 1.2 Wanted: Data Miners -- 1.3 The Need for Human Direction of Data Mining -- 1.4 The Cross-Industry Standard Process for Data Mining: CRISP-DM -- 1.4.1 CRISP-DM: The Six Phases -- 1.5 Fallacies of Data Mining -- 1.6 What Tasks Can Data Mining Accomplish -- 1.6.1 Description -- 1.6.2 Estimation -- 1.6.3 Prediction -- 1.6.4 Classification -- 1.6.5 Clustering -- 1.6.6 Association -- The R Zone -- R References -- Exercises -- Chapter 2 Data Preprocessing -- 2.1 Why do We Need to Preprocess the Data? -- 2.2 Data Cleaning -- 2.3 Handling Missing Data -- 2.4 Identifying Misclassifications -- 2.5 Graphical Methods for Identifying Outliers -- 2.6 Measures of Center and Spread -- 2.7 Data Transformation -- 2.8 Min-Max Normalization -- 2.9 Z-Score Standardization -- 2.10 Decimal Scaling -- 2.11 Transformations to Achieve Normality -- 2.12 Numerical Methods for Identifying Outliers -- 2.13 Flag Variables -- 2.14 Transforming Categorical Variables into Numerical Variables -- 2.15 Binning Numerical Variables -- 2.16 Reclassifying Categorical Variables -- 2.17 Adding an Index Field -- 2.18 Removing Variables that are not Useful -- 2.19 Variables that Should Probably not be Removed -- 2.20 Removal of Duplicate Records -- 2.21 A Word About ID Fields -- The R Zone -- R Reference -- Exercises -- Chapter 3 Exploratory Data Analysis -- 3.1 Hypothesis Testing Versus Exploratory Data Analysis -- 3.2 Getting to Know the Data Set -- 3.3 Exploring Categorical Variables -- 3.4 Exploring Numeric Variables -- 3.5 Exploring Multivariate Relationships -- 3.6 Selecting Interesting Subsets of the Data for Further Investigation -- 3.7 Using EDA to Uncover Anomalous Fields.
3.8 Binning Based on Predictive Value -- 3.9 Deriving New Variables: Flag Variables -- 3.10 Deriving New Variables: Numerical Variables -- 3.11 Using EDA to Investigate Correlated Predictor Variables -- 3.12 Summary of Our EDA -- The R Zone -- R References -- Exercises -- Chapter 4 Dimension-Reduction Methods -- 4.1 Need for Dimension-Reduction in Data Mining -- 4.2 Principal Components Analysis -- 4.3 Applying PCA to the Houses Data Set -- 4.4 How Many Components Should We Extract? -- 4.4.1 The Eigenvalue Criterion -- 4.4.2 The Proportion of Variance Explained Criterion -- 4.4.3 The Minimum Communality Criterion -- 4.4.4 The Scree Plot Criterion -- 4.5 Profiling the Principal Components -- 4.6 Communalities -- 4.6.1 Minimum Communality Criterion -- 4.7 Validation of the Principal Components -- 4.8 Factor Analysis -- 4.9 Applying Factor Analysis to the Adult Data Set -- 4.10 Factor Rotation -- 4.11 User-Defined Composites -- 4.12 An Example of a User-Defined Composite -- The R Zone -- R References -- Exercises -- Part II Statistical Analysis -- Chapter 5 Univariate Statistical Analysis -- 5.1 Data Mining Tasks in Discovering Knowledge in Data -- 5.2 Statistical Approaches to Estimation and Prediction -- 5.3 Statistical Inference -- 5.4 How Confident are We in Our Estimates? -- 5.5 Confidence Interval Estimation of the Mean -- 5.6 How to Reduce the Margin of Error -- 5.7 Confidence Interval Estimation of the Proportion -- 5.8 Hypothesis Testing for the Mean -- 5.9 Assessing the Strength of Evidence Against the Null Hypothesis -- 5.10 Using Confidence Intervals to Perform Hypothesis Tests -- 5.11 Hypothesis Testing for the Proportion -- Reference -- The R Zone -- R Reference -- Exercises -- Chapter 6 Multivariate Statistics -- 6.1 Two-Sample t-Test for Difference in Means -- 6.2 Two-Sample Z-Test for Difference in Proportions.
6.3 Test for the Homogeneity of Proportions -- 6.4 Chi-Square Test for Goodness of Fit of Multinomial Data -- 6.5 Analysis of Variance -- Reference -- The R Zone -- R Reference -- Exercises -- Chapter 7 Preparing to Model the Data -- 7.1 Supervised Versus Unsupervised Methods -- 7.2 Statistical Methodology and Data Mining Methodology -- 7.3 Cross-Validation -- 7.4 Overfitting -- 7.5 Bias-Variance Trade-Off -- 7.6 Balancing the Training Data Set -- 7.7 Establishing Baseline Performance -- The R Zone -- R Reference -- Exercises -- Chapter 8 Simple Linear Regression -- 8.1 An Example of Simple Linear Regression -- 8.1.1 The Least-Squares Estimates -- 8.2 Dangers of Extrapolation -- 8.3 How Useful is the Regression? The Coefficient of Determination, r2 -- 8.4 Standard Error of the Estimate, s -- 8.5 Correlation Coefficient r -- 8.6 Anova Table for Simple Linear Regression -- 8.7 Outliers, High Leverage Points, and Influential Observations -- 8.8 Population Regression Equation -- 8.9 Verifying the Regression Assumptions -- 8.10 Inference in Regression -- 8.11 t-Test for the Relationship Between x and y -- 8.12 Confidence Interval for the Slope of the Regression Line -- 8.13 Confidence Interval for the Correlation Coefficient ρ -- 8.14 Confidence Interval for the Mean Value of y Given x -- 8.15 Prediction Interval for a Randomly Chosen Value of y Given x -- 8.16 Transformations to Achieve Linearity -- 8.17 Box-Cox Transformations -- The R Zone -- R References -- Exercises -- Chapter 9 Multiple Regression and Model Building -- 9.1 An Example of Multiple Regression -- 9.2 The Population Multiple Regression Equation -- 9.3 Inference in Multiple Regression -- 9.3.1 The t-Test for the Relationship Between y and xi -- 9.3.2 t-Test for Relationship Between Nutritional Rating and Sugars -- 9.3.3 t-Test for Relationship Between Nutritional Rating and Fiber Content.
9.3.4 The F-Test for the Significance of the Overall Regression Model -- 9.3.5 F-Test for Relationship between Nutritional Rating and {Sugar and Fiber}, Taken Together -- 9.3.6 The Confidence Interval for a Particular Coefficient, βi -- 9.3.7 The Confidence Interval for the Mean Value of y, Given x1, x2, ..., xm -- 9.3.8 The Prediction Interval for a Randomly Chosen Value of y, Given x1, x2, ..., xm -- 9.4 Regression with Categorical Predictors, Using Indicator Variables -- 9.5 Adjusting R2: Penalizing Models for Including Predictors that are not Useful -- 9.6 Sequential Sums of Squares -- 9.7 Multicollinearity -- 9.8 Variable Selection Methods -- 9.8.1 The Partial F-Test -- 9.8.2 The Forward Selection Procedure -- 9.8.3 The Backward Elimination Procedure -- 9.8.4 The Stepwise Procedure -- 9.8.5 The Best Subsets Procedure -- 9.8.6 The All-Possible-Subsets Procedure -- 9.9 Gas Mileage Data Set -- 9.10 An Application of Variable Selection Methods -- 9.10.1 Forward Selection Procedure Applied to the Gas Mileage Data Set -- 9.10.2 Backward Elimination Procedure Applied to the Gas Mileage Data Set -- 9.10.3 The Stepwise Selection Procedure Applied to the Gas Mileage Data Set -- 9.10.4 Best Subsets Procedure Applied to the Gas Mileage Data Set -- 9.10.5 Mallows' Cp Statistic -- 9.11 Using the Principal Components as Predictors in Multiple Regression -- The R Zone -- R References -- Exercises -- Part III Classification -- Chapter 10 k-Nearest Neighbor Algorithm -- 10.1 Classification Task -- 10.2 k-Nearest Neighbor Algorithm -- 10.3 Distance Function -- 10.4 Combination Function -- 10.4.1 Simple Unweighted Voting -- 10.4.2 Weighted Voting -- 10.5 Quantifying Attribute Relevance: Stretching the Axes -- 10.6 Database Considerations -- 10.7 k-Nearest Neighbor Algorithm for Estimation and Prediction -- 10.8 Choosing k.
10.9 Application of k-Nearest Neighbor Algorithm Using IBM/SPSS Modeler -- The R Zone -- R References -- Exercises -- Chapter 11 Decision Trees -- 11.1 What is a Decision Tree? -- 11.2 Requirements for Using Decision Trees -- 11.3 Classification and Regression Trees -- 11.4 C4.5 Algorithm -- 11.5 Decision Rules -- 11.6 Comparison of the C5.0 and CART Algorithms Applied to Real Data -- The R Zone -- R References -- Exercises -- Chapter 12 Neural Networks -- 12.1 Input and Output Encoding -- 12.2 Neural Networks for Estimation and Prediction -- 12.3 Simple Example of a Neural Network -- 12.4 Sigmoid Activation Function -- 12.5 Back-Propagation -- 12.6 Gradient-Descent Method -- 12.7 Back-Propagation Rules -- 12.8 Example of Back-Propagation -- 12.9 Termination Criteria -- 12.10 Learning Rate -- 12.11 Momentum Term -- 12.12 Sensitivity Analysis -- 12.13 Application of Neural Network Modeling -- The R Zone -- R References -- Exercises -- Chapter 13 Logistic Regression -- 13.1 Simple Example of Logistic Regression -- 13.2 Maximum Likelihood Estimation -- 13.3 Interpreting Logistic Regression Output -- 13.4 Inference: are the Predictors Significant? -- 13.5 Odds Ratio and Relative Risk -- 13.6 Interpreting Logistic Regression for a Dichotomous Predictor -- 13.7 Interpreting Logistic Regression for a Polychotomous Predictor -- 13.8 Interpreting Logistic Regression for a Continuous Predictor -- 13.9 Assumption of Linearity -- 13.10 Zero-Cell Problem -- 13.11 Multiple Logistic Regression -- 13.12 Introducing Higher Order Terms to Handle Nonlinearity -- 13.13 Validating the Logistic Regression Model -- 13.14 WEKA: Hands-On Analysis Using Logistic Regression -- The R Zone -- R References -- Exercises -- Chapter 14 Naïve Bayes and Bayesian Networks -- 14.1 Bayesian Approach -- 14.2 Maximum a Posteriori (Map) Classification -- 14.3 Posterior Odds Ratio.
14.4 Balancing the Data.
Summary: Learn methods of data analysis and their application to real-world data sets This updated second edition serves as an introduction to data mining methods and models, including association rules, clustering, neural networks, logistic regression, and multivariate analysis. The authors apply a unified "white box" approach to data mining methods and models. This approach is designed to walk readers through the operations and nuances of the various methods, using small data sets, so readers can gain an insight into the inner workings of the method under review. Chapters provide readers with hands-on analysis problems, representing an opportunity for readers to apply their newly-acquired data mining expertise to solving real problems using large, real-world data sets. Data Mining and Predictive Analytics, Second Edition: Offers comprehensive coverage of association rules, clustering, neural networks, logistic regression, multivariate analysis, and R statistical programming language Features over 750 chapter exercises, allowing readers to assess their understanding of the new material Provides a detailed case study that brings together the lessons learned in the book Includes access to the companion website, www.dataminingconsultant.com, with exclusive password-protected instructor content Data Mining and Predictive Analytics, Second Edition will appeal to computer science and statistic students, as well as students in MBA programs, and chief executives.
Tags from this library: No tags from this library for this title. Log in to add tags.
    Average rating: 0.0 (0 votes)
No physical items for this record

Cover -- Contents -- Preface -- Acknowledgments -- Part I Data Preparation -- Chapter 1 An Introduction to Data Mining and Predictive Analytics -- 1.1 What is Data Mining? What is Predictive Analytics? -- 1.2 Wanted: Data Miners -- 1.3 The Need for Human Direction of Data Mining -- 1.4 The Cross-Industry Standard Process for Data Mining: CRISP-DM -- 1.4.1 CRISP-DM: The Six Phases -- 1.5 Fallacies of Data Mining -- 1.6 What Tasks Can Data Mining Accomplish -- 1.6.1 Description -- 1.6.2 Estimation -- 1.6.3 Prediction -- 1.6.4 Classification -- 1.6.5 Clustering -- 1.6.6 Association -- The R Zone -- R References -- Exercises -- Chapter 2 Data Preprocessing -- 2.1 Why do We Need to Preprocess the Data? -- 2.2 Data Cleaning -- 2.3 Handling Missing Data -- 2.4 Identifying Misclassifications -- 2.5 Graphical Methods for Identifying Outliers -- 2.6 Measures of Center and Spread -- 2.7 Data Transformation -- 2.8 Min-Max Normalization -- 2.9 Z-Score Standardization -- 2.10 Decimal Scaling -- 2.11 Transformations to Achieve Normality -- 2.12 Numerical Methods for Identifying Outliers -- 2.13 Flag Variables -- 2.14 Transforming Categorical Variables into Numerical Variables -- 2.15 Binning Numerical Variables -- 2.16 Reclassifying Categorical Variables -- 2.17 Adding an Index Field -- 2.18 Removing Variables that are not Useful -- 2.19 Variables that Should Probably not be Removed -- 2.20 Removal of Duplicate Records -- 2.21 A Word About ID Fields -- The R Zone -- R Reference -- Exercises -- Chapter 3 Exploratory Data Analysis -- 3.1 Hypothesis Testing Versus Exploratory Data Analysis -- 3.2 Getting to Know the Data Set -- 3.3 Exploring Categorical Variables -- 3.4 Exploring Numeric Variables -- 3.5 Exploring Multivariate Relationships -- 3.6 Selecting Interesting Subsets of the Data for Further Investigation -- 3.7 Using EDA to Uncover Anomalous Fields.

3.8 Binning Based on Predictive Value -- 3.9 Deriving New Variables: Flag Variables -- 3.10 Deriving New Variables: Numerical Variables -- 3.11 Using EDA to Investigate Correlated Predictor Variables -- 3.12 Summary of Our EDA -- The R Zone -- R References -- Exercises -- Chapter 4 Dimension-Reduction Methods -- 4.1 Need for Dimension-Reduction in Data Mining -- 4.2 Principal Components Analysis -- 4.3 Applying PCA to the Houses Data Set -- 4.4 How Many Components Should We Extract? -- 4.4.1 The Eigenvalue Criterion -- 4.4.2 The Proportion of Variance Explained Criterion -- 4.4.3 The Minimum Communality Criterion -- 4.4.4 The Scree Plot Criterion -- 4.5 Profiling the Principal Components -- 4.6 Communalities -- 4.6.1 Minimum Communality Criterion -- 4.7 Validation of the Principal Components -- 4.8 Factor Analysis -- 4.9 Applying Factor Analysis to the Adult Data Set -- 4.10 Factor Rotation -- 4.11 User-Defined Composites -- 4.12 An Example of a User-Defined Composite -- The R Zone -- R References -- Exercises -- Part II Statistical Analysis -- Chapter 5 Univariate Statistical Analysis -- 5.1 Data Mining Tasks in Discovering Knowledge in Data -- 5.2 Statistical Approaches to Estimation and Prediction -- 5.3 Statistical Inference -- 5.4 How Confident are We in Our Estimates? -- 5.5 Confidence Interval Estimation of the Mean -- 5.6 How to Reduce the Margin of Error -- 5.7 Confidence Interval Estimation of the Proportion -- 5.8 Hypothesis Testing for the Mean -- 5.9 Assessing the Strength of Evidence Against the Null Hypothesis -- 5.10 Using Confidence Intervals to Perform Hypothesis Tests -- 5.11 Hypothesis Testing for the Proportion -- Reference -- The R Zone -- R Reference -- Exercises -- Chapter 6 Multivariate Statistics -- 6.1 Two-Sample t-Test for Difference in Means -- 6.2 Two-Sample Z-Test for Difference in Proportions.

6.3 Test for the Homogeneity of Proportions -- 6.4 Chi-Square Test for Goodness of Fit of Multinomial Data -- 6.5 Analysis of Variance -- Reference -- The R Zone -- R Reference -- Exercises -- Chapter 7 Preparing to Model the Data -- 7.1 Supervised Versus Unsupervised Methods -- 7.2 Statistical Methodology and Data Mining Methodology -- 7.3 Cross-Validation -- 7.4 Overfitting -- 7.5 Bias-Variance Trade-Off -- 7.6 Balancing the Training Data Set -- 7.7 Establishing Baseline Performance -- The R Zone -- R Reference -- Exercises -- Chapter 8 Simple Linear Regression -- 8.1 An Example of Simple Linear Regression -- 8.1.1 The Least-Squares Estimates -- 8.2 Dangers of Extrapolation -- 8.3 How Useful is the Regression? The Coefficient of Determination, r2 -- 8.4 Standard Error of the Estimate, s -- 8.5 Correlation Coefficient r -- 8.6 Anova Table for Simple Linear Regression -- 8.7 Outliers, High Leverage Points, and Influential Observations -- 8.8 Population Regression Equation -- 8.9 Verifying the Regression Assumptions -- 8.10 Inference in Regression -- 8.11 t-Test for the Relationship Between x and y -- 8.12 Confidence Interval for the Slope of the Regression Line -- 8.13 Confidence Interval for the Correlation Coefficient ρ -- 8.14 Confidence Interval for the Mean Value of y Given x -- 8.15 Prediction Interval for a Randomly Chosen Value of y Given x -- 8.16 Transformations to Achieve Linearity -- 8.17 Box-Cox Transformations -- The R Zone -- R References -- Exercises -- Chapter 9 Multiple Regression and Model Building -- 9.1 An Example of Multiple Regression -- 9.2 The Population Multiple Regression Equation -- 9.3 Inference in Multiple Regression -- 9.3.1 The t-Test for the Relationship Between y and xi -- 9.3.2 t-Test for Relationship Between Nutritional Rating and Sugars -- 9.3.3 t-Test for Relationship Between Nutritional Rating and Fiber Content.

9.3.4 The F-Test for the Significance of the Overall Regression Model -- 9.3.5 F-Test for Relationship between Nutritional Rating and {Sugar and Fiber}, Taken Together -- 9.3.6 The Confidence Interval for a Particular Coefficient, βi -- 9.3.7 The Confidence Interval for the Mean Value of y, Given x1, x2, ..., xm -- 9.3.8 The Prediction Interval for a Randomly Chosen Value of y, Given x1, x2, ..., xm -- 9.4 Regression with Categorical Predictors, Using Indicator Variables -- 9.5 Adjusting R2: Penalizing Models for Including Predictors that are not Useful -- 9.6 Sequential Sums of Squares -- 9.7 Multicollinearity -- 9.8 Variable Selection Methods -- 9.8.1 The Partial F-Test -- 9.8.2 The Forward Selection Procedure -- 9.8.3 The Backward Elimination Procedure -- 9.8.4 The Stepwise Procedure -- 9.8.5 The Best Subsets Procedure -- 9.8.6 The All-Possible-Subsets Procedure -- 9.9 Gas Mileage Data Set -- 9.10 An Application of Variable Selection Methods -- 9.10.1 Forward Selection Procedure Applied to the Gas Mileage Data Set -- 9.10.2 Backward Elimination Procedure Applied to the Gas Mileage Data Set -- 9.10.3 The Stepwise Selection Procedure Applied to the Gas Mileage Data Set -- 9.10.4 Best Subsets Procedure Applied to the Gas Mileage Data Set -- 9.10.5 Mallows' Cp Statistic -- 9.11 Using the Principal Components as Predictors in Multiple Regression -- The R Zone -- R References -- Exercises -- Part III Classification -- Chapter 10 k-Nearest Neighbor Algorithm -- 10.1 Classification Task -- 10.2 k-Nearest Neighbor Algorithm -- 10.3 Distance Function -- 10.4 Combination Function -- 10.4.1 Simple Unweighted Voting -- 10.4.2 Weighted Voting -- 10.5 Quantifying Attribute Relevance: Stretching the Axes -- 10.6 Database Considerations -- 10.7 k-Nearest Neighbor Algorithm for Estimation and Prediction -- 10.8 Choosing k.

10.9 Application of k-Nearest Neighbor Algorithm Using IBM/SPSS Modeler -- The R Zone -- R References -- Exercises -- Chapter 11 Decision Trees -- 11.1 What is a Decision Tree? -- 11.2 Requirements for Using Decision Trees -- 11.3 Classification and Regression Trees -- 11.4 C4.5 Algorithm -- 11.5 Decision Rules -- 11.6 Comparison of the C5.0 and CART Algorithms Applied to Real Data -- The R Zone -- R References -- Exercises -- Chapter 12 Neural Networks -- 12.1 Input and Output Encoding -- 12.2 Neural Networks for Estimation and Prediction -- 12.3 Simple Example of a Neural Network -- 12.4 Sigmoid Activation Function -- 12.5 Back-Propagation -- 12.6 Gradient-Descent Method -- 12.7 Back-Propagation Rules -- 12.8 Example of Back-Propagation -- 12.9 Termination Criteria -- 12.10 Learning Rate -- 12.11 Momentum Term -- 12.12 Sensitivity Analysis -- 12.13 Application of Neural Network Modeling -- The R Zone -- R References -- Exercises -- Chapter 13 Logistic Regression -- 13.1 Simple Example of Logistic Regression -- 13.2 Maximum Likelihood Estimation -- 13.3 Interpreting Logistic Regression Output -- 13.4 Inference: are the Predictors Significant? -- 13.5 Odds Ratio and Relative Risk -- 13.6 Interpreting Logistic Regression for a Dichotomous Predictor -- 13.7 Interpreting Logistic Regression for a Polychotomous Predictor -- 13.8 Interpreting Logistic Regression for a Continuous Predictor -- 13.9 Assumption of Linearity -- 13.10 Zero-Cell Problem -- 13.11 Multiple Logistic Regression -- 13.12 Introducing Higher Order Terms to Handle Nonlinearity -- 13.13 Validating the Logistic Regression Model -- 13.14 WEKA: Hands-On Analysis Using Logistic Regression -- The R Zone -- R References -- Exercises -- Chapter 14 Naïve Bayes and Bayesian Networks -- 14.1 Bayesian Approach -- 14.2 Maximum a Posteriori (Map) Classification -- 14.3 Posterior Odds Ratio.

14.4 Balancing the Data.

Learn methods of data analysis and their application to real-world data sets This updated second edition serves as an introduction to data mining methods and models, including association rules, clustering, neural networks, logistic regression, and multivariate analysis. The authors apply a unified "white box" approach to data mining methods and models. This approach is designed to walk readers through the operations and nuances of the various methods, using small data sets, so readers can gain an insight into the inner workings of the method under review. Chapters provide readers with hands-on analysis problems, representing an opportunity for readers to apply their newly-acquired data mining expertise to solving real problems using large, real-world data sets. Data Mining and Predictive Analytics, Second Edition: Offers comprehensive coverage of association rules, clustering, neural networks, logistic regression, multivariate analysis, and R statistical programming language Features over 750 chapter exercises, allowing readers to assess their understanding of the new material Provides a detailed case study that brings together the lessons learned in the book Includes access to the companion website, www.dataminingconsultant.com, with exclusive password-protected instructor content Data Mining and Predictive Analytics, Second Edition will appeal to computer science and statistic students, as well as students in MBA programs, and chief executives.

Description based on publisher supplied metadata and other sources.

Electronic reproduction. Ann Arbor, Michigan : ProQuest Ebook Central, 2018. Available via World Wide Web. Access may be limited to ProQuest Ebook Central affiliated libraries.

There are no comments on this title.

to post a comment.

Powered by Koha