Dimensionality reduction in data science / Max Garzon [and five others].

Format
Book
Language
English
Published/​Created
  • Cham, Switzerland : Springer, [2022]
  • ©2022
Description
1 online resource (268 pages) : illustrations

Details

Subject(s)
Author
Summary note
This book provides a practical and fairly comprehensive review of Data Science through the lens of dimensionality reduction, as well as hands-on techniques to tackle problems with data collected in the real world. State-of-the-art results and solutions from statistics, computer science and mathematics are explained from the point of view of a practitioner in any domain science, such as biology, cyber security, chemistry, sports science and many others. Quantitative and qualitative assessment methods are described to implement and validate the solutions back in the real world where the problems originated. The ability to generate, gather and store volumes of data in the order of tera- and exo bytes daily has far outpaced our ability to derive useful information with available computational resources for many domains. This book focuses on data science and problem definition, data cleansing, feature selection and extraction,statistical, geometric, information-theoretic, biomolecular and machine learning methods for dimensionality reduction of big datasets and problem solving, as well as a comparative assessment of solutions in a real-world setting. This book targets professionals working within related fields with an undergraduate degree in any science area, particularly quantitative. Readers should be able to follow examples in this book that introduce each method or technique. These motivating examples are followed by precise definitions of the technical concepts required and presentation of the results in general situations. These concepts require a degree of abstraction that can be followed by re-interpreting concepts like in the original example(s). Finally, each section closes with solutions to the original problem(s) afforded by these techniques, perhaps in various ways to compare and contrast dis/advantages to other solutions.
Bibliographic references
Includes bibliographical references.
Source of description
Description based on print version record.
Contents
  • Intro
  • Preface
  • Contents
  • Acronyms
  • 1 What Is Data Science (DS)?
  • 1.1 Major Families of Data Science Problems
  • 1.1.1 Classification Problems
  • 1.1.2 Prediction Problems
  • 1.1.3 Clustering Problems
  • 1.2 Data, Big Data, and Pre-processing
  • 1.2.1 What Is Data?
  • 1.2.2 Big Data
  • 1.2.3 Data Cleansing
  • 1.2.3.1 Duplication
  • 1.2.3.2 Fixing/Removing Errors
  • 1.2.3.3 Missing Data
  • 1.2.3.4 Outliers
  • 1.2.3.5 Multicollinearity
  • 1.2.4 Data Visualization
  • 1.2.5 Data Understanding
  • 1.3 Populations and Data Sampling
  • 1.3.1 Sampling
  • 1.3.2 Training, Testing, and Validation
  • 1.4 Overview and Scope
  • 1.4.1 Prerequisites and Layout
  • 1.4.2 Data Science Methodology
  • 1.4.3 Scope of the Book
  • Reference
  • 2 Solutions to Data Science Problems
  • 2.1 Conventional Statistical Solutions
  • 2.1.1 Linear Multiple Regression Model: Continuous Response
  • 2.1.1.1 Akaike Information Criterion (AIC)
  • 2.1.1.2 Bayesian Information Criterion (BIC)
  • 2.1.1.3 Adjusted R-Squared
  • 2.1.2 Logistic Regression: Categorical Response
  • 2.1.3 Variable Selection and Model Building
  • 2.1.4 Generalized Linear Model (GLM)
  • 2.1.5 Decision Trees
  • 2.1.6 Bayesian Learning
  • 2.2 Machine Learning Solutions: Supervised
  • 2.2.1 k-Nearest Neighbors (kNN)
  • 2.2.2 Ensemble Methods
  • 2.2.3 Support Vector Machines (SVMs)
  • 2.2.4 Neural Networks (NNs)
  • 2.3 Machine Learning Solutions: Unsupervised
  • 2.3.1 Hard Clustering
  • 2.3.2 Soft Clustering
  • 2.4 Controls, Evaluation, and Assessment
  • 2.4.1 Evaluation Methods
  • 2.4.2 Metrics for Assessment
  • References
  • 3 What Is Dimensionality Reduction (DR)?
  • 3.1 Dimensionality Reduction
  • 3.2 Major Approaches to Dimensionality Reduction
  • 3.2.1 Conventional Statistical Approaches
  • 3.2.2 Geometric Approaches
  • 3.2.3 Information-Theoretic Approaches
  • 3.2.4 Molecular Computing Approaches.
  • 3.3 The Blessings of Dimensionality
  • 4 Conventional Statistical Approaches
  • 4.1 Principal Component Analysis (PCA)
  • 4.1.1 Obtaining the Principal Components
  • 4.1.2 Singular Value Decomposition (SVD)
  • 4.2 Nonlinear PCA
  • 4.2.1 Kernel PCA
  • 4.2.2 Independent Component Analysis (ICA)
  • 4.3 Nonnegative Matrix Factorization (NMF)
  • 4.3.1 Approximate Solutions
  • 4.3.2 Clustering and Other Applications
  • 4.4 Discriminant Analysis
  • 4.4.1 Linear Discriminant Analysis (LDA)
  • 4.4.2 Quadratic Discriminant Analysis (QDA)
  • 4.5 Sliced Inverse Regression (SIR)
  • 5 Geometric Approaches
  • 5.1 Introduction to Manifolds
  • 5.2 Manifold Learning Methods
  • 5.2.1 Multi-Dimensional Scaling (MDS)
  • 5.2.1.1 Classical MDS: Spectral Approach
  • 5.2.1.2 Metric MDS: Optimization-Based Approach
  • 5.2.2 Isometric Mapping (ISOMAP)
  • 5.2.3 t-Stochastic Neighbor Embedding ( t-SNE )
  • 5.3 Exploiting Randomness (RND)
  • 6 Information-Theoretic Approaches
  • 6.1 Shannon Entropy (H)
  • 6.2 Reduction by Conditional Entropy
  • 6.3 Reduction by Iterated Conditional Entropy
  • 6.4 Reduction by Conditional Entropy on Targets
  • 6.5 Other Variations
  • 7 Molecular Computing Approaches
  • 7.1 Encoding Abiotic Data into DNA
  • 7.2 Deep Structure of DNA Spaces
  • 7.2.1 Structural Properties of DNA Spaces
  • 7.2.2 Noncrosshybridizing (nxh) Bases
  • 7.3 Reduction by Genomic Signatures
  • 7.3.1 Background
  • 7.3.2 Genomic Signatures
  • 7.4 Reduction by Pmeric Signatures
  • 8 Statistical Learning Approaches
  • 8.1 Reduction by Multiple Regression
  • 8.2 Reduction by Ridge Regression
  • 8.3 Reduction by Lasso Regression
  • 8.4 Selection Versus Shrinkage
  • 8.5 Further Refinements
  • 9 Machine Learning Approaches
  • 9.1 Autoassociative Feature Encoders
  • 9.1.1 Undercomplete Autoencoders.
  • 9.1.2 Sparse Autoencoders
  • 9.1.3 Variational Autoencoders
  • 9.1.4 Dimensionality Reduction in MNIST Images
  • 9.2 Neural Feature Selection
  • 9.2.1 Facial Features, Expressions, and Displays
  • 9.2.2 The Cohn-Kanade Dataset
  • 9.2.3 Primary and Derived Features
  • 9.3 Other Methods
  • 10 Metaheuristics of DR Methods
  • 10.1 Exploiting Feature Grouping
  • 10.2 Exploiting Domain Knowledge
  • 10.2.1 What Is Domain Knowledge?
  • 10.2.2 Domain Knowledge for Dimensionality Reduction
  • 10.3 Heuristic Rules for Feature Selection, Extraction, and Number
  • 10.4 About Explainability of Solutions
  • 10.4.1 What Is Explainability?
  • 10.4.1.1 Outcome Explanations
  • 10.4.1.2 Model Explanations
  • 10.4.2 Explainability in Dimensionality Reduction
  • 10.5 Choosing Wisely
  • 10.6 About the Curse of Dimensionality
  • 10.7 About the No-Free-Lunch Theorem (NFL)
  • 11 Appendices
  • 11.1 Statistics and Probability Background
  • 11.1.1 Commonly Used Discrete Distributions
  • 11.1.2 Commonly Used Continuous Distributions
  • 11.1.3 Major Results in Probability and Statistics
  • 11.2 Linear Algebra Background
  • 11.2.1 Fields, Vector Spaces and Subspaces
  • 11.2.2 Linear Independence, Bases and Dimension
  • 11.2.3 Linear Transformations and Matrices
  • 11.2.4 Eigenvalues and Spectral Decomposition
  • 11.3 Computer Science Background
  • 11.3.1 Computational Science and Complexity
  • 11.3.2 Machine Learning
  • 11.4 Typical Data Science Problems
  • 11.5 A Sample of Common and Big Datasets
  • 11.6 Computing Platforms
  • 11.6.1 The Environment R
  • 11.6.2 Python Environments
  • References.
ISBN
3-031-05371-0
Statement on responsible collection description
Princeton University Library aims to describe library materials in a manner that is respectful to the individuals and communities who create, use, and are represented in the collections we manage. Read more...
Other views
Staff view

Supplementary Information