Chapter 1 Introduction

1.1 An Overview of Statistical Learning

“Statistical learning refers to a vast set of tools for understanding data.”

  • Supervised: Using statistical models to predict or estimate outputs based on inputs.
  • Unsupervised: Finding relationships between variables and structure in the data

1.2 Data sets

Example data used in the book

  • Wages
  • Stock Market Data
  • Gene Expression Data

1.2.1 Wages

Used for regression problem examples such as predicting wage based on age and education

## Rows: 3,000
## Columns: 11
## $ year       <int> 2006, 2004, 2003, 2003, 2005, 2008, 2009, 2008, 2006, 2004,…
## $ age        <int> 18, 24, 45, 43, 50, 54, 44, 30, 41, 52, 45, 34, 35, 39, 54,…
## $ maritl     <fct> 1. Never Married, 1. Never Married, 2. Married, 2. Married,…
## $ race       <fct> 1. White, 1. White, 1. White, 3. Asian, 1. White, 1. White,…
## $ education  <fct> 1. < HS Grad, 4. College Grad, 3. Some College, 4. College …
## $ region     <fct> 2. Middle Atlantic, 2. Middle Atlantic, 2. Middle Atlantic,…
## $ jobclass   <fct> 1. Industrial, 2. Information, 1. Industrial, 2. Informatio…
## $ health     <fct> 1. <=Good, 2. >=Very Good, 1. <=Good, 2. >=Very Good, 1. <=…
## $ health_ins <fct> 2. No, 2. No, 1. Yes, 1. Yes, 1. Yes, 1. Yes, 1. Yes, 1. Ye…
## $ logwage    <dbl> 4.318063, 4.255273, 4.875061, 5.041393, 4.318063, 4.845098,…
## $ wage       <dbl> 75.04315, 70.47602, 130.98218, 154.68529, 75.04315, 127.115…

1.2.2 Stock Market Data

Used for classification problem examples with categorical or qualitative output, such as predicting whether a stock index will either increase or decrease on any given day.

Daily percentage change of S&P 500 stock index and 5 prior days

## Rows: 1,250
## Columns: 9
## $ Year      <dbl> 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, 2001, …
## $ Lag1      <dbl> 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0.…
## $ Lag2      <dbl> -0.192, 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0…
## $ Lag3      <dbl> -2.624, -0.192, 0.381, 0.959, 1.032, -0.623, 0.614, 0.213, 1…
## $ Lag4      <dbl> -1.055, -2.624, -0.192, 0.381, 0.959, 1.032, -0.623, 0.614, …
## $ Lag5      <dbl> 5.010, -1.055, -2.624, -0.192, 0.381, 0.959, 1.032, -0.623, …
## $ Volume    <dbl> 1.1913, 1.2965, 1.4112, 1.2760, 1.2057, 1.3491, 1.4450, 1.40…
## $ Today     <dbl> 0.959, 1.032, -0.623, 0.614, 0.213, 1.392, -0.403, 0.027, 1.…
## $ Direction <fct> Up, Up, Down, Up, Up, Up, Down, Up, Up, Up, Down, Down, Up, …

1.2.3 Gene Expression Data

Used for examples of clustering problems such as identifying related groups of cancer cells based on observed characteristics.

## List of 2
##  $ data: num [1:64, 1:6830] 0.3 0.68 0.94 0.28 0.485 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:64] "V1" "V2" "V3" "V4" ...
##   .. ..$ : chr [1:6830] "1" "2" "3" "4" ...
##  $ labs: chr [1:64] "CNS" "CNS" "CNS" "RENAL" ...

1.3 History

A brief timeline for the development of statistical learning

  • 1800’s Linear Regression (Method of Least Squares)
  • 1936 Linear Discriminant Analysis developed to predict qualitative values
  • 1940s Logistic Regression developed to predict qualitative values
  • 1970s Generalized Linear Models including both logistic and linear regression
  • 1980s Classification and Regression Trees
  • 1986 Generalized Additive Models
  • Present day (2001) Machine Learning

1.5 Matrix Notation

Conventions used in the book

  • \(n\) number of observations in a sample

  • \(p\) number of variables

  • \(\textbf{X}\) an \(n \times p\) matrix

    • where \(x_{ij}\) represents the element in the \(i\)th row and the \(j\)th column.
    • \(x_i\) represents a single observation (row) as a vector with length \(p\). Note that vectors are written vertically by convention in math notation.
    • \(\textbf{x}_j\) represents a single variable (column) as a vector with length \(n\). Note that the bold face font is used to distinguish columns (\(\textbf{x}_3\)) from rows (\(x_3\)).
  • The \(^T\) superscript operator denotes the transpose of a matrix or vector, where row and column indices are reversed such that the resulting matrix or vector will have \(p\) rows and/or \(n\) columns.

Examples

  • A matrix of elements \[ \textbf{X} = \left( \begin{matrix} x_{11} & x_{12} & \dots & x_{1p} \\ x_{21} & x_{22} & \dots & x_{2p} \\ \vdots & \vdots & \ddots & \vdots \\ x_{n1} & x_{n2} & \dots & x_{np} \end{matrix} \right) \]

  • A row vector \[ x_i = \left(\begin{matrix} x_{i1} \\ x_{i2} \\ \vdots \\ x_{ip} \end{matrix} \right) \]

  • A column vector \[ \textbf{x}_j = \left(\begin{matrix} x_{1j} \\ x_{2j} \\ \vdots \\ x_{nj} \end{matrix} \right) \]

  • A matrix represented as a collection of column vectors \[ \textbf{X} = \left( \textbf{x}_1, \textbf{x}_2, \dots, \textbf{x}_j \right) \]

  • A transposed matrix. Rows become columns and columns become rows \[ \textbf{X}^{T} = \left( \begin{matrix} x_{11} & x_{12} & \dots & x_{1n} \\ x_{21} & x_{22} & \dots & x_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ x_{p1} & x_{p2} & \dots & x_{pn} \end{matrix} \right) \]

  • A transposed row vector. Again, vector elements are listed vertically by default, so this presentation shows the new orientation. \[ x_{i}^{T} = \left( x_{i1}, x_{i2}, \dots, x_{ip} \right) \]

  • A matrix represented as a collection of row vectors \[ \textbf{X} = \left( \begin{matrix} x_{1}^T \\ x_{2}^T \\ \vdots \\ x_{n}^T \end{matrix} \right) \]