Thinking About Data

TODO Overheads
Unit Information ATTACH
Deriving the Normal Distribution
- Power Series Series
- Modelling Normal Distribution
Understanding the p-value
Calculating Power
- Example
Confidence Intervals
Weekly Material
Central Limit Theorem
Assignment
- Information from wk10

TODO Overheads

DONE Install Emacs Application Framework

This is going to be necessary to deal with not just equations but links, tables and other quirks

The reason for this is that generating latex preview fragments is just far too slow to be useful in any meaningful fashion.

So having done this, it works really really well, refer to The installation Notes here.

I wouldn’t mind also having some sort of way to do mathjax/katex preview of equations, maybe something like xwidget-katex, although the instant preview is pretty quick TBH so I might leave well enough alone.

TODO Install a live preview for equations in org-mode

Here is one example but there was a better one I was using

TODO Experiment with using Bookdown to Merge all RMD Files

Using Bookdown Package

Unit Information ATTACH

Learning Guide
- Zoom Tutorial
- Zoom Lecture

Deriving the Normal Distribution

Power Series Series

A function $f$ :

\[ f\left( z \right)= \sum^{\infty}_{i= 0} \left[ C_n\left( z- a \right)^n \right], \exists z \in \mathbb{C} \]

Is a Power Series a and will either:

Converge only for $x= a$,
converge $\forall x$
converge in the circle \(\left| z- a \right|

Example

$f\left( x \right) = \sum_{n=0}^{\infty} \left[ n! \cdot x^n \right]$

Because the terms inside the power series has a factorial the only test that will work is the limit ratio test so we use that to evaluate convergence. ¹

let $a_n= n!\cdot x^n$:

\[\begin{aligned} \frac{\lim_{n \rightarrow \infty }\left| a_{n+ 1} \right|}{\lim_{n \rightarrow \infty}\left| a_n \right| } &= \lim_{n \rightarrow \infty}\left| \frac{\left( n+ 1 \right) !\cdot x^n \cdot x}{n! \cdot x^n} \right| \\ &= \left( n+ 1 \right) \cdot \left| x \right| \\ &= 0 \iff x =0 \end{aligned}\]

$\therefore$ The power series converges if and only $x= 0$.

Representing a function as a Power Series

Ordinary functions can be represented as power series, this can be useful to deal with integrals that don’t have an elementary anti-derivative.

Geometric Series

First take the Series:

\[\begin{aligned} S_n &= \sum_{n=0}^{n} r^k \\ &= 1 + r + r^2 + r^3 \ldots + r^{n-1} + r^n \\ \implies r \cdot S_n &= r + r^2 + r^3 + r^4 \ldots r^n + r^{n+ 1} \\ \implies S_n - r \cdot C_n &= 1 + r^{n+ 1} \\ \implies S_n &= \frac{1+ r^{n+ 1}}{1- r} \end{aligned}\]

So now consider the geometric series:

\[\begin{aligned} \sum^{\infty}_{k= 0} \left[ x^k \right] &= \lim_{n \rightarrow \infty}\left[ \sum^{n}_{k= 0} x^k \right]\\ &= \lim_{n \rightarrow \infty}\left[ \frac{1+ x^{n+ 1}}{1- x} \right]\\ &= \frac{1+ \lim_{n \rightarrow \infty}\left[ x^{n+ 1} \right]}{1 - x} \\ &= \frac{1+ 0}{1- x}\\ &= \frac{1}{1- x} \end{aligned}\]
Using The Geometric Series to Create a Power Series

Take for example the function:

\[\begin{aligned} g\left( x \right)&= \frac{1}{1 + x^2} \end{aligned}\]

This could be represented as a power series by observing that:

\[\begin{aligned} \frac{1}{1- \#_1} = \sum_{n=0}^{\infty} \left[ \#_1^n \right] \end{aligned}\]

And then simply putting in the value of $\#_1 = \left( - x^2 \right)$ :

\[\begin{aligned} \frac{1}{1- \left( - x^2 \right) } = \sum_{n=0}^{\infty} \left[ \left( - x^2 \right) ^n \right] \end{aligned}\]

Calculus Rules and Series

The laws of differentiation allow the following relationships:

Differentiation

\[\frac{\mathrm{d} }{\mathrm{d} x}\left( \sum_{n=1}^{\infty} c_n\left( z- a \right) ^n \right) = \sum_{n=1}^{\infty} \left[ \frac{\mathrm{d} }{\mathrm{d} x}\left( c_n\left( z- a \right) ^n \right) \right] \]
Integration

\[\int \left( \sum_{n=1}^{\infty} c_n \left( z- a \right) ^n\right) \mathrm{d}x = \sum_{n=1}^{\infty} \left[ c_n \left( z- a \right) ^n \right] \]

Taylor Series

This is the important one, the idea being that you can use this to easily represent any function as an infinite series:

Consider the pattern formed by taking derivatives of $f\left( z \right)= \sum_{n=1}^{\infty} c_n \left( z- a \right)^n$:

\[\begin{aligned} f\left( z \right)&= c_0 + c_1\left( x- a \right)+ c_2\left( x- a \right)^2 + c_3\left( x- a \right)^3 + \ldots\\ & \implies f\left( a \right)= c_0\\ f'\left( z \right)&= c_1 + 2c_2\left( z- a \right) + 3c_3\left( z- a \right)^2 + 4c_4\left( z- a \right)^3 \\ & \implies f'\left( a \right)= c_1\\ f''\left( z \right)&= 2c_2+ 3\times 2\times c_3\left( z- a \right)+ 4 \times 3 c_4\left( z- a \right)^2 + \ldots \\ & \implies f''\left( a \right)= 2 \cdot c_2 \\ f'''\left( z \right)&= 3 \times 2 \times 1 \cdot c_3 + 4 \times 3 \times 2 c_4 \left( z - a \right) + \ldots \\ & \implies f'''\left( a \right)= 3! c_3 \end{aligned}\]

Following this pattern forward:

\[\begin{aligned} f^{\left( n \right)}\left( a \right)&= n!\cdot c_n \\ \implies c_n &= \frac{f^{\left( n \right)}\left( a \right)}{n!} \end{aligned}\]

Hence, if there exists a power series to represent the function $f$, then it must be:

\[\begin{aligned} f\left( z \right)= \sum^{\infty}_{n= 0} \left[ \frac{f^{\left( n \right)}\left( a \right)}{n!}\left( x- a \right)^n \right] \end{aligned}\]

If the power series is centred around 0, it is then called a Mclaurin Series.

Power Series Expansion of $e$

\[\begin{aligned} f\left( z \right)= e^z&= \sum^{\infty}_{n= 0} \left[ \frac{f^{\left( n \right)}\left( 0 \right)}{n!}\cdot x^n \right]\\ &= \sum^{\infty}_{n= 0} \left[ \frac{e^0}{n!}x^n \right] \\ &= \sum^{\infty}_{n= 0} \left[ \frac{x^n}{n!} \right] \end{aligned}\]

Modelling Normal Distribution

The Normal Distribution is a probability density function that is essentially modelled after observation.²

what is the $y$-axis in a Density curve? ggplot2 ATTACH

Consider a histogram of some continuous normally distributed data:

     # layout(mat = matrix(1:6, nrow = 3))
      layout(matrix(1:6, 3, 2, byrow = TRUE))


      x <- rnorm(10000, mean = 0, sd = 1)
      sd(x)
      hist(rnorm(10000), breaks = 5, freq = FALSE)
      ## curve(dnorm(x, 0, 1), add = TRUE, lwd = 3, col = "royalblue")

      hist(rnorm(10000), breaks = 10, freq = FALSE)
      ## curve(dnorm(x, 0, 1), add = TRUE, lwd = 3, col = "royalblue")

      hist(rnorm(10000), breaks = 15, freq = FALSE)
     ##  curve(dnorm(x, 0, 1), add = TRUE, lwd = 3, col = "royalblue")

      hist(rnorm(10000), breaks = 20, freq = FALSE)
       curve(dnorm(x, 0, 1), add = TRUE, lwd = 3, col = "royalblue")

      hist(rnorm(10000), breaks = 25, freq = FALSE)
       curve(dnorm(x, 0, 1), add = TRUE, lwd = 3, col = "royalblue")

      hist(x, breaks = 30, freq = FALSE, col = "lightblue")
      curve(dnorm(x, 0, 1), add = TRUE, lwd = 3, col = "royalblue")

(Or in ggplot2) as described in listing 2 and shown in figure 1

   library(tidyverse)
   library(gridExtra)
   x <- rnorm(10000)
   x <- tibble::enframe(x)
   head(x)
   PlotList <- list()
   for (i in seq(from = 5, to = 30, by = 5)) {
     PlotList[[i/5]] <- ggplot(data = x, mapping = aes(x = value)) +
       geom_histogram(aes(y = ..density..), col = "royalblue", fill = "lightblue", bins = i) +
       stat_function(fun = dnorm, args = list(mean = 0, sd = 1))+
       theme_classic()
   }

   # arrangeGrob(grobs = PlotList, layout_matrix = matrix(1:6, nrow = 3))
   grid.arrange(grobs = PlotList, layout_matrix = matrix(1:6, nrow = 3))

Figure 2: Histograms Generated in ggplot2

Observe that the outline of the frequencies can be made arbitrarily close to a curve given that the bin-width is made sufficiently small. This curve, known as the probability density function, represents the frequency of observation around that value, or more accurately the area beneath the curve around that point on the $x$-axis will be the probability of observing values within that corresponding interval.

Strictly speaking the curve is the rate of change of the probability at that point as well.

Defining the Normal Distribution

Data are said to be normally distributed if, the plot of the frequency density curve is such that:

The rate of change is proportional to:
- The distance of the score from the mean
  - $\frac{\mathrm{d} }{\mathrm{d} x}\left( f \right) \propto - \left( x- \mu \right)$
- The frequencies themselves.
  - $\frac{\mathrm{d} }{\mathrm{d} x} \propto f$

If the Normal Distribution was only proportional to the distance from the mean (i.e. $(x\propto-(x-\mu)$) the model would be a parabola that dips below zero, as shown in No description for this link, so it is necessary to provide the restriction that the rate of change is also proportional to the frequency (i.e. $y \propto y$).

let $f$ be the frequency of observation around $x$, following these rules the plot would come to look something like figure 2:

Bell Curve

Modelling only distance from the mean

If we presumed the frequency (which we will call $f$ on the $y$-axis) was proportional only to the distance from the mean the model would be a parabola:

\[\begin{aligned} \frac{\mathrm{d}f }{\mathrm{d} x} &\propto - \left( x- \mu \right)\\ \frac{\mathrm{d}f }{\mathrm{d} x}&= - k\left( x- \mu \right), \quad \exists k \in \mathbb{R}\\ \int \frac{\mathrm{d}f }{\mathrm{d} x} \mathrm{d}x &= - \int \left( x- \mu \right) \mathrm{d}x \end{aligned}\]

Using integration by substitution:

\[\begin{aligned} \text{let:} \quad v&= x- \mu\\ \implies \frac{\mathrm{d}v }{\mathrm{d} x}&= 1\\ \implies \mathrm{d}v &= \mathrm{d}x \end{aligned}\]

and hence

\[\begin{aligned} \int \frac{\mathrm{d}f }{\mathrm{d} x} \mathrm{d}x &= - \int \left( x- \mu \right) \mathrm{d}x \\ \implies \int \mathrm{d}p &= - \int v \mathrm{d}v \\ p&= - \frac{1}{2}v^2\cdot k + C \\ p&= - \frac{1}{2}\left( x- \mu \right)^2\cdot k + C \end{aligned}\]

Clearly the problem with this model is that it allows for probabilities less than zero, hence the model needs to be refined to:

incorporate a slower rate of change for smaller values of $f$ (approaching 0)
incorporate a faster rate of change for larger values of $f$
- offset by the the condition that $\frac{\mathrm{d}f }{\mathrm{d} x}\propto -\left( x- \mu \right)$

Incorporating Proportional to Frequency

In order to make the curve bevel out for smaller values of $f$ it is sufficient to implement the condition that $\frac{\mathrm{d}f }{\mathrm{d} x} \propto f$:

\[\begin{aligned} \frac{\mathrm{d}f }{\mathrm{d} x} &\propto f\\ \int \frac{1}{f}\cdot \frac{\mathrm{d}f }{\mathrm{d} x} \mathrm{d}x&= k\cdot \int \mathrm{d}x \\ \ln{ \left| f \right| }&= k\cdot x\\ f&= C \cdot e^{\pm x} \\ f & \propto e^{\pm x} \end{aligned}\]

Putting both Conditions together

So in order to model the bell-curve we need:

\[\begin{aligned} f \propto f \wedge f &\propto - \left( x- \mu \right)\\ \implies \frac{\mathrm{d}f }{\mathrm{d} x} &\propto - f\left( x - \mu \right)\\ \int \frac{1}{f} \mathrm{d}f &= - k \cdot \int \left( x- \mu \right) \mathrm{d}x \\ \ln{ \left| f \right| }&= - k \int \left( x- \mu \right) \mathrm{d}x \end{aligned}\]

because $f>0$ by definition, the absolute value operators may be dispensed with:

\[\begin{aligned} \ln{ \left( f \right) }&= - k\cdot \frac{1}{2}\left( x- \mu \right)^2 + C \\ f & \propto e^{\frac{\left( x - \mu \right)^2}{2}} \end{aligned}\]

Now that the function has been solved it is necessary to apply the IC’s in order to further simplify it.

IC, Probability Adds to 1

The area bound by the curve must be 1 because it represents probability, hence:

\[\begin{aligned} 1&= \int_{-\infty}^{\infty} f \mathrm{d}f \\ 1&= - C \int_{-\infty}^{\infty} e^{\frac{k}{2} \left( x- \mu \right)^2} \mathrm{d}f \\ \end{aligned}\]

Using integration by substitution:

\[\begin{aligned} \text{let:} \quad u^2&= \frac{k}{2}\left( x - \mu \right)^2\\ u&= \sqrt{\frac{k}{2}} \left( x- \mu \right) \\ \frac{\mathrm{d}u }{\mathrm{d} x}&= \sqrt{\frac{k}{2}} \end{aligned}\]

hence:

\[\begin{aligned} 1&= - C \int_{-\infty}^{\infty} e^{\frac{k}{2} \left( x- \mu \right)^2}\\ 1&= \sqrt{\frac{2}{k}} \cdot C \int_{-\infty}^{\infty} e^{- u^2} \mathrm{d}u \\ 1^2&= \left( \sqrt{\frac{2}{k}} \cdot C \int_{-\infty}^{\infty} e^{- u^2} \mathrm{d}u \right)^2\\ 1^2&= \left( \sqrt{\frac{2}{k}} \cdot C \int_{-\infty}^{\infty} e^{- u^2} \mathrm{d}u \right) \times \left( \sqrt{\frac{2}{k}} \cdot C \int_{-\infty}^{\infty} e^{- u^2} \mathrm{d}u \right) \end{aligned}\]

Because this is a definite integral $u$ is merely a dummy variable and instead we can make the substitution of $x$ and $y$ for clarity sake.

\[\begin{aligned} 1^2&= \left( \sqrt{\frac{2}{k}} \cdot C \int_{-\infty}^{\infty} e^{- x^2} \mathrm{d}x \right) \times \left( \sqrt{\frac{2}{k}} \cdot C \int_{-\infty}^{\infty} e^{- y^2} \mathrm{d}y \right) \end{aligned}\]

Now presume that the definite integral is equal to some real constant $\beta \in \mathbb{R}$:

\[\begin{aligned} 1&= \frac{2}{k}\cdot C^2 \int_{-\infty}^{\infty} e^{- y^2} \mathrm{d}y \times \beta \\ &= \frac{2}{k}\cdot C^2 \int_{-\infty}^{\infty} \beta\cdot e^{- y^2} \mathrm{d}y \\ &= \frac{2}{k}\cdot C^2 \cdot \int_{-\infty}^{\infty} \left( \int_{-\infty}^{\infty} e^{- x^2} \mathrm{d}x \right)e^{- y^2} \mathrm{d}y\\ &= \frac{2}{k}\cdot C^2 \int_{-\infty}^{\infty} \int_{-\infty}^{\infty} e^{- \left( x^2+ y^2 \right)} \mathrm{d}x \mathrm{d}y \end{aligned}\]

This integral will be easier to evaluate in polar co-ordinates, a double integral may be evaluated in polar co-ordinates using the following relationship: ³

\[\begin{aligned} \iint_D f\left( x,y \right) dA = \int_{\alpha}^{\beta} \int_{h_1\left( \phi \right)}^{h_2\left( \phi \right)} f\left( r \cdot \cos{\left( \phi \right)} , r \cdot \sin \left( \phi \right)\right) \mathrm{d}r \mathrm{d}\phi \end{aligned}\]

hence this simplifies to:

\[\begin{aligned} 1&= \frac{2}{k} c ^2 \int_{0}^{2 \pi } \int_{0}^{r} r \cdot e^{\left( r \cdot \cos{\theta}\right)^2+ \left( r \cdot \sin{\theta} \right)^2} \mathrm{d}r \mathrm{d}\theta \\ 1&= \frac{2}{k} c ^2 \int_{0}^{2 \pi} \int_{0}^{r} r \cdot e^{r^2} \mathrm{d}r \mathrm{d}\theta \end{aligned}\]

Because the integrand is of the form $f'\left( x \right)\times g\left( f\left( x \right) \right)$ we may use integration by substitution:

\[\begin{aligned} \text{let:} \quad u&= - r^2\\ \frac{\mathrm{d}u }{\mathrm{d} r} &= - 2 r \\ \mathrm{d} r &= - \frac{1}{2r} \mathrm{d} u \end{aligned}\]

and hence:

\[\begin{aligned} 1&= \frac{2}{k} c ^2 \int_{0}^{2 \pi} \int_{0}^{r} r \cdot e^{r^2} \mathrm{d}r \mathrm{d}\theta \\ \implies 1&= - \frac{2}{k}c^2 \int_{0}^{2 \pi} \int_{0}^{\infty} r \cdot e^{r^2} \mathrm{d}r \mathrm{d}\theta \end{aligned}\]

\[\begin{aligned} 1&= \frac{2}{k} c ^2 \int_{0}^{2 \pi} \int_{0}^{r} r \cdot e^{r^2} \mathrm{d}r \mathrm{d}\theta \\ \implies 1&= - \frac{2}{k}c^2 \int_{0}^{2\pi} \int_{0}^{\infty} - \frac{1}{2}e^{- u} \mathrm{d}u \mathrm{d}\theta \\ &= \frac{1}{k}c^2 \int_{0}^{2\pi} \int_{0}^{\infty} e^{- u} \mathrm{d}u \mathrm{d}\theta \\ &= \frac{1}{k} c^2 \int_{0}^{2\pi} \left[ - e^{- u} \right]_{0}^{\infty}\mathrm{d}\theta \\ 1&= \frac{1}{k}c^2 2 \pi \\ \implies C^2&= \frac{k}{2\pi} \end{aligned}\]

So from before:

\[\begin{aligned} f&= - C \cdot e^{k\cdot \frac{\left( x- \mu \right)^2}{2}} \\ &= - \sqrt{\frac{k}{2\pi}} \cdot e^{k\cdot \frac{\left( x- \mu \right)^2}{2}} \end{aligned}\]

so now we simply need to apply the next initial condition.
IC, Mean Value and Standard Deviation
- Definitions
  
  The definition of the expected value, where $f(x)$ is a probability function is: ⁴
  
  \[\begin{aligned} \mu= E(x) = \int^{b}_{a} x \cdot f\left( x \right) \mathrm{d}x \end{aligned}\]
  
  That is, roughly, the sum of the expected proportion of occurence.
  
  The definition of the variance is:
  
  \[\begin{aligned} V\left( x \right)= \int_{a}^{b} \left( x- \mu \right)^2 f\left( x \right) \mathrm{d}x \end{aligned}\]
  
  which can be roughly interpreted as the sum of the proportion of squared distance units from the mean. The standard deviation is $\sigma = \sqrt{V(x)}$.
- Expected Value of the Normal Distribution
  
  The expected value of the normal distribution is $\mu$, this can be shown rigorously:
  
  \[\begin{aligned} \text{let:} \quad v&= x- \mu \\ \implies \mathrm{d} v&= \mathrm{d} x \end{aligned}\]
  
  Observe that the limits of integration will also remain as $\pm \infty$ following the substitution:
  
  \[\begin{aligned} E\left( v \right)&= \int_{-\infty}^{\infty} v\times f\left( v \right) \mathrm{d}v \\ &= k\cdot \int_{-\infty}^{\infty} v\cdot e^{v^2} \mathrm{d}v \\ &= \frac{1}{2} \left[ e^{x^2} \right]^{\infty}_{\infty}\\ &= \frac{1}{2} \lim_{b \rightarrow \infty} \left[\left[ e^{x^2} \right]^{b}_{-b} \right] \\ &= \frac{1}{2} \lim_{b \rightarrow \infty} \left[ e^{b^2} - e^{\left( - b \right)^2} \right]\\ &= \lim_{b \rightarrow \infty}\left[ 0 \right] \times \frac{1}{2}\\ &= \frac{1}{2} \times 0 \\ &= 0 \end{aligned}\]
  
  Hence the Expected value of the standard normal distribution is $0=x-\mu$ and so $E(x)=\mu$.
- Variance of the Normal Distribution
  Now that the expected value has been confirmed, consider the variance of the distribution:
  
  \[\begin{aligned} \sigma^2 &= \int_{-\infty}^{\infty} \left( x- \mu \right) ^2 \times f \left( x \right) \mathrm{d}x \\ \end{aligned}\]
  
  Now observe that $\left( x- \mu \right)$ appears as an exponential and as a factor if this is redefined as $w= x- \mu \implies \mathrm{d} x= \mathrm{d} w$ we have:
  
  \[\begin{aligned} \sigma^2&= \sqrt{\frac{k}{2}} \int_{-\infty}^{\infty} w^2e^{-\frac{k}{2}w^2} \mathrm{d}w \end{aligned}\]
  
  Now the integrand is of the form $f\left( x \right)\times g\left( x \right)$ meaning that the only strategy to potentially deal with it is integration by parts:
  
  \[\begin{aligned} \int u \mathrm{d}v&= u\cdot v- \int v \mathrm{d}u \end{aligned}\] where:
  - $u$ is a function that simplifies with differentiation
  - $\mathrm{d} v$ is something that can be integrated
  \begin{matrix} u= w & \enspace & \mathrm{d}v = w \cdot e^{- \frac{k}{2}w^2}\mathrm{d} w \\ \implies \mathrm{d}u = \mathrm{d}w & \enspace & \implies v= \int w\cdot e^{\frac{k}{2} w^2} \mathrm{d}w \\ & \enspace & \implies v=\frac{1}{k} e^{\frac{k}{2}w^2} \end{matrix}
  Hence the value of the variance may be solved:
  
  Now that the expected value has been confirmed, consider the variance of the distribution:
  
  \[\begin{aligned} \sigma^2&= \int_{-\infty}^{\infty} \left( x- \mu \right)^2\times f\left( x \right) \mathrm{d}x \\ &= \int_{-\infty}^{\infty} \left( x- \mu \right)^2\times \left( \sqrt{\frac{k}{2\pi}}e^{- \frac{k}{2}\left( x- \mu \right)^2} \right)\mathrm{d}x \\ &= \sqrt{\frac{k}{2\pi}} \int_{-\infty}^{\infty} \left( x- \mu \right)^2\times \left( e^{- \frac{k}{2}\left( x- \mu \right)^2} \right)\mathrm{d}x \end{aligned}\]
  
  Now observe that $\left( x- \mu \right)$ appears as an exponential and as a factor if this is redefined as $w= x- \mu \implies \mathrm{d} x= \mathrm{d} w$ we have:
  
  \[\begin{aligned} \sigma^2&= \sqrt{\frac{k}{2}} \int_{-\infty}^{\infty} w^2e^{-\frac{k}{2}w^2} \mathrm{d}w \end{aligned}\]
  
  Now the integrand is of the form $f\left( x \right)\times g\left( x \right)$ meaning that the only strategy to potentially deal with it is integration by parts:
  
  \[\begin{aligned} \int u \mathrm{d}v&= u\cdot v- \int v \mathrm{d}u \end{aligned}\] where:
  - $u$ is a function that simplifies with differentiation
  - $\mathrm{d} v$ is something that can be integrated
  \begin{matrix} u= w & \enspace & \mathrm{d}v = w \cdot e^{- \frac{k}{2}w^2}\mathrm{d} w \\ \implies \mathrm{d}u = \mathrm{d}w & \enspace & \implies v= \int w\cdot e^{\frac{k}{2} w^2} \mathrm{d}w \\ & \enspace & \implies v=\frac{1}{k} e^{\frac{k}{2}w^2} \end{matrix}
  Hence the value of the variance may be solved:
  
  \[\begin{aligned} \sigma^2&= \sqrt{\frac{k}{2\pi}} \int_{-\infty}^{\infty} w^2e^{-\frac{k}{2}w^2} \mathrm{d}w \\ &= \sqrt{\frac{k}{2\pi}} \left[ u\cdot v - \int v \mathrm{d}u \right]^{\infty}_{\infty}\\ &= \sqrt{\frac{k}{2\pi}} \left( \left[ \frac{-w}{k}\cdot e^{-\frac{k}{2}w^2} \right]^{\infty}_{\infty} - \frac{1}{k} \int^{\infty}_{-\infty} e^{\frac{k}{2}w^2} \mathrm{d}w \right) \\ &= \sqrt{\frac{k}{2\pi}} \left[ \frac{-w}{k}\cdot e^{-\frac{k}{2}w^2} \right]^{\infty}_{\infty} - \frac{1}{k} \left( \sqrt{\frac{k}{2\pi}} \int^{\infty}_{-\infty} e^{\frac{k}{2}w^2} \mathrm{d}w \right) \\ \end{aligned}\]
  
  The left term evaluates to zero and the right term is the area beneath the bell curve with mean value 0 and so evaluates to 1:
  
  \[\begin{aligned} \sigma^2&= 0- \frac{1}{k}\\ \implies k&= \frac{1}{\sigma^2} \end{aligned}\]
  
  So the function for the density curve can be simplified:
  
  \[\begin{aligned} &= - \sqrt{\frac{k}{2\pi}} \cdot e^{k\cdot \frac{\left( x- \mu \right)^2}{2}}\\ &= \sqrt{\frac{1}{2\pi \sigma^2}} \cdot e^{ \frac{1}{2}\cdot \frac{\left( x- \mu \right)^2}{\sigma^2}} \end{aligned}\]
  
  now let $z= \frac{ x- \mu }{\sigma} \implies \mathrm{d}z= \frac{\mathrm{d} x}{\sigma}$, this then simplifies to:
  
  \[\begin{aligned} f\left( x \right)= \sqrt{\frac{1}{2\pi}}\cdot e^{- \frac{1}{2}z^2} \end{aligned}\]
  
  Now using the power series identity from BEFORE :
  
  \[\begin{aligned} e^{- \frac{1}{2}z^2}= \sum^{\infty}_{n= 0} \left[\frac{ \left( - \frac{1}{2}z^2 \right)^n}{n!} \right] \end{aligned}\]
  
  We can solve the integral of $f\left( x \right)$ (which has no elementary integral.
  
  \[\begin{aligned} f\left( x \right)&= \sqrt{\frac{1}{2\pi}}\cdot \sum^{\infty}_{n= 0} \left[\frac{ \left( - \frac{1}{2}z^2 \right)^n}{n!} \right] \\ \int f\left( x \right) \mathrm{d}x &= \frac{1}{\sqrt{2\pi} }\int \sum^{\infty}_{n= 0} \left[ \frac{\left( - \frac{1}{2}z^2 \right)^n}{n!} \right] \mathrm{d}z \\ &= \frac{1}{\sqrt{2\pi} }\cdot \sum^{\infty}_{n= 0} \left[ \int \frac{\left( - 1 \right)^{- 1}z^{2n}}{2^n\cdot n!} \mathrm{d}z \right] \\ &= \frac{1}{\sqrt{2\pi} }\cdot \sum^{\infty}_{n= 0} \left[ \frac{\left( - 1 \right)^n \cdot z^{2n+ 1}}{2^n\left( 2n+ 1 \right)n!} \right] \end{aligned}\]
  
  Although this is a power series it still gives a method to solve the area beneath the curve of the density function of the normal distribution.

Understanding the p-value

Let’s say that I’m given 100 vials of medication and in reality only 10 of them are actually effective.

POS	POS	POS	POS	POS	POS	POS	POS	POS	POS
:-:	:-:	:-:	:-:	:-:	:-:	:-:	:-:	:-:	:-:
NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG
NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG
NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG
NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG
NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG
NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG
NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG
NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG
NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG
NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG

We don’t know which ones are effective so It is necessary for the effective medications to be detected by experiment. Let:

the p-value be 9% for detecting a significant effect
assume the statistical power is 70%

So this means that the corresponding errors are:

Of the 90 Negative Drugs, $\alpha \times 90 \approx 8$ will be identified as Positive ( False Positive) a. This means 72 will be correctly identified as negative. (TN)
Of the 10 Good drugs $\beta \times 10 = 3$ will be labelled as negative (False Negative) b. This means 8 will be correctly identified as positive (True Positive)

These results can be summarised as:

	Really Negative	Really Positive
Predicted Negative	TNR; ($1-\alpha$)	FNR; $\beta \times 10 = 3$
Predicted Positive	FPR; $\mathsf{FPR}= \alpha \times 90 \approx 8$	TPR ($1-\beta$)

And a table visualising the results:

TP	TP	TP	TP	TP	TP	TP	FN	FN	FN
:-:	:-:	:-:	:-:	:-:	:-:	:-:	:-:	:-:	:-:
FP	FP	FP	FP	FP	FP	FP	FP	TN	TN
TN	TN	TN	TN	TN	TN	TN	TN	TN	TN
TN	TN	TN	TN	TN	TN	TN	TN	TN	TN
TN	TN	TN	TN	TN	TN	TN	TN	TN	TN
TN	TN	TN	TN	TN	TN	TN	TN	TN	TN
TN	TN	TN	TN	TN	TN	TN	TN	TN	TN
TN	TN	TN	TN	TN	TN	TN	TN	TN	TN
TN	TN	TN	TN	TN	TN	TN	TN	TN	TN
TN	TN	TN	TN	TN	TN	TN	TN	TN	TN
TN	TN	TN	TN	TN	TN	TN	TN	TN	TN

So looking at this table, it should be clear that:

If the null hypothesis had been true, the probability of a False Positive

would indeed have been $\frac{8}{90}\approx 0.09$

The probability of incorrectly rejecting the null hypothesis though is the

number of FP from anything identified as positive $\frac{7}{7+6} \approx 0.5$

False Positive Rate

The False Positive Rate is expected to be $\alpha$ it is:

\[\begin{aligned} \mathrm{E}\left( \mathsf{FPR} \right) &= \alpha \mathsf{;}\\ \mathsf{FPR}&=\frac{FP}{N}\\ &= \frac{FP}{FN+TP}\\ &= \frac{8}{8+72}\\ &= 9\% \end{aligned}\]

False Discovery Rate

The False discovery Rate is the proportion of observations considered as positive (or significant) that are False Positives. If you took all the results you considered as positive and pulled one out, the probability that one was a false positive (and you were commiting a type I error) would be the FDR and could be much higher than the FPR.

Measuring Probability

In setting $\alpha$ as 9% I’ve said that ’if the null hypothesis was true and every vial was negative, 9% of them would be false positives’, this means that in practice 9% of the negative vials would be detected as false positives (I wouldn’t count the positives because my $\alpha$ assumption was made under the assumption that everything was negative, hence 9% of the negative vials will be false positives).

So this measures the probability of rejecting the null hypothesis if it were true.

It does not measure the probability of rejecting the null hypothesis but then being mistaken, because to reject the null hypothesis it is necessary to consider observations that are considered positive (whether or not they actually are), the number of those that are False Positive would represent the probability of committing a type 1 error in that experiment

So the $p$ -value measures the probability of committing a type I error under the assumption that the null hypothesis is true.

The FDR represents the actual probability of committing a type I error when taking multiple comparisons.

Comparing $\alpha$ and the p-value

The distinction between $\alpha$ and $p$ -value is essentially that the $\alpha$ value is set as a significance standard and the $p$ -value represents the probability of getting a test-statistic $\geq$ the observed value

The $\alpha$ value is the probability of

Rejecting the null hypothesis under the assumption that the null hypothesis is true.

This will be the False Positive Rate:

The proportion of Negative Observations misclassified as Positive will be the False Positive Rate.

Be careful though because this is not necessarily the probability of incorrectly rejecting the null hypothesis there is also the the $\mathsf{FDR}=\frac{\mathsf{FP}}{\mathsf{TP+FP}}$:

The proportion of observations classified as positive that are false positives, this estimates the probability of rejecting the null hypothesis and being wrong. (whereas the $\alpha$ value is the probability of rejecting the null hypothesis under the assumption it was true this is different from the probability of rejecting $H_0$ and being wrong, which is the FDR).

The $p$ -value is the corresponding probability of the test statistic that was returned, so they mean essentially the same thing, but the $\alpha$ value is set before hand and the $p$-value is set after the fact:

The $p$-value is the probability, under the assumption that there is no true effect or no true difference, of collecting data that shows a difference equal to or more extreme than what was actually observed.

Wikipedia Links

Helpful Wikipedia Links

False Positive Rate
False Discovery Rate
Sensitivity and Specificity
ROC Curve
- This has all the TP FP calculations
Type I and Type II Errors
- This has the useful Tables and SVG Density Curve

Calculating Power

Statistical Power is the probability of rejecting the null hypothesis assuming that the null hypothesis is false (True Positive).

Complementary to the False Positive Rate and False Detection Rate, the power is distinct from the probability of correctly rejecting the null hypothesis, which is the probability of selecting a True Positive from all observations determined to be positive (the Positive Predictive Value or the Precision):

\[\begin{aligned} PPV=\frac{TP}{TP+FP}\\ FDR=\frac{FP}{TP+FN}\\ \alpha = \frac{FP}{N}=\frac{TP}{TN+FP}\\ \beta = \frac{TP}{P}=\frac{TP}{TP+FN} \end{aligned}\]

Example

Problem

An ISP stated that users average 10 hours a week of internet usage, it is already known that the standard deviation of this population is 5.2 hours. A sample of $n=100$ was taken to verify this claim with an average of $\bar{x}$.

A worldwide census determined that the average is in fact 12 hours a week not 10.

Solution

Hypotheses
1. $H_0 \quad :$ The Null Hypothesis that the average internet usage is 10 hours per week
2. $H_a \quad :$ The Alternative Hypothesis that the average internet usage exceeds 10 hours a week

Data

Value	Description
$n=100$	The Sample Size
$\sigma= 5.2$	The Standard Deviation of internet usage of the population
$\mu=10$	The alleged average internet usage.
$\bar{x}=11$	The average of the sample
$\mu_True=12$	The actual average from the population
$\alpha=0.05$	The probability of a type 1 error at which the null hypothesis is rejected
$\beta=??$	The probability of a type 2 error

Step 1; Find the Critical Sample Mean ($\bar{x}_\mathsf{crit}$)

The Central Limit Theorem provides that the mean value of a sample that is:

sufficiently large, or
drawn from a normally distributed population

will be normally distributed, so if we took various samples of a population and recorded all the sample means in a set $\overline{X}$ we would have: will be normally distributed, so if we took various samples of a population and recorded all the sample means in a set $\overline{X}$ we would have:

\[\overline{X} \sim \mathcal{N}\left( \mu, \left( \frac{\sigma}{\sqrt{n} } \right) \right)\]

And hence we may conclude that:

\[\begin{aligned} Z&= \frac{\overline{x} - \mu}{\left( \frac{\sigma}{\sqrt{n} } \right) } \\ \implies \overline{x}_{crit}&= \mu + z_\alpha \cdot \left( \frac{\sigma}{\sqrt{n} } \right) \\ \overline{x}_{crit} &= \mu + z_{0.05} \cdot \left( \frac{\sigma}{\sqrt{n} } \right) \\ \overline{x}_{crit}&= \mu + 1.645 \cdot \left( \frac{5.2}{\sqrt{100} } \right) \\ &= 10.8554 \end{aligned}\]

Thus $H_0$ is rejected for a sample mean of 10.86 hours per week at a confidence level of $\alpha = 0.05$.

Step 2: Find the Difference between the Critical and True Means as a Z-Value (prob of Type II)

The probability of accepting the null hypothesis assuming that it is false, is the probability of getting a value less than the critical value given that the mean value is actually 12:

\[\begin{aligned} z&= \frac{\overline{x}_{crit}- \mu_{true}}{\left( \frac{\sigma}{\sqrt{n} } \right) } \\ &= \frac{10.86 - 12}{\frac{5.2}{10}} &= -2.2 \end{aligned}\]

Step 3: State the value of $\beta$

\[\begin{aligned} \beta &= \mathrm{P}\left( \textsf{Type II Error} \right) \\ &= \mathrm{P}\left( H_0 \textsf{ is not rejected} \mid H_0 \textsf{ is false} \right) \\ &= \mathrm{P} \left( \mu_{\overline{X}_{\textsf{Crit}}} < \overline{x}_{\textsf{crit}} \mid \mu = 12 \right) \\ &= 0.014 \end{aligned}\]

Step 4: State the Power Value

\[\begin{aligned} \mathsf{Power} &= \left( H_0\textsf{ is not rejected} \mid H_0 \textsf{ is false} \right) \\ &= \mathrm{P}\left( \mu_{\overline{X}_{\textsf{Crit}}} < \overline{x}_\textsf{Crit} \right) \\ &= 1 - \beta \\ &= 1 - 0.14 \\ &= 98.6\% \end{aligned}\]

Confidence Intervals

Khan Academy

According to Khan Academy:

This means that for any sample drawn from the population, the true population value would be found within this interval for 0.95 of those samples

The Confidence Interval is not the probability ATTACH

adapted from:

https://qr.ae/pNrVlx

https://qr.ae/pNrV6y

PDF Version

I assume that the motivation for this question is that most statistics books emphasize the fact that, once you have taken a sample and constructed the confidence interval (CI), there is no longer any “randomness” left in a CI statement (except for the Bayesian point of view which thinks of $\mu$ as being a random variable).

That is, when reporting a CI: “I am 95% confident that the mean is between 25.1 and 32.6” is correct. “There is a 95% probability that the mean is between 25.1 and 32.6” is WRONG. Either μ is in that interval or not; there is no probability associated with it.

The Reasoning

Suppose that somewhere on the wall is an “invisible bullseye” — a special point (call it “ μ ”) which only I can see. I’m going to throw a dart at μ . Based on long observation, you know that when I throw a dart at something, 95% of the time, my dart will hit within 6 inches of what I was aiming at. (The other 5% of the time, I miss by more than 6 inches.) When you see where that dart lands, you will draw a circle around it with a radius of 6 inches.

It is correct to say:

The probability that μ will be in that circle is 95%.

The reason that is correct is, I have not yet thrown the dart, so the location of the circle is random, and in 95% of repetitions of this dart-throwing, circle-drawing routine, μ will be in the resulting circle. Now if I actually take aim, throw my dart, and it hits …

\[ \text{right here }\implies \cdot \]

It is no longer correct to talk about probabilities. You can be pretty sure that μ is within that circle. To be specific, “pretty sure” = “95% confident.” But you cannot say that the probability that μ is in that circle is 95%, because μ is not random.This throw might have been one of the 5% of throws that miss μ by more than 6 inches.

Let’s assume we want a 95% CI for μ from a normal population with a known standard deviation σ , so the margin of error is:

\[ M=1.96\frac\sigma{\sqrt n} \]

Then $\overline{X}$ is the “dart” we are throwing at μ.

Before you take the sample and compute the mean, you have:

\[ P(\overline X-M<\mu<\overline X+M)=95\%\tag*{} \]

This is correct because $\overline X$ is a random variable. However, once you compute the mean $\overline x$ (lowercase x meaning it is now just a number, not a random quantity), the inequality:

\[ \overline x-M<\mu<\overline x+M \]

is either true or false; the “dart” has landed at $\overline x$ , and we don’t know if this was one of the throws that is within M of μ.

Classical Confidence Interval

A classical confidence interval contains all values for which the data do not reject the null hypothesis that the parameter is equal to that value.

This does not necessarily tell you anything regarding the probability that the parameter is in the interval.

If however you intended to take a sample of the data and draw a 92% confidence interval, there would be a 92% probability of the population mean being within that interval, if however you drew that sample and created that interval the the probability of the invisible point μ being within that interval can’t really be known because we just don’t know where it is relative to the dart (i.e. how well the sample reflects the population).

Weekly Material

(3); Comparison of Population Samples wk3

Lecture

Boxplots
The delimiting marks in box plots correspond to the median and interquarlite range (which is basically the median of all data below the median):
```
     library("ggplot2")
     ggplot(iris, aes(x = Sepal.Width, y = Sepal.Length, color = Species)) +
     geom_boxplot() +
     theme_bw()
```
Figure 5: Bar Plots Generated in ggplot2
Lecture Announcements

Everything is online now. We’ll be using Zoom a lot.
- DONE Finish Quiz 1
- TODO Finish Quiz 1
  
  30 minutes to finish it, Test your computer First.
- Post pacman on the mailing list.
Naming Variables

Attribute … Data Base
DONE Review Chi Distribution,
- Is it in VNote?
- Should I put it in org-mode?
DONE Fix YAML Headers in rmd to play ball with Notable
- DONE Post this use-case to Reddit
- DONE Fix YAMLTags and TagFilter and Post to Reddit bash
  
  The Bash Script is here
  - TODO Post an easy way to use this to the mailing list
    
    Here is a start that I’ve made
  - Should I have all the lists and shit in /tmp
    
    Pros Cons
    
    Less Mess Harder to Directly watch what’s happening
    
    Easier to manage ⁵
    
    should I use /tmp or /tmp/Notes or somting?
  - DONE Is there an easy way to pass the md off to vnote
    
    Or should I just use the ln -s ~/Notes/DataSci /tmp/notes trick?
    
    Follow the instructions here, it has to be done manually and then symlinked SCHEDULED: <2020-03-21 Sat>
  - Should I have all the lists and shit in /tmp
    
    Pros Cons
    
    Less Mess Harder to Directly watch what’s happening
    
    Easier to manage ⁵
    
    should I use /tmp or /tmp/Notes or somting?
- DONE Is there a way to fix the Text Size of Code in emacs when I zoom out?
  
  Yeah just disable M-x mixed-pitch-mode

Calculalating mean

 library(tidyverse)
 bwt <- c(3429, 3229, 3657, 3514, 3086, 3886)
 (bwt <- sort(bwt))
 mean(bwt)
 mean(c(3429, 3514))
 median(bwt)
 max(bwt)-min(bwt)


 [1] 3086 3229 3429 3514 3657 3886

 [1] 3466.833

 [1] 3471.5

 [1] 3471.5

 [1] 800

The mean value is nice in that it has good mathematical properties, so for predictions and classifications (like gradient descent), if the model contains the mean the model will be smooth and the mean will lead to a well behaved model with respect to the derivative.

The Median value, however is more immune to large outliers, for example:

 library(tidyverse)
 x <- c(rnorm(10), 9) * 10 %>% round(1)
 mean(x); median(x)

 ── Attaching packages ────────────────────────────────── tidyverse 1.3.0.9000 ──
 ✔ ggplot2 3.2.1     ✔ purrr   0.3.3
 ✔ tibble  2.1.3     ✔ dplyr   0.8.4
 ✔ tidyr   1.0.2     ✔ stringr 1.4.0
 ✔ readr   1.3.1     ✔ forcats 0.5.0
 ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
 ✖ dplyr::filter() masks stats::filter()
 ✖ dplyr::lag()    masks stats::lag()

 [1] 5.337263
 [1] 0.6294542

Calculating Range

 range(bwt)
 bwt %>% range %>% diff

[1] 3086 3886

[1] 800

Calculating Variance

   (var <- (bwt-mean(bwt))^2 %>%  mean)
   var(bwt)
   (sd <- (bwt-mean(bwt))^2 %>%  %>% sqrt) # Not using n-1 !!
   (sd <- sqrt(sum((bwt-mean(bwt))^2)/(length(bwt) -1)))
   sd(bwt)
   mean(sum((bwt-mean(bwt))^2))

 [1] 69519.81

 [1] 83423.77

 Error: unexpected SPECIAL in "(sd <- (bwt-mean(bwt))^2 %>%  %>%"

 [1] 288.8317

 [1] 288.8317

 [1] 417118.8

InterQuartile Data

Tutorial

The tutorial work is located at ~/Notes/DataSci/ThinkingAboutData and linked here:

DONE (4); Using Student’s t-Distribution wk4

The tutorial work is located at ~/Notes/DataSci/ThinkingAboutData and linked here:

DONE (5) Discrete Distributions (Mapping Disease) wk5

DONE Quizz for t Distribution Material

The Quiz has been Released

03 Tutorial

Tutorial 3

Lecture

The Poisson Model is the Binomial Model stretch towards its limits.

Combinatorics
The Counting Formulas are:

selection ordered unordered

With Repetition $n^m$ $\binom{m+n-1}{n}$

Without Repetition $n_{(m)}$ $\binom{n}{m}$

Where:
- $\binom{n}{m} =\frac{m_(n)}{n!}=\frac{m!}{n!(m-n)!}$
- $n_{(m)}=\frac{n!}{(n-m)!}$
- $n! = n \times (n-1) \times (n-2) \times 2 \times 1$
Binomial Distribution
A Binomial experiment requires the following conditions:
1. We have n independent events.
2. Each event has the same probability p of success.
3. We are interested in the number of successes from the $n$ trials (referred to as size in R.
4. The probability of k successes from the n trials is a Binomial distribution with
probabilities:

\[\begin{aligned} P\left( k \right)= \binom{n}{k} p^k \left( 1- p \right)^{n- k} \end{aligned}\]
- Problem
  Hard drives have an annual failure rate of 10%, what is the probability of 2 hard drives failing after 3 years?
  
  This means that:
  - The number of repetitions is 3 ($n$ = size = 3)
  - The statistic we are interested in is 2 ($k$ = x = 2)
    dbinom(x = 2, size = 3, prob = 0.1) ## For Hard Drives ## k/x....is the number of years ## n/ ....size is the number of failures ## p ....is the probability of failure ## choose(n,k)*p^k*(1-p)^(n-k) ## dbinom(x = 2, size = 4, prob = p)
```
[1] 0.027
```
    So in this case there would only a 2% chance.
Poisson

An interesting thing with the poisson distribution is that the mean value and the variance are both equal.

The expected value is the limit that the mean value would approach if the sample was made arbitrarily large, the value is denoted $\lambda$

Poisson is French for fish so sometimes people call the distribution the fishribution.
Binomial and Poisson

The Poisson distribution is derived from the Binomial distribution.

If $(1-p)$ is close to 1 then $np(1-p) \approx np$, so for very large sample sizes we have the expected value equal to the variance, and a Poisson distribution.

This has something to do with widely increasing the number of trials, like say if we hade an infinite number of trials with the probability of success in a given hour as 30%.

For a binomial distribution there are a set number of trials, let’s say 8 trials with a 20% probability of Success:

1 2 3 4 5 6 7 8

F S F F S S F F

In this case there are 3 successes, so let’s set $k=3$ and instead however the region was divided into smaller spaces ($n \rightarrow \infty$):

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

X S X X S S F F X S X X S S F F

If we kept dividing this up we would have

The poisson is the limit as we increase the number of trials but try to keep $k$ constant??
Confidence Intervals
- Binomial
  
  Just use bootstrapping, but also you can just use an approximate standardisation:
  
  \[\begin{aligned} \sigma &= \sqrt{p\left( 1- p \right) } \\ \sigma_{\overline{x}}&= \frac{\sigma}{\sqrt{n} } \\ &= \frac{\sqrt{p\left( 1- p \right) }}{\sqrt{n} } \ \\ \implies p-z_{0.025}\times \sigma_{\overline{x}} < &p < p-z_{0.975}\times \sigma_{\overline{x}} \\ \implies p-1.96 \sigma_{\overline{x}} < &p < p-z_{0.975} 1.96 \sigma_{\overline{x}} \end{aligned}\]
  
  So an approximate 95% confidence interval could be
  
  This is the normal data, but remember that this is estimating binomial by standard normal and so will only be good for large values of n because binomial is discrete by nature.
  
  See the Correlation Notes and Khan Academy and this section section [[]]
  
  This means that for any sample drawn from the population, the true population value would be found within this interval for 0.95 of those samples
- Poisson
  
  This can also be done with Poisson by bootstrapping or using the same trick of $\lambda$ as the variance:
  
  \[\begin{aligned} \sigma &= \sqrt{\lambda} \\ \sigma_{\overline{x}}&= \frac{\sigma}{\sqrt{n} } \\ &= \frac{\sqrt{\lambda}}{\sqrt{n} } \ \\ \implies \lambda-z_{0.025}\times \sigma_{\overline{x}} < &p < \lambda+z_{0.975}\times \sigma_{\overline{x}} \\ \lambda - 1.96 \sigma_{\overline{x}} < &p < \lambda + 1.96 \sigma_{\overline{x}} \\ p-1.96 \sqrt{\frac{\lambda}{n}} < &p < p-z_{0.975} 1.96 \sqrt{\frac{\lambda}{n}} \end{aligned}\]
Summary
- Binomial for independent trials
  - mean: np
  - variance: np(1-p)
    - standard error roughly is $\sqrt{\frac{\hat{p}(1-\hat{p})}{n}}$
- Poisson for number of events in a given period
  - mean λ
  - variance λ
  - Standard error roughly is $\sqrt{\frac{\hat{\lambda}}{n}}$
- Choropleth maps are useful for visualising changes over area.

Tutorial

DONE (6) Paired t-test (Observation or Experiment) wk6

TODO (7) Corellation (Do Taller People Earn More) wk7

DONE Lecture

In the past we did categorical and categorical-continuous.

Now we’re doing purely continuous.

DONE How to Derive the Correlation Coefficient

Refer to this paper The Correlation coefficient can be interpreted in one of two ways:

The covariance scaled relative to the $x$ and $y$ variance
- $\rho = \frac{S_{x, y}}{s_x \cdot s_y}$
The rate of change of the line of best fit of the standardised data
- This is equivalent to the rate of change of the line of best fit divided by $(\frac{s_y}{s_x})$:
  - $\rho = b \cdot \frac{s_x}{s_y} \quad \iff \quad \hat{y_i}=bx_i + c$

This can be seen by performing linear regression in R:

  head(cars)
  cars_std <- as.data.frame(scale(cars))
  y <- cars$dist
  x <- cars$speed

  ## Correlation Coefficient
  cor(x = cars$speed, y = cars$dist)

  ## Covariance
  cov(x = cars$speed, y = cars$dist)/sd(cars$speed)/sd(cars$dist)

  ## Standardised Rate of Change
  lm(dist ~ speed, data = cars_std)$coefficients[2]

  ### Using Standardised Rate of change
  lm(dist ~ speed, data = cars)$coefficients[2] / (sd(y)/sd(x))

  speed dist
1     4    2
2     4   10
3     7    4
4     7   22
5     8   16
6     9   10

[1] 0.8068949

[1] 0.8068949

    speed 
0.8068949

    speed 
0.8068949

#+BEGIN_SRC R :cache yes :exports both :results output graphics :file ./test.png
  library("ggplot2")
  cars_std <- as.data.frame(scale(cars))

  ggplot(cars_std, aes(x = speed, y = dist)) +
  geom_point() +
  geom_smooth(method = "lm") +
  theme_bw()

This shows that despite noise data can still have a correlation coefficient of 1 if the noise is evenly distributed in a way that the correlation coefficient can have a rate of change of 1.

TODO Prove the Correlation Coefficient and email Laurence
Bootstrapping

The big assumption with bootstrapping is that the population can be seen as equiv to an infinite repetition of the sample size.

So assume that a population that is an infinite repetition of the sample, then take a sample of that infinite population and you have a bootstrap.

So we could either create a population from the sample with a size of $\infty$, which might be difficult, or, we could instead just resample the observation for each replication.

Confidence Intervals

  load("Notes/DataSci/ThinkingAboutData/TAD.rdata ")
  r = cor(crabsmolt$postsz, crabsmolt$presz)
  a <- crabsmolt
  N <- nrow(crabsmolt)
  pos <- sample(N, size = N, replace = TRUE)
  aboot <- a[pos,]

  cor(aboot$postsz, aboot$presz)


#  replicate(10^4, {})

Attached RMD

TODO Questions

In this part would it simply be equivalent to take the mean of all observations?

TODO (8) Linear Regression (No Really do they Earn More?) wk8

TODO Lecture

TODO Tutorial

RMD File

Measuring $p$ -value for a particular slope
Say we had two data sets as shown in 1:
Listing 1: Generating Sample Data
```
x    <- 1:10
y    <- round(1:10 * 3 + rnorm(10), 2)
(exampleData <- data.frame(x, y))
```
1 2.66

2 5.56

3 10.19

4 11.99

5 15.85

6 16.15

7 19.78

8 24.64

9 26.98

10 29.72

Suppose that we wanted to know:

The probability of incorrectly assuming that there was a non-zero slope for the linear regression between $x$ and $y$ when in fact the slope of the linear regression is zero and this observation was made by chance.

then we could easily measure that by using summary and lm as shown in listing 2:
Listing 2: Summarising a Linear Model
```
summary(lm(y ~ x, exampleData))
```
```
Call:
lm(formula = y ~ x, data = exampleData)

Residuals:
    Min      1Q  Median      3Q     Max
-1.0919 -0.4014 -0.2601  0.2745  1.6065

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  -1.5033     0.6205  -2.423   0.0417 *
x             3.2917     0.1000  32.917 7.91e-10 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.9083 on 8 degrees of freedom
Multiple R-squared:  0.9927,	Adjusted R-squared:  0.9918
F-statistic:  1084 on 1 and 8 DF,  p-value: 7.912e-10
```
this provides a very low $p$ value for a non-zero slope at 7.91e-10, which is too be expected because we set the data up in listing 2 to have a slope of 3.

Now let’s modify our question, instead of testing for a slope of 0, say we wanted to test for a slope of 3, then our question would be:

What is the probability of incorrectly accepting that there is a slope with a value different from 3, when in fact the slope of the data is actually 3?

There’s nothing built in that just works, so we need to perform a boot strap like this:
```
## Calculate the current slope (offset by 3)
slope <- lm((y-3*x) ~ x, data = exampleData)[[1]][2]

## Simulate null hypothesis

sim <- replicate(10^3, {
    ## Shuffle the data
    x_perm <- sample(exampleData$x)
    ## Recalculate the slope
    slope_sim <- lm((y-x) ~ x_perm, data = exampleData)[[1]][2]
    ## Is the slope more extreme
    abs(slope_sim) > abs(slope)
})

(pval <- mean(sim))
```
```
0.967
```
This provides that the p-value is actually quite high, indicating a high probability of a Type I Error (Incorrectly rejecting the null hypothesis when it was true), so might conclude something like this:

There is no evidence to support rejecting the hypotheis that the slope is 3

wk9 Break wk9

TODO (9) ANOVA (Do redheads have a lower pain threshold?) wk10

Lecture Slides

Tutorial Sheet

So this is about having multiple variables. It’s actually determining if groups are different.

Look at the variance to see if groups are different.

We use the $F$ Statistic in order to measure things

TODO Get the old ANOVA Notes and transcribbe them

So we want to know if there is a difference between pain tolerance across hair colours on nil.

Pooled Variance

Pooled variance is not the same as the variance of all the data, because, it uses the means of both groups, so it will be smaller than the variance of all the data put together.

The sum of squares

nil or better yet This PDF Export

Between groups

Add the differences between the blobs of groups
Within groups

The difference between Category Values and the category mean.

Add the differences within the groups
TODO Add an inkscape image
So we ratio SSB and SSW to measure how seperate the groups are.

Via bootstrapping we can measure the probability of these groupbs being so blurry by chance.

For the Hair Colour Data we have:

Dark Blonde Dark Brunette Light Blonde Light Brunette

ns 5 5 5 4

menas 51.2 37.4 59.2 42.5

vars 86.2 69.3 72.7 29.67
```
xbark <- c(51.2, 37.4, 59.2, 42.5)
n <- c(5,5,5,4)
xbar = 47.85
N <- sum(n)
k <- length(n)

## Between SS
SSB <- sum(n*(xbark-xbar)^2)

## Within SS
vark <- c(86.2, 69.3, 72.7, 29.67)
SSW <- sum(vark*(n-1))

## F Stat

(F <- ((SSB)/(k-1)) / ((SSW)/(N-k)))
```
```
[1] 6.791345
```

The FStatistic

Now the question is:

Is the F statistic large enough to reject the Null Hypothesis?

or rather:

What’s the probability of incorrectly rejecting the null hypothesis assuming that the null hypothesis is true.

We can just do this in R:

load("~/Notes/DataSci/ThinkingAboutData/TAD.rdata")
oneway.test(pain ~ hair, data = hair2, var.equal = TRUE)
names(hair2)

If you use the coin library

hair <- hair2
names(hair2)

ns = table (hair$hair)

x = replicate (1000,{
    # permute the categories (to satisfy H0)
    hair.perm <- sample(hair$hair)
    fit0 = aov (pain ~ hair.perm, data=hair)
    # compute MSE and means on permuted categories
    MSE = summary(fit0)[[1]][2,3] ## Extract the residual MSE
    means = aggregate(pain ~ hair.perm, data=hair, mean)[,2]
    # compute t statistic for all pairs
    ## Plus and Minus do weird things
    Ts = outer(means, means, "-")/ sqrt(outer(1/ns,1/ns, "+"))
    Ts = Ts/ sqrt (MSE)
    # keep only the largest t statistic
    max ( abs (Ts))
})


c(1,2,3) %*% c(4,5,6)
c(1,2,3) %o% c(4,5,6)

R Prob Functions
- p is cummulative probability
- d is the area
- r is a random number

TODO (10) What is Normal? wk11

TODO (11) Normality as opposed to deviant etc. wk12

TODO (12) When it all goes Wrong wk13

TODO (13) Exam Prep wk14

Symlinks to Material

Lecture

Tutorial

Central Limit Theorem

The central Limit theorem provides us the sampling distribution of $\overline{X}$ even when we don’t know what the original population of $X$ looks like:

If the population is normal, the sample mean of that population will be

normally distributed, $\overline{X} \sim \mathcal{N}\left( \mu \left(\frac{\sigma}{\sqrt{n}} \right) \right)$

As sample size $n$ increases, the distribution of sample means converges to

the population mean $\mu$

i.e. the standard error of the mean

$\sigma_{\overline{x}}=\left(\frac{\sigma}{\sqrt{n}}\right)$ will become smaller

If the sample size (of sample means) is large enough ($n \geq 30$) the sample

means will be normally distributed even if the original population is non-normal

Assignment

RMD File
Project Specification

Information from wk10

He only wants a PDF.

How much should I explain the code?

Or should I explain the conclusions?

Make sure you’re following the marking criteria

Proper introduction and Conclusion.

Anybody can understand but specialist could interpret it.

There’s a pretty good Marking Criteria

Footnotes:

Refer to Solving Series Strategy

The Normal Distribution

Calculus III - Double Integrals in Polar Coordinates

⁴

Expected Value and Variance

⁵

By which I mean I’m not sure if the directory that the 00tagmatch directory 00taglist file will be made in are the wd of bash, or, if they are the location of ~/Notes

I’m also not sure how that will be influenced by looking for #tags in the ~/Notes/DataSci Directory

POS	POS	POS	POS	POS	POS	POS	POS	POS	POS
:-:	:-:	:-:	:-:	:-:	:-:	:-:	:-:	:-:	:-:
NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG
NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG
NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG
NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG
NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG
NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG
NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG
NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG
NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG
NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG

TP	TP	TP	TP	TP	TP	TP	FN	FN	FN
:-:	:-:	:-:	:-:	:-:	:-:	:-:	:-:	:-:	:-:
FP	FP	FP	FP	FP	FP	FP	FP	TN	TN
TN	TN	TN	TN	TN	TN	TN	TN	TN	TN
TN	TN	TN	TN	TN	TN	TN	TN	TN	TN
TN	TN	TN	TN	TN	TN	TN	TN	TN	TN
TN	TN	TN	TN	TN	TN	TN	TN	TN	TN
TN	TN	TN	TN	TN	TN	TN	TN	TN	TN
TN	TN	TN	TN	TN	TN	TN	TN	TN	TN
TN	TN	TN	TN	TN	TN	TN	TN	TN	TN
TN	TN	TN	TN	TN	TN	TN	TN	TN	TN
TN	TN	TN	TN	TN	TN	TN	TN	TN	TN

Value	Description
\(n=100\)	The Sample Size
\(\sigma= 5.2\)	The Standard Deviation of internet usage of the population
\(\mu=10\)	The alleged average internet usage.
\(\bar{x}=11\)	The average of the sample
\(\mu_True=12\)	The actual average from the population
\(\alpha=0.05\)	The probability of a type 1 error at which the null hypothesis is rejected
\(\beta=??\)	The probability of a type 2 error

Pros	Cons
Less Mess	Harder to Directly watch what’s happening
Easier to manage ⁵

selection	ordered	unordered
With Repetition	\(n^m\)	\(\binom{m+n-1}{n}\)
Without Repetition	\(n_{(m)}\)	\(\binom{n}{m}\)

	Dark Blonde	Dark Brunette	Light Blonde	Light Brunette
ns	5	5	5	4
menas	51.2	37.4	59.2	42.5
vars	86.2	69.3	72.7	29.67

1	2.66
2	5.56
3	10.19
4	11.99
5	15.85
6	16.15
7	19.78
8	24.64
9	26.98
10	29.72

POS	POS	POS	POS	POS	POS	POS	POS	POS	POS
:-:	:-:	:-:	:-:	:-:	:-:	:-:	:-:	:-:	:-:
NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG
NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG
NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG
NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG
NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG
NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG
NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG
NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG
NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG
NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG

TP	TP	TP	TP	TP	TP	TP	FN	FN	FN
:-:	:-:	:-:	:-:	:-:	:-:	:-:	:-:	:-:	:-:
FP	FP	FP	FP	FP	FP	FP	FP	TN	TN
TN	TN	TN	TN	TN	TN	TN	TN	TN	TN
TN	TN	TN	TN	TN	TN	TN	TN	TN	TN
TN	TN	TN	TN	TN	TN	TN	TN	TN	TN
TN	TN	TN	TN	TN	TN	TN	TN	TN	TN
TN	TN	TN	TN	TN	TN	TN	TN	TN	TN
TN	TN	TN	TN	TN	TN	TN	TN	TN	TN
TN	TN	TN	TN	TN	TN	TN	TN	TN	TN
TN	TN	TN	TN	TN	TN	TN	TN	TN	TN
TN	TN	TN	TN	TN	TN	TN	TN	TN	TN

Thinking About Data

Table of Contents

TODO Overheads

DONE Install Emacs Application Framework

TODO Install a live preview for equations in org-mode

TODO Experiment with using Bookdown to Merge all RMD Files

Unit Information ATTACH

Deriving the Normal Distribution

Power Series Series

Example

Representing a function as a Power Series

Calculus Rules and Series

Taylor Series

Modelling Normal Distribution

what is the $y$-axis in a Density curve? ggplot2 ATTACH

Defining the Normal Distribution

Modelling only distance from the mean

Incorporating Proportional to Frequency

Putting both Conditions together

Understanding the p-value

False Positive Rate

False Discovery Rate

Measuring Probability

Comparing \(\alpha\) and the p-value

Wikipedia Links

Calculating Power

Example

Problem

Solution

Step 1; Find the Critical Sample Mean (\(\bar{x}_\mathsf{crit}\))

Step 2: Find the Difference between the Critical and True Means as a Z-Value (prob of Type II)

Step 3: State the value of \(\beta\)

Step 4: State the Power Value

Confidence Intervals

Khan Academy

The Confidence Interval is not the probability ATTACH

The Reasoning

Classical Confidence Interval

Weekly Material

(3); Comparison of Population Samples wk3

Lecture

Tutorial

DONE (4); Using Student’s t-Distribution wk4

DONE (5) Discrete Distributions (Mapping Disease) wk5

DONE Quizz for t Distribution Material

Lecture

Tutorial

DONE (6) Paired t-test (Observation or Experiment) wk6

TODO (7) Corellation (Do Taller People Earn More) wk7

DONE Lecture

TODO (8) Linear Regression (No Really do they Earn More?) wk8

TODO Lecture

TODO Tutorial

wk9 Break wk9

TODO (9) ANOVA (Do redheads have a lower pain threshold?) wk10

TODO Get the old ANOVA Notes and transcribbe them

TODO (10) What is Normal? wk11

TODO (11) Normality as opposed to deviant etc. wk12

TODO (12) When it all goes Wrong wk13

TODO (13) Exam Prep wk14

Symlinks to Material

Lecture

Tutorial

Central Limit Theorem

Assignment

Information from wk10

Footnotes:

POS	POS	POS	POS	POS	POS	POS	POS	POS	POS
:-:	:-:	:-:	:-:	:-:	:-:	:-:	:-:	:-:	:-:
NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG
NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG
NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG
NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG
NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG
NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG
NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG
NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG
NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG
NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG	NEG

TP	TP	TP	TP	TP	TP	TP	FN	FN	FN
:-:	:-:	:-:	:-:	:-:	:-:	:-:	:-:	:-:	:-:
FP	FP	FP	FP	FP	FP	FP	FP	TN	TN
TN	TN	TN	TN	TN	TN	TN	TN	TN	TN
TN	TN	TN	TN	TN	TN	TN	TN	TN	TN
TN	TN	TN	TN	TN	TN	TN	TN	TN	TN
TN	TN	TN	TN	TN	TN	TN	TN	TN	TN
TN	TN	TN	TN	TN	TN	TN	TN	TN	TN
TN	TN	TN	TN	TN	TN	TN	TN	TN	TN
TN	TN	TN	TN	TN	TN	TN	TN	TN	TN
TN	TN	TN	TN	TN	TN	TN	TN	TN	TN
TN	TN	TN	TN	TN	TN	TN	TN	TN	TN