Bias in Terms of Multiple Linear Regression - Environmental Informatics (MATH3005)

Bias Terms in Multiple Regression

Consider a matrix multiplication as shown below.

$n \times R \hat{Y} = n \times F F \times R XW$

Where:

$n$ is the sample size
$R$ is the number of columns in the output, i.e. the number of response variables (or dependent variables) being modelled
$F$ is the number of columns in the input, i.e. the number of *features (or input/independent variables)

In this representation, each of the $n$ rows of $X$ and $Y$ is an observation and the columns represent input and response variables respectively.

However, I have elected to use a column-major so let's instead represent this:

$R \times n \hat{Y} = R \times F F \times n WX$

Now each observation is a column-vector instead of a row vector.

This corresponds to a matrix form thusly:

$w_{11} w_{21} w_{31} w_{12} w_{22} w_{32} w_{13} w_{23} w_{33} x_{1} x_{2} x_{3} = w_{[1, :]} \cdot x w_{[2, :]} \cdot x w_{[3, :]} \cdot x$

note the following points:

This is a column-major representation, the $x$ vector represents a single observation and subsequent observations would become additional columns of $X$ .
- A row major representation can be acheived by transposing $^{T}$ the matrices.
- R, Julia, Octave, Wolfram and Fortran all use column-major representations
- Python, C(++), Go and Rust use a row-major representation
- It's important to get this right, languages store values in a certain pattern in memory, cutting against the grain will be less performant.
The use of sympy/octave/julia notation, whereby $w_{[1, :]}$ represents a vector composed of the first row of $W$ .

Let's now add a bias term (i.e. an intercept) so that the model reflects $y = m x + b$ from simple linear regression:

$w_{11} w_{21} w_{31} w_{12} w_{22} w_{32} w_{13} w_{23} w_{33} x_{1} x_{2} x_{3} + k_{1} k_{2} k_{3} = w_{[1, :]} \cdot x w_{[2, :]} \cdot x w_{[3, :]} \cdot x + k_{1} k_{2} k_{3} = w_{[1, :]} \cdot x + k_{1} w_{[2, :]} \cdot x + k_{2} w_{[3, :]} \cdot x + k_{3}$

The bias term could equally be expressed inside the matrix thusly:

$w_{11} w_{21} w_{31} w_{12} w_{22} w_{32} w_{13} w_{23} w_{33} k_{1} k_{2} k_{3} x_{1} x_{2} x_{3} 1 = w_{[1, :]} \cdot x + k_{1} w_{[2, :]} \cdot x + k_{2} w_{[3, :]} \cdot x + k_{3}$

That's why we don't often include a bias term when performing multiple linear regression, as it can be included as a component of the weights matrix.

More Observations

If there were $n$ observations:

$w_{11} w_{21} w_{31} w_{12} w_{22} w_{32} w_{13} w_{23} w_{33} k_{1} k_{2} k_{3} x_{11} x_{21} x_{n 1} x_{12} x_{22} x_{n 2} \dots x_{1 n} x_{2 n} x_{nn} 111 = w_{[1, :]} \cdot x + k_{1} w_{[2, :]} \cdot x + k_{2} w_{[3, :]} \cdot x + k_{3}$

Row Major

If we transformed this to a row-major representation, the 1s would now be a row of the input and the bias a row of the weights: $w_{11} w_{21} w_{31} w_{12} w_{22} w_{32} w_{13} w_{23} w_{33} k_{1} k_{2} k_{3} x_{11} x_{21} x_{n 1} x_{12} x_{22} x_{n 2} \dots x_{1 n} x_{2 n} x_{nn} 111^{T} x_{11} x_{21} x_{n 1} x_{12} x_{22} x_{n 2} \dots x_{1 n} x_{2 n} x_{nn} 111^{T} w_{11} w_{21} w_{31} w_{12} w_{22} w_{32} w_{13} w_{23} w_{33} k_{1} k_{2} k_{3}^{T} = w_{[1, :]} \cdot x + k_{1} w_{[2, :]} \cdot x + k_{2} w_{[3, :]} \cdot x + k_{3}^{T} = w_{[1, :]} \cdot x + k_{1} w_{[2, :]} \cdot x + k_{2} w_{[3, :]} \cdot x + k_{3}^{T}$

A note on Memory Layout in languages

Column major matrices are rooted in mathematical notation and conventions of linear algebra. In many textbooks and papers, 1D vectors are often treated as column vectors where convenient. This makes column-major order a natural and intuitive choice for mathematical languages where matrix operations are common. The reader may have already noted that many column-major languages are also 1-indexed, this is for a smiliar reason.

The C language was developed for systems programming, presumably row-major representation was chosen because it is consistent with how people usually layout data, there was no need to cater towards mathematical programming.