Spark Machine Learning¶

Linear Regression Review¶

Let $X$ be a set of observed data, aka, independent, or explanatory variables.
Let $Y$ be a set of dependent values that we may be able to predict.

Linear regression attempts to predict $Y$ by assuming it has a linear relationship to the explanatory variable(s).

\[Y = mX + b\]

Least Squares Method -> Finds the squared distance between all of the X values and the line of prediction. The "line of best fit" is the one that minimizes all of the Least Squares.
- Squaring the distances ensures all values are positive.
- Sum all of the squared distances yields the Sum of Squares.
Multiple Linear Regression is when there is one dependent variable with many independent variables. $$Y = X \beta + \epsilon $$

Here, $X$ is the matrix of features and $\beta$ has all parameters to be predicted for the line (slopes and y-intercepts for each dimension). $\epsilon$ is an error term, which introduces a probability that the model is not genuinely linear.

Advantages & Disadvantages of Linear Regression¶

Advantages of Linear Regression:

The method can be fairly easy to intuit and implement.
The method is not very computationally expensive.
Has a lot of predictive power if the model (X vs. Y) can be explained by the shape of a line

Disadvantages of Linear Regression:

If the model (X vs. Y) cannot be explained by the shape of a line, Linear Regression performs poorly
Can be very sensitive to outliers

VectorAssembler¶

This is a Spark ML feature transformer that combines a given list of columns into a single vector column. It can be used to combine raw features and features generated by different feature transformers into a single feature vector for ML model training.

Multiple Linear Regression¶

When you have a mix of feature types in your data, (e.g., numerical with categorical), create features storing categorical and numerical variables separately.

Example:

# Find the data type of each column:
data.dtypes

[('Miscellaneous_Expenses', 'double'),
 ('Food_Innovation_Spend', 'double'),
 ('Advertising', 'double'),
 ('City', 'string'),
 ('Profit', 'double')]

Then, split the columns up. Here, list comprehensions are used with indexing to yield the correct distribution of data types.

categorical_cols = [item[0] for item in data.dtypes if item[1].startswith('string')]
print(categorical_cols)

numerical_cols = [item[0] for item in data.dtypes if item[1].startswith('int') | item[1].startswith('double')][:-1]
print(numerical_cols)

['City']
['Miscellaneous_Expenses', 'Food_Innovation_Spend', 'Advertising']