Spark Machine Learning¶
Linear Regression Review¶
- Let \(X\) be a set of observed data, aka, independent, or explanatory variables.
- Let \(Y\) be a set of dependent values that we may be able to predict.
Linear regression attempts to predict \(Y\) by assuming it has a linear relationship to the explanatory variable(s).
-
Least Squares Method -> Finds the squared distance between all of the X values and the line of prediction. The "line of best fit" is the one that minimizes all of the Least Squares.
- Squaring the distances ensures all values are positive.
- Sum all of the squared distances yields the Sum of Squares.
-
Multiple Linear Regression is when there is one dependent variable with many independent variables. $$Y = X \beta + \epsilon $$
Here, \(X\) is the matrix of features and \(\beta\) has all parameters to be predicted for the line (slopes and y-intercepts for each dimension). \(\epsilon\) is an error term, which introduces a probability that the model is not genuinely linear.
Advantages & Disadvantages of Linear Regression¶
Advantages of Linear Regression:
- The method can be fairly easy to intuit and implement.
- The method is not very computationally expensive.
- Has a lot of predictive power if the model (X vs. Y) can be explained by the shape of a line
Disadvantages of Linear Regression:
- If the model (X vs. Y) cannot be explained by the shape of a line, Linear Regression performs poorly
- Can be very sensitive to outliers
VectorAssembler¶
- This is a Spark ML feature transformer that combines a given list of columns into a single vector column. It can be used to combine raw features and features generated by different feature transformers into a single feature vector for ML model training.
Multiple Linear Regression¶
When you have a mix of feature types in your data, (e.g., numerical with categorical), create features storing categorical and numerical variables separately.
Example:
1 2 | |
1 2 3 4 5 | |
- Then, split the columns up. Here, list comprehensions are used with indexing to yield the correct distribution of data types.
1 2 3 4 5 | |
1 2 | |