Skip to content

Spark Machine Learning

Linear Regression Review

  • Let \(X\) be a set of observed data, aka, independent, or explanatory variables.
  • Let \(Y\) be a set of dependent values that we may be able to predict.

Linear regression attempts to predict \(Y\) by assuming it has a linear relationship to the explanatory variable(s).

\[Y = mX + b\]
  • Least Squares Method -> Finds the squared distance between all of the X values and the line of prediction. The "line of best fit" is the one that minimizes all of the Least Squares.

    • Squaring the distances ensures all values are positive.
    • Sum all of the squared distances yields the Sum of Squares.
  • Multiple Linear Regression is when there is one dependent variable with many independent variables. $$Y = X \beta + \epsilon $$

Here, \(X\) is the matrix of features and \(\beta\) has all parameters to be predicted for the line (slopes and y-intercepts for each dimension). \(\epsilon\) is an error term, which introduces a probability that the model is not genuinely linear.

Advantages & Disadvantages of Linear Regression

Advantages of Linear Regression:

  • The method can be fairly easy to intuit and implement.
  • The method is not very computationally expensive.
  • Has a lot of predictive power if the model (X vs. Y) can be explained by the shape of a line

Disadvantages of Linear Regression:

  • If the model (X vs. Y) cannot be explained by the shape of a line, Linear Regression performs poorly
  • Can be very sensitive to outliers

VectorAssembler

  • This is a Spark ML feature transformer that combines a given list of columns into a single vector column. It can be used to combine raw features and features generated by different feature transformers into a single feature vector for ML model training.

Multiple Linear Regression

When you have a mix of feature types in your data, (e.g., numerical with categorical), create features storing categorical and numerical variables separately.

Example:

1
2
# Find the data type of each column:
data.dtypes
1
2
3
4
5
[('Miscellaneous_Expenses', 'double'),
 ('Food_Innovation_Spend', 'double'),
 ('Advertising', 'double'),
 ('City', 'string'),
 ('Profit', 'double')]

  • Then, split the columns up. Here, list comprehensions are used with indexing to yield the correct distribution of data types.
1
2
3
4
5
categorical_cols = [item[0] for item in data.dtypes if item[1].startswith('string')]
print(categorical_cols)

numerical_cols = [item[0] for item in data.dtypes if item[1].startswith('int') | item[1].startswith('double')][:-1]
print(numerical_cols)
1
2
['City']
['Miscellaneous_Expenses', 'Food_Innovation_Spend', 'Advertising']