March 2021 - Journal¶
March 31, 2021¶
Learned mostly about Decision Trees today, through reading Geron's book Hands-On Machine Learning..., and the ISLR, 7th ed. book that has been assigned for this week in the Applied ML course.
Decision trees successively split a dataset in two, according to the desired parameters. For instance, a dataset of supermarket data would allow the model builder to split the data on available hot dog brands using one attribute and a threshold value of that attribute. Split the hot dog brands into those with less than or equal to 6 in a package and those that have greater than 6 in a package, and the Decision tree algorithm I looked at today will provide a successive split of the data, trying to minimize RSS on each subset of data. The split with the least Residual Sum of Squares will 'win', and be the locale for the split.
Residual Sum of Squares in this context:
I also learned that one can use a tree model to predict the probabilites of class participation. This happens in an intuitive way:
- The tree is traversed to find the leaf node for the instance;
- It returns the ratio of training instances of class k in that node;
- This serves as the probability that the instance belongs to that class.
- It should then predict the class the instance belongs to as the class with the greatest probability.
Tomorrow, I will read of tree pruning, to start, and also the CART algorithm.
One thing on Python I read about today is the os module that I have frequently seen imported in notebooks. This module has many utilities, it seems, for interfacing with the host operating system.