Domain randomization

Machine learning system fails when the training data distribution is significantly different from testing data. For a “simple” problem like image classification, we can avoid this problem by including sufficient diversity for the training data. But for more complicated real-world problems, such as robotic AI, there are simply too many different possibilities that one cannot…

counterfactual explanation

Counterfactual explanations are proposed and studied in recent years. In logic, counterfactual refers to the scenario when the condition of an if-statement is universally false. Note that the if-statement is universally true when the condition is universally false. So the conclusion is false even though the if-statement holds true always. Counterfactual example in ML refers…

Distributed representation

It sounds like a misnomer to me. I probably will just call it a “vector” representation. It doesn’t have the “distributed” meaning of scattering information into different places. For example, to recognize a cat with “distributed” representation, we may distribute features into like “does it has a tail?”, “does it have four legs?”, and “does…

LeCun’s first lecture

LeCun has a new course on deep learning this spring. I found two things he mentioned that worth jotting down. First, natural data lives in low-dimensional manifold. Probably I should have came across that before but it didn’t register earlier. Come to think of it. This is a very important fact. Second, as it is…

Self-supervised learning

A very good lecture by Ishan Misra summarized many self-supervised learning methods. Idea of self-supervised learning is very simple. We are trying to design some pretext tasks where the labels can be obtained for free. And instead of training a model with real task (down stream task), we will pretrain the model with these pretext…

Visualizing CNN

Let’s summarize a couple techniques for visualization. Here all networks are trained to do classification. The simplest one is to randomly black out part of the image and evalute the classification result. For example, the dog score should drop when we black out the dog face. This approach is readily applicable to any classifier regardless…

No free lunch theorem

The no-free-lunch theorem basically claims that every learning algorithm sucks as they won’t be better than random guessing for a supervised problem. The argument is that there are infinitely many functions out there and since we only sample some of the points of the function. The rest of the function can literally be anything. If…

Introduction to generalized linear model

As the name suggested, generalized linear model (GLM) is a generalization of the linear regression model. In a classic linear model, with input $latex X$, the output $latex Y$ is modeled by $latex Y|X \sim \mathcal{N} (B X, \sigma^2 I)$. GLM extends the model in two ways, first instead of having the mean as simple…

Boosting

While bagging mitigates the variance problem, boosting can reduce bias through combination of weak learners. Boosting can be viewed as gradient descent in the functional space. That as the current classifier, by Taylor expansion, we have . So out of an ensemble of weak classifiers, we want to find the best as follows Denote ,…

Decision Tree

  Decision tree is a simple algorithm. Simply spliting data out of all possible features. A deeper tree will have high variance and shallower tree will have high bias. Naturally, we want to find a smallest tree to result to fit all data (we called a zero training error tree consistent). However, in general this…