Machine Learning 04: DNN, CNN, RNN, and Representation Learning
 Why not Logistic Regression?
 Multilayer Networks of Sigmoid Units
 Example: Learn probalistics XOR
 A Convolutional Neural Net for Handwritten Digit recognition: LeNet
 Convolution layer, Maxpooling layer
 Dealing with Overfitting by Validation Set
 Batch Normalization
 Learning Hidden Layer Representations
 Learning Distributed Representations for Words
 What is cross entropy?
 Recurrent Neural Networks
 Long Short Term Memory (LSTM) Unit
 Gated Recurrent Units (GRUs)
 Programming Frameworks for Deep Nets
Course Notes of Professor Tom Mitchell Machine Learning Course @ CMU, 2017
Why not Logistic Regression?
We like logistic regression, but:
 how would it perform when trying to learn P(image contains Hillary Clinton  pixel values X1, X2 ... X10,000)?
 what Xi image features to use? edges? color blotches? generic face? subwindows? lighting invariant properties? position independent? SIFT features?
Multilayer Networks of Sigmoid Units
E.g. 1 Speech Recognition
This is a multilayer networks of sigmoid units to do speech recognize. 1hot vector encoding for the last layer to do classification into certain category. On the right side, we can see the decision surface is very complecated instead of linear surface of logistic regression.
E.g. 2 ALVINN selfdriving car
 4 hidden units in the hidden layer
 fully connected, output units connecting to all hidden layer units
 Onehot encoding for the output 30 units in training data
Sigmoid Unit
Rectified Linear Unit (ReLU)
Comparing with Sigmoid, the only difference is the activation function *f*. ReLU change the sigmoid function to thresholded output. Note that ReLU is still linear classifier!
Many types of parameterized units
 Sigmoid units
 ReLU
 Leaky ReLU (fixed nonzero slope for input<0)
 Parametric ReLU (trainable slope)
 Max Pool
 Inner Product
 GRU’s
 LSTM’s
 Matrix multiply
 .... no end in sight
Training Deep Nets

Choose loss function J(θ) to optimize
 sum of squared errors for y continuous: Σ (y – h(x; θ))^2
 maximize conditional likelihood: Σ log P(yx; θ)
 MAP estimate: Σ log P(yx; θ) P(θ)
 0/1 loss. Sum of classification errors: Σ δ(y = h(x; θ) — Not a good choice because not smooth
 ...

Design network architecture
 Network of layers (ReLU’s, sigmoid, convolutions, ...)
 Widths of layers
 Fully or partly interconnected
 ...

Training algorithm
 Derive gradient formulas
 Choose gradient descent method, including stopping condition
 Experiment with alternative architectures
 Drop out
Example: Learn probalistics XOR
 Given boolean Y, X1, X2 learn P(YX1,X2), where
 P(Y=0  X1 = X2) = 0.9
 P(Y=1  X1 ≠ X2) = 0.9
 Can we learn this with logistic regression?
 No, it's not a linear problem.
 Draw the axis with x1, x2 here to see.
 What can we do?
 Add a hidden layer which gets the inputs from x1 and x2 and output to the y results.
Feedforward Process
Gradient Calculation with Chain Rule
Loss function to be minimized: negative log likelihood
use chain rule:
In which:
Back Propagation Process
A Convolutional Neural Net for Handwritten Digit recognition: LeNet
Convolution layer, Maxpooling layer
A detailed guide of forward propagation and back propagation.
Dealing with Overfitting by Validation Set
Our learning algorithm involves a parameter n=number of gradient descent iterations
 Separate available data into training and validation set
 Use training to perform gradient descent
 n <— number of iterations that optimizes validation set error
This gives unbiased estimate of optimal n
(but still an optimistically biased estimate of true error)
Batch Normalization
Key idea: add batch normalization layers to network. For each minibatch, BN layer scales each feature to have mean 0, variance 1. Then adds an offset β and scaling γ parameter to train.
Impact of Batch Normalization on MNIST net:
Batch normalization potentially helps in two ways: faster learning and higher overall accuracy. The improved method also allows you to use a higher learning rate, potentially providing another boost in speed. A detailed intro of BN.
Learning Hidden Layer Representations
Note that the hidden layer is actually like 3 bit to represent numbers: 100, 001, 010, 111, 000, 011, 101, 110.
Note that hidden layer values are not boolean. So it have stronger representative ability .
Learnt Hidden Unit Weights in Face Recognition
Learning pose from face pictures: Link
Learning Distributed Representations for Words
Learning Distributed Representations for Words
 also called “word embeddings”
 word2vec is one commonly used embedding
 based on skip gram model
Key idea: given word sequence w1 w2 … wT
train network to predict surrounding words. for each word wt predict wt2, wt1, wt+1, wt+2
e.g., “the dog jumped over the fence in order to get to..” — “the cat jumped off the widow ledge in order to ...”
Word2Vec Word Embeddings
Skipgram Word Embeddings
Learning Representations for Words and Relations
NELL (Never Ending Language Learner) is learning to read the web, building large knowledge graph. Yang & Mitchell, 2017 PDF
What is cross entropy?
Our negative log likelihood loss is also called cross entropy. Why?
Recurrent Neural Networks
 Many tasks involve sequential data
 predict stock price at time
t+1
based on prices att, t1, t2, …
 translate sentences (word sequences) from Spanish to English
 transcribe speech (sound sequences) to text (word sequences)
 predict stock price at time
 Key idea: recurrent network uses (part of) its state at
t
as input fort+1
Note that in the diagram, the weight W
is shared by each iteration. (parameter sharing). When using back propagation to calculate the gradient, there will be a problem of vanishing and / or exploding gradients. (0.9 × 0.9 × 0.9 × … × 0.9 ≈ 0)
Gradient Clipping: Managing exploding gradients
Simply clip gradient when they explode.
Bidirectional Recurrent Neural Networks
Key idea: processing of word at position t can depend on following words too, not just preceding words.
 For the hidden layer units
h
, it accepts two inputs, one from inputx
and the other from the previous hidden layer unit (representing the previous time).  In the bidirectional RNN,
g
layer is added to accept inputs from "future" (the output of nextg
unit).  The output
o
is decided by bothh
andg
, containing both the "past" and "future" information.
Deep Bidirectional Recurrent Network
In [Irsoy & Cardie, 2014], Deep Bidirectional Recurrent Network shows better performance in Opinion Mining tasks than shallow RNN.
Long Short Term Memory (LSTM) Unit
In previous RNN, there is a problem that weight vanishing after long steps (e.g. 100 steps later). In this case, the network cannot make use of information happened long ago. LSTM is to solve the problem by adopting several gates for the hidden layer units.
Gated Recurrent Units (GRUs)
Programming Frameworks for Deep Nets
 TensorFlow (Google)
 TFLearn (runs on top of TensorFlow, but simpler to use)
 Theano (University of Montreal)
 Pytorch (Facebook)
 CNTK (Microsoft)
 Keras (can run on top of Theano, CNTK, TensorFlow)
 Caffe (Berkeley AI Research)
Many support use of Graphics Processing Units (GPU’s)