In this article we will look at TensorFlow (& why Swift).
This is a public alpha (getting feedback) and I am still working on it. Feel free to read it but I expect to have it cleaned up are ready for general public consumption in the next few weeks.
Swift for TensorFlow: 2 kinds of users
Advanced ML researchers who are limited by current ML frameworks. Swift for TensorFlow’s advantages include a seamless integration with a modern general-purpose language, allowing for more dynamic and sophisticated models. Fast abstractions can be developed “in user-space” (as opposed to in C/C++ aka “framework-space”), resulting in modular APIs that can be easily customized.
ML learners* who are just getting started with machine learning. Thanks to Swift’s support for quality tooling (e.g. context-aware autocomplete), Swift for TensorFlow can be one of the most productive ways to get started learning the fundamentals of machine learning.
*Everyone at some point
Why Swift for TensorFlow?
Machine Learning with No Boundaries
“Swift for TensorFlow is the first serious effort I’ve seen to incorporate differentiable programming deep into the heart of a widely used language that is designed from the ground up for performance” — Jeremy Howard
Fast.ai’s Deep Learning from the Foundations with Swift for TensorFlow:
Jeremy and Chris Lattner co-taught two advanced sessions of Deep Learning from the Foundations. The MOOC based on the recorded lectures, a companion fastai library, and all course materials are publicly available!
This lead to Swift for TensorFlow v0.4, which includes:
- Support for automatic differentiation of functions with control flow at compile time.
- A prototype new execution mode — Lazy Tensor — that has the potential to unlock higher performance on accelerators such as GPUs and TPUs.
- Many new activation functions and layers, as well as a collection of notebooks and tutorials.
- Several new models in our model garden, which includes ResNet, Transformer, and MiniGo.
~~> All this might not make much sense so let’s get some background <~~
Some Background:
What is Artificial Intelligence(AI) / Machine Learning(ML) / Deep Learning(DL) / Neural Nets(NN)? Here we list them from general to more specific.
Artificial Intelligence(AI) — The field was founded on the claim that human intelligence “can be so precisely described that a machine can be made to simulate it”. Some would say we are not building human intelligence but we are building alien intelligence.
Machine Learning(ML) — is the scientific study of algorithms and statistical models that computer systems use in order to perform a specific task effectively without using explicit instructions, relying on patterns and inference instead. ML is used in AI.
Deep Learning(DL) — (deep structured learning or hierarchical learning) is part of a broader family of machine learning methods based on artificial neural networks. Learning can be supervised, semi-supervised or unsupervised. DL is one form of ML using ANN.
Artificial Neural Nets(ANN) — based on a collection of connected units or nodes called artificial neurons, which loosely model the neurons in a biological brain.
Note: All our examples are Supervised Learning as we have a training set, features, labels and use them to tune the hyper-parameters. Unsupervised Learning is the same but we do not provide any labels allowing the system to determine them.
So now let’s build up from ANN to AI.
All the information in this section is found here …
Codelabs are here …
Problem: We have a basic problem that we want to identify a hand written number using the below data set.
We know all known problems are solved by one of three methods:
- Solve the problem with a mathematical equation
- Solve the problem with a transform. Convert the problem to a new domain. Solve the problem and convert it back.
- Guess at the answer.
A surprising large number of systems use the third way. Guess at the answer and see how far are you from the solution and make another guess. For this to work we need an “Error Function” to determine the distance from our guess to the correct answer.
Lets start thinking about how we can build this “Error Function”
Say we want to build a system (model: system of equations / training set) with an error function using images of a numbers. The question is how do you go from image input (pixels) using an error function to understandable output (number). We need to “recognize” an image of a number, so we need to somehow “quantify” this image. Using hot one encoding we can convert a pixel image to an array of numbers. Now that we have an image in a numerical form we can do “computations” on it.
We will use a Layered Neural Network to build a system of equations to generate a model. By training the model we will generate our error function. In the next sections we explain this “model” consisting of features, nodes and labels.
Basic Layered Neural Network (LNN) — You can think of a LNN as a series of layers of nodes which have “activation energy”. The first layer are the inputs (Features). The middle layers are the “compute nodes” and the last layer is the output(Labels).
The more layers the more degrees of freedom in the model.
Each “neuron” in a neural network does a weighted sum of all of its inputs, adds a constant called the “bias” and then feeds the result through some non-linear activation function. We will be using Relu activation function.
But how do we calculate the weights and bias? We need the error function and for this. For the data set we have we know the model should produce correct answers so we can “train” the model.
→ Split your data into training (training loss) and validation (testing loss) is an art and needs experience. But 80/20 is indeed a good starting point.
Use the cross entropy function between the actual probabilities “hot-one” encoded and computed properties. Producing a Gradient Decent that minimize the distance between these two vectors. Let’s define two terms.
Entropy Loss Function: Is the error in the model. The loss is calculated on training (training loss) and validation (test loss) and its interoperation is how well the model is doing for these two sets. This is a number not a percentage. We will use the log loss function which penalizes heavily for being very confident and very wrong.
Accuracy: The percentage of wrong / right. Is determined after the model parameters are learned and fixed.
Let’s build a system of equations (Matrix/Tensor) form these arrays, weights and biases.
These are the system of equations we need to solve.
Using this L (label) and Softmax predictions to train the weight and biases using known digits and predictions. This is supervised learning.
Next need to go from this numerical array to a probability of what is the number. We do this by using a “Softmax” — is a function that takes an input vector of real numbers (here is an array of 784) and normalizes it into a probability distribution consisting of probabilities (here it is 10 probabilities [digits 0–9] and each node is the summation of inputs plus “trained” weights and bias).
Now that we have setup our systems of equations (Matrix) … how do we solve this? These systems of equations are held in a Tensors …
The definition of a tensor: A tensor is any multilinear map from a vector space to a scalar field. But in this context we can simply think of a Tensor as a Matrix which functions (linear operations) can act on …
Our error function needs to know which direction to travel. Should it make the values higher or lowers? The gradient vector is used to determine the direction of the optimization. This gradient is the vector of all the partial differential functions relative to the weight and biases.
To understand the Gradient Descent Optimizer we need to understand partial differential equation (PDE) at a high level.
In mathematics, a partial differential equation (PDE) is a differential equation that contains unknown multivariable functions and their partial derivatives. — wiki
How do you solve Implicit Differentiation by NancyPi
By solving the PDE at the current point (which is the derivative of each dimension) produces a gradient vector points to values which make the Tensor functions higher. So if we flip the gradient vector with a negative sign we now point in the direction of values to minimize the function.
PDE are use to solve a gradient vector in multi-demential space.
This gradient vector (using the minimizing function) points us to the direction to move. This vector which has magnitude & direction is our guiding star … our way home to the minimum error …
Using various strategies and gradient functions we find the minimum of a surface.
These Tensors are “solved” using a Gradient Descent Optimizer with a learning rate. We set the optimizer to minimize and we pass it the loss function (cross_entropy).
If the learning rate is too hight it will skip the correct answer, if it is too low it will take longer to solve.
But how do you tell the computer to solve these PDE equations? Do you just say to the computer ??? “solve this partial differential equation”
There are three types of differentiation:
- Numerical: it uses the definition of the derivative (lim) to approximate the result. This is what you learned in school.
- Symbolic: manipulation of mathematical expressions
- Automatic: repeatedly using the chain rule to break down the expression and use simple rules to obtain the result.
Tensor Flow uses reverse mode automatic differentiation. Reverse mode automatic differentiation lets you calculate gradients much more efficiently than forward mode automatic differentiation.
The Chain Rule explained by Nancy Pi.
Definition: Automatic, or algorithmic, differentiation (AD) is a chain rule-based technique for evaluating derivatives of functions given as computer programs for their elimination.
Automatic Differentiation (AD), also known as algorithmic differentiation, is a family of techniques used to obtain the derivative of a function. Functions can be represented as a composition of elementary operators whose derivatives are well-known. While partial derivatives can be computed through different techniques, the most common is a recursive application of the chain rule in the reverse direction, called reverse-mode AD.
Auto Differentiation
There are two types of Auto Differentiation Methods
Reverse mode — Derivation of single output w.r.t all inputs .
Forward Mode — Derivation of all outputs w.r.t one input .
The basic unit of above two methods is the Computational Graphs
- Left — Forward Pass Graph (This is what we create with TF )
- Right — Backward Pass Graph (TF automatically creates this )
The Swift for TensorFlow project aims to provide best-in-class support for AD (see Swift section below)
→ produce a computational graph in memory ←
TensorFlow uses a dataflow graph to represent your computation in terms of the dependencies between individual operations. This leads to a low-level programming model in which you first define the dataflow graph, then create a TensorFlow session to run parts of the graph across a set of local and remote devices.
When we start a session , TF automatically calculates gradients for all the deferential operations in the graph and use them in chain rule.
Because TensorFlow does not do a numeric solution but a formal derivation of the formula for the loss, and having the graph in memory helps running a distributed clusters.
The Swift for TensorFlow allows you to run these graphs in eager mode. Giving you the best of both words. Eager execution with a graph flow but no need for a session. Allowing one to step through the execution for debugging (See Swift section below)
To run these graphs you need to start a TF Session.
That #train in the above slide is the gradient derived at some point in the learning cycle.
34:12 is the time set.
Stochastic Gradient Descent (SGD)
Stochastic gradient descent is an incredibly useful optimization method (it is also the heart of deep learning, where it is used for back-propagation).
For standard gradient descent, we evaluate the loss using all of our data which can be really slow. In stochastic gradient descent, we evaluate our loss function on just a sample of our data (sometimes called a mini-batch). We would get different loss values on different samples of the data, so this is why it is stochastic. It turns out that this is still an effective way to optimize, and it’s much more efficient!
- **** Before we can begin deriving the expression, it must be converted into a computational graph. A computational graph simply turns each operation into a node and connects them through lines, called edges. The computation graph for our example function is shown below. *****
- ****** Next, we need to calculate the partial derivatives of each connection between operations, represented by the edges. These are the calculations of the partials of each edge: *****
Mini-batch — GPUs love large matrix and it avoids local minima.
For better results add some layers to increase the degrees of freedom.
Remember: Basic Layered Neural Network (LNN) — series of layers of nodes which have “activation energy”. The first layer are the inputs (features). The middle layers are the “compute nodes” and the last layer is the output(labels).
Hyperparameter tuning
For us to find the best model we need to set these parameters:
- Learning rate: how much we change the weights in the optimization.
- Epochs: number of pass through the neural network.
- Batch size: size of the input
- Activation function: the function used to turn on the neuron. ReLue.
- Hidden layers: increase the number of degrees of freedom in the model
- Weight initialization: random small numbers uniformly distribution.
- Dropout: remove certain neurons to avoid overfitting.
With so many layers you might wind up with too many degrees of freedom and this causes overfitting and you need to apply dropouts.
Overfitting is observed when the training set is moving to perfection and the validation set is getting worse.
Convolutional Neural Networks (CNN)
Remember how we are using our images, all pixels flattened into a single vector ? That was a really bad idea. Handwritten digits are made of shapes and we discarded the shape information when we flattened the pixels. However, there is a type of neural network that can take advantage of shape information: convolutional networks. Let us try them. — From: TensorFlow and deep learning, without a PhD
Think of each of these dots as a node on a neural network with a weight and a bias. Each layer connects to the layer above it and forms a ANN.
Here we can see how the stride changes the shape till we get to the fully connected layer for the Softmax.
The sizing of the layers is done so that the number of neurons goes down roughly by a factor of two at each layer: 28x28x4≈3000 → 14x14x8≈1500 → 7x7x12≈500 → 200.
To make the results even better apply DROP RATE to the fully connected layer.
Batch Normalization:
Recurrent Neural Networks (RNN)
CNN is for images so how do you do words? RNN is good for natural language processing.
x inputs — the rest are from the next layer feed back into the previous layer. Think of this as reinforced learning. What you know is feed back into your input. This is represented by the figure on the right. At time (t) -> X(t) is H plus input Softmax Y(t). So now our neural net has state that corresponds to time t. This is a state machine with memory.
Very good for Natural Language Processing (NLP).
Long Short Term Memory.
Vector size p + n
- Forget Gate: Sigma for Sigmoid values 0 to 1
- Update Gate:
- Result Gate:
- Input: This is the simplest possible RNN.
- New C: Second internal state called C
- New H:
- Output: Softmax(Ht * W + b)
Output vector size (m)
the term logits layer is popularly used for the last neuron layer of neural networks used for classification tasks, which produce raw prediction values as real numbers ranging from +/- infinity.
Now that we have the necessary background we can answer the question:
Why Swift for Tensor Flow?
Machine learning paradigms are so important that they deserve first-class language and compiler support. With Swift for TensorFlow, you can easily differentiate functions using differential operators like gradient(of:)
, or differentiate with respect to an entire model by calling method gradient(in:)
Major Features
- Automatic Reverse Differentiation (Forward not implemented yet)
- Define-by-Run design (no sessions required)
- Swift optimized to include machine learning specific functions
- Allows Python APIs access in Pythonic way
- Includes
PyValue
type for Python’s dynamic system behavior
Lets look at each of these:
The Swift for TensorFlow project aims to provide best-in-class support for AD — including the best optimizations, best error messages in failure cases, and the most flexibility and expressivity. To achieve this, we built support for AD right into the Swift compiler. (see Swift section below)
Users can use @differentiable
to give any function guaranteed differentiability. The attribute has a few associated arguments:
- the differentiation mode (currently only
reverse
is supported) - the primal (optional, and should be specified if the adjoint requires checkpoints)
- the adjoint
Define-by-Run design (no sessions required)
The core graph program extraction algorithm, automatic differentiation, and Python language interoperability features of Swift for TensorFlow can be implemented for other programming languages, and we are occasionally asked why we didn’t use some other one for this project.
Allows Python APIs access in Pythonic way & Includes PyValue
type for Python’s dynamic system behavior
We quickly realized that our core static analysis-based Graph Program Extraction algorithm would not work well for Python given its highly dynamic nature.
Python will never work …
The fundamental algorithms we use to automatically identify and extract graphs out a program are based on static analysis of the code. This means that we need to be able to look at code “statically” — i.e., without running it — and be able to reliablyidentify Tensor operations and the data flow and control flow dependencies between them. In order to achieve the performance of graphs, we need to be able to connect together large chunks of tensor code — preferably with the same granularity as if you were to manually use an API like the TensorFlow session API. We also need to allow users to be able to use high-level APIs like layers and estimators (and build their own abstractions!) without breaking our ability to do this analysis.
Swift optimized to include machine learning specific functions
let z1 = images ⊗ w1 + b1
let h1 = sigmoid(z1)
let z2 = h1 ⊗ w2 + b2
let predictions = sigmoid(z2)
One thing to note here is the Swift uses ⊗ unicode to represent dot product
func sillyExp(_ x: Float) -> Float {
let 𝑒 = Float(M_E)
print("Taking 𝑒(\(𝑒)) to the power of \(x)!")
return pow(𝑒, x)
}@differentiating(sillyExp)
func sillyDerivative(_ x: Float) -> (value: Float, pullback: (Float) -> Float) {
let y = sillyExp(x)
return (value: y, pullback: { v in v * y })
}print("exp(3) =", sillyExp(3))
print("𝛁exp(3) =", gradient(of: sillyExp)(3))// ~~~~ or ~~~~// Custom differentiable type.
struct Model: Differentiable {
var w: Float
var b: Float
func applied(to input: Float) -> Float {
return w * input + b
}
}
// Differentiate using `Differentiable.gradient(at:in:)`.
let model = Model(w: 4.0, b: 3.0)
let (𝛁model, 𝛁input) = model.gradient(at: 2.0) { model, input in
model.applied(to: input)
}
Notice the unicode 𝛁 nabla symbol.
Final Note:
Tensor Flow 2.0 Eager execution
TensorFlow 1.X requires users to manually stitch together an abstract syntax tree (the graph) by making tf.* API calls. It then requires users to manually compile the abstract syntax tree by passing a set of output tensors and input tensors to a session.run()
call. By contrast, TensorFlow 2.0 executes eagerly (like Python normally does) and in 2.0, graphs and sessions should feel like implementation details.