Programming Frameworks:

When learning NN, we wrote a lot of our functions for finding out optimal weights. There were too many parameters to tune, too many algorithms to chase, and writing each of them from scratch for each project isn't very productive. So, idea is to write python modules to do our job. AI/ML people came up with programming frameworks. Programming frameworks provide all these functions pre written for us in a compact library, with a lot of additional features for speed, efficiency, etc. Most popular AI frameworks are PyTorch, TensorFlow, Keras, etc. These are all open source. 

TensorFlow:

TensorFlow (tf) is one of the programming frameworks used in AI. It was developed by google and is open source now. TensorFlow framework, provides a collection of libraries to develop and train models using  pgm languages as python, javascript, etc. for ML We'll concentrate on using TensorFlow in Python only, since tf is most widely used with python. Its flexible architecture allows for the easy deployment of computation across a variety of platforms (i.e CPUs, GPUs, TPUs).

Official website for tensorflow is:

https://www.tensorflow.org

This is a good place to get started with basic syntax and installation:

https://www.tensorflow.org/guide

Gotchas of TensorFlow:

Caution: If you start learning tensorflow, there's actually no clear tutorial or simple documentation for this. So, you learn by examples. you write some cryptic looking code, and it does the job. It's very hard to see why it works, how it works and how to debug it if it fails. In raw python, you can just debug by writing your debug code and having enough "print" statements to see where did it go wrong. In tf, a lot of steps are combined into one cryptic function, and if it doesn't return the desired result, there's little cryptic looking help. There's TensorBoard that supposedly helps you in this debug process. I haven't tried that yet. A lot of AI folks hate tensorflow for it's obscure programming style. One such rant here:

http://nicodjimenez.github.io/2017/10/08/tensorflow.html

A lot of these complains are about initial version of TF known as Tensorflow 1. So, google came out with new revision of Tensorflow, called TensorFlow 2, which supposedly is better than earlier version. More details below.

Installation:

Tensorflow is installed as a python module, just like any other module. tf pkg are available as tensorFlow 1 and tensorFlow 2:  tf 1 is older tf, while tf 2 is newer one.

TensorFlow 1: This is original TensorFlow pkg (one with lots of complaints). Version 1.0 of TensorFlow was released in 2017. Final version of TensorFlow 1 is 1.15. For TensorFlow 1.x, CPU and GPU packages are separate:

  • tensorflow==1.15 —Release for CPU-only
  • tensorflow-gpu==1.15 —Release with GPU support

TensorFlow 2: When Facebook released their own ML framework called pyTorch, it immediately started gaining ground against TF.  By 2018, popularity of TensorFlow started declining. Pytorch seemed more intuitive to people. So, google made a major version release of TensorFlow named as TensorFlow 2. This was released in 2019. TensorFlow 2.0 introduced many changes, the most significant being TensorFlow eager, which changed the automatic differentiation scheme from the static computational graph, to the "Define by run" scheme originally made popular by Chainer and later PyTorch. Here CPU and GPU pkg are in one.

Migration from TF1 to TF2:

We can write our code in TF 1, and then migrate that code to be able to run in TF 2, by applying very few changes to TF 1 code. This link shows how:

https://www.tensorflow.org/guide/migrate

Install TensorFlow 1:

TensorFlow gets installed by installing "tensorflow using pip3". Documentation doesn't say which major version gets installed by running the install cmd. It looks like tf 2 gets installed by default, provided your system meets the requirements. However tf2 needs newer python and pip3 versions. Basic tf installation needs python 3.5 or greater and pip3.  For tf 2 we need python 3.8 or greater (not sure??) and pip3 version 19.0 or greater. First check the versions, to find out if tf can even be installed or not, and if so, which major version.

$ python3 --version => returns Python 3.6.8 on my local m/c
$ pip3 --version => returns "pip 9.0.3 from /usr/lib/python3.6/site-packages (python 3.6)" on my local m/c

TensorFlow can be installed on CentOS 7 via following cmd in a terminal (assuming pip is installed).

sudo python3.6 -m pip install tensorflow => Type exactly as is. If you omit any of the options (i.e not doing sudo or or not using python3.6), the cmd will give you a lot of errors, and won't be able to install tensorflow for python 3.6. After installing, check in python3.6 dir to make sure the package is there:

$ ls /usr/local/lib64/python3.6/site-packages/ => shows following tensorflow related new dir. As can be seen, tensorFlow 1.14 got installed (probably due to not meeting python3 and pip3 requirements for tf 2). even though the last release for tensorflow 1 is 1.15, we see that 1.14 got installed (and not the latest 1.15). This may be due to some system requirements not being met.

tensorflow-1.14.0.dist-info/

keras_applications/                
Keras_Applications-1.0.8.dist-info/   
keras_preprocessing/      
Keras_Preprocessing-1.1.2.dist-info/    

TensorFlow 1 vs TensorFlow 2: TensorFlow 2 is radical departure from TensorFlow 1, and if you planning to learn Tensorflow, don't even bother about tf 1. Just learn tf 2 and you will saved from a lot of grief. However we are going to learn tf 1, as that is what gets used in Coursera courses in Deep learning. If we use tf 2, we may not be able to get our programming assignments to work (s they are written for tf 1), as tf is still cryptic enough, and any bug is not easy to debug..

NOTE: Documentation on google site doesn't mention what version exactly gets installed when you install tensorflow as above. Nor is there any documentation to help us understand how to install only tf1 or tf2. It just happened that tf1 v14 go installed for me. From the tons of warnings that I receive from the installed version, looks like not everything got installed the right way for tf 1. However the warnings seemed benign, so I continued on. I didn't try to update python3 and pip3 to see if tf 2 would get installed. Python3 latest version may not be available on many linux distro, and may even break your other python applications. It's always a major risk to update python3 and pip3, so I would be careful to do that on my system. So, let's live with tf 1 for now.

Syntax: Even though we have installed tf 1, all documentation on tensorflow.org refers to tf 2. My notes below are relevant for tf 1, but I'll highlight tf 2 wherever applicable.

Tensorflow is like any other module in Python. So, all cmds of python remain same, except that we call functions/methods in tensorflow as "tf" followed by dot and the function/method name (i.e tf.add() just as we do for any other module in python). Some of these functions/methods have to be run a certain way though, which is the start of mystic coding style of tf. We'll see details later.

After installing tf, run a quick pgm to see if everything got installed correctly. Name file as "test_tf.py".

#!/usr/bin/python3.6

import math
import numpy as np
import tensorflow as tf

print(tf.version) #print version

We see that on running above pgm, we get tons of warnings as below: those are OK. Also, the version gets printed with the file name and module v1. This indicates it's tf 1. For tf 2, we would see v2 instead of v1 (my hunch??).

/usr/local/lib/python3.6/site-packages/tensorflow/python/framework/dtypes.py:516: FutureWarning: Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'. => These are the warnings ...
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])

.... and so on ....

<module 'tensorflow._api.v1.version' from '/usr/local/lib/python3.6/site-packages/tensorflow/_api/v1/version/__init__.py'> => this line shows "v1" implying it's tensorflow 1. So, our TF 1 got installed correctly !!

Tensor Data Structure

TensorFlow is "flow of tensors". Tensors are used as the basic data structures in TensorFlow language. Tensors represent the connecting edges in any flow diagram called the Data Flow Graph. Tensors are defined as multidimensional array or list with a uniform type (called a dtype). If you're familiar with NumPy, tensors are (kind of) like np.arrays. Tensor objects are rep as tf.Tensor.

You can see all supported dtypes at tf.dtypes.DType(). Some of the Dtype objects are float, int, bool and string (i.e tf.float16/32/64, tf.int16/32/64, tf.unit16/32/64,  tf.bool, tf.string and few more). These look same as numpy dtype (array1.dtype). so, there's no difference b/w the two, except that one is for numpy, while other one is for tf. Not sure, why they had to define their own data type, when they are the same as numpy data types.

NOTE: All tensors are immutable like Python numbers and strings: you can never update the contents of a tensor, only create a new one.

When writing a TensorFlow program, the main object that is manipulated and passed around is the tf.Tensor. A tf.Tensor object has a shape/rank (dimensions as rows, columns, etc. same as shape in numpy) and dtype (data type of tensor elements,  all of which need to be of the same data type). We operate on these tensor objects, i.e add, multiply, etc (just like what we do for any arrays).

Tensors can be of any rank (i.e dimension). See in python - numpy section for details on arrays.

Rank 0 Tensor: Rank 0 means it's a scalar and not an array. ex: 4. All other higher ranks of array are vectors.

Rank 1 Tensor: Rank 1 is an array with 1 dim. Ex: [2, 3, 4]

Rank 2 Tensor: Rank 2 is an array with 2 dim. Ex: [ [2.1, 3.4], [3.5, 4.0] ]

And so on for higher dim Tensors.

NOTE: comma are needed to separate individual elements as in numpy arrays. numpy arrays can be used in many TensorFlow functions/methods. To carry out any computation on these tensors, as matrix multplication, matrix addition, etc, we use tf functions/methods instead of numpy functions/methods. Just as in numpy, where we define arrays of same data type for a given array, in a tensor, the values present in a tensor hold an identical data type with the known dimensions of the array. So, tensors are same as arrays in numpy for all practical purposes.

ex: tensor_2d = np.array([(1,2,3,4),(4,5,6,7),(8,9,10,11),(12,13,14,15)]) => declares a 2D tensor., and can be used in tensor operations NOTE: tensor_2d is just 2D numpy array.

Specialized Tensors: Constants, Variables, and Placeholders: We'll learn about how to create Tensors of different data types, and different ranks. There are many functions available to do this, but we'll look at 3 most important ones.

1. tf constants: TensorFlow constant is the simplest category of Tensors. It is not trainable and does not have a fixed dimension. It is used to store constant values. "constant" function is used here to declare constants of any rank.

syntax: constant(value, dtype=None, shape=None, name=’ Length ’, verify_shape=False ) => where, value is a constant value that will be used; dtype is the data type of the value (float , int, etc.); shape defines the shape of the constant (it’s optional); name defines the optional name of the tensor, and verify_shape is a Boolean value that will verify the shape.

ex: L=tf.constant(10, name="length", dtype=tf.int32) => Defines constant 10, with name "length" and of type int32.

print("L=",L) => prints L= Tensor("length_1:0", shape=(), dtype=int32) => This shows that the object is a Tensor object with shape blank (since it's a scalar) and type int32. NOTE: it doesn't display the value of constant. In tf 1, values are computed when session is run. We'll learn running sessions later. In TF 2, looks like the data is printed right here, even w/o running the session (that is what google tensorflow tutorials show).

ex: c = tf.constant([[4.0, 5.0 1.2], [10.0, 1.0 4.3]]) => Defines a rank 2 tensor. Since type not specified, it's automatically inferred based on contents of tensor. Here type is float32.

print("c=",c) => prints c= Tensor("Const_2:0", shape=(2,3), dtype=float32) => This shows object is a Tensor of rank=2 with shape=(2,3). As expected, type is assigned as float, even though we never explicitly assigned the type. Name here is "Const_2.0", since we didn't assign a specific name.

2. tf placeholders: TensorFlow placeholder is basically used to feed data to the computation graph during runtime. Thus, it is used to take the input parameter during runtime. We need to use the feed_dict method to feed the data to the tensors during session runtime. How to do this is explained later. Function "constant" discussed above had a constant value assigned at time of declaration, but here we assign value when running the session. Declaration of TensorFlow Placeholder is done via function "placeholder"

syntax: placeholder(dtype, shape=None, name=None) => Here, dtype is the data type of the tensor; shape defines the shape of the tensor, and name will have the optional name of the tensor.

ex: L2= tf.placeholder(tf.float32) => placeholder of type float32. Here we didn't define the shape, so any shape tensor can be stored into it.

print("L2=",L2) => prints L2= Tensor("Placeholder:0", dtype=float32). => Note: name here is "Placeholder:0" since we didn't specify a name.

ex: sess.run(L2, feed_dict = {L2: 3}) => This assigns a value of 3 to L2 during session run time. NOTE: We have to put L2 as 1st arg in sess.run to actually run Tensor "L2". If we don't do that, it will error out.

ex: L2= tf.placeholder(tf.float32, shape=(2,3)) => Here it's array of rank=2. NOTE: order of args have to be the same as defined in syntax above. Else it errors out.

print("L2=",L2) => prints L2= Tensor("Placeholder:0", shape=(2, 3), dtype=float32)

ex: sess.run(L2, feed_dict = {L2: [[2, 3, 1],[1, 2, 1]]}) => This assigns array values as shown to L2 during session run time.

3.tf Variables: These are variables used to store values that can change during operation. We can assign initial values and they can store other values later. These are similar to variables in other languages. We use function "Variable" to define a var. tf Variables act and feel like Tensors and are backed by tf.Tensor. Like tensors, they have a dtype and a shape, For all purposes, we can treat them as Tensors.

ex: Here we define a constant, and then create a var using that constant as the initial value. We don't define type and shape as they are automatically inferred. We don't define a name either.

my_tensor = tf.constant([[1.0, 2.0], [3.0, 4.0]]) => Here we define a constant
my_var
= tf.Variable(my_tensor) => Here we defined a variable "my_var" which has the initial value defined by the constant above. It's shape is automatically inferred to be (2,2) and type as float32.

print("my_var = ",my_var) => prints my_var = <tf.Variable 'Variable:0' shape=(2, 2) dtype=float32_ref> => NOTE: here it doesn't print "Tensor" but instead prints "tf.Variable" as Varaibles are not Tensor objects, but are backed by tf.Tensor. We have to explicitly convert them to tensors (by using tf.convert_to_tensor func explained later), if any function requires a Tensor as i/p.

ex: my_var = tf.Variable([[1.0, 2.0], [3.0, 4.0]]) => This is exactly same as above. We just put the initial value of variable into the function itself.

ex: Var= tf.Variable(tf.zeros((1,2)), dtype=tf.float32, name=”Var1”) => Here we create 2D tensor with shape=(1,2) named "Var" that we init to 0. See syntax of tf.zeros later.

print("var1 = ",Var1) => prints var1 =  <tf.Variable 'Var:0' shape=(1, 2) dtype=float32_ref>

Non trainable variables: Although variables are important for differentiation, some variables will not need to be differentiated. You can turn off gradients for a variable by setting trainable to false at creation. An example of a variable that would not need gradients is a training step counter.

ex: step_counter = tf.Variable(1, trainable=False) # initial value is assigned to 1. However by declaring it as non-trainable, we prevent it from differentiation.

ex: Var=tf.Variable( tf.add(x, y), trainable=False) => Here we define variable "Var", which is sum of tensors/array x,y, but we don't init to anything. This is because init values are picked up from x, y.

Irrespective of whether we inittialized variables or not, actual initialization of these variables does NOT take place at time of defining,  but when we run func "global_variables_initializer()".

ex: init= tf.global_variables_initializer() #Here we assigned this func to "init". Now, initialization takes place when we run init as "sess.run(init)"

tf.get_variable() => Gets an existing variable with these parameters or create a new one. Not sure about the diff b/w tf.Varaiable() and tf.get_variable(). May be here we get many more options for initialization, regularization, etc.

Syntax: tf.get_variable(name, shape=None, dtype=None, initializer=None, regularizer=None, ... many more options):

ex: tf.get_variable("W1", [2,3], initializer = tf.contrib.layers.xavier_initializer(seed = 1)) => This creates a var "W1" with shape=(2,3) and initializer set to Xavier initialization.

Difference b/w constants, placeholders and variables: constants are easy = their value remains fixed. Placeholders are like constants, but they allow us to change their values at run time so that we can run the pgm with many different values. Variables are like variables in any other pgm language => They allow us to store results of any computation.

shape, type, numpy: We can get shape or type of any Tensor by using Tensor.shape, Tensor.dtype, etc (i.e my_var.shape, my_var.dtype). We can also convert Tensors to numpy by using Tensor.numpy() (i.e my_var.numpy() will print array [[1.0, 2.0], [3.0, 4.0]] => However, in my installation, it gives an error  => AttributeError: 'RefVariable' object has no attribute 'numpy'.

TensorFlow programs: Once we have created tensors (constants, placeholders, variables), we can use these in TensorFlow programs. Making tf pgm involve three components:

  • Graph: It is the basic building block of TensorFlow that helps in understanding the flow of operations.
  • Tensor: It represents the data that flows between the operations. Tensors are constants/variables that we created above. operations are add, multiply, etc. In the data flow graph, nodes are the mathematical operations and the edges are the data in the form of tensor, hence the name Tensor-Flow.
  • Session: A session is used to execute the operations. Session is the most important and odd concept in TF. More details later.

Writing and running programs in TensorFlow has the following steps:

  1. Create Tensors (constants, placeholders, variables, as shown above) that are not yet executed/evaluated.
  2. Write operations between those Tensors (i.e multiply, add, etc). These operations can be done via tf functions as add, mul, etc or by using plain +, * etc as these operators are overloaded in tf (since these are originally in python). We can also use numpy arrays as i/p to Tensor operators, as the arrays will automatically be converted into Tensors. When you specify the operations needed for a computation, you are telling TensorFlow how to construct a computation graph. We put them in computation graph, but we haven't run them yet. This is different than what we do in conventional programming, where computation is carried out, as soon as we write the operation. The computation graph can have some placeholders whose values you will specify only later.
  3. Initialize your Tensors. Constants are already initialized via func "constant()", but variables need to be initialized using func "global_variables_initializer()" shown above.
  4. Create a Session using function "Session()". A Session object encapsulates the environment in which Operation objects are executed, and Tensor objects are evaluated.
    • syntax: tf.Session(target= ' ', graph=None, config=None). Usually no args are provided, so, just call Session() and pass the handle to a var.
    • ex: sess=tf.Session().
  5. Run the Session, using method "run()" on that session.  By running the session we can get values of Tensor objects and results of operations.This will run the operations you'd written above. You have to specify the function/method inside run() to run that particular func. If you have defined placeholders, you need to assign their values here. When you run the session, you are telling TensorFlow to execute the computation graph.
    • syntax: run(fetches, feed_dict=None, options=None, run_metadata=None) => Runs operations and evaluates tensors in fetches. This method runs one "step" of TensorFlow computation, by running the necessary graph fragment to execute every Operation and evaluate every Tensor in fetches, substituting the values in feed_dict for the corresponding input values. options and run_metadata are not used for our purposes.
    • The fetches argument may be a single graph element, or an arbitrarily nested list, tuple or dict containing graph elements at its leaves. A graph element can be one of the following types (there are few more, but we list 2 that we mostly use: Operation and Tensor):
      • A tf.Operation. The corresponding fetched value will be None.
      • A tf.Tensor. The corresponding fetched value will be a numpy ndarray containing the value of that tensor. This is important to note that the fetched value is not Tensor but numpy ndarray.
    • The optional feed_dict argument allows the caller to override the value of tensors in the graph. Each key in feed_dict can be a tf.Tensor, the value of which may be a Python scalar, string, list, or numpy ndarray that can be converted to the same dtype as that tensor. Each value in feed_dict must be convertible to a numpy array of the dtype of the corresponding key.
    • The value returned by run() has the same shape as the fetches argument, where the leaves are replaced by the corresponding values returned by TensorFlow.
    • ex: a = tf.constant([10, 20]); b = tf.constant([1.0, 2.0])
      • v = sess.run(a) => Here a is evaluated. Since "fetches" arg is a single graph element of type Tensor, return value is numpy array [10, 20]
      • v = sess.run([a, b]) => Here a and b are evaluated. Since "fetches" arg is a list of 2 graph elements of type Tensor, return value is a list with 2 numpy array [10, 20] and [1.0, 2.0]
  6. Close the session, using method "close()" on that session. A session may own resources, so by closing the session, we release the resources.

So, why do we do these complicated steps of making a graph, and then running them via session? Most likely, this is to map these computations to different nodes of CPU/GPU/TPU, etc. We keep on defining various operations of the final graph (i.e add, mul, etc to calculate cost function), and then map it to various nodes of GPU/TPU. Once we've mapped these, then in the very last step, we simply provide i/p values to the i/p nodes, and processor can easily compute the node value for all nodes in the graph.

ex: Below is an example where we multiply 2 constants to get the result.

a = tf.constant(2)
b = tf.constant(10)
c = tf.multiply(a,b) # we can also write c=a*b, as operators are overloaded
print(c) # You will not see result for c=20, but instead get this Tensor => "Tensor("Mul:0", shape=(), dtype=int32)". 
#This says that the result is a tensor that does not have the shape attribute, and is of type "int32".

sess = tf.Session() # In order to actually multiply the two numbers, you will have to create a session and run it.
print("result = ",sess.run(c)) # Now, we run the session for computation graph c. we get the result 20.
print(c) # This will again not show 20, but show the Tensor.
#Reason is that return value of sess.run(c) has to be grabbed to get the value of c.
print("res =", sess.run(c**2) => We could write any operation, i.e c**2 and it would compute c^2=400. This prints 400.
#Or o/p of sess.run(c) can be stored in a var and printing that var will show the value 20. i.e
tmp=sess.run(c)
print(tmp) # This will print value 20, as o/p is stored in tmp, which won't change now.
sess.close()
 

ex: Here we solve eqn y=m*x+c. Here y is computed for constant m, c and var x varying from 10 to 40.

m= tf.Variable( [2.7], dtype=tf.float32) #define m as var with initial value 2.7

C= tf.Variable( [-2.0], dtype=tf.float32) #define C as var with initial value -2.0

X=tf.placeholder(tf.float32) #X is defined as placeholder, as it's value is going to be assigned during session runtime later.

Y = m*X + C #we write the eqn directly instead of using func add, mul, as this involves single value computation

sess = tf.Session() #create session

init = tf.global_variables_initializer() #func to Initialize all the variables, as var initialization takes place only via this func

sess.run(init) #running session for init

print(sess.run( Y, feed_dict = {X :[10, 20, 30, 40]})) #Running session for computing Y. feed_dict func used to feed X data. We see o/p as: [ 25.  52.  79. 106.]

sess.close()

 In both the examples above, we see a lot of warnings related to many of these names being deprecated. This may be possibly due to TensorFlow 2 (v2) now in release, so earlier names for TensorFlow 1 (v1) have been moved to tf.compat.v1.* (compatibility version v1) so as to not cause confusion. I see these warnings: (compat.v1 needs to be added to get rid of the warnings). you can add these to get rid of the warnings. I haven't tried that yet (UPDATE: on trying that, a lot of other things broke, so not worth it to fix these warnings).

WARNING:tensorflow:From ./test_tf.py:14: The name tf.Session is deprecated. Please use tf.compat.v1.Session instead.
WARNING:tensorflow:From ./test_tf.py:25: The name tf.placeholder is deprecated. Please use tf.compat.v1.placeholder instead.
WARNING:tensorflow:From ./test_tf.py:28: The name tf.global_variables_initializer is deprecated. Please use tf.compat.v1.global_variables_initializer instead.

Running Sessions: We ran session above using one of the ways to run sessions. There are actually 2 ways to run sessions:

Method 1: This is the method we used above

sess = tf.Session()
# Run the variables initialization (if needed), run the operations
result = sess.run(..., feed_dict = {...})
sess.close() # Close the session

Method 2: This is the method that is more concise (requires less lines of code)

with tf.Session() as sess: 
    # run the variables initialization (if needed), run the operations
    result = sess.run(..., feed_dict = {...})
    # This takes care of closing the session for you :)

TensorFlow Functions: We'll look at few important functions used in tf. Many functions take i/p as Tensors or numpy ndarray with no issues.

tf.one_hot() => This returns a 1 hot tensor (tensor is array or list). One hot is used very widely in AI in multi class classification. Here we have a o/p vector Y which has a classification for each i/p vector. As an example, consider a picture which can be cat, dog, mouse, others. So, given a picture it can be  any of these 4 classes. We give these classes number as cat=0, dog=1, mouse=2 others=3. In AI, we write o/p vector for 6 different pictures as Y=[1 3 0 2 0 2] => This implies 1st picture is dog(class=1) 2nd picture is others(class=3) and so on. However, we can't this Y vector directly in NN equations, as we need to write it in form which says whether each picture is cat/not-cat, dog/not-dog, mouse/not-mouse, other/not-others. This is the same form as what we wrote for 2 class classification, which said if picture is cat or not-cat.

To write in above form, we need to have 4 cols for each picture, each of which says whether it's cat/not-cat, dog/not-dog, mouse/not-mouse, other/not-others. 

So, Y(1-hot) =

[ [ 0 1 0 0 ] => 1st row is for picture 1, says that picture is not-cat, is dog, is not-mouse and is not-others (implying it's a dog picture, but written in 1 hot form. It's 1 for dog, and 0 for others)

  [ 0 0 0 1 ]

  [ 1 0 0 0 ]

  [ 0 0 1 0 ]

 [ 1 0 0 0 ]

 [ 0 0 1 0 ] ] => 6th row is for picture 6, says that picture is not-cat, is not-dog, is mouse and is not-others (implying it's a mouse picture, but written in 1 hot form. It's 1 for mouse, and 0 for others)

syntax: tf.one_hot (indices, depth, on_value=None, off_value=None, axis=None, dtype=None, name=None) => Of all args, important ones are indices and depth. Indices is a tensor (or an array or list) and the locations represented by indices in indices take value on_value (default=1), while all other locations take value off_value (default=0). depth is a scalar (i.e a single number) that defines the depth of 1 hot dimension.

If indices is a scalar the output shape will be a vector of length depth.

If the input indices is rank N, the output will have rank N+1. The new axis is created at dimension axis (default is -1: which means the new axis is appended at the end). If indices is 1D (i.e rank=1), then for axis=-1, the shape of output is (length_of_indices X depth), while for axis=0, the shape of output is (depth X length_of_indices)

ex:

indices = np.array([1,2,3,0,2,1]) #Here indices is a 1D array with rank=1

depth=4 #depth is needed since we don't know how many total classes we have for classification. indices may not contain all the classes.

one_hot = tf.one_hot(indices, depth) #one_hot is a 2D tensor of shape (indices, depth) => since axis is not specified, it's set to default of -1.

print("one_hot = \n", one_hot) => This prints the Tensor w/o computing it. So, it prints => Tensor("one_hot:0", shape=(6, 4), dtype=float32). We need to run session in order to compute the graph.

sess = tf.Session() #create session

one_hot = run.sess(tf.one_hot(indices, depth))

sess.close() #session can be closed once computation graph has run.

print ("one_hot = \n", one_hot) => This prints the one_hot tensor vector as below. one_hot is a 2D tensor, with shape (6,4)

one_hot = 
[[0. 1. 0. 0.]
 [0. 0. 1. 0.]
 [0. 0. 0. 1.]
 [1. 0. 0. 0.]
 [0. 0. 1. 0.]
 [0. 1. 0. 0.]]

 tf.zeros() / tf.ones() => These functions initialize a vector to zeros or ones. It takes in a shape and return an array of dimension shape full of zeros and ones respectively. Here shape is  a list of integers, a tuple of integers, or a 1-D Tensor. These are same as numpy np.zeros / np.ones except that in numpy shape are in form of tuple (a,b), while in tf, shape can be in form of list or 1D tensor = [a,b]. 

ex: tf.zeros((1,2)) => Here we create 2D tensor with shape=(1,2). NOTE: we specified shape in tf.zeros as (1,2) which is same syntax as numpy. However, we can specify shape as array too, i.e tf.zeros([1,2]). This is what you will see used more commonly.

ex: tf.ones([2, 3], tf.int32) => This returns 2D tensor of shape(2,3) =  [[1, 1, 1], [1, 1, 1]]

ex: tf.zeros([3]) => This returns 1D tensor of shape (3,) = [0. 0. 0.]

tf.convert_to_tensor()  => This converts  Python objects of various types to Tensor objects. It accepts Tensor objects, numpy arrays, Python lists, and Python scalars. "tf.Variable" which is not a Tensor object is converted into Tensor type by using this func.

syntax: tf.convert_to_tensor(value, dtype=None, dtype_hint=None, name=None) => This converts "value" into a tensor.dtype is the lement type for the returned tensor. If missing, the type is inferred from the type of value.

ex: W1 = tf.convert_to_tensor([[1.0, 2.0], [3.0, 4.0]]) => This converts the array into tensor object.

print(y1) => returns "y1= Tensor("Const_4:0", shape=(2, 2), dtype=float32)" => this shows that it's a Tensor now with type float32 which is inferred from the i/p type.

tf.train.GradientDescentOptimizer(learning_rate = 0.005).minimize(cost) =>  tf.train.GradientDescentOptimizer is an Optimizer that implements the gradient descent algorithm. It uses the learning rate specified. It has a method "minimize" which adds operations to minimize loss by updating var_list. Minimize() method simply combines calls compute_gradients() and apply_gradients(). This whole function with it's method is called in Tensorflow to do back propagagtion and parameter update for 1 iteration on the "loss" equation. We iterate over it multiple times to get optimal "weights" to get lowest loss.

syntax of minimize: minimize(loss, var_list=None) => loss is a Tensor containing the value to minimize. var_list is an Optional list or tuple of "tf.Variable" objects to update to minimize loss. Defaults to the list of variables collected in the graph under the key GraphKeys.TRAINABLE_VARIABLES.

For Adam optimizer, we can use AdamOptimizer.

ex: optimizer = tf.train.AdamOptimizer(learning_rate = learning_rate).minimize(cost)

tf.nn.softmax_cross_entropy_with_logits: => This function computes softmax cross entropy between logits and labels. logits are the o/p of last nn layer, before it feeds into the exponential function. So, Z[L] is the logit. It's a matrix of shape (c,m), where c=num of classes, m=num of examples. It feeds into sigmoid function for a binary classifier to yield a[L]. For Multi class classifier, it feeds into exponent function to yield a[L] which is a matrix of same shape as Z[L] . Labels is a matrix of same shape as Z[L]. Softmax cross entropy is the loss function that is defined in AI section, i.e Loss(Y, Yhat) = - ∑ Yj * loge(Yhat(j)) where Yhat = a[L] and Y is output labels vector. Backpropagation will happen into both logits and labels.

syntax: tf.nn.softmax_cross_entropy_with_logits(labels, logits, axis=-1, name=None) => It returns a Tensor that contains the softmax cross entropy loss. Its type is the same as logits and its shape is the same as labels except that it does not have the last dimension of labels. So, loss returned is a vector where each entry is for each example.

ex: Here it's a (2,3) matrix for logits and lables. As per syntax, logits and labels are transposed, so shape of logits and labels feeding into this function is (m,c). So, below we have data for 2 examples, and 3 classes. classes don't need to be 1-hot, they can be probability values that add up to 1.

logits = [[4.0, 2.0, 1.0], [0.0, 5.0, 1.0]] => Here Z[L] for 1st example is [4,2,1], while for 2nd example it's [0,5,1]
labels = [[1.0, 0.0, 0.0], [0.0, 0.8, 0.2]] => Here probability of 3 classes for 1st example is 1,0,0, while for 2nd example it's 0,0.2,0.8.
print(tf.nn.softmax_cross_entropy_with_logits(labels=labels, logits=logits))
Tensor("softmax_cross_entropy_with_logits_sg/Reshape_2:0", shape=(2,), dtype=float32) => o/p shape is 1D vector with 2 entries, 1 for each example
print("y=",sess.run(tf.nn.softmax_cross_entropy_with_logits(labels=labels, logits=logits)))
y= [0.16984604 0.82474494] => This is the loss value computed for the 2 examples.

This how it's computed: NOTE: log is with base e(i.e it's ln and NOT log with base 10)
Loss for 1st example = -( 1.0*log(e^4/(e^4+e^2+e^1)) + 0.0*log(e^2/(e^4+e^2+e^1)) + 0.0*log(e^1/(e^4+e^2+e^1)) ) = - ( 1*log(54.6/54.6+7.4+2.7) + 0 + 0 ) = -log(54.6/64.7)= - (-0.17) = 0.17
Loss for 2nd example = - ( 0.0*log(e^0/(e^0+e^5+e^1)) + 0.8*log(e^5/(e^0+e^5+e^1)) + 0.2*log(e^1/(e^0+e^5+e^1)) ) = - ( 0 + 0.8*log(148.4/1+148.4+2.7) + 0.2*log(2.7/1+148.4+2.7) ) 
= - ( 0.8*(log(148.4/152.1) + 0.2*log(2.7/152.1) ) = -( 0.8*(-0.03) + 0.2*(-4.07) ) = 0.828

This computation matches closely with what's computed by the softmax function.
 
logits = [[4.0, 2.0, 1.0], [0.0, 5.0, 1.0]]
labels = [[1.0, 0.0, 0.0], [0.0, 0.8, 0.2]]
tf.nn.softmax_cross_entropy_with_logits(labels=labels, logits=logits)

tf.reduce_mean: This computes mean across all entries of an array (same as numpy np.mean). This is used in conjunction with above softmax function to calculate final cost. Final cost is mean of all the costs (i.e sum of al the costs for "m" examples, divided by "m").

ex: tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=labels, logits=logits)))  => This returns y= 0.4972955, which is the mean of entries in array returned above = (0.17+0.82)/2 = 0.49

 --------

LINKS:

This section is for putting random links to websites, articles or anything else that have been useful or interesting to me:

 


 

General Websites:

www.wikipedia.org: => Number 1 website for all my learning needs. Whether it's mathematical, geographical, complex advanced science material, wikipedia has always given me the best material to start with.

www.slickdeals.net: => This is number 1 website for all your deals. However, you may not want to but anything by clicking the link from slickdeals, as you don't make any cashback from here. There are cashback websites that give you cash for buying things on internet, so use those for making money. Two reliable ones that I use are topcashback.com and rakuten.com

www.doctorofcredit.com => This is another very good website for finding any deal that makes you money. It's different than slickdeals, in that it puts all deals (financial, credit cards, cashback websites, etc) that are not necessarily sponsored. You will never find these kind of deals on slickdeals unless they are sponsored. The comments on this site also help on lot, in helping you decide if you should pursue a deal or not.

www.stallman.org => This is the site of Richard M Stallman (rms), who started the open source revolution (Open Source Foundation and GNU project). "Basic human freedom" is the cornerstone of his views.

 


 

Educational sites:

3blue1brown: I would have never known about this site, had I not run into it for an AI video search. It's a channel+website started by a Stanford guy named Grant Sanderson. Absolutely amazing !! If you can find your topic on his video, you don't need to watch any other video for that topic, that's how good they are. Lots of topic on Maths, AI, Crypto, etc. Not even sure how can 1 person have absolute expertise in such unrelated fields. Learning a lot:

Personal site with videos: https://www.3blue1brown.com

You Tube Video channel (named 3blue1brown): https://www.youtube.com/channel/UCYO_jab_esuFRV4b17AJtAw

 


 

Puzzles:

Maths puzzles for kids: https://www.weareteachers.com/10-magical-math-puzzles/

Google interview puzzle about finding fastest 3 horses among 25 horses in minimum number of races => https://www.youtube.com/watch?v=i-xqRDwpilM 

 


 

Random Articles on web:

https://getpocket.com/explore/item/the-feynman-technique-the-best-way-to-learn-anything => Very nice technique discussed to learn something = break complex things into simplest things

https://getpocket.com/explore/item/whoa-this-is-what-happens-to-your-body-when-you-drink-enough-water => benefits of drinking lots of water. Make sure you drink at least 2 litres of water everyday. 1 litre of water should be drunk every morning on empty stomach after you wake up, and before you use the restroom.

https://www.propublica.org/article/what-are-2020s-tax-brackets-and-will-i-get-audited => Tax details

https://getpocket.com/explore/item/mental-models-how-to-train-your-brain-to-think-in-new-ways => mental models to train your brain

https://getpocket.com/explore/item/indian-employers-are-stubbornly-obsessed-with-elite-students-and-it-s-hurting-them?utm_source=pocket-newtab => somewhat interesting take on hiring under performers from non elite colleges in India 

https://getpocket.com/explore/item/work-stress-how-the-42-rule-could-help-you-recover-from-burnout  => amount of rest your body needs is 10 hr/day.

https://getpocket.com/explore/item/why-being-lazy-and-procrastinating-could-make-you-wildly-successful => How laziness and procrastination is so awesome !! My Favorite article. Let me go back to ...

 


 

YouTube Channels:

Robert Greene: Channel => https://www.youtube.com/@RobertGreeneOfficial

Andrew Huberman: Aneuroscience professor from Stanford. Tons of podcasts. Channel => https://www.youtube.com/@hubermanlab

 

 


 

Course 2 - week 3 - Hyperparameter tuning, Batch Normalization and Programming Frameworks

This week's course is divided into various sections. The 1st 2 sections are continuation of previous week material. That last section is about using Programming frameworks, whcih is totally new and will require some time to undersatnd it.

Hyperparameter tuning:

There are various hyper parameters that we saw in previous section, that require to be tuned for our NN. These parameters in terms of their effect on NN performance are:

1. learning rate (alpha): Most important hyper parameter to tune. Not choosing this value properly may cause large oscillations in optimal cost function.

2. Mini batch size, number of hidden units and momentum (beta): These are second in importance.

3. Number of layers (L), learning rate decay, Adam parameters (beta1, beta2, epsilon): These are last in importance. Adam hyperparameters (beta1=0.9, beta2=0.999, epsilon=10^-8) are usually not tuned, as these values work well in practice.

 It's hard to know in advance what hyper parameter values will work, so we try random values of these hyper parameters from within a bounding box (may be changing 2 at a time or 3 or even more at a time). Once we find a smaller bounding box, where hyper parameters seemed to perform better, we use "coarse to fine" technique to start trying finer values, until we get close to optimal hyperparameters. 

We need to choose the scale of where to sweep the hyper parameters very carefully, so that we cover the whole range. For ex, to sweep learning rate alpha, we sweep hyperparameter on log scale from 0.0001 (10^-4) to 1 (10^0) on a log scale in steps of x10 (i.e 10^-4, then 10^-3, then 10^-2, then 10^-1 and finally 10^0).

 There are 2 approaches to hyper parameter tuning:

1. caviar approach: We use this if we have many computing resources available. We train many NN in parallel with different hyper parameter settings, and see which ones work. Caviar is a fish, and how they care for their babies, is to have too many of them, and just let the best ones survive.

2. panda approach: Here, we just run one NN model with a set of hyperparameters. However as time pass, we tune hyperparameters and see if they are making the performance of NN better or worse, and keep on adjusting hyper parameters every day or so. So, here we babysit just 1 model, similar to how panda do with their babies. They don't produce too many babies, but keep on watching their one baby with all effort and making them stronger.

 

Batch Normalization:

 Here, we normalize inputs to speed up our NN. We subtract the mean from inputs, and then divide it by their variance (or should be std deviation, since variance is still in square form). That way inputs gets more uniformly distributed around a center, which causes our cost function to be more symmetric, resulting in faster execution when finding minima.

For a deep NN, we can normalize inputs to each layer. Input to each layer is o/p of activation func, a[l]. However, instead of normalizing a[l], we normalize Z[l].

 μ = 1/m * Σ Z[i]

σ2 = 1/m * Σ (Z[i] -  μ)2

Znorm[i] = (Z[i] -  μ) / √(σ2+ε)

Now instead of using Z[i] in our previous NN eqn, we use Znorm[i] which is the normalized value.

If we want to be more flexible in how we want to use Z[i], we may define learnable parameters gamma and beta, which allows the model to choose either raw Z[i], normalized Znorm[i] or any other intermediate value of Z[i]. This can be achieved by defining new var Z˜[i] (Z tilde)

Ztilde[i]   = γ*Znorm[i]  + β => by changing values of gamma and beta, we can get any Ztilde[i]  . For ex, if γ=1 and β=0, then Ztilde[i]   = Znorm[i]. If γ=√(σ2+ε), and β=μ, then Ztilde[i]   = Z[i]

Since gamma and beta are learable parameters (just like weights), we really don't have to worry about the most optimal values of gamma and beta. The gd algo would choose the values that gives the lowest cost for our cost function. Note that each layer has it's own gamma and beta, so they can be treated just like weights for each layer. gd now calculates γ[i]  and β[l], on top of W[l] and b[l]. However since we are normalizing, we will see that b[l] is cancelled out, so we can omit b[l]. So, we have 3 parameters to optimize: γ[i] , β[l] and W[l] for each layer l. We can extend this to mini batch technique too, with all gd algorithms like momentum, adam, etc.

Batch norm works because it makes NN computation more immune to covariate shifts. The i/p data and all other intermediate i/p data are always normalized. It ensures that mean and variance of all i/p will remain the same, no matter how the data moves. This makes these values more stable even if i/p shifts.

Multi Class classification:

Binary classification is what we have used so far, which classifies any picture into just 2 outcomes = cat vs non-cat. However, we can have multi class classification, where o/p layer produces multiple outputs, i.e if the picture is cat, dog, cow or horse (known as 4 class classification). It outputs the final probability of each of the classes, and the sum of these probabilities is 1.

Here the o/p layer L, instead of generating 1 o/p, generates multiple o/p values one for each class. So, the o/p layer Z[L], instead of being a 1x1 matrix in binomial classification, is a Cx1 matrix now for multi class classification, where C is the number of classes to classify. Previously activation function for o/p layer a[L] used to be sigmoid function which worked well for binomial classification. However, now with multi class classification, we need a different activation function for o/p layer. We choose activation function to be exponent function normalized by sum of exponents.

For 2 class classification,we use sigmoid func:

Sigmoid function σ(z) = 1/(1+e^-z) = e^z/(1+e^z)

prob for being in class 0 = yhat = σ(z) and

prob for being in class 1 (not in class 0, or class=others) = 1 - yhat = 1 - σ(z) = 1/(1+e^z)

We generalize above eqn for C classes. We use exponent func in o/p layer (also called as softmax layer)

exponent func = e^zk/(e^z1 + e^z2 + ... e^zc) where C is the number of classes, k is the kth class

prob for being in class 0 = yhat[0] =  e^z1/(e^z1 + e^z2 + ... e^zc) 

prob for being in class 1 = yhat[1] =  e^z2/(e^z1 + e^z2 + ... e^zc) 

...

prob for being in class C-1 = yhat[c] =  e^zc/(e^z1 + e^z2 + ... e^zc) 

So, probabilities all add up to 1. Matrix a[L] or yhat is CX1 matrix

For C=2, multiclass reduces to binary classification. For implementation of multi class, the only difference in algo would be to compute o/p layer differently, and then do back prop.

For 2 class, if we choose e^z2=1, then we get

prob for being in class 0 = yhat[0] =  e^z1/(e^z1 + e^z2) = e^z1/(e^z1 + 1) 

prob for being in class 1 = yhat[1] =  e^z2/(e^z1 + e^z2) = 1/(e^z1 + 1) 

Which is exactly what we got by using our sigmoid function earlier. so, exponent func and sigmoid func in o/p layer give the same result, implying sigmoid was just a special case of exponent func.

NOTE: In binary classification, we had an extra function at o/p which converted yhat to 0 or 1 depending on if it's value was greater < 0.5 or not. That was called hard max. Here in multi class classification, we don't have that extra function. We just stop once we get the probabilties of each class. This is called softmax.

Logits: In multi class classification, computed vector Z = [Z1, Z2 ... ZC] are called logits. The shape of logits is (C,m) where C=number of classes, m=number of examples

Labels: In multi class classification, given vector Y = [Y1, Y2 ... YC] are called labels. Each Y1, Y2 is 1 hot, so it has C entries, instead of just one entry. The shape of labels is same as that of logits i.e shape = (C,m)

Cost eqn: For multiclass classification, loss function is same as for binary classification with some modification.

Programming Frameworks

Instead of writing all these NN functions ourselves (forward prop, backprop, adam, gd, etc), we have NN frameworks, which provide all these functions to us. Tenosrflow is one such framework. We'll use tensorflow in python for our exercises. You can get introductory material for tensorflow including installation in "python - tensorflow" section. Once you've completed that section, come back here.

Programming Assignment 1: here we have 2 parts. 1st part, we learn basics of tensorflow (TF), while in 2nd part, we build a NN using TF.

Here's the link to pgm assigment:

TensorFlow_Tutorial_v3b.html

This project has 3 python pgm, that we need to understand.

A. tf_utils.py => this is a pgm that defines following functions that are used in our NN model later:

tf_utils.py

  • load_dataset() => It loads test and training data from h5 files, similar to function used in section "1.2 - Neural Network basics - Assignment 1". The only difference is that Y is now a number from 0 to 5 (6 numbers), instead of being a binary umber - either 0 or 1. This is because we are doing a multi class classification here. Each picture is a sign language picture representing number 0, 1, 2, 3, 4 or 5.
    • Training set: 1080 pictures (64 by 64 pixels) of signs representing numbers from 0 to 5 (180 pictures per number). X after flattening is 2D vector with shape = (12288, 1080), while Y after flattening is 2D vector with shape = (1, 1080)
    • Test set: 120 pictures (64 by 64 pixels) of signs representing numbers from 0 to 5 (20 pictures per number). X after flattening is 2D vector with shape = (12288, 120), while Y after flattening is 2D vector with shape = (1, 120)
  • random_mini_batches() => This creates a lit of random mini batches from set X,Y. These random mini batches are shuffled, and each have a size as specified in argument.
  • convert_to_one_hot(Y, C) => This returns a 1 hot matrix for given o/p vector Y, and for "C" classes. 1 hot vector is needed for multi class classification.
  • predict(X, parameters) => Given i/p picture X, and optimized weights, it returns the prediction Yhat, i.e what number from 0 to 5 is the picture representing
  • forward_propagation_for_predict(X, parameters) => Implements the forward propagation for the model. It returns Z3, which is the o/p of last linear unit (before it feed into the softmax function to yield a[3])

B. improv_utils.py => this pgm is not used anywhere, so you can ignore it, This is a pgm that defines all the functions that are used in our NN model later. This has all functions that were in tf_utils.py, as well as all the func that are going to be defined in test_cr2_wk3.py. So, this pgm is basically solution to the assignment, as all the functions are written here, that we are going to write in our assignment later. You should not look at this pgm at all, nor should you use it (unless you want to check your work after writing your own functions).

improv_utils.py

C. test_cr2_wk3.py => Below is the whole pgm,

test_cr2_wk3.py

This pgm has 2 parts to it. In 1st part, we just explore TF library, while in 2nd part, we write the actual NN model using TF.

Part 1: This is where we explore TF library. All i/p and o/p of these examples is Tensor Data. NOTE: we don't use any of these functions below in our NN model that we build in part 2. This is just for practise.

  • comput loss eqn: simple loss eqn value is computed by creating a TF variable for loss.
  • multiply using constant: multiplying 2 constant numbers and printing result.
  • multiply using placeholder: Here we feed value into placeholder at runtime, and compute 2*x.
  • linear_function(): Here we compute Y=W*X+B, where W,X,B and Y are all Tensor vectors (i.e Matrices) of a pre determined shape
  • sigmoid(z): Given i/p z, compute sigmoid od z.
  • cost(logits, labels): This computes cost using tf func "tf.nn.sigmoid_cross_entropy_with_logits()".  This calculates cost= - ( y*log(sigmoid(z)) + (1-y)*log(1-sigmoid(z)) ). This returns a vector with 1 entry for each logits/label pair. When you have "m" examples for each logit/label, then it computes summation and mean. However in NN model that we build later, we'll be using "tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits())", which works for multiclass classification. This func is explained in Python-TF section.
  • one_hot_matrix(labels, C): This returns a 1 hot matrix for given labels, and for "C" classes.
  • ones(shape) => creates a Tensor matrix of given shape, and initializes it with all 1.

 

Part 2: This is where we build a neural network using tensorflow. Our job here is to identify numbers 0 to 5 from sign language pictures. We implement a 3 layer NN. The model is LINEAR -> RELU -> LINEAR -> RELU -> LINEAR -> SOFTMAX.

Below are the functions defined in our pgm for part 2:

  • create_placeholders() => creates placeholders for i/p vector X and o/p vector Y
  • initialize_parameters() => initializes w,b arrays. W is init with random numbers, while b is init with 0.
  • forward_propagation(X, parameters) => Given X, w, b, this func calculates Z3 instead of a3 (z3 is the output o last NN layer, which feeds into the softmax(exponent) function)
  • compute_cost(Z3, Y) => This computes cost (which is the log function of A3,Y). A3 is computed from Z3, and cost is calculated as per loss eqn for softmax func. We use following TF func for computing cost: tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits = ..., labels = ...)). logits=Z3, while labels=Y.
  • backward propagation and parameter update: This is done using a TF func "tf.train.AdamOptimizer(learning_rate = learning_rate).minimize(cost)". This is explained in TF section.The notes talks about TF func "tf.train.GradientDescentOptimizer" but we use "tf.train.AdamOptimizer" for our exercise. This func is instantiated directly within the model, as it's a built in func (not a user defined func that we need to define or write function for)
  • predict() => Given input picture array X, it predicts Y (i.e whether pic is cat or not). It uses w,b calculated using optimize function. We can provide a set of "n" pictures here in single array X (we don't need to provide each pic individually as an array). This is done for efficiency purpose, as Prof Andrew explains multiple times in his courses.
  • model() => This is the NN model that will be called in our pgm. We provide both training and test pictures as 2 big arrays as i/p to this func. This model has 2 parts. First it defines functions, and then it runs(calls) them It inside a session. These are the 2 parts:
    • Define the functions as shown below:calls above functions as shown below:
      • defines func create_placeholders()  for X,Y.
      • defines func initialize_parameters() to init w,b randomly
      • Then it defines forward_propagation() to compute Z3
      • Then it defines compute_cost() to compute total cost given ze and o/p labels Y.
      • then it defines TF func "tf.train.AdamOptimizer(learning_rate = learning_rate).minimize(cost)" to do backward propagation to update paramer=ters for 1 iterayionptimize() to optimize w,b to give lowest cost across the training set.
      • defines an inti func "tf.global_variables_initializer()". This is needed to init all variables. See in TF section for details
    • Now it creates a session, forms a loop, and calls the above functions
      • start the session. Inside the session. run these steps
        • Run the init func defined above => "tf.global_variables_initializer()".
        • Make a loop and iterate below func for "num_of_epoch" times. It's set to 1500. We will change it to 10,000 too and see the impact on accuracy.
          • Form minibatches of X,Y and shuffle them
          • iterate thru each minibatch
            • call these two func "tf.train.AdamOptimizer(learning_rate = learning_rate).minimize(cost)" and "compute_cost" for each minibatch. Here, we don't need to explicitly call "compute_cost ", since running "minimize(cost)" will call the func compute_cost any way. The reason we do it, is because we want to get the "cost" o/p returned by "compute_cost" func for our plotting of cost vs iterations.
            • We add the cost from each iteration to "total_cost" and divide it by number of examples, to get avg cost
        • Now plot "total_cost" vs "num f=of iterations". This shows how the cost is going down as we iterate more and more.
        • Now it runs "parameters" node again to get values of parameters. NOTE: that running "parameters" again doesn't run func "initialize_parameters() " again, but instead just returns the computed values for that node
        • It then calls tf functions to calculate prediction and accuracy for all examples in test and training set. Accuracy is then reported for all pictures on what they actually were vs what our pgm predicted.

 Below is the explanation of main code (after we have defined our functions as above):

  1. We get our datset X,Y by calling load_dataset().
  2. Next we can enter index of any picture, and it will show the corresponding picture for our training and test set. This is for our own understanding. Once we have seen a few pictures, we can enter "N" and the pgm will continue.
  3. Now we flatten the array returned and normalize it. We also use "one_hot" function to convert labels from one entry to a one hot entry, since our labels need to be one-hot format for our softmax func to work.
  4. Now we call our function model() defined above. We provide X,Y training and testsing arrays (which are not Tensors, but are numpy arrays). We see that these numpy arrays are used as Tensor i/p to many functions above. I guess it still works, as conversion from numpy to Tensors takes place automatically when needed.
  5. In above exercise, we used a 3 Layer NN, with fixed number of hidden untis for each layer. We ran

Results:

On running above pgm, we see these results:

  • On running the above model with 1500 iterations get a training accuracy of 70%.
    • Cost after epoch 0: 1.913693
    • Cost after epoch 100: 1.049044 .... => If you get "Cost after epoch 100: 1.698457", that means you are still using GradientDescentOptimizer". Switch to "tf.train.AdamOptimizer".
    • Cost after epoch 1400: 0.053902
  • When we increase the number of iterations to 10,000, our training accuracy goes to 89%. See how cost keeps on going down and then kind of flattens out.
    •  Cost after epoch 1400: 0.053902 ...
    • Cost after epoch 2500: 0.002372
    • Cost after epoch 5000: 0.000091
    • Cost after epoch 9900: 0.000003

 

Programming Assignment 2: This is my own programming assignment. It's not part of the lecture series. Here, I took an example from one of the earlier programming assignments, and rewrote it using TF to see if I could do that. It does work, but not sure if everything is working correctly (as the cost is different from previous assignment, and there is no way to verify the accuracy)

test_cr2_wk2_tf.py => Below is the whole pgm, This pgm is copied from course 2 week 2 pgm => course2/week2/test_cr2_wk2.py. We wrote same pgm with tensorflow functions now. We implement it for batch gd only (not other ones)

test_cr2_wk2_tf.py

We implement a 3 layer NN. The model is LINEAR -> RELU -> LINEAR -> RELU -> LINEAR -> SOFTMAX.

Even though this is a binary classifier, we still used a softmax implementation, as binary is a special case of softmax with number of classes=2. All the functions that we defined in Pgm assignment 1 are the same here. The only diff is in initialize_parameters() func, and model() func. Differences are explained below:

  • initialize_parameters() => here we allowed args to be passed for "number of hidden units" for each layer, so that we can keep it consistent with our "course 2 week 2 pgm". That also allows us to play around with different number of hidden units for each layer, and observe the impact. The default number of hidden units for 3 layers is: [2, 25, 12, 2] where 1st entry 2 is for input layer.
  • model() => Here, the definition of functions is same as in assignment 1. These are few differences:
    • Test set: Since we have only training set in this example, we don't have arg for "test_set" in model() func. model() func is copied from course2 week 2 and is modified wherever needed to work for TF.
    • optimizer: We call "tf.train.GradientDescentOptimizer" instead of Adam Optimizer. We could try both.This is just to keep it consistent with "course 2 week 2" pgm.
    • cost_avg: One other diff is that we don't do "cost_avg" by dividing by "m", as we already averages when we divide it by "mini_batch_size" within the loop.
    • All other part of model() is same, except that we don't evaluate test accuracy (since there's no test set)

Now we run the main pgm code the same way as in assignment 1. These are the diff:

  • We load the red/blue dataset (by using diff func load_dataset_rb_dots()). This is needed since the dataset here is different and is created by writing python cmds.
  • We convert Y label to 1 hot. This is needed for softmax function as explained earlier. We convert Y = [ 0 1 0] into Y(one_hot) = [ [1 0] [0 1] [1 0] ] where 0=red, 1=blue
  • We now call model() with desired number of hidden units, and it gives us the prediction accuracy.

Results:

This is the result we get (with the default settings we have in our pgm):

Cost after epoch 0: 0.051880
Cost after epoch 1000: 0.038114
Cost after epoch 2000: 0.030764
Cost after epoch 3000: 0.027093
Cost after epoch 4000: 0.025386
Cost after epoch 5000: 0.022814
Cost after epoch 6000: 0.021766
Cost after epoch 7000: 0.021067
Cost after epoch 8000: 0.019954
Cost after epoch 9000: 0.019063
Parameters have been trained!
Train Accuracy: 0.9166667

 

Summary:

Here we built a 3 layer NN using Tensor Flow. TF is not easy or intuitive, so I'm lost too, on why somethings work with tensors, some with numpy, and what run session does, and s on. But eventually it did work. The main take away is that multi class classification worked just as easily as binary classification, and got us 90% accuracy if trained for long enough. Our optional second assignment, helped us to see how we can transform a regular NN pgm written using numpy into TF NN pgm.

 

Optimization algorithms: Course 2 - Week 2

This course goes over how to optimize the algo for finding thew lowest cost. We saw gradient descent that was used to find lowest cost by differentiating the cost function and finding the minima. However, with NN running on lots of data to get trained, the step for finding lowest cost may take a long time. Any improvement in training algo would help a lot.

Several such algo are discussed below:

1. Mini batch gradient descent (mini gd):

We train our NN on m examples, where "m" may be in millions. We vectorized these m examples so that we don't have to run expensive for loops. But even then, it takes a long time to run across m examples. so, we divide our m examples in "k" mini batches with m/k examples in each mini batch. We call our original gradient descent scheme as "Batch gd".

We form a for loop for running each mini batch in one "for" loop. Then there is an outer for loop to iterate "num" times to find lowest cost. Each loop thru all examples is called one "epoch".

 With "Batch gd", our cost function would come down with each iteration. However, with mini batch, our cost function is noisy and oscillates up and down, as it comes down to a minima. One hyper parameter to choose is the size of mini batch. 3 possibilities:

  1. Size of m:  When each mini batch contains all "m" examples, then it becomes same as "batch gd". It takes too long when "m" is very large.
  2. Size of 1:  When each mini batch contains only 1 example, then it it's called "stochastic gd". This is the other extreme of batch gd. Here we lose all the advantage of vectorization, as we have a for loop for each example. It's cost function is too noisy as it keeps on oscillating, as it approaches the minima.
  3. Size of 1<size<m: When each mini batch contains only a subset of example, then it it's called "mini batch gd". This is the best approach, provided we can tune the size of each mini batch. typical mini batch sizes are power of 2, and chosen as 64, 128, 256 and 512. Mini batch size of 1024 is also employed, though it's less common. Mini batch size should be such that it fits in CPU/GPU memory, or else performance will fall off the cliff (as we'll continuously be swapping training set data in and out of memory)

2. Gradient Descent with momentum:

We make an observation when running gradient descent for mini batch, there are oscillations which are due to W, b getting updated with only  a small number of examples in each step. When it sees he new mini batch, W, b may get corrected to different value in opposite direction resulting in oscillations. These oscillations are in Y direction (i.e values of weight/bias jumping around) as we approach to a optimal value (of cost function) in x direction. These Y direction oscillations are the ones that we don't want. These oscillations can be reduced by a technique known as "Exponentially weighted avg". Let's see what is it:

Exponentially weighted average:

Here, we average out the new weight/bias value with previous values. So, in effect, any dramatic update to weight/bias in current mini batch, doesn't cause too much change immediately. This smoothes out the curve.

Exponentially weighted avg is defined as:

Vt = beta*Vt + (1-beta)*Xt => Here Xt is the sample number "t". t goes from 0, 1, ... n, where n is the total number of samples.

It can be proved that Vt is approximately avg over 1/(1-beta) samples. so, for beta=0.9, Vt is avg over last 10 samples. If beta=0.98, then Vt is avg over last 50 samples. Higher the beta, smoother will be the curve, as it takes avg over larger number of samples.

It's called exponential as if we expand, Vt we see that Vt contains exponentially decaying "weight" for previous samples. i.e Vt = (1-beta)*Xt + (1-beta)*[ beta*X{t-1} + beta^2*X{t-2} + beta^3*X{t-3} + ...]

It can be shown that weight decays to 1/e when we look at the last 1/(1-beta) sample. So, in effect it seems that it's taking avg of last 1/(1-beta) samples.

However, the above eqn has an issue at startup. We choose V0=0, so first few values of Vt are way off from avg value of Xt, until we have collected few samples of Xt. To fix this, we add a bias correction term as follows:

Vt (bias corrected) = 1/(1-beta^t) * [ beta*Vt + (1-beta)*Xt ] => Here we multiplied the whole term by 1/(1-beta^t), so that for the first few values of "t", Vt becomes a small value, so contribution from Xt dominates. However as "t" starts getting larger, 1/(1-beta^t) goes to 1, and can be seen to have no impact.

gd with momentum:

Now we apply the above technique of "Exponentially weighted avg" to gd with momentum. Instead of using dW and db to update W, b, we use weighted avg of dW and db to update W, b. This results in a much smoother curve, where W, b don't oscillate too much with each iteration.

Instead of doing W=W- alpha*dW, b=b-alpha*db, as we did in our original gd algo, we use weighted avg of dW and db here.

W=W- alpha*Vdw, b=b-alpha*Vdb, where Vdw, Vdb are weight avg of last beta samples of dW and db, and defined as

Vdw = beta1*Vdw+ (1-beta1)*dW, Vdb = beta1*Vdb+ (1-beta1)*db

3. Gradient Descent with RMS prop:

This is a slightly different variant of gd with momentum. Here also, we use the technique of exponentially weighted avg, but instead of using dW and db, we use square of dW and db. Also, we note that in "gd with momentum", we never knew which values of dW and db are oscillating, we smoothed all of them equally. Here, we smooth out those values more which have more oscillations, and vice versa. We achieve this by dividing dW and db by their weight avg (instead of using weighted avg of dW and db directly in the eqn). that way whichever dW or db oscillates by most (may be w1,w7 and w9 oscillate the most), then their derivatives are going to be the largest (dw1, dw7 and dw9 have high values). So, on dividing these large derivatives by larger numbers will smoothen them out more than ones with lower derivatives. Eqn are as follows:

W=W- alpha*dW/√Sdw, b=b-alpha*db/√Sdb, where Sdw, Sdb are weight avg of last beta samples of (dW)^2 and (db)^2, and defined as

Sdw = beta2*Sdw+ (1-beta2)*(dW)^2, Sdb = beta2*Sdb+ (1-beta2)*(db)^2

NOTE: we used beta1 in gd with momentum, and beta2 in gd with RMS prop to distinguish that they are different beta. Also, we add a small epsilon=10^-8 to √Sdw and √Sdb so that we don't run into numerical issue of dividing by 0 (or by a number so small that computer's effectively treat it as 0). so, modified eqn becomes:

W=W- alpha*dW/(√Sdw + epsilon), b=b-alpha*db/(√Sdb + epsilon)

4. Gradient Descent with Adam (Adaptive Moment Estimation):

Adam optimization method took both "gd with momentm" and "gd with RMS prop" and put them together. It works better than both of them, and works extremely well across wide range of applications. Here, we modify RMs prop little bit. Instead of using dW and db with alpha, we use VdW and Vdb with alpha. Then we reduce oscillations even more, since we are applying 2 separate oscillation reducing techniques in one. This technique is called moment estimation, since we are using different moments : dW is called 1st moment, (dW)^2 is called 2nd moment and so on.

So, eqn look like:

W=W- alpha*Vdw/(√Sdw + epsilon), b=b-alpha*Vdb/(√Sdb + epsilon),where Vdw, Vdb, Sdw and Sdb are defined above

Here there are 4 hyper parameters to choose from: alpha needs to be tuned, but beta1, beta2 and epsilon can be chosen as follows:

beta1=0.9, beta2=0.999, epsilon=10^-8. These values work well in practice and tuning these doesn't help much.

5. Learning rate decay:

In all the gd techniques above, we kept learning rate "alpha" constant. However, we observe that learning rate doesn't need to be constant. It can be kept high when we start, as we need to take big steps, but can be reduced as we approach the optimal cost, since smaller steps suffice as we converge to optimal value. The larger steps cause oscillations. so reducing alpha reduces these oscillations, so that it allows us to converge smoothly. this approach is called learning rate decay and there are various techniques to achieve this.

simplest formula for implementing learning rate decay is:

alpha = 1/(1+decay_rate*epoch_num) * alpha0, where alpha0 is our initial learning rate. epoch_num is the current iteration number.

So, as we do more and more iterations, we keep on reducing the decay rate, until it gets close to 0. Now we have one more hyper parameter "decay_rate" on top of alpha0, both of which need to be tuned.

Other formula for implementing learning rate decay are:

alpha = ((decay_rate)^epoch_num) * alpha0 => This also decays learning rate

alpha = (k/√epoch_num) * alpha0 => This also decays learning rate

Some people also manually reduce alpha every couple of hours or days based on run time. No matter what formula is used, they all achieve the same result of reducing oscillations

Conclusion:

Adam and all other techniques discussed above speeds up our NN learning rate. They solve the problem of plateau in gd, where the gradient changes very slowly over a large space, resulting in very slow learning. All these gd techniques speed up this learning process by speeding up learning in x direction. There is also the problem of getting stuck in local mimima, but looks like this is not an issue in NN with large dimensions.This is because, instead of hitting local minima (where the shape is like bottom of trough or valley), we hit saddle points (where th shape is like saddle of horse). For a local minima, all dimensions need to have a shape like a trough, which is highly unlikely for our NN to get stuck at. At least one of the dimensions will have a slope to get us out of this saddle point, and we will keep on making progress w/o ever getting stuck at local minima.

 

Programming Assignment 1: here we implement different optimization techniques discussed above.  We apply our different optimization techniques to 3 layer NN (of blue/red dots):

  • batch gd: This is same as what we did in last few exercises. This is just to warm up. Nothing new here.
  • min batch gd: We implement mini batch gd by shuffling data randomly across different mini batches.
  • Momentum gd: This implements gd with momentum
  • Adam gd: This implements Adam - which combines momentum and RMS prop
  • model: Now we apply the 3 optimization techniques to our 3 layer NN = mini batch gd, momentum gd and Adam gd.

Here's the link to pgm assigment:

Optimization_methods_v1b.html

This project has 2 python pgm.

A. testCases.py => There are bunch of testcases here to test your functions as you write them. In my pgm, I've them turned off.

testCases.py

There is also a dataset that we use to run our model on:

data.mat

B. opt_utils_v1a.py => this pgm defines various functions similar to what we used in previous assignments

opt_utils_v1a.py

C. test_cr2_wk2.py => This pgm calls functions in above pgms. It implements all optimization techniques discussed above. It loads the dataset, and runs the model with mini batch gd, momentum gd and Adam gd. Adam gd performs the best.

test_cr2_wk2.py

 

Summary:

By finishing this exercise, we learnt  many faster techniques for implementing gradient descent. We would get the same accuracy for all of the gd methods discussed above, it's just that slower gd techniques are going to require a lot more iterations to get there. This can be verified by setting "NUM_EPOCH" to 90,000 in our program.

Derate:

We learned OCV in previous section. However running OCV at 2 different PVT corners may not always be practical. For ex, consider the voltage drops seen on a chip due to IR. We may not be able to get lib at that particular voltage corner after accounting for the voltage drop due to IR. Similarly for temperature, we may not be able to lib for that exact temperature after accounting for on chip heating. Also, even if we are able to get these libs, ocv analysis requires running at 2 extreme corners. If we do not want to run analysis at 2 diff corners for ocv, we can run it at 1 corner only by specifying derating. Derating is an alternate approach where we speed up or slow down certain paths so that they can indirectly achieve same results as OCV. Derating is basically applying a certain multiplying factor to each gate delay so that the delay can be scaled up (by having a multiplying factor > 1), or can be scaled down (by having a multiplying factor < 1). The advantage of derate is that each and every gate in design can now be customized to have a particular delay on it. With OCV analysis

.codebox {
        border:1px solid black;
        background-color:#EEEEFF;
        white-space: pre-line;
        padding:10px;
        font-size:0.9em;
        display: inline-block;
}

, we weren't able to do this, as the flow just chose b/w WC and BC lib and applied one or the other to each gate in design. Here, we first choose a nominal voltage, for which we have library available, and then apply derate to achieve effects of Voltage and Temperature variations.

There are different kind of derates:

  • Timing derates: When we run sta at particular voltage/temperature, we assume same voltage and same temperature on all transistors. However, based on IR drop and temperature variation as well as aging effect, we know that not all transistors will be on same voltage/temperature. So, we apply timing derating. We apply these timing derates as "voltage guardband derates". Even though we say voltage, we include effect of temperature and aging effect, so that the "voltage derate" includes effects of all of these. In PT flow, these derates specified via "set_timing_derate -pocvm guardband" or "set_timing_derate -aocvm guardband" => This is explained later. By default, derate specify ocv derate, which are derates due to local process variations only. Then we apply either aocv or pocv voltage guardband derate, which account for Voltage+Temperature+reliability derates. 
  • POCVM distance derates: Only applied on clocks. This is additional derate on top of voltage derate above.
  • LDE (Layout dependent Effect) derates: provided by foundry. Applied as incremental derate
  • MixedVt / MixedLg derates: Differences in Threshold voltage as well as in "Length" of transistors, we experience differences in delay which don't scale same way. i.e process is not correlated for different VT, so LVT might be at SS -3σ corner, but ULVT instead of being perfectly at SS -3σ corner, it may be a little bit faster or slower. e.g. in a slow-slow corner the capture clock is LVT and might be slightly faster than fast-fast corner due to Vt mistracking. This VT mistracking is not OCV related. OCV models local process variations, while Mixed VT is modeling global process corner correlation. We model this MixedVt correleation effect via derate.
  • Margining derates: Other derates used for margining

What derating factor to apply for ocv/aocv/pocv is derived by running monte carlo sims.

1. set_timing_derate => It allows us to adjust delays by a multiplying factor. It automatically sets op cond to ocv mode. The derating factor applies only to the specified objects which is a collection of cells, lib_cells or nets. If no objects specified, then derating applies to whole design. report_timing -derate shows timing with derating factor shown for each cell/net in the path. We do not derate slews or constraint arcs (as they are not supported for AOCV or POCV), but we do have options for setting these in set_timing_derate cmd.

options:

-early/-late => unlike other cmds, there is no default here. We have to specify -early to apply derating on early (shortest delay) path, and -late for late (longest delay) path. We need to have separate cmd for early and late, else derating will get applied to only 1. The tool applies worst-case delay adjustments, both early and late, at the same time. For example, for a setup check, it applies late derating on the launch clock path and data path, and early derating on the capture clock path. For a hold check, it does the opposite. We get these derating values from simulations. First, we try to find worst/best case voltage drop on transistor power pins (after accounting for off chip IR drop, PMU overshoot/undershoot and on chip IR drop) and then apply derating accordingly. 

  • Early derating: We apply early derating corresponding to the voltage level which would be with off chip IR drop only. This is the absolute highest voltage that can be seen by any transistor on die. Then we add extra derate to account for temperature offset. We apply same early derate for both clk and data path.
  • Late derating: We apply late derating corresponding to the voltage level which would be with off chip IR drop + on chip IR/power_switch drop + reliability margin (due to VT shift seen for transistors with low activity). This is the absolute lowest voltage that can be seen by any transistor on die. Here, we apply slightly different derate for clk and data path. For clock path, we don't add the reliability margin, since clk is always switching, so there is no reliability penalty that the clk path incurs. So, clk path sees a slightly higher voltage.

NOTE: since we apply these early/late derates, we want our nominal voltage at which we are going to run STA to be around these early/late voltages. If our librar's nominal voltage is too far from these early/late voltages, then we have to apply large derating, which may not produce accurate results.

-cell_delay/-net_delay => By default, derating applies to both cell and net delays, but not to cell timing check constraints. This allows derating to apply only to cell or net delays. -cell_check allows derating to be applied to cell timing check constraints (setup/hold constraints)

-data/-clock => By default, delays are derated in both data paths and clock paths. This allows derating to be applied to only data or clock

-rise/-fall => By default, both rising and falling delays are derated. This allows derating to be applied to cell/net delays that end with a rising or falling transition only

There are many more options that we'll see later (including -aocvm guardband/-pocvm guardband options). If the options -aocvm guardband/-pocvm guardband are not used, then the above derating cmd sets OCV derate, which ony accounts for local process variation related derate. Voltage/Temperature and Reliability derates are captured via additional derate specified with -aocvm guardband/-pocvm guardband options. Thus OCV derate and aocv/pocv guardband derate are all needed to account for all PVT+reliability variations.

ex: set_timing_derate -early 0.9; set_timing_derate -late 1.2 => The first command decreases all early (shortest-path) cell and net delays by 10 percent, such as those in the data path of a hold check (and clk path of setup check). The second command increases late (longest-path) cell and net delays by 20 percent, such as those in the data path of a setup check (and clk path of hold check). These two adjustments result in a more conservative analysis. We should use derating < 1 for early and >1 for late, since we are trying to simulate worst case ocv that can happen. Derating gets applied to whole design, as we did not specify any object.

ex: set_timing_derate -increment -cell_delay -data -rise -late 0.0171 [get_lib_cells { TSM_svt_ff_1v_125c_cbest/SDF_NOM_D1_SVT}] => applies derating of 1.7% only to lib cell specified for rise dirn, and long delay path. -increment adds this derating on top of any derating that is already applied globally or to this cell earlier.

ex: set_timing_derate -cell_delay -net_delay -late 1.05 [get_cells top/H1] => sets a late derating factor on all cells and nets in the hierarchical cell top/H1, including its lower-level hierarchical cells

report_timing_derate => shows derates set for all cells in design. This is very long list so better to redirect it to some o/p file

AOCV and POCV: OCV is OK, but it doesn't model advanced levels of variations for 65nm and below, which results in overdesign. OCV allows us to model different derating for diff cells (by using set_timing_derate cmd), but fails to capture other factors. To mitigate some of these OCV issues, advanced forms of OCV came into picture. AOCV (advanced OCV) was used earlier, but even more advanced POCV (parametric OCV) is used now. We will go over details of both

PT has app_var variables which allow advanced OCV and parametric OCV. To report all app_var, we can use this cmd:

pt_shell > report_app_var *ocv* => reports all aocv and pocv app_var settings

AOCV: timing_aocvm_enable_analysis => setting it to true enables AOCV) => needs AOCV tables in a side file

POCV: timing_pocvm_enable_analysis => setting it to true enables POCV) => needs POCV side file or liberty file

AOCV: Advanced on chip variation

OCV doesn't handle below factors:

  1. path depth => variation reduces on long paths due to statistical canceling. So, even if each cell has lot of variations, due to random nature of variations, they can be +ve variation or -ve variation. More the gates in design, higher the changes that +ve and -ve variations will cancel out, resulting in very low level of variations. So, path depth is a factor only for random variations on die.
  2. path distance => variation increases as paths travels across more die area. This is based on simple silicon observation that close by structures have less variation, but if you compare structure far away, they have larger variation. That is why in analog circuits, matching transistors are placed as close as possible to each other to minimize variations b/w them. So, path distance is a factor only for systematic spatial variations on die. Even if you have more gates in design but they are closeby, then spatial variations will be very low, compared to when these gates are far away. In other words, we are saying that these variations are correlated more or less depending on their proximity with each other. So, total variation in any path is function of both random variation as well as spatial variation. However random variation dominate, so path distance based variation is not very critical.
  3. different cell types => variations varies depending on transistors used in cells. Lower width transistor have more variations that larger width ones. However, cell level derating is already captured in simple derating cmd above, as it allows us to set different derates for different kinds of cells.

AOCV was proposed to provide path depth and distance based derating factors to capture random and systematic on-die variation effects respectively. Instead of applying a constant derating factor as in ocv to any cell, we adjust the derating factor for a given cell based on path distance and depth. This is provided in form of a 2D table in a file. AOCV provides a single number for delay of a path based on derating used (derating value itself is taken from 2D table based on path depth and path distance for that cell). It only models delay variation, but does not model transition or constraint variation. Thus AOCV is siame as OCV except for derating added for above 2 factors.

Both GBA and PBA AOCV can be performed. GBA AOCV reduces pessimism by some compared to GBA OCV, which may be sufficient for signoff. If few paths still fail, PBA AOCV can be run on selected failing paths, which reduces pessimism even further.

AOCV flow:

 set_app_var read_parasitics_load_locations true => this is so that read_parasitics can read location info from spef file

read_parasitics file.spef => To use distance based derating specified in aocv file below, we need physical location of various gates, nets, etc. This info is contained in SPEF files, and can be read via read_parasitics cmd. If we have hirarchical flow, where there are separate spef files for blocks and for top lvel, PT can automatically determine correct offset and orientation of blocks. However, if it fails, we can specify it manually via extra args in read_parasitics cmd.

set_app_var timing_aocvm_enable_analysis true => sets GBA POCV analysis

read_ocvm file.aocvm => reads derating from this aocvm file. It has 2D table of derates with depth and distance as index (It can also be 1D table with either depth or distance as index, although this will give less accurate results). aocv derate take precedence over ocv derate specified for any cell, as it's more accurate. Syntax of this file is explained under pocv section below.

set_timing_derate -aocvm_guardband -early 0.95 => this applies additional guardband to model non proces related effects (ir drop, margin, etc) in aocv flow. For fast paths, we reduced delays by further 5%. Final derate = aocv_derate * guardband_derate. set_timing_derate -increment adds derate on top of this derate (instead of multiplying, it adds). So, Final derate = aocv_derate * guardband_derate + incremental_derate. Either guardband derate or incremental derate can be applied, as two are mutually exclusive.

set_timing_derate -aocvm_guardband -late 1.04 => For slow paths, we increased delays by 4%.

update_timing => performs timing update

report_ocvm -type aocvm => reports number of cells/nets that aocvm deratings were applied on, in summarized table. Any cell/net not-annotated with derate is listed here.

report_ocvm -type aocvm I0/U123 => If object list specified,  derating on specific instances or arcs reported

report_timing => shows timing paths with aocv analysis. -derate shows derate factor too

POCV: Parametric on chip variation

POCV is even more advanced, and radical departure from conventional methods. Here, timing values are stored not as one single number but rather as statistical quantities (i.e as a gaussian distribution with mean u and std dev sigma). These statistical distribution are propagated along the path, and probabllity/statistics theorems applied to come with a statistical distribution for the final delay at end point of a path. AOCV models deratings only for delay, but POCV statistical method is applied not only for delay, but also for transition variation for each cell on a path. It also models constraint variation (setup/hold times on flops), as these vary too depending on variation within the cell, as well as transition variation on clk and data pins. mu and sigma values are stored in lib files for each cell for delay, transition and constraint (only if provided for flops). By defauly, only delay variation is enabled. Transition variation and constraint variation have to be enabled to get better match with HPSICE monte carlo sims. Timing values can be reported at any N sigma corner (since sigma is known). Usually, we report it at 3 sigma, as that implies 99.9% of the dies for that timing path will lie within 3 sigma (i.e only 0.1% of chips will fail for that path).

POCV takes care of path depth automatically, as it propagates each distribution as independent random variable. So, statistically cancellation takes care of path depth. path distance is handled by using distance based AOCV tables. So, these tables are 1D in case of POCV (as opposed to 2D tables in AOCV).

Lower VDD, as found in low nm tech, increases sensitivity to delay, transition and constraint variation (as Vdd is close to Vth, so small change in Vdd causes large changes in current) . So, POCV accounts for all this sensitivity, and prevents overdesign during PnR. POCV run with GBA  provides better coorelation with PBA, as it reduces pessimism in GBA. On other hand, with AOCV, exhaustive PBA had to be run at signoff as GBA has inbuilt pessimism, increasing run time. Tight GBA PBA corelation in POCV  prevents running exhaustive PBA.

POCV is run in PT as regular flow. The only extra step is reading variation information from liberty files, or from sidefiles in AOCV like table. Then timing is reported at specific sigma corner.

POCV input data: 2 methods: One is providing a sidefile for distance based derating (or single sigma value called as single coefficient), and the other is liberty file with sigma values across i/p slew rate and o/p load. derate in POCV can be applied to both mean or sigma values. derate is applied to mean values available in .lib file, and sigma values available in .lib file or in sdefile as single coeff.

1. sidefile with POCV single coefficient: This sidefile is just an extension of AOCV table format in version 4.0 (this is same synopsys file format as shown in AOCV section). It can either have distance based derate, or constant coefficient for sigma. Syntax is as below: (file1.pocvm or file1.aocvm or any name)

version: 4.0

ocvm_type: pocvm => it can be aocvm or pocvm
object_type: lib_cell => this can be design, lib_cell or hier cell
rf_type: rise fall => for both rise/fall

voltage: 0.9 => this allows voltage based derating where diff derate can be applied based on voltage the cell is at.
delay_type: cell => this can be cell or net. For net, object_spec is empty to apply it for all nets
derate_type: early => early means it's applied on shortest paths (for setup, clk paths are early, while for hold, data paths are early => to model worst case scenario)
path_type: clock => below derating is applied only for cells in clk path (applicable only for setup). For cells in data path for early (applicable only for hold), we may specify a different derating (<1).
object_spec: -quiet TSM_LIB/* => applies to all cells. For particular cells, we can do TSM_LIB/invx1*. -quiet prevents warnings from showing up
distance: 0 5000000 10000000 20000000 30000000 40000000 => this specfies distance for derating purpose
table: 1.000000 0.990442 0.986483 0.980885 0.976588 0.972967 => since type is early (fast paths), derates are < 1 to model worst scenario.

coefficient: 0.05 => this specifies single coefficient which is sigma divided by mean = sigma/mu (random variation coeff normalized with mu). This is specified if we want to do single coeff based POCV for our timing runs, instead of more accurate liberty based sigma. However, coeff and distance are mutually exclusive, you can specify only one of them. Different values can be specified for diff cells, etc. Usually more accurate lib files used to provide sigma, instead of providing coefficient here.

ocvm_type: pocvm
object_type: lib_cell
rf_type: rise fall
delay_type: cell
derate_type: late => late means it's applied on longest paths
path_type: clock => below derating only for cells on clk path (applicable only for hold). We specify derating separately for cells on data path for late (applicable only for setup)
object_spec: -quiet TSM_LIB/*
distance: 0 5000000 10000000 20000000 30000000 40000000
table: 1.000000 1.009558 1.013517 1.019115 1.023412 1.027033 => since type is late (slow paths), derates are > 1 to model worst scenario. 

coefficient: 0.02

2. LVF (liberty variation format) file: These file have additional groups which contain sigma info for delay, transition and constraint variation. They may also have distance based derating values here, instead of being in a sidefile (using ocv_derate group)

 format of this explained in Liberty section

POCV flow:

set_app_var read_parasitics_load_locations true => this is so that read_parasitics can read location info from spef file

read_parasitics file.spef =>

set_app_var timing_pocvm_enable_analysis true => sets GBA POCV analysis

set_app_var timing_pocvm_corner_sigma 3 => sets corner value for timing to be 3 sigma. It can be set to 5 sigma for more conservative analysis

set_app_var timing_enable_slew_variation true => to enable transition variation (i/p slew variation affects delay variation as well as o/p transition variation). Optional but recommended for better accuracy at < 16nm

set_app_var timing_enable_constraint_variation true => to enable constraint variation (setup/hold, rec/rem, clkgating variation). Optional but recommended for better accuracy at < 16nm

read_ocvm file.pocvm => reads single coeff or distance based derating from side file based on what's available

set_timing_derate -pocvm_guardband -early 0.95 => For fast paths, we reduced delays by further 5%. POCV guardband is applied on both mean delay and sigma delay (AOCV guardband is only for mean delay, as there's no concept of sigma in AOCV). If we want to derate only sigma delay, we can scale pcvm coefficient in sidefile or liberty file (w/o modifying the value directly in sidefile or liberty file) by using "set_timing_derate -pocvm_coefficient_scale_factor 1.03" to scale it, which scales only sigma and not mean delay. However, pocvm coeff scaling is applied on top of guardband for sigma delay.

  • Final derate_mean = pocv_derate * guardband_derate + incremental_derate,
  • Final_derate_sigma = guardband_derate * pocvm_coefficient_scale_factor

set_timing_derate -pocvm_guardband -late 1.04 => For slow paths, we increased delays by 4%

update_timing => performs timing update

report_ocvm -type pocvm => reports summarized pocvm deratings applied on cells/nets. If object list provided, it shows coeff and distance based derating picked from sidefile or LVF

report_timing => shows timing paths with aocv analysis. -derate shows derate factor too. However, now we may want to see both mean and sigma delays (since sigma delays are taken into account when reporting slack). slacks are not simple difference b/w expected arrival time and atual arrival time, but are square root of squares of these (since now we are dealing with statistical quantities). To see both mean and sigma delays, set this app var:

set_app_var variation_report_timing_increment_format delay_variation => Now report_timing -derate will show 2 columns: mean (delay w/o variation) and sensit (sigma or delay variation). Incremental time column for that arc should equal mean +/- 3*sensit (+ or - depending on slow(max) or fast(min) paths). mean and sensit are with derating already applied. Apart from incremental colums, there is also path colums, which show both mean and sensit again. Mean here is the cummulative mean upto that point (sum of all means), while sensit is cummulative sensitivity upto that point (sqrt of sqaures of all sigma). These help to verify various path delays and how they contribute to overall delays. There are also statistical corrections applied to get numbers to add up. There is also statistical graph pessimism applied in timing analysis. Latch borrowing also needs to be treated differently when in POCV mode.

  • final cell_delay mean derated = Cell_delay_mean * final_derate_mean = cell_delay * ( "POCVM guardband" * "POCVM distance derate" + "Incremental derate" )
  • final cell_delay sigma derated = cell_delay_sigma_adjusted * final_derate_sigma = cell_delay_sigma_adjusted * ( "POCVM guardband" * pocvm_coefficient_scale_factor) => cell_delay_sigma here is adjusted from the original sigma number reported in liberty file (if LVF used) by accounting for the fact that transition variation on i/p will affect delay variation (as well as o/p transition variation) depending on correlation b/w transition and delay. There is proprietery formula applied by synopsys to come up with adjusted sigma number.

Now update_timing runs aocv, and report_tiing shows timing paths with pocv analysis.

report_delay_calculation -from buf1/A -to buf1/Z -derate => This shows detailed calculation of cell mean delay and cell sigma delay with derating. This is useful for debug.

POCV analysis with xtalk: POCV analysis is only applied on non-SI cell delay. However, POCV can indirectly change crosstalk delta delay due to the difference in timing window alignment.

POCV characterization: POCV data can be generated via Synopsys SiliconSmart too, It can generate sigma for use in LVF, or coeff for use in pocv side file. distance based derating in sidefile can't be generated via any tool, and is generated via silicon measurements