LINKS:

This section is for putting random links to websites, articles or anything else that have been useful or interesting to me:

 


 

General Websites:

www.wikipedia.org: => Number 1 website for all my learning needs. Whether it's mathematical, geographical, complex advanced science material, wikipedia has always given me the best material to start with.

www.slickdeals.net: => This is number 1 website for all your deals. However, you may not want to but anything by clicking the link from slickdeals, as you don't make any cashback from here. There are cashback websites that give you cash for buying things on internet, so use those for making money. Two reliable ones that I use are topcashback.com and rakuten.com

www.doctorofcredit.com => This is another very good website for finding any deal that makes you money. It's different than slickdeals, in that it puts all deals (financial, credit cards, cashback websites, etc) that are not necessarily sponsored. You will never find these kind of deals on slickdeals unless they are sponsored. The comments on this site also help on lot, in helping you decide if you should pursue a deal or not.

www.stallman.org => This is the site of Richard M Stallman (rms), who started the open source revolution (Open Source Foundation and GNU project). "Basic human freedom" is the cornerstone of his views.

 


 

Educational sites:

3blue1brown: I would have never known about this site, had I not run into it for an AI video search. It's a channel+website started by a Stanford guy named Grant Sanderson. Absolutely amazing !! If you can find your topic on his video, you don't need to watch any other video for that topic, that's how good they are. Lots of topic on Maths, AI, Crypto, etc. Not even sure how can 1 person have absolute expertise in such unrelated fields. Learning a lot:

Personal site with videos: https://www.3blue1brown.com

You Tube Video channel (named 3blue1brown): https://www.youtube.com/channel/UCYO_jab_esuFRV4b17AJtAw

 


 

Puzzles:

Maths puzzles for kids: https://www.weareteachers.com/10-magical-math-puzzles/

Google interview puzzle about finding fastest 3 horses among 25 horses in minimum number of races => https://www.youtube.com/watch?v=i-xqRDwpilM 

 


 

Random Articles on web:

https://getpocket.com/explore/item/the-feynman-technique-the-best-way-to-learn-anything => Very nice technique discussed to learn something = break complex things into simplest things

https://getpocket.com/explore/item/whoa-this-is-what-happens-to-your-body-when-you-drink-enough-water => benefits of drinking lots of water. Make sure you drink at least 2 litres of water everyday. 1 litre of water should be drunk every morning on empty stomach after you wake up, and before you use the restroom.

https://www.propublica.org/article/what-are-2020s-tax-brackets-and-will-i-get-audited => Tax details

https://getpocket.com/explore/item/mental-models-how-to-train-your-brain-to-think-in-new-ways => mental models to train your brain

https://getpocket.com/explore/item/indian-employers-are-stubbornly-obsessed-with-elite-students-and-it-s-hurting-them?utm_source=pocket-newtab => somewhat interesting take on hiring under performers from non elite colleges in India 

https://getpocket.com/explore/item/work-stress-how-the-42-rule-could-help-you-recover-from-burnout  => amount of rest your body needs is 10 hr/day.

https://getpocket.com/explore/item/why-being-lazy-and-procrastinating-could-make-you-wildly-successful => How laziness and procrastination is so awesome !! My Favorite article. Let me go back to ...

https://www.bbc.com/worklife/article/20210222-how-a-beginners-mindset-can-help-you-learn-anything => How "beginner's mindset" helps us learn anything

 


 

YouTube Channels:

Robert Greene: Channel => https://www.youtube.com/@RobertGreeneOfficial

Andrew Huberman: A neuroscience professor from Stanford. Tons of podcasts. Channel => https://www.youtube.com/@hubermanlab

 

 


 

Course 2 - week 3 - Hyperparameter tuning, Batch Normalization and Programming Frameworks

This week's course is divided into various sections. The 1st 2 sections are continuation of previous week material. That last section is about using Programming frameworks, whcih is totally new and will require some time to undersatnd it.

Hyperparameter tuning:

There are various hyper parameters that we saw in previous section, that require to be tuned for our NN. These parameters in terms of their effect on NN performance are:

1. learning rate (alpha): Most important hyper parameter to tune. Not choosing this value properly may cause large oscillations in optimal cost function.

2. Mini batch size, number of hidden units and momentum (beta): These are second in importance.

3. Number of layers (L), learning rate decay, Adam parameters (beta1, beta2, epsilon): These are last in importance. Adam hyperparameters (beta1=0.9, beta2=0.999, epsilon=10^-8) are usually not tuned, as these values work well in practice.

 It's hard to know in advance what hyper parameter values will work, so we try random values of these hyper parameters from within a bounding box (may be changing 2 at a time or 3 or even more at a time). Once we find a smaller bounding box, where hyper parameters seemed to perform better, we use "coarse to fine" technique to start trying finer values, until we get close to optimal hyperparameters. 

We need to choose the scale of where to sweep the hyper parameters very carefully, so that we cover the whole range. For ex, to sweep learning rate alpha, we sweep hyperparameter on log scale from 0.0001 (10^-4) to 1 (10^0) on a log scale in steps of x10 (i.e 10^-4, then 10^-3, then 10^-2, then 10^-1 and finally 10^0).

 There are 2 approaches to hyper parameter tuning:

1. caviar approach: We use this if we have many computing resources available. We train many NN in parallel with different hyper parameter settings, and see which ones work. Caviar is a fish, and how they care for their babies, is to have too many of them, and just let the best ones survive.

2. panda approach: Here, we just run one NN model with a set of hyperparameters. However as time pass, we tune hyperparameters and see if they are making the performance of NN better or worse, and keep on adjusting hyper parameters every day or so. So, here we babysit just 1 model, similar to how panda do with their babies. They don't produce too many babies, but keep on watching their one baby with all effort and making them stronger.

 

Batch Normalization:

 Here, we normalize inputs to speed up our NN. We subtract the mean from inputs, and then divide it by their variance (or should be std deviation, since variance is still in square form). That way inputs gets more uniformly distributed around a center, which causes our cost function to be more symmetric, resulting in faster execution when finding minima.

For a deep NN, we can normalize inputs to each layer. Input to each layer is o/p of activation func, a[l]. However, instead of normalizing a[l], we normalize Z[l].

 μ = 1/m * Σ Z[i]

σ2 = 1/m * Σ (Z[i] -  μ)2

Znorm[i] = (Z[i] -  μ) / √(σ2+ε)

Now instead of using Z[i] in our previous NN eqn, we use Znorm[i] which is the normalized value.

If we want to be more flexible in how we want to use Z[i], we may define learnable parameters gamma and beta, which allows the model to choose either raw Z[i], normalized Znorm[i] or any other intermediate value of Z[i]. This can be achieved by defining new var Z˜[i] (Z tilde)

Ztilde[i]   = γ*Znorm[i]  + β => by changing values of gamma and beta, we can get any Ztilde[i]  . For ex, if γ=1 and β=0, then Ztilde[i]   = Znorm[i]. If γ=√(σ2+ε), and β=μ, then Ztilde[i]   = Z[i]

Since gamma and beta are learable parameters (just like weights), we really don't have to worry about the most optimal values of gamma and beta. The gd algo would choose the values that gives the lowest cost for our cost function. Note that each layer has it's own gamma and beta, so they can be treated just like weights for each layer. gd now calculates γ[i]  and β[l], on top of W[l] and b[l]. However since we are normalizing, we will see that b[l] is cancelled out, so we can omit b[l]. So, we have 3 parameters to optimize: γ[i] , β[l] and W[l] for each layer l. We can extend this to mini batch technique too, with all gd algorithms like momentum, adam, etc.

Batch norm works because it makes NN computation more immune to covariate shifts. The i/p data and all other intermediate i/p data are always normalized. It ensures that mean and variance of all i/p will remain the same, no matter how the data moves. This makes these values more stable even if i/p shifts.

Multi Class classification:

Binary classification is what we have used so far, which classifies any picture into just 2 outcomes = cat vs non-cat. However, we can have multi class classification, where o/p layer produces multiple outputs, i.e if the picture is cat, dog, cow or horse (known as 4 class classification). It outputs the final probability of each of the classes, and the sum of these probabilities is 1.

Here the o/p layer L, instead of generating 1 o/p, generates multiple o/p values one for each class. So, the o/p layer Z[L], instead of being a 1x1 matrix in binomial classification, is a Cx1 matrix now for multi class classification, where C is the number of classes to classify. Previously activation function for o/p layer a[L] used to be sigmoid function which worked well for binomial classification. However, now with multi class classification, we need a different activation function for o/p layer. We choose activation function to be exponent function normalized by sum of exponents.

For 2 class classification,we use sigmoid func:

Sigmoid function σ(z) = 1/(1+e^-z) = e^z/(1+e^z)

prob for being in class 0 = yhat = σ(z) and

prob for being in class 1 (not in class 0, or class=others) = 1 - yhat = 1 - σ(z) = 1/(1+e^z)

We generalize above eqn for C classes. We use exponent func in o/p layer (also called as softmax layer)

exponent func = e^zk/(e^z1 + e^z2 + ... e^zc) where C is the number of classes, k is the kth class

prob for being in class 0 = yhat[0] =  e^z1/(e^z1 + e^z2 + ... e^zc) 

prob for being in class 1 = yhat[1] =  e^z2/(e^z1 + e^z2 + ... e^zc) 

...

prob for being in class C-1 = yhat[c] =  e^zc/(e^z1 + e^z2 + ... e^zc) 

So, probabilities all add up to 1. Matrix a[L] or yhat is CX1 matrix

For C=2, multiclass reduces to binary classification. For implementation of multi class, the only difference in algo would be to compute o/p layer differently, and then do back prop.

For 2 class, if we choose e^z2=1, then we get

prob for being in class 0 = yhat[0] =  e^z1/(e^z1 + e^z2) = e^z1/(e^z1 + 1) 

prob for being in class 1 = yhat[1] =  e^z2/(e^z1 + e^z2) = 1/(e^z1 + 1) 

Which is exactly what we got by using our sigmoid function earlier. so, exponent func and sigmoid func in o/p layer give the same result, implying sigmoid was just a special case of exponent func.

NOTE: In binary classification, we had an extra function at o/p which converted yhat to 0 or 1 depending on if it's value was greater < 0.5 or not. That was called hard max. Here in multi class classification, we don't have that extra function. We just stop once we get the probabilties of each class. This is called softmax.

Logits: In multi class classification, computed vector Z = [Z1, Z2 ... ZC] are called logits. The shape of logits is (C,m) where C=number of classes, m=number of examples

Labels: In multi class classification, given vector Y = [Y1, Y2 ... YC] are called labels. Each Y1, Y2 is 1 hot, so it has C entries, instead of just one entry. The shape of labels is same as that of logits i.e shape = (C,m)

Cost eqn: For multiclass classification, loss function is same as for binary classification with some modification.

Programming Frameworks

Instead of writing all these NN functions ourselves (forward prop, backprop, adam, gd, etc), we have NN frameworks, which provide all these functions to us. Tenosrflow is one such framework. We'll use tensorflow in python for our exercises. You can get introductory material for tensorflow including installation in "python - tensorflow" section. Once you've completed that section, come back here.

Programming Assignment 1: here we have 2 parts. 1st part, we learn basics of tensorflow (TF), while in 2nd part, we build a NN using TF.

Here's the link to pgm assigment:

TensorFlow_Tutorial_v3b.html

This project has 3 python pgm, that we need to understand.

A. tf_utils.py => this is a pgm that defines following functions that are used in our NN model later:

tf_utils.py

  • load_dataset() => It loads test and training data from h5 files, similar to function used in section "1.2 - Neural Network basics - Assignment 1". The only difference is that Y is now a number from 0 to 5 (6 numbers), instead of being a binary umber - either 0 or 1. This is because we are doing a multi class classification here. Each picture is a sign language picture representing number 0, 1, 2, 3, 4 or 5.
    • Training set: 1080 pictures (64 by 64 pixels) of signs representing numbers from 0 to 5 (180 pictures per number). X after flattening is 2D vector with shape = (12288, 1080), while Y after flattening is 2D vector with shape = (1, 1080)
    • Test set: 120 pictures (64 by 64 pixels) of signs representing numbers from 0 to 5 (20 pictures per number). X after flattening is 2D vector with shape = (12288, 120), while Y after flattening is 2D vector with shape = (1, 120)
  • random_mini_batches() => This creates a lit of random mini batches from set X,Y. These random mini batches are shuffled, and each have a size as specified in argument.
  • convert_to_one_hot(Y, C) => This returns a 1 hot matrix for given o/p vector Y, and for "C" classes. 1 hot vector is needed for multi class classification.
  • predict(X, parameters) => Given i/p picture X, and optimized weights, it returns the prediction Yhat, i.e what number from 0 to 5 is the picture representing
  • forward_propagation_for_predict(X, parameters) => Implements the forward propagation for the model. It returns Z3, which is the o/p of last linear unit (before it feed into the softmax function to yield a[3])

B. improv_utils.py => this pgm is not used anywhere, so you can ignore it, This is a pgm that defines all the functions that are used in our NN model later. This has all functions that were in tf_utils.py, as well as all the func that are going to be defined in test_cr2_wk3.py. So, this pgm is basically solution to the assignment, as all the functions are written here, that we are going to write in our assignment later. You should not look at this pgm at all, nor should you use it (unless you want to check your work after writing your own functions).

improv_utils.py

C. test_cr2_wk3.py => Below is the whole pgm,

test_cr2_wk3.py

This pgm has 2 parts to it. In 1st part, we just explore TF library, while in 2nd part, we write the actual NN model using TF.

Part 1: This is where we explore TF library. All i/p and o/p of these examples is Tensor Data. NOTE: we don't use any of these functions below in our NN model that we build in part 2. This is just for practise.

  • comput loss eqn: simple loss eqn value is computed by creating a TF variable for loss.
  • multiply using constant: multiplying 2 constant numbers and printing result.
  • multiply using placeholder: Here we feed value into placeholder at runtime, and compute 2*x.
  • linear_function(): Here we compute Y=W*X+B, where W,X,B and Y are all Tensor vectors (i.e Matrices) of a pre determined shape
  • sigmoid(z): Given i/p z, compute sigmoid od z.
  • cost(logits, labels): This computes cost using tf func "tf.nn.sigmoid_cross_entropy_with_logits()".  This calculates cost= - ( y*log(sigmoid(z)) + (1-y)*log(1-sigmoid(z)) ). This returns a vector with 1 entry for each logits/label pair. When you have "m" examples for each logit/label, then it computes summation and mean. However in NN model that we build later, we'll be using "tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits())", which works for multiclass classification. This func is explained in Python-TF section.
  • one_hot_matrix(labels, C): This returns a 1 hot matrix for given labels, and for "C" classes.
  • ones(shape) => creates a Tensor matrix of given shape, and initializes it with all 1.

 

Part 2: This is where we build a neural network using tensorflow. Our job here is to identify numbers 0 to 5 from sign language pictures. We implement a 3 layer NN. The model is LINEAR -> RELU -> LINEAR -> RELU -> LINEAR -> SOFTMAX.

Below are the functions defined in our pgm for part 2:

  • create_placeholders() => creates placeholders for i/p vector X and o/p vector Y
  • initialize_parameters() => initializes w,b arrays. W is init with random numbers, while b is init with 0.
  • forward_propagation(X, parameters) => Given X, w, b, this func calculates Z3 instead of a3 (z3 is the output o last NN layer, which feeds into the softmax(exponent) function)
  • compute_cost(Z3, Y) => This computes cost (which is the log function of A3,Y). A3 is computed from Z3, and cost is calculated as per loss eqn for softmax func. We use following TF func for computing cost: tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits = ..., labels = ...)). logits=Z3, while labels=Y.
  • backward propagation and parameter update: This is done using a TF func "tf.train.AdamOptimizer(learning_rate = learning_rate).minimize(cost)". This is explained in TF section.The notes talks about TF func "tf.train.GradientDescentOptimizer" but we use "tf.train.AdamOptimizer" for our exercise. This func is instantiated directly within the model, as it's a built in func (not a user defined func that we need to define or write function for)
  • predict() => Given input picture array X, it predicts Y (i.e whether pic is cat or not). It uses w,b calculated using optimize function. We can provide a set of "n" pictures here in single array X (we don't need to provide each pic individually as an array). This is done for efficiency purpose, as Prof Andrew explains multiple times in his courses.
  • model() => This is the NN model that will be called in our pgm. We provide both training and test pictures as 2 big arrays as i/p to this func. This model has 2 parts. First it defines functions, and then it runs(calls) them It inside a session. These are the 2 parts:
    • Define the functions as shown below:calls above functions as shown below:
      • defines func create_placeholders()  for X,Y.
      • defines func initialize_parameters() to init w,b randomly
      • Then it defines forward_propagation() to compute Z3
      • Then it defines compute_cost() to compute total cost given ze and o/p labels Y.
      • then it defines TF func "tf.train.AdamOptimizer(learning_rate = learning_rate).minimize(cost)" to do backward propagation to update paramer=ters for 1 iterayionptimize() to optimize w,b to give lowest cost across the training set.
      • defines an inti func "tf.global_variables_initializer()". This is needed to init all variables. See in TF section for details
    • Now it creates a session, forms a loop, and calls the above functions
      • start the session. Inside the session. run these steps
        • Run the init func defined above => "tf.global_variables_initializer()".
        • Make a loop and iterate below func for "num_of_epoch" times. It's set to 1500. We will change it to 10,000 too and see the impact on accuracy.
          • Form minibatches of X,Y and shuffle them
          • iterate thru each minibatch
            • call these two func "tf.train.AdamOptimizer(learning_rate = learning_rate).minimize(cost)" and "compute_cost" for each minibatch. Here, we don't need to explicitly call "compute_cost ", since running "minimize(cost)" will call the func compute_cost any way. The reason we do it, is because we want to get the "cost" o/p returned by "compute_cost" func for our plotting of cost vs iterations.
            • We add the cost from each iteration to "total_cost" and divide it by number of examples, to get avg cost
        • Now plot "total_cost" vs "num f=of iterations". This shows how the cost is going down as we iterate more and more.
        • Now it runs "parameters" node again to get values of parameters. NOTE: that running "parameters" again doesn't run func "initialize_parameters() " again, but instead just returns the computed values for that node
        • It then calls tf functions to calculate prediction and accuracy for all examples in test and training set. Accuracy is then reported for all pictures on what they actually were vs what our pgm predicted.

 Below is the explanation of main code (after we have defined our functions as above):

  1. We get our datset X,Y by calling load_dataset().
  2. Next we can enter index of any picture, and it will show the corresponding picture for our training and test set. This is for our own understanding. Once we have seen a few pictures, we can enter "N" and the pgm will continue.
  3. Now we flatten the array returned and normalize it. We also use "one_hot" function to convert labels from one entry to a one hot entry, since our labels need to be one-hot format for our softmax func to work.
  4. Now we call our function model() defined above. We provide X,Y training and testsing arrays (which are not Tensors, but are numpy arrays). We see that these numpy arrays are used as Tensor i/p to many functions above. I guess it still works, as conversion from numpy to Tensors takes place automatically when needed.
  5. In above exercise, we used a 3 Layer NN, with fixed number of hidden untis for each layer. We ran

Results:

On running above pgm, we see these results:

  • On running the above model with 1500 iterations get a training accuracy of 70%.
    • Cost after epoch 0: 1.913693
    • Cost after epoch 100: 1.049044 .... => If you get "Cost after epoch 100: 1.698457", that means you are still using GradientDescentOptimizer". Switch to "tf.train.AdamOptimizer".
    • Cost after epoch 1400: 0.053902
  • When we increase the number of iterations to 10,000, our training accuracy goes to 89%. See how cost keeps on going down and then kind of flattens out.
    •  Cost after epoch 1400: 0.053902 ...
    • Cost after epoch 2500: 0.002372
    • Cost after epoch 5000: 0.000091
    • Cost after epoch 9900: 0.000003

 

Programming Assignment 2: This is my own programming assignment. It's not part of the lecture series. Here, I took an example from one of the earlier programming assignments, and rewrote it using TF to see if I could do that. It does work, but not sure if everything is working correctly (as the cost is different from previous assignment, and there is no way to verify the accuracy)

test_cr2_wk2_tf.py => Below is the whole pgm, This pgm is copied from course 2 week 2 pgm => course2/week2/test_cr2_wk2.py. We wrote same pgm with tensorflow functions now. We implement it for batch gd only (not other ones)

test_cr2_wk2_tf.py

We implement a 3 layer NN. The model is LINEAR -> RELU -> LINEAR -> RELU -> LINEAR -> SOFTMAX.

Even though this is a binary classifier, we still used a softmax implementation, as binary is a special case of softmax with number of classes=2. All the functions that we defined in Pgm assignment 1 are the same here. The only diff is in initialize_parameters() func, and model() func. Differences are explained below:

  • initialize_parameters() => here we allowed args to be passed for "number of hidden units" for each layer, so that we can keep it consistent with our "course 2 week 2 pgm". That also allows us to play around with different number of hidden units for each layer, and observe the impact. The default number of hidden units for 3 layers is: [2, 25, 12, 2] where 1st entry 2 is for input layer.
  • model() => Here, the definition of functions is same as in assignment 1. These are few differences:
    • Test set: Since we have only training set in this example, we don't have arg for "test_set" in model() func. model() func is copied from course2 week 2 and is modified wherever needed to work for TF.
    • optimizer: We call "tf.train.GradientDescentOptimizer" instead of Adam Optimizer. We could try both.This is just to keep it consistent with "course 2 week 2" pgm.
    • cost_avg: One other diff is that we don't do "cost_avg" by dividing by "m", as we already averages when we divide it by "mini_batch_size" within the loop.
    • All other part of model() is same, except that we don't evaluate test accuracy (since there's no test set)

Now we run the main pgm code the same way as in assignment 1. These are the diff:

  • We load the red/blue dataset (by using diff func load_dataset_rb_dots()). This is needed since the dataset here is different and is created by writing python cmds.
  • We convert Y label to 1 hot. This is needed for softmax function as explained earlier. We convert Y = [ 0 1 0] into Y(one_hot) = [ [1 0] [0 1] [1 0] ] where 0=red, 1=blue
  • We now call model() with desired number of hidden units, and it gives us the prediction accuracy.

Results:

This is the result we get (with the default settings we have in our pgm):

Cost after epoch 0: 0.051880
Cost after epoch 1000: 0.038114
Cost after epoch 2000: 0.030764
Cost after epoch 3000: 0.027093
Cost after epoch 4000: 0.025386
Cost after epoch 5000: 0.022814
Cost after epoch 6000: 0.021766
Cost after epoch 7000: 0.021067
Cost after epoch 8000: 0.019954
Cost after epoch 9000: 0.019063
Parameters have been trained!
Train Accuracy: 0.9166667

 

Summary:

Here we built a 3 layer NN using Tensor Flow. TF is not easy or intuitive, so I'm lost too, on why somethings work with tensors, some with numpy, and what run session does, and s on. But eventually it did work. The main take away is that multi class classification worked just as easily as binary classification, and got us 90% accuracy if trained for long enough. Our optional second assignment, helped us to see how we can transform a regular NN pgm written using numpy into TF NN pgm.

 

Optimization algorithms: Course 2 - Week 2

This course goes over how to optimize the algo for finding thew lowest cost. We saw gradient descent that was used to find lowest cost by differentiating the cost function and finding the minima. However, with NN running on lots of data to get trained, the step for finding lowest cost may take a long time. Any improvement in training algo would help a lot.

Several such algo are discussed below:

1. Mini batch gradient descent (mini gd):

We train our NN on m examples, where "m" may be in millions. We vectorized these m examples so that we don't have to run expensive for loops. But even then, it takes a long time to run across m examples. so, we divide our m examples in "k" mini batches with m/k examples in each mini batch. We call our original gradient descent scheme as "Batch gd".

We form a for loop for running each mini batch in one "for" loop. Then there is an outer for loop to iterate "num" times to find lowest cost. Each loop thru all examples is called one "epoch".

 With "Batch gd", our cost function would come down with each iteration. However, with mini batch, our cost function is noisy and oscillates up and down, as it comes down to a minima. One hyper parameter to choose is the size of mini batch. 3 possibilities:

  1. Size of m:  When each mini batch contains all "m" examples, then it becomes same as "batch gd". It takes too long when "m" is very large.
  2. Size of 1:  When each mini batch contains only 1 example, then it it's called "stochastic gd". This is the other extreme of batch gd. Here we lose all the advantage of vectorization, as we have a for loop for each example. It's cost function is too noisy as it keeps on oscillating, as it approaches the minima.
  3. Size of 1<size<m: When each mini batch contains only a subset of example, then it it's called "mini batch gd". This is the best approach, provided we can tune the size of each mini batch. typical mini batch sizes are power of 2, and chosen as 64, 128, 256 and 512. Mini batch size of 1024 is also employed, though it's less common. Mini batch size should be such that it fits in CPU/GPU memory, or else performance will fall off the cliff (as we'll continuously be swapping training set data in and out of memory)

2. Gradient Descent with momentum:

We make an observation when running gradient descent for mini batch, there are oscillations which are due to W, b getting updated with only  a small number of examples in each step. When it sees he new mini batch, W, b may get corrected to different value in opposite direction resulting in oscillations. These oscillations are in Y direction (i.e values of weight/bias jumping around) as we approach to a optimal value (of cost function) in x direction. These Y direction oscillations are the ones that we don't want. These oscillations can be reduced by a technique known as "Exponentially weighted avg". Let's see what is it:

Exponentially weighted average:

Here, we average out the new weight/bias value with previous values. So, in effect, any dramatic update to weight/bias in current mini batch, doesn't cause too much change immediately. This smoothes out the curve.

Exponentially weighted avg is defined as:

Vt = beta*Vt + (1-beta)*Xt => Here Xt is the sample number "t". t goes from 0, 1, ... n, where n is the total number of samples.

It can be proved that Vt is approximately avg over 1/(1-beta) samples. so, for beta=0.9, Vt is avg over last 10 samples. If beta=0.98, then Vt is avg over last 50 samples. Higher the beta, smoother will be the curve, as it takes avg over larger number of samples.

It's called exponential as if we expand, Vt we see that Vt contains exponentially decaying "weight" for previous samples. i.e Vt = (1-beta)*Xt + (1-beta)*[ beta*X{t-1} + beta^2*X{t-2} + beta^3*X{t-3} + ...]

It can be shown that weight decays to 1/e when we look at the last 1/(1-beta) sample. So, in effect it seems that it's taking avg of last 1/(1-beta) samples.

However, the above eqn has an issue at startup. We choose V0=0, so first few values of Vt are way off from avg value of Xt, until we have collected few samples of Xt. To fix this, we add a bias correction term as follows:

Vt (bias corrected) = 1/(1-beta^t) * [ beta*Vt + (1-beta)*Xt ] => Here we multiplied the whole term by 1/(1-beta^t), so that for the first few values of "t", Vt becomes a small value, so contribution from Xt dominates. However as "t" starts getting larger, 1/(1-beta^t) goes to 1, and can be seen to have no impact.

gd with momentum:

Now we apply the above technique of "Exponentially weighted avg" to gd with momentum. Instead of using dW and db to update W, b, we use weighted avg of dW and db to update W, b. This results in a much smoother curve, where W, b don't oscillate too much with each iteration.

Instead of doing W=W- alpha*dW, b=b-alpha*db, as we did in our original gd algo, we use weighted avg of dW and db here.

W=W- alpha*Vdw, b=b-alpha*Vdb, where Vdw, Vdb are weight avg of last beta samples of dW and db, and defined as

Vdw = beta1*Vdw+ (1-beta1)*dW, Vdb = beta1*Vdb+ (1-beta1)*db

3. Gradient Descent with RMS prop:

This is a slightly different variant of gd with momentum. Here also, we use the technique of exponentially weighted avg, but instead of using dW and db, we use square of dW and db. Also, we note that in "gd with momentum", we never knew which values of dW and db are oscillating, we smoothed all of them equally. Here, we smooth out those values more which have more oscillations, and vice versa. We achieve this by dividing dW and db by their weight avg (instead of using weighted avg of dW and db directly in the eqn). that way whichever dW or db oscillates by most (may be w1,w7 and w9 oscillate the most), then their derivatives are going to be the largest (dw1, dw7 and dw9 have high values). So, on dividing these large derivatives by larger numbers will smoothen them out more than ones with lower derivatives. Eqn are as follows:

W=W- alpha*dW/√Sdw, b=b-alpha*db/√Sdb, where Sdw, Sdb are weight avg of last beta samples of (dW)^2 and (db)^2, and defined as

Sdw = beta2*Sdw+ (1-beta2)*(dW)^2, Sdb = beta2*Sdb+ (1-beta2)*(db)^2

NOTE: we used beta1 in gd with momentum, and beta2 in gd with RMS prop to distinguish that they are different beta. Also, we add a small epsilon=10^-8 to √Sdw and √Sdb so that we don't run into numerical issue of dividing by 0 (or by a number so small that computer's effectively treat it as 0). so, modified eqn becomes:

W=W- alpha*dW/(√Sdw + epsilon), b=b-alpha*db/(√Sdb + epsilon)

4. Gradient Descent with Adam (Adaptive Moment Estimation):

Adam optimization method took both "gd with momentm" and "gd with RMS prop" and put them together. It works better than both of them, and works extremely well across wide range of applications. Here, we modify RMs prop little bit. Instead of using dW and db with alpha, we use VdW and Vdb with alpha. Then we reduce oscillations even more, since we are applying 2 separate oscillation reducing techniques in one. This technique is called moment estimation, since we are using different moments : dW is called 1st moment, (dW)^2 is called 2nd moment and so on.

So, eqn look like:

W=W- alpha*Vdw/(√Sdw + epsilon), b=b-alpha*Vdb/(√Sdb + epsilon),where Vdw, Vdb, Sdw and Sdb are defined above

Here there are 4 hyper parameters to choose from: alpha needs to be tuned, but beta1, beta2 and epsilon can be chosen as follows:

beta1=0.9, beta2=0.999, epsilon=10^-8. These values work well in practice and tuning these doesn't help much.

5. Learning rate decay:

In all the gd techniques above, we kept learning rate "alpha" constant. However, we observe that learning rate doesn't need to be constant. It can be kept high when we start, as we need to take big steps, but can be reduced as we approach the optimal cost, since smaller steps suffice as we converge to optimal value. The larger steps cause oscillations. so reducing alpha reduces these oscillations, so that it allows us to converge smoothly. this approach is called learning rate decay and there are various techniques to achieve this.

simplest formula for implementing learning rate decay is:

alpha = 1/(1+decay_rate*epoch_num) * alpha0, where alpha0 is our initial learning rate. epoch_num is the current iteration number.

So, as we do more and more iterations, we keep on reducing the decay rate, until it gets close to 0. Now we have one more hyper parameter "decay_rate" on top of alpha0, both of which need to be tuned.

Other formula for implementing learning rate decay are:

alpha = ((decay_rate)^epoch_num) * alpha0 => This also decays learning rate

alpha = (k/√epoch_num) * alpha0 => This also decays learning rate

Some people also manually reduce alpha every couple of hours or days based on run time. No matter what formula is used, they all achieve the same result of reducing oscillations

Conclusion:

Adam and all other techniques discussed above speeds up our NN learning rate. They solve the problem of plateau in gd, where the gradient changes very slowly over a large space, resulting in very slow learning. All these gd techniques speed up this learning process by speeding up learning in x direction. There is also the problem of getting stuck in local mimima, but looks like this is not an issue in NN with large dimensions.This is because, instead of hitting local minima (where the shape is like bottom of trough or valley), we hit saddle points (where th shape is like saddle of horse). For a local minima, all dimensions need to have a shape like a trough, which is highly unlikely for our NN to get stuck at. At least one of the dimensions will have a slope to get us out of this saddle point, and we will keep on making progress w/o ever getting stuck at local minima.

 

Programming Assignment 1: here we implement different optimization techniques discussed above.  We apply our different optimization techniques to 3 layer NN (of blue/red dots):

  • batch gd: This is same as what we did in last few exercises. This is just to warm up. Nothing new here.
  • min batch gd: We implement mini batch gd by shuffling data randomly across different mini batches.
  • Momentum gd: This implements gd with momentum
  • Adam gd: This implements Adam - which combines momentum and RMS prop
  • model: Now we apply the 3 optimization techniques to our 3 layer NN = mini batch gd, momentum gd and Adam gd.

Here's the link to pgm assigment:

Optimization_methods_v1b.html

This project has 2 python pgm.

A. testCases.py => There are bunch of testcases here to test your functions as you write them. In my pgm, I've them turned off.

testCases.py

There is also a dataset that we use to run our model on:

data.mat

B. opt_utils_v1a.py => this pgm defines various functions similar to what we used in previous assignments

opt_utils_v1a.py

C. test_cr2_wk2.py => This pgm calls functions in above pgms. It implements all optimization techniques discussed above. It loads the dataset, and runs the model with mini batch gd, momentum gd and Adam gd. Adam gd performs the best.

test_cr2_wk2.py

 

Summary:

By finishing this exercise, we learnt  many faster techniques for implementing gradient descent. We would get the same accuracy for all of the gd methods discussed above, it's just that slower gd techniques are going to require a lot more iterations to get there. This can be verified by setting "NUM_EPOCH" to 90,000 in our program.

Derate:

We learned OCV in previous section. However running OCV at 2 different PVT corners may not always be practical. For ex, consider the voltage drops seen on a chip due to IR. We may not be able to get lib at that particular voltage corner after accounting for the voltage drop due to IR. Similarly for temperature, we may not be able to lib for that exact temperature after accounting for on chip heating. Also, even if we are able to get these libs, ocv analysis requires running at 2 extreme corners. If we do not want to run analysis at 2 diff corners for ocv, we can run it at 1 corner only by specifying derating. Derating is an alternate approach where we speed up or slow down certain paths so that they can indirectly achieve same results as OCV. Derating is basically applying a certain multiplying factor to each gate delay so that the delay can be scaled up (by having a multiplying factor > 1), or can be scaled down (by having a multiplying factor < 1). The advantage of derate is that each and every gate in design can now be customized to have a particular delay on it. With OCV analysis

.codebox {
        border:1px solid black;
        background-color:#EEEEFF;
        white-space: pre-line;
        padding:10px;
        font-size:0.9em;
        display: inline-block;
}

, we weren't able to do this, as the flow just chose b/w WC and BC lib and applied one or the other to each gate in design. Here, we first choose a nominal voltage, for which we have library available, and then apply derate to achieve effects of Voltage and Temperature variations.

There are different kind of derates:

  • Timing derates: When we run sta at particular voltage/temperature, we assume same voltage and same temperature on all transistors. However, based on IR drop and temperature variation as well as aging effect, we know that not all transistors will be on same voltage/temperature. So, we apply timing derating. We apply these timing derates as "voltage guardband derates". Even though we say voltage, we include effect of temperature and aging effect, so that the "voltage derate" includes effects of all of these. In PT flow, these derates specified via "set_timing_derate -pocvm guardband" or "set_timing_derate -aocvm guardband" => This is explained later. By default, derate specify ocv derate, which are derates due to local process variations only. Then we apply either aocv or pocv voltage guardband derate, which account for Voltage+Temperature+reliability derates. 
  • POCVM distance derates: Only applied on clocks. This is additional derate on top of voltage derate above.
  • LDE (Layout dependent Effect) derates: provided by foundry. Applied as incremental derate
  • MixedVt / MixedLg derates: Differences in Threshold voltage as well as in "Length" of transistors, we experience differences in delay which don't scale same way. i.e process is not correlated for different VT, so LVT might be at SS -3σ corner, but ULVT instead of being perfectly at SS -3σ corner, it may be a little bit faster or slower. e.g. in a slow-slow corner the capture clock is LVT and might be slightly faster than fast-fast corner due to Vt mistracking. This VT mistracking is not OCV related. OCV models local process variations, while Mixed VT is modeling global process corner correlation. We model this MixedVt correleation effect via derate.
  • Margining derates: Other derates used for margining

What derating factor to apply for ocv/aocv/pocv is derived by running monte carlo sims.

1. set_timing_derate => It allows us to adjust delays by a multiplying factor. It automatically sets op cond to ocv mode. The derating factor applies only to the specified objects which is a collection of cells, lib_cells or nets. If no objects specified, then derating applies to whole design. report_timing -derate shows timing with derating factor shown for each cell/net in the path. We do not derate slews or constraint arcs (as they are not supported for AOCV or POCV), but we do have options for setting these in set_timing_derate cmd.

options:

-early/-late => unlike other cmds, there is no default here. We have to specify -early to apply derating on early (shortest delay) path, and -late for late (longest delay) path. We need to have separate cmd for early and late, else derating will get applied to only 1. The tool applies worst-case delay adjustments, both early and late, at the same time. For example, for a setup check, it applies late derating on the launch clock path and data path, and early derating on the capture clock path. For a hold check, it does the opposite. We get these derating values from simulations. First, we try to find worst/best case voltage drop on transistor power pins (after accounting for off chip IR drop, PMU overshoot/undershoot and on chip IR drop) and then apply derating accordingly. 

  • Early derating: We apply early derating corresponding to the voltage level which would be with off chip IR drop only. This is the absolute highest voltage that can be seen by any transistor on die. Then we add extra derate to account for temperature offset. We apply same early derate for both clk and data path.
  • Late derating: We apply late derating corresponding to the voltage level which would be with off chip IR drop + on chip IR/power_switch drop + reliability margin (due to VT shift seen for transistors with low activity). This is the absolute lowest voltage that can be seen by any transistor on die. Here, we apply slightly different derate for clk and data path. For clock path, we don't add the reliability margin, since clk is always switching, so there is no reliability penalty that the clk path incurs. So, clk path sees a slightly higher voltage.

NOTE: since we apply these early/late derates, we want our nominal voltage at which we are going to run STA to be around these early/late voltages. If our librar's nominal voltage is too far from these early/late voltages, then we have to apply large derating, which may not produce accurate results.

-cell_delay/-net_delay => By default, derating applies to both cell and net delays, but not to cell timing check constraints. This allows derating to apply only to cell or net delays. -cell_check allows derating to be applied to cell timing check constraints (setup/hold constraints)

-data/-clock => By default, delays are derated in both data paths and clock paths. This allows derating to be applied to only data or clock

-rise/-fall => By default, both rising and falling delays are derated. This allows derating to be applied to cell/net delays that end with a rising or falling transition only

There are many more options that we'll see later (including -aocvm guardband/-pocvm guardband options). If the options -aocvm guardband/-pocvm guardband are not used, then the above derating cmd sets OCV derate, which ony accounts for local process variation related derate. Voltage/Temperature and Reliability derates are captured via additional derate specified with -aocvm guardband/-pocvm guardband options. Thus OCV derate and aocv/pocv guardband derate are all needed to account for all PVT+reliability variations.

ex: set_timing_derate -early 0.9; set_timing_derate -late 1.2 => The first command decreases all early (shortest-path) cell and net delays by 10 percent, such as those in the data path of a hold check (and clk path of setup check). The second command increases late (longest-path) cell and net delays by 20 percent, such as those in the data path of a setup check (and clk path of hold check). These two adjustments result in a more conservative analysis. We should use derating < 1 for early and >1 for late, since we are trying to simulate worst case ocv that can happen. Derating gets applied to whole design, as we did not specify any object.

ex: set_timing_derate -increment -cell_delay -data -rise -late 0.0171 [get_lib_cells { TSM_svt_ff_1v_125c_cbest/SDF_NOM_D1_SVT}] => applies derating of 1.7% only to lib cell specified for rise dirn, and long delay path. -increment adds this derating on top of any derating that is already applied globally or to this cell earlier.

ex: set_timing_derate -cell_delay -net_delay -late 1.05 [get_cells top/H1] => sets a late derating factor on all cells and nets in the hierarchical cell top/H1, including its lower-level hierarchical cells

report_timing_derate => shows derates set for all cells in design. This is very long list so better to redirect it to some o/p file

AOCV and POCV: OCV is OK, but it doesn't model advanced levels of variations for 65nm and below, which results in overdesign. OCV allows us to model different derating for diff cells (by using set_timing_derate cmd), but fails to capture other factors. To mitigate some of these OCV issues, advanced forms of OCV came into picture. AOCV (advanced OCV) was used earlier, but even more advanced POCV (parametric OCV) is used now. We will go over details of both

PT has app_var variables which allow advanced OCV and parametric OCV. To report all app_var, we can use this cmd:

pt_shell > report_app_var *ocv* => reports all aocv and pocv app_var settings

AOCV: timing_aocvm_enable_analysis => setting it to true enables AOCV) => needs AOCV tables in a side file

POCV: timing_pocvm_enable_analysis => setting it to true enables POCV) => needs POCV side file or liberty file

AOCV: Advanced on chip variation

OCV doesn't handle below factors:

  1. path depth => variation reduces on long paths due to statistical canceling. So, even if each cell has lot of variations, due to random nature of variations, they can be +ve variation or -ve variation. More the gates in design, higher the changes that +ve and -ve variations will cancel out, resulting in very low level of variations. So, path depth is a factor only for random variations on die.
  2. path distance => variation increases as paths travels across more die area. This is based on simple silicon observation that close by structures have less variation, but if you compare structure far away, they have larger variation. That is why in analog circuits, matching transistors are placed as close as possible to each other to minimize variations b/w them. So, path distance is a factor only for systematic spatial variations on die. Even if you have more gates in design but they are closeby, then spatial variations will be very low, compared to when these gates are far away. In other words, we are saying that these variations are correlated more or less depending on their proximity with each other. So, total variation in any path is function of both random variation as well as spatial variation. However random variation dominate, so path distance based variation is not very critical.
  3. different cell types => variations varies depending on transistors used in cells. Lower width transistor have more variations that larger width ones. However, cell level derating is already captured in simple derating cmd above, as it allows us to set different derates for different kinds of cells.

AOCV was proposed to provide path depth and distance based derating factors to capture random and systematic on-die variation effects respectively. Instead of applying a constant derating factor as in ocv to any cell, we adjust the derating factor for a given cell based on path distance and depth. This is provided in form of a 2D table in a file. AOCV provides a single number for delay of a path based on derating used (derating value itself is taken from 2D table based on path depth and path distance for that cell). It only models delay variation, but does not model transition or constraint variation. Thus AOCV is siame as OCV except for derating added for above 2 factors.

Both GBA and PBA AOCV can be performed. GBA AOCV reduces pessimism by some compared to GBA OCV, which may be sufficient for signoff. If few paths still fail, PBA AOCV can be run on selected failing paths, which reduces pessimism even further.

AOCV flow:

 set_app_var read_parasitics_load_locations true => this is so that read_parasitics can read location info from spef file

read_parasitics file.spef => To use distance based derating specified in aocv file below, we need physical location of various gates, nets, etc. This info is contained in SPEF files, and can be read via read_parasitics cmd. If we have hirarchical flow, where there are separate spef files for blocks and for top lvel, PT can automatically determine correct offset and orientation of blocks. However, if it fails, we can specify it manually via extra args in read_parasitics cmd.

set_app_var timing_aocvm_enable_analysis true => sets GBA POCV analysis

read_ocvm file.aocvm => reads derating from this aocvm file. It has 2D table of derates with depth and distance as index (It can also be 1D table with either depth or distance as index, although this will give less accurate results). aocv derate take precedence over ocv derate specified for any cell, as it's more accurate. Syntax of this file is explained under pocv section below.

set_timing_derate -aocvm_guardband -early 0.95 => this applies additional guardband to model non proces related effects (ir drop, margin, etc) in aocv flow. For fast paths, we reduced delays by further 5%. Final derate = aocv_derate * guardband_derate. set_timing_derate -increment adds derate on top of this derate (instead of multiplying, it adds). So, Final derate = aocv_derate * guardband_derate + incremental_derate. Either guardband derate or incremental derate can be applied, as two are mutually exclusive.

set_timing_derate -aocvm_guardband -late 1.04 => For slow paths, we increased delays by 4%.

update_timing => performs timing update

report_ocvm -type aocvm => reports number of cells/nets that aocvm deratings were applied on, in summarized table. Any cell/net not-annotated with derate is listed here.

report_ocvm -type aocvm I0/U123 => If object list specified,  derating on specific instances or arcs reported

report_timing => shows timing paths with aocv analysis. -derate shows derate factor too

POCV: Parametric on chip variation

POCV is even more advanced, and radical departure from conventional methods. Here, timing values are stored not as one single number but rather as statistical quantities (i.e as a gaussian distribution with mean u and std dev sigma). These statistical distribution are propagated along the path, and probabllity/statistics theorems applied to come with a statistical distribution for the final delay at end point of a path. AOCV models deratings only for delay, but POCV statistical method is applied not only for delay, but also for transition variation for each cell on a path. It also models constraint variation (setup/hold times on flops), as these vary too depending on variation within the cell, as well as transition variation on clk and data pins. mu and sigma values are stored in lib files for each cell for delay, transition and constraint (only if provided for flops). By defauly, only delay variation is enabled. Transition variation and constraint variation have to be enabled to get better match with HPSICE monte carlo sims. Timing values can be reported at any N sigma corner (since sigma is known). Usually, we report it at 3 sigma, as that implies 99.9% of the dies for that timing path will lie within 3 sigma (i.e only 0.1% of chips will fail for that path).

POCV takes care of path depth automatically, as it propagates each distribution as independent random variable. So, statistically cancellation takes care of path depth. path distance is handled by using distance based AOCV tables. So, these tables are 1D in case of POCV (as opposed to 2D tables in AOCV).

Lower VDD, as found in low nm tech, increases sensitivity to delay, transition and constraint variation (as Vdd is close to Vth, so small change in Vdd causes large changes in current) . So, POCV accounts for all this sensitivity, and prevents overdesign during PnR. POCV run with GBA  provides better coorelation with PBA, as it reduces pessimism in GBA. On other hand, with AOCV, exhaustive PBA had to be run at signoff as GBA has inbuilt pessimism, increasing run time. Tight GBA PBA corelation in POCV  prevents running exhaustive PBA.

POCV is run in PT as regular flow. The only extra step is reading variation information from liberty files, or from sidefiles in AOCV like table. Then timing is reported at specific sigma corner.

POCV input data: 2 methods: One is providing a sidefile for distance based derating (or single sigma value called as single coefficient), and the other is liberty file with sigma values across i/p slew rate and o/p load. derate in POCV can be applied to both mean or sigma values. derate is applied to mean values available in .lib file, and sigma values available in .lib file or in sdefile as single coeff.

1. sidefile with POCV single coefficient: This sidefile is just an extension of AOCV table format in version 4.0 (this is same synopsys file format as shown in AOCV section). It can either have distance based derate, or constant coefficient for sigma. Syntax is as below: (file1.pocvm or file1.aocvm or any name)

version: 4.0

ocvm_type: pocvm => it can be aocvm or pocvm
object_type: lib_cell => this can be design, lib_cell or hier cell
rf_type: rise fall => for both rise/fall

voltage: 0.9 => this allows voltage based derating where diff derate can be applied based on voltage the cell is at.
delay_type: cell => this can be cell or net. For net, object_spec is empty to apply it for all nets
derate_type: early => early means it's applied on shortest paths (for setup, clk paths are early, while for hold, data paths are early => to model worst case scenario)
path_type: clock => below derating is applied only for cells in clk path (applicable only for setup). For cells in data path for early (applicable only for hold), we may specify a different derating (<1).
object_spec: -quiet TSM_LIB/* => applies to all cells. For particular cells, we can do TSM_LIB/invx1*. -quiet prevents warnings from showing up
distance: 0 5000000 10000000 20000000 30000000 40000000 => this specfies distance for derating purpose
table: 1.000000 0.990442 0.986483 0.980885 0.976588 0.972967 => since type is early (fast paths), derates are < 1 to model worst scenario.

coefficient: 0.05 => this specifies single coefficient which is sigma divided by mean = sigma/mu (random variation coeff normalized with mu). This is specified if we want to do single coeff based POCV for our timing runs, instead of more accurate liberty based sigma. However, coeff and distance are mutually exclusive, you can specify only one of them. Different values can be specified for diff cells, etc. Usually more accurate lib files used to provide sigma, instead of providing coefficient here.

ocvm_type: pocvm
object_type: lib_cell
rf_type: rise fall
delay_type: cell
derate_type: late => late means it's applied on longest paths
path_type: clock => below derating only for cells on clk path (applicable only for hold). We specify derating separately for cells on data path for late (applicable only for setup)
object_spec: -quiet TSM_LIB/*
distance: 0 5000000 10000000 20000000 30000000 40000000
table: 1.000000 1.009558 1.013517 1.019115 1.023412 1.027033 => since type is late (slow paths), derates are > 1 to model worst scenario. 

coefficient: 0.02

2. LVF (liberty variation format) file: These file have additional groups which contain sigma info for delay, transition and constraint variation. They may also have distance based derating values here, instead of being in a sidefile (using ocv_derate group)

 format of this explained in Liberty section

POCV flow:

set_app_var read_parasitics_load_locations true => this is so that read_parasitics can read location info from spef file

read_parasitics file.spef =>

set_app_var timing_pocvm_enable_analysis true => sets GBA POCV analysis

set_app_var timing_pocvm_corner_sigma 3 => sets corner value for timing to be 3 sigma. It can be set to 5 sigma for more conservative analysis

set_app_var timing_enable_slew_variation true => to enable transition variation (i/p slew variation affects delay variation as well as o/p transition variation). Optional but recommended for better accuracy at < 16nm

set_app_var timing_enable_constraint_variation true => to enable constraint variation (setup/hold, rec/rem, clkgating variation). Optional but recommended for better accuracy at < 16nm

read_ocvm file.pocvm => reads single coeff or distance based derating from side file based on what's available

set_timing_derate -pocvm_guardband -early 0.95 => For fast paths, we reduced delays by further 5%. POCV guardband is applied on both mean delay and sigma delay (AOCV guardband is only for mean delay, as there's no concept of sigma in AOCV). If we want to derate only sigma delay, we can scale pcvm coefficient in sidefile or liberty file (w/o modifying the value directly in sidefile or liberty file) by using "set_timing_derate -pocvm_coefficient_scale_factor 1.03" to scale it, which scales only sigma and not mean delay. However, pocvm coeff scaling is applied on top of guardband for sigma delay.

  • Final derate_mean = pocv_derate * guardband_derate + incremental_derate,
  • Final_derate_sigma = guardband_derate * pocvm_coefficient_scale_factor

set_timing_derate -pocvm_guardband -late 1.04 => For slow paths, we increased delays by 4%

update_timing => performs timing update

report_ocvm -type pocvm => reports summarized pocvm deratings applied on cells/nets. If object list provided, it shows coeff and distance based derating picked from sidefile or LVF

report_timing => shows timing paths with aocv analysis. -derate shows derate factor too. However, now we may want to see both mean and sigma delays (since sigma delays are taken into account when reporting slack). slacks are not simple difference b/w expected arrival time and atual arrival time, but are square root of squares of these (since now we are dealing with statistical quantities). To see both mean and sigma delays, set this app var:

set_app_var variation_report_timing_increment_format delay_variation => Now report_timing -derate will show 2 columns: mean (delay w/o variation) and sensit (sigma or delay variation). Incremental time column for that arc should equal mean +/- 3*sensit (+ or - depending on slow(max) or fast(min) paths). mean and sensit are with derating already applied. Apart from incremental colums, there is also path colums, which show both mean and sensit again. Mean here is the cummulative mean upto that point (sum of all means), while sensit is cummulative sensitivity upto that point (sqrt of sqaures of all sigma). These help to verify various path delays and how they contribute to overall delays. There are also statistical corrections applied to get numbers to add up. There is also statistical graph pessimism applied in timing analysis. Latch borrowing also needs to be treated differently when in POCV mode.

  • final cell_delay mean derated = Cell_delay_mean * final_derate_mean = cell_delay * ( "POCVM guardband" * "POCVM distance derate" + "Incremental derate" )
  • final cell_delay sigma derated = cell_delay_sigma_adjusted * final_derate_sigma = cell_delay_sigma_adjusted * ( "POCVM guardband" * pocvm_coefficient_scale_factor) => cell_delay_sigma here is adjusted from the original sigma number reported in liberty file (if LVF used) by accounting for the fact that transition variation on i/p will affect delay variation (as well as o/p transition variation) depending on correlation b/w transition and delay. There is proprietery formula applied by synopsys to come up with adjusted sigma number.

Now update_timing runs aocv, and report_tiing shows timing paths with pocv analysis.

report_delay_calculation -from buf1/A -to buf1/Z -derate => This shows detailed calculation of cell mean delay and cell sigma delay with derating. This is useful for debug.

POCV analysis with xtalk: POCV analysis is only applied on non-SI cell delay. However, POCV can indirectly change crosstalk delta delay due to the difference in timing window alignment.

POCV characterization: POCV data can be generated via Synopsys SiliconSmart too, It can generate sigma for use in LVF, or coeff for use in pocv side file. distance based derating in sidefile can't be generated via any tool, and is generated via silicon measurements

car oil change

If you have a regular car, chances are that an oil change is one of the things that you have to do once or twice a year. Without an oil change, your car may get into serious trouble. If you skip oil changes for a couple of years, or bought a used car that had neglected oil changes, then your car may not last long. Oil change is #1 maintenance thing that you have to do. If you buy a new car, for the first 10 years of ownsership, you may not have to do any other maintenance except for oil change every year. So, don't skip on it, but also do not overdo it. Overdoing it doesn't harm your car, but may be money getting flushed down the toilet.

Oil change can be done either yourself or at a mechanic's shop. I'll list both options below:

1. At a shop: This usually runs from $20 to $100 depending on the car. Older or cheaper cars use conventional oil which is usually cheaper but needs an oil change every 6 months or so. Newer cars use "synthetic oil" which is usually expensive but needs an oil change just once a year. More about oils below. So, in the long run, all kind of cars cost about $50-$100 in oil change per year. Walmart is the cheapest place for an oil change. They don't look for unnecessary repairs for your car, or come back with a list of 100 things that you have to get done on your car. If you take your own oil and filter to them, they charge you just for the labor, which is $10. However, some walmart locations will still charge you the full price irrespective of whether you bring your own oil+filter or get their oil.

Other places as Jiffy, Pep Boys, Car dealers etc have coupons for oil change. Look on their websites. Car dealers do your oil change for almost the same price as these local chains or mom and pop stores, especially if you use their coupon that they have on website most of the times. I would rather get the oil change from the dealer than from these smaller shops.

2. Do it yourself (DIY): Of course you are here on this website to save money, so we everything humanly possible ourselves. Oil change is such an easy job, that it can be done yourself in less than 30 minutes in your parking space. It not only saves you money but time too. Going to a repair shop, waiting there, getting the oil change and coming back is half a day lost for nothing. Not to mention cheap oil that is used on top of being an expensive oil change. I've started doing my oil change, and I never have done any tool work in my life before. So, if I can do it, anyone can do it !! You will need to buy some tools though, but they will all pay off in one oil change.

Items needed for Oil Change:

Get these things before you start doing oil change. These are one time investment, and can be reused.

1. Ramp: You need to raise the car, so that you can get under it. This is the most dangerous part, that gets people extremely scared to do oil change. Raising car on jacks is one option, but jacks are not easy, and you don't know if it's done right. Ramps are a solution to these. You just drive your car ver the ramp. The front 2 wheels get oon the ramp, and now are raised by couple of inches compared to back wheels. Then you slide form the front of the car.  Only 1/4th to 1/3rd of your body needs to be under the car, as the oil changing screw is in the front of th car. These ramp are very sturdy. One thing to note is that your ramps start sliding, once you drive your front wheels over the ramps. To prevent thet, place the front of the ramps against some raised level (as like the entry of garage where the outside of garage is little bit lower than the inside of garage, giving an edge that will prevent sliding). Or people use a rope, or some other tricks. Read in links below on slickdeals on various ideas.

I bought Rhino ramps from walmart for about $35. Here's a deal for  ramps for $30:

https://slickdeals.net/f/15026011-rhino-gear-rhinoramp-29-99-at-advance-auto-parts

2. Screw: Opening the screw at the bottom of the car, from where the oil drains is another big thing. You need to have the right size "screw opener" to open it. Look in Youtube videos for your particular brand car and find out the size you need. Any generic toolbox has the screw opener you need.

3. Oil Filter Wrench: You not only need the screw opener, but also the circular oil filter opener (called as wrench). This is the hardest part to find. Filter box should be easy to come out, but they are circular, and so hard for anything to move it without slipping. I bought one at Autozone for about $10 (It specifically mentioned on the item that it works for Toyota cars), which works great on My Totyota Sienna. All the other styles that I bought never worked. So, choose this or a similar style:

Link: https://www.autozone.com/shop-and-garage-tools/oil-filter-wrench/p/performance-tool-oil-filter-wrench-w54105

4. Drain pan container: To drain oil and store it, you need a drain pan. These collect the used oil as you drain them. Then you can close the top opening of the container, and take it to a local auto shop, and they will get rid of the oil for you - for free. These used to cost $7 or so, but as of 2022, I'm seeing prices of $15.

One such Link: https://www.ebay.com/itm/374355003098

5. Funnel: Any funnel to pour oil into the engine. I bought this super funnel from walmart for $2.50, but you may easily get smaller ones for a dollar or less.

Link: https://www.walmart.com/ip/FloTool-05034-Super-Funnel/20440553

6. Rugs: These are any plastic sheet and old clothes lying around. You don't need to buy anything here. Just use old kids clothes, socks, underwear or whatever. Something to soak the oil if it drips on the floor, as well as to clean the floor that you are going to lie down on.

7. Oil:

First things first: oil is not same as gasoline. In countries other than USA, oil loosely refers to gasoline, but here oil and gasoline are  2 different things. Oil is the one that goes in your engine to lubricate it. You change it once a year or so for newer cars. Gasoline is what your car runs on (diesel or petrol). You will need to know what oil goes into your car. Assuming you are driving a regular vehicle, you will  see oil with numbering as 0W-20, 5W-30, etc written on them. There will also be synthetic oil, conventional oil, full synthetic oil etc. Most of the times Synthetic oil is what is used on newer cars. However, your car manual is the ultimate guide on what kind of oil can go in your car. Read it. Many times, even though manual say conventional oil,you can still go for synthetic (search online forums to see if it's supported). Given an option, go for full synthetic oil as the price is the same as other synthetic oils (semi synthetic, blend etc). Most cars get better mileage and need less frequent oil change with synthetic oil.

Viscosity of oil: Oil viscosity is it's ability to flow. higher number => thicker or more viscous (higher viscosity) oil. Oil with viscosity=5 flows better than oil with viscosity=20. Thicker oil generally gives lower fuel econmoy, and more stress on engine. Oil gets thinner when hot and thicker when cold. So, to maintain optimal performance, you would want to use thinner oil in winter (as oil will get thicker anyway with colder temp), and thicker oil during summer (as oil will get thinner anyway with hotter temp). So, you are able to maintain best fuel economy and less stress on your engine. This is what people used to do in old times. They would change oil with different viscosity during summer and winter to keep optimal performance. These were called single weight oils. Single weight oils are not supposed to be used anymore. We have multi weight oils now that care of viscocity automatically in high and low temps. They have a mixed formula where the composition of oil changes to thinner oil in winter, since it gets thicker anyway with cold temp, and to thicker oil in summer, since it gets thin anyway due to heat. Thus one oil works for both summer and winter, and there's no need to change oil due to outside temps changing.

Oil Numbering (5W-20, etc) refers to this viscosity at low and high temps. Numbers such as 5W-20 refer to the weight and viscosity (or thickness) of the oil. The letter ‘W’ means both varieties are suitable for cold temperatures. The first number before the letter refers to the oil’s thickness at a cold temperature and the number following the letter indicates the thickness at operating temperature. 

Oil is expensive. Your car needs 5-7quart of oil. You can almost always find Oil on sale. Look in gasoline/oil deals section. It will cost about $1/quart when on sale.

 


 

Steps for Oil Change:

Search on youtube for your specific model of car followed by "oil change", i.e "Honda Civic 2012 oil change". You should find at least some video which has all the steps for an oil change. Even if your exact model or year is not there, many of the oil change steps are similar for same brand car with similar specs. This is because, car companies don't make lot of changes around where different components are placed, as that disrupts their assembly line process, and incurs higher cost. May be once every 10 years, you will see some changes, but that's it.

Toyota Sienna MiniVan: This is one of such videos showing how to change oil for 2010-2016 Toyota Sienna minivan => https://www.youtube.com/watch?v=8jlpDcSeUz0

Honda  CR-V: This vid shows how to change oil for 2007-2011 Honda CR-V => https://www.youtube.com/watch?v=FXxG5sZFFoA

 

I'll list the basic steps again for cars that Ive worked on (I've worked on doing oil changes on both Toyota and Honda cars. First time it took about an hour, but now it takes 30 minutes or less).

  1. Put the ramps on flat ground. It's better if you can find the edge of your garage to put the front of your ramps. Since there 's small notch (a slightly higher surface separating the outside driveway from inside of the garage), you can rest your ramp against that notch so that the ramps don't slide. Now drive your car on to those ramps. It looks scary the first time you drive, because the car goes really high (or at least it seems that way). At this point you will notice that the ramps start sliding. this is where the notch helps you and keeps those ramps from sliding. Make sure that the car front wheels are on the flat surface of the ramp.
  2. Open the hood of of the car, and take out the "oil dipstick". Check oil level, and leave the dipstick on side, or put it back. We'll use it to check the level of oil once we fill in new oil.
  3. Once the car is resting in a stable state on the ramps, get underneath the car from the front of the car. You will have to slide your body may be 2 feet inside to get to the nut that holds it.
  4. Get your oil drain pan under the nut that you are going to unscrew. Oil is going to drain fast as soon as you unscrew the nut. Unscrew it slowly, so that you have enough time to move the drain pan in right spot to collect the oil from the car.
    • You need size 17mm socket (most common nut size on vehicles) to remove oil drain nut. This may or may not be there in your std tool set. Get one before you start the job. 
  5. Now you need to get out the oil filter that is sitting close to this nut. This oil filter is tricky, as sometimes you need special tool to open it. I had to use "oil filter wrench" for my toyota sienna. Some left over oil will gush from here too.
  6. Let the car sit for 5 min or so to let all the oil come out. Once done, you replace the old oil filter with new oil filter , and put the filter back in place using the wrench. Now screw the nut back exactly to the point it was there before. If you don't screw it enough, you will see drops of oil dripping when car is standing overnight. If it's too tight, it may not unscrew at all next time you do an oil change. That's even more painful, since that may require a mechanic shop visit to unscrew it, or worse break that nut, that will cost you a bit. As a guide, I mark the point on the nut to the body of car, before I unscrew the nut. Then when I put it back, I screw it until those marks align again. That way I know that it's tightened enough where it won't leak (as it wasn't leaking before).
  7. Now on the hood of the car, open the oil cap, from where you are going to fill in the oil. You put in a little bit of oil and check if it's leaking from the bottom, where you tightend the nut. If not, you fill more. Keep filling until you are slightly below the spec. If it says, 6.4 quarts, you go to 5 quart and then check the oil level. Of course the car is tilted, so the oil levels will not be accurate. Close the oil cap, and put the dipstick back where it came from.
  8. Now you start the car, and bring it back to ground.
    1. IMP: Do not drive the car before filling in the new oil. Since you have drained out old oil, the engine is w/o oil, and driving may damage your car seriously.
  9. Put in some more oil now, and keep checking the oil level via the dipstick. Dipstick is not always accurate, so you have to rely on the spec. Make sure, you fill slightly under the spec. i.e if it says 6.4 quart, go up to 6.2 quart and monitor the car for a day. Check the dipstick and put some more oil if needed. Never overfill, else we'll need to drain it from the bottom by loosening the nut a little. That is too much work, so play safe.
  10. You should keep monitoring oil levels in general, and if it goes below the min, you should fill it with extra oil from the top. That shouldn't happen though. So, it might indicate a leak or something else wrong with the engine. If it's over the max, then you need to see how much over. Little bit over is fine, but if it's too much, then drain out some oil from the bottom.

At this point you are done with oil change. Collect you used oi, and take it to any car repair shop. They have to take your used oil to dispose of it in lawful way. They can't refuse to take it. Or goto walmart auto shop which might be lot easier. Keep lot of rugs, papers handy, as oil may get to your hands, floor, inside of car, etc. Congrats for job well done !!