Matplotlib: one of the most popular Python packages used for data visualization. It is a cross-platform library for making 2D plots from data in arrays. Matplotlib is written in Python and makes use of NumPy, Matplotlib along with NumPy can be considered as the open source equivalent of MATLAB. MATLAB, is a proprietary programming language developed by MathWorks used for plotting as well as carrying out various kinds of scientific computations. Infact, you don't need to use MatLab or any other open source equivalent of Matlab (not even gnuplot) at all, once you start using Matplotlib. Matplotlib can create static, animated, and interactive visualizations. 

Official matplotlib website: https://matplotlib.org/

Very good tutorial here: https://www.tutorialspoint.com/matplotlib/index.htm

Source code for matplotlib here: https://matplotlib.org/_modules/matplotlib.html

As you can see, there are lotsof functions defined, that can be called from this library.

Installation: on pthon3.6 version. NOTE: All our examples are with python3 (version 3.6).

1. CentOS:

$ python3 -m pip install matplotlib => this installs it for specific version of python (here for python3, which is pointing to python3.6). We can verify that it got installed for python 3.6 by looking in dir for python 3.6 library.

$ ls /usr/local/lib64/python3.6/site-packages/matplotlib/* => We see this "matplotlib" dir in python3.6 lib dir, so it's installed correctly

Usage:

Pyplot: A very important module of matplotlib is pyplot.

matplotlib.pyplot is a collection of command style functions that make Matplotlib work like MATLAB. Matplotlib is the whole package; matplotlib.pyplot is a module in Matplotlib.

Here's the full source code for pyplot: https://matplotlib.org/_modules/matplotlib/pyplot.html

Here's introductory pyplot tutorial: https://matplotlib.org/3.1.0/tutorials/introductory/pyplot.html#sphx-glr-tutorials-introductory-pyplot-py

import pyplot: Before we can use pyplot, we need to import pyplot in python script:

import matplotlib.pyplot as plt => This imports pyplot module as plt (plt is just a conevention, we can use any name)

from matplotlib import pyplot as plt => This is an alternative way of importing pyplot

Pyplot functions: Each Pyplot function makes some change to a figure (i.e xlabel, title, plot, etc).These are few common ones:

pyplot.imread() => Reads an image from a file into an array. Similarly pyplot.Imsave() saves  an array as in image file, while pyplot.Imshow() displays an image on the axes.

pyplot.show() => displays a figure by invoking plot viewer window, this is used in interactive mode if you want to see the plot. Most of the times, you will just want to save the plot. pyplot.Savefig saves the current figure, while pyplot.close() closes a figure window

pyplot.bar() => makes a bar plot, similarly options for histogram (pyplot.hist), scatter plot (pyplot.scatter), line plot (pyplot.plot), etc. These create the plot in background, but do not display it. We use pyplot.show() to display the plot.

A. scatter plot (plt.scatter): scatter plot is used widely in stats. The syntax for scatter plot is:

matplotlib.pyplot.scatter(x, y) => A scatter plot of y vs x with varying marker size and/or color that can be specified with optional args. X, Y are array like. Fundamentally, scatter works with 1-D arrays; x, y may be input as 2-D arrays, but within scatter they will be flattened. shape of X and Y are (n,) where n is the number of elements in the 1D array. Important Optional args are size and color.

s=area (or size). Since area is written as R^2, so we always write area in form R**2, where R=radii in point for that marker. This can be provided as a scalar number or as an array of shape (n,) (i.e 1D array of n elements) in which case size array is mapped to each element of X,Y array.

c=color. color is little complicated. We can specify single color format string or sequence of colors. Seq of colors is specified as 1D array of shape (n,) where elements are floating numbers b/w 0 to 1 or int b/w 0 to inf, or any decimal numbers. (Not sure how these numbers map to colors). We can have optional cmap option(cmap=my_map), where we specify a colormap instance or a registered colormap name. cmap is used only if c is an array of n floating numbers (again floating numbers can be in any range).

ex: Below ex draws X-Y plot with random size and random color of each point which is itself placed randomly.

N = 50
x = np.random.rand(N) #This returns 1D array with 50 random entries in it (all b/w 0 and 1). returns x = [ 0.46875992 .... 0.498185 ]
y = np.random.rand(N) #similar 1D random array y
colors = np.random.rand(N) #similar 1D array for color. Each of 50 entry corresponds to one (X,Y) point. Each colors value is random floating num from 0 to 1.
area
= (30 * np.random.rand(N))**2 # 0 to 15 point radii. This is again a 1D array wit 50 random entries one for each (x,y) point.
plt.scatter(x, y, s=area, c=colors, alpha=0.5) #c is color, s is area in points**2. alpha is the "alpha blending value", between 0 (transparent) and 1 (opaque). alpha=0.5 is in between transparent and opaque.
plt.show() #this shows the plot for display

ex:Below ex draws same kind of random plot as above, but here size is fixed at 40 and we use a colormap called Spectral. Spectral colormap has a mapping from number to color, so it maps accordingly.

color=np.random.randint(5,size=(400,))

plt.scatter(X[0, :], Y[0, :], c=color, s=40, cmap=plt.cm.Spectral); => This plots elements of 2D array where X and Y are 2D arrays. Here [0,:] array is sliced along index=0, so it plots axis=1 for both arrays. c is color, s is area of 40 and cmap is a registered colormap name. Here color is provided as 1D array of 400 values, with each value being a random num from 0 to 4. As per Spectral cmap, 0 maps to red, 1 maps to blue and so on. So, each dot on scatter plot gets a random color from the set of 5 colors.  However, again I see here that any real number in any range can be provided for color and it still works, so again not sure how it maps.

B. Line plot (plt.plot): We draw a line plot (most common kind of plots) using plt.plot() func.

Ex: Below is a simple plot of sine wave, where x is varied from o to 2*pi, while y is a sine wave constructed from these x values.

#!/usr/bin/python3.6
from
matplotlib import pyplot as plt

import numpy as np import math #needed for definition of pi

x = np.arange(0, math.pi*2, 0.05) #numpy lib has arange function that provides ndarray objects of angles between 0 and 2π with step=0.05. If we increase this step to 1, we get a very non smooth plot.

y = np.sin(x) plt.plot(x,y) #plots line plot with x and y

plt.xlabel("angle")

plt.ylabel("sine")

plt.title('sine wave')

plt.show() #displays the plot

Ex: plot an array with no x values provided => uses index of array as X values
ex: a = [1, 2, 3, 4, 5]
plt.plot(a) #here we provide only Y axis values to plot. Since it's an array, the array index are plotted on X axis.
b = [0, 0.6, 0.2, 15, 10, 8, 16, 21]
plt.plot(b, "or") #Here we plot graph "b" but not as a line, option "or" means o=circle, and r=red. So, it says to plot it as scatterplot, with X axis being the index, and dots being circle with red color.


C. Contour plot (plt.contour): These are one of the ways to show 3D surface on 2D planes. More details here: https://www.tutorialspoint.com/matplotlib/matplotlib_contour_plot.htm
meshgrid is needed to create gris of X and Y array values on which the plot of Z is drawn. plt.contourf() fils the contour lines with color, while plt.contour() just shows the contour lines

ex: draws meshgrid with X and Y both in range of -3 to +3. Z is the function whose contour plot we draw.

import numpy as np
import matplotlib.pyplot as plt
xlist = np.linspace(-3.0, 3.0, 100)
ylist = np.linspace(-3.0, 3.0, 100)
X, Y = np.meshgrid(xlist, ylist)
Z = np.sqrt(X**2 + Y**2)
plt.contourf(X, Y, Z)
plt.show()


Figures: all plots that we saw above used a figure to draw the plot on. We never had to call any function separately to create a figure. However, we can do that too.

1. Figure (plt.figure):   matplotlib.figure module in matplotlib contains the Figure class. It is a top-level container for all plot elements. The Figure object is instantiated by calling the figure() function from the pyplot module (an object inst is returned back on calling the class method init, which is automatically called on calling class_name. See OO section for details)

fig=plt.figure(figsize=(3,4)) => This returns figure instance by calling figure class. Here we specified figure (width,height) in inches. If we don't specify anything, i.e. plt.figure() then default size figure is opened.

2. Axes: Axes object is the region of the image with the data space. The Axes contains two (or three in the case of 3D) Axis objects. Axes object is added to figure by calling the add_axes() method. It returns the axes object and adds an axes at position rect [left, bottom, width, height] where all quantities are in fractions of figure width and height.

ax=fig.add_axes([0.05,0.06,0.9, 0.9]) => This adds axes on the figure starting from left which is 0.05*figure_width, bottom which is 0.06*figure_height, and then height and width are same as height and width of figure. Usually you'll see [0,0,1,1] as axes position, but that doesn't show the axes as the axes are right on the edge of figure. So, we leave some margin on sides.

Now on the axis object, we can draw plots, put labels, legends, etc

l1 = ax.plot(xlist,ylist,'ys-') # solid line with yellow colour and square marker
l2 = ax.plot(x2list,ylist,'go--') # dash line with green colour and circle marker
ax.legend(labels = ('tv', 'Smartphone'), loc = 'lower right') # legend placed at lower right
ax.set_title("Advertisement effect on sales") #title on top of axes plot
ax.set_xlabel('medium') => label on x axis
ax.set_ylabel('sales') => label on y axis
plt.show() => this finally shows the fig with the plot on it

3. subplots: Apart from plot function, we also have subplots func, that is used to create a figure and a set of subplots.The subplots() method returns both the fig and the axes. subplots() is used when we want to create more than 1 plot on the same figure. plt.subplots(nrows, ncols) draws subplot grid with nrows and ncols.

ex: fig, ax = plt.subplots() => Now we can use fig and axis the regular way (i.e ax.set_xlabel, etc). i.e ax.plot(x,y)

ex: Below ex creates 2 rows and 2 cols with 4 plots total. Only plotiing 2 plots, the other 2 plots remain empty.

fig, axs = plt.subplots(2, 2)

axs[0, 0].plot(x, y)

axs[1, 1].scatter(x, y)

plt.show()

We can also use plt.title, plt.xlabel, plt.ylabel etc, and bypass ax.set_title, etc. Not sure, what the diff is. FIXME?

GDP = Gross Domestic Product. Now you know it !!

We hear this term so much in everyday life, yet I knew nothing about it. So, started reserahing about it to find out what is it, and if it has any relevance at all?

Wages and salaries: If we add up everyone's wages and salaries in a given year that may give us some idea of how much more money are people making every year, compared to previous year. However ieven if a person makes more money this year than last year, it doesn't mean much, unless with his increased paycheck, he could buy more things.

Let's say with $100 in wages last year, a person was able to but 20 kgs of rice (at $5/kg). This year, let's say he made $110, but could still but only 20kg of rice, then he didn't really make any more money, since he could still but the same amount of rice as last year (and nothing more since price of rice this year is $5.50/kg).

Historical GDP for each country can be found here: https://data.oecd.org/gdp/gross-domestic-product-gdp.htm

Total GDP =$80T, USA=$20T, China=$12T, Japan=$5T, Germany=$4T. Next is India, UK, France and Brazil. Just top 15 countries, with 50% of world population account for 75% of world GDP. GDP grows by rate of 3% per year. From $1.4T in 1960, it has grown 50 fold in last 50 years. No matter which part of world you live, you need to spend about $1K/year on your food and basic supplies to survive. So, 8B people would imply a minimum of $8T in GDP. Of course, top 10% of the world spend more than that on a phone every year, so rest of the GDP comes from those rich people. About 2.5B people in world are very poor, do not get enough to eat and are malnourished. 70% of world population lives on less than $10/day. Only 7% or 500M people live on > $50/day. Most of these rich people are in USA, Canada, UK, Australia and western European countries.

Usually countries with high population also have large GDP. Since GDP is closely tied to increasing population (more people, more consumption, more GDP), countries with fast increase in population will have higher GDP growth every year.

USA GDP:

USA GDP data is more reliable than world GDP data. And since we'll be looking at US economy data in more detail, it's better to look at USA GDP. I'll mostly be talking about nominal GFP and not real GDP.

Nominal GDP = GDP as in current US dollars

Real GDP = Nominal GDP - Inflation (For this we consider some particular year as a baseline, and then compare real GDP compared to that year).

US nominal GDP: https://fred.stlouisfed.org/series/NGDPNSAXDCUSQ

 

 

 

 

 

 

NumPy: Numerical python. Very popular python library used for working with arrays. Python has native lists that work as arrays but they are very slow. NumPy is very fast. It has a lot of functions to work with the arrays too. It is the fundamental package for scientific computing with Python. Numpy is used heavily in ML/AI, so we need to have this installed. All exercises in AI use numpy.

Official numpy website with good intro material is: https://numpy.org/doc/stable/

A good tutorial is here: https://www.geeksforgeeks.org/python-numpy

Installation:

CentOS: We install it using pip.

 $ sudo python3.6 -m pip install numpy => installs numpy on any Linux OS. We can also run "sudo python3 -m pip install numpy",

Arrays:

Basics of Array: Number of square brackets [ ... ] in the beginning or end determine the dimension of array. so, [ ... ] is 1 dimensional, while [ [ ... ] ] is 2 diemensional and so on, as you will see below.

1 Dimensional array is a an array which has only 1 index to find out any element. ex: arr_1D = [ 1 2 3 4 5 ] => This is a 1D array with 5 elements. arr_1D[0]=1, arr_1D[1]=2, ...

2 Dimensional array is an array which has 2 indices to find out any element. So, we have 2 square brackets here.

ex: arr_2D = [ [1 2 3 ]  [7 8 9]  [4 5 6]  [2 4 6] ] => Here we see that outer array has 4 elements (similar to 1D array), but now each element of this outer array is itself an array. so, if we try to print each element of this outer array, it will print the array element. ex: arr_2D[0] = [1 2 3], arr_2D[1] = [7 8 9], and so on. Now if we want to print element of each internal array too (i.e the final value stored in array), we have to provide that index too, i.e arr_2D[1][2] = 9 => here arr_2D[1] points to array [7 8 9], and then for this we can report any index. So, if var=arr_2D[1] = [7 8 9], then var[0]=7, var[1]=8. var[2]=9. But here var happens to be arr_2D[1], so arr_2D[1][2] gives 2nd internal array and 3rd entry in this array. So, full array range is arr_2D[0:3][0:2].

Sometimes writing 2D array in other way is more visual. Writing above array in row/col format, we now see that there are 4 rows and 3 cols. So, it's 4X3 matrix array, i.e outermost has 4 elements and each of that contains 3 elements.

[ [ 1 2 3 ] 

  [7 8 9] 

  [4 5 6] 

  [2 4 6] ]

ex: arr_2D = [ [1] [2] [3] ] => Each element is 1D array. So, it's a 3X1 matrix, i.e outermost has 3 elements and each of that contains 1 element. So, shape is 3X1, and dimension is 2.

[ [1]

 [2]

 [3] ]

ex: arr_2D = [ [1 2 3] ] =>Each element is 1D array with 1X3 matrix. So, shape is 1X3, and dimension is 2.

3 Dimensional array is an array which has 2 indices to find out any element.

ex: arr_3D = [ [ [1 2 ] [3 4] ] [ [5 6] [7 8] ]  [ [9 0] [1 4] ] ]. Here outer array has 3 elements, all 3 of which are 2D array. The 2D array is 2X2. So, full array range is arr_3D[0:2][0:1][0:1]. so, it's a 3X2X2 matrix, i.e outermost has 3 elements and each of that contains 2 elements and each of these 2 elements finally contains 2 elements. So start with innermost entries, that determines the final dimension of matrix. Then move outward.

[ [ [1 2 ] [3 4] ]

 [ [5 6] [7 8] ]

 [ [9 0] [1 4] ] ]

Usage of Numpy:

We saw array module in python section to create arrays. However, it's highly preferred to use numpy module to work on arrays, instead of using array module, that's included in python by default.

Import numpy module:

First we need to import numpy module in our python script in order to use it:

ex: import numpy => imports numpy. Now, we can call numpy functions as numpy.array, etc.

NumPy is usually imported under the np alias, so that we can use the short name np instead of longer NumPy

ex: import numpy as np

Creating numpy array:

After importing numpy module, we can use array( ) function in numpy module to create numpy array object. The class of this array object is ndarray (it will be seen as "numpy.ndarray" object in pgm). See in "python: Object Oriented" section on how classes are created.

array() function: Input to array function can be python objects of data type list, tuples, etc. See in Python section for list, tuples, defn. These list, tuples, etc are converted into numpy array object of class "ndarray" by the array() func. Array in Numpy is a table of elements (usually numbers), all of the same type, indexed by a tuple of positive integers. The type of array created is figured out automatically based on the type of input contents (i.e if list has int type, then array created has int type). If we have mixed contents, then type is undefined. We can also explicitly define a type for ndarray object, that we'll see later.

import numpy as np
arr = np.array( [1, 2, 3, 4, 5] ) #here input is a python list, with all integers. An ndarray object is created with all integer elements, i.e arr = [1 2 3 4 5]

arr = np.array((1, 2, 3, 4, 5)) # here tuple is provided as an input to array() function.

Print: can be used to print elements of array
print(arr) => prints array elements [1 2 3 4 5]. "arr" is ndarray object. It has no commas when it's printed. We don't know what form is ndarray object stored internally, but "print" func prints it in this form. This is 1D array. arr[0] = 1, arr[4]=5, and so on

NOTE: In above ex, the input list, tuple etc, has elements which are separated by a comma (as per the syntax of list, tuple, etc), and they get printed the same way with commas. However, the output of array() func is ndarray object, which is printed with no commas. i.e arr = [1 2 3 4 5]. However, [1 2 3 4 5] is not ndarray object (it's just the printed o/p), arr is the ndarray object. If we try to apply any numpy func on [1 .. 5], we'll get an error: i.e. arr=np.array([1 2 3 4 5]) gives syntax error.

We can't create numpy array by just assigning a python "list" to a var.

arr= [ [1,2], [3,4],[5,6] ] => This assigns the "list" to var "arr". Since it's not numpy array (since we didn't use numpy.array() function on this), we wouldn't expect any numpy function/method to work on this list. However, surprisingly it does work for a lot of functions i.e np.squeeze(arr) will work, even though arr is a list (and NOT ndarray object). Not sure why? Maybe, most numpy func automatically convert input arg which is list or tuples into ndarray object, if it's NOT ndarray to start with. Best thing to do is to convert list/tuple into numpy  "ndarray" object using np.array() func, and then work on it. Later, we'll see many other functions to create numpy array (besides the array() func)

Data types (dtype): data types in NumPy are same as those in Python, just a few more. They are rep as int32, float64, etc or we can specify it in short form as single char followed by number of bytes, i.e int8 is rep by "i1", "f4" for 32 bit float, "b" for boolean, "S2" for string with length 2, etc. Instead of S type, we use U (unicode) type string in Python 3. See details for unicode in regular python section.

W don't have a separate type for each element of ndarray object, as ndarray can have elements of only one type. As we saw above, numpy array object inherits the "type" from type of list/tuple. This type becomes the data type of whole array. It's referred to as attribute "dtype" of the array object.

print(arr.dtype) => property "dtype" prints data type of an array. Here it prints "int64", since data is integer rep with 64 bytes

When declaring array using array() func, we may specify dtype explicitly. Then those array contents are converted to that data type and stored (if possible)

ex: arr=np.array(['2', '72', 'a'], dtype='int64') => Here it errors out since 3rd entry 'a' can't be converted to int type. '2' and '72' are OK to be converted even though they are strings. If a' is replaced by 823, then arr would be [2 72 823] , i.e array with int64 elements and NOT string.

arr=np.array(['23', 'cde', 71],dtype='S2') => Here we are creating an array of 3 elements with dtype as string of 2 byte. So, numpy converts 71 (which is without quotes, and so an int) to a string too. However, 'cde' needs 3 bytes, but since we are forcing it to 2 bytes, 'e' is dropped and only 'cd' is stored

print(arr.dtype, arr) => It returns => |S2 [b'23' b'cd' b'71'] => S2 means it's dtype is string with 2 Byte length. b'23 means string "23" is stored as bytes. Here array got printed with these b', which we don't want. To print only the string, we can convert these to utf-8 by using decode method, i.e arr[1].decode("utf-8")) returns "cd" unicode string

ex: arr=np.array(['2', '32', 7],dtype='i4'); print(arr.dtype, arr) => returns => int32 [ 2 32  7] as dtype is int32 and array elements are converted to 4 byte integer, so string '2' and '32' become integer 2 and 32.

arary with multiple data types: ex: np.array( ['as', 2, "me", 4.457]  ) => here all 4 elements of array are of diff data types. By default, dtype here is U=Unicode. This is valid.  Since 4.457 has length=5, so it's type is Unicode with length=5 or U5. So, all elements of this array are U5 irrespective of whether it's string or int. Basically all array elements got converted to unicode (or string in loose sense). Just that operations like arr[2] + arr[3] may not be valid, since not all operations apply on unicode type.

shape: A tuple of integers giving the size of the array along each dimension is known as shape of the array, i.e the shape of an array is the number of elements in each dimension.

print(arr.shape) => returns a tuple with each index having the number of corresponding elements. Here it returns (2,3) meaning array is 2 dimensional, and each dimension has 3 elements, so it's 2X3 array.

Since shape is a tuple, we can access each element of this tuple cia index, i.e shape[0] returns 2 (num of arrays), while shape[1] returns 3 (elements in each array)

Dimension (ndim): This shows dimension of an array as 1D, 2D and so on. In Numpy, number of dimensions of the array is also called rank of the array (i.e 2D array has rank of 2).

print(arr.ndim) => "ndim" attr returns the number of dimension of an array. since arr has 1 dimension, this returns 1

0D array => a_0D = np.array(2) => This is an array with just element of the array, i.e there is only 1 value. so, it's not really an array, but a scalar. It shows ndim=0. I shows blank for shape, i,e a_0D.shape = ( )

1D array => b_1D = np.array( [2] ) => By adding square brackets, we convert 0D array into 1D array. It shows ndim=1, and b_1D.shape = (1, ). We would have expected it to show (1,1) since there's 1 row and 1 col, but for 1D array, number of rows is 0 (since if there were any rows, it would become 2D array. 1D array just has columns. So shape tuple omits rows, and only shows cols for 1D array. This is called a rank 1 array, and because it's neither  row vector nor a col vector (explained below), it's difficult to work with. So, avoid having these 1D arrays, as they won't yield desired results in AI computations. We usually use reshape function (explained later) to transform it to a 2D array as row vector.

ex: b_1D = np.array( [2, 3, 5] ) => This shows shape as (3, ) since this has 3 columns.

2D array => c_2D = np.array( [[1, 2, 3], [4, 5, 6]] ) => This is 2D array with 1st row [1 2 3] and 2nd row [4 5 6]. c_2D.ndim=2, c_2D.shape = (2,3) since there are 2 rows and 3 columns. NOTE: there are comma in between elements and in between arrays.

print( c_2D[0]) => prints 1st element of array c_2D which is "[1 2 3]", c_2D[1]=[4 5 6], c_2D[1,2] = 5

arr_2D = np.array( [[1, 2, 3]] )=> This is 2D array which has only 1st row which is a 1D array with 3 elements. So, arr_2D.ndim=2, arr_2D.shape=(1,3). NOTE: this 2D array has 1 row and 3 columns, unlike 1D array which had no rows and just 3 columns.

row vector: These are 2D array of shape (1,n) i.e they have a single row. ex: [ [ 1 2 3 ] ]

column vector: These are 2D array of shape (m,1) i.e they have a single col. ex: [ [1] [2] [3] ]


3D array => d_3D = np.array( [[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]] ) => 3D array can be seen as each row itself being 2D array. d_3D.ndim=3, d_3D.shape = (2,2,3) since it has 2 outermost entries, then each of these 2 entries has 2 array, and each of these 2 have 3 elements each.

Axis of Numpy array: In numpy, number of dimension of array is called as number of axis of an array, i.e 3D is called an array with 3 axis. 1st axis or axis=0 is the outermost array. Then 2nd axis or axis=1 is the next inner array and so on.

For N dim matrix as (N1, N2, ... Nn) => There are N1 data-points of shape (N2, N3 .. Nn) along axis-0. Applying a function across axis-0 means you are performing computation between these N1 data-points.. Each data-point along axis-0 will have N2 data-points of shape (N3, N4 .. Nn). These N2 data-points would be considered along axis-1. Applying a function across axis-1 means you are performing computation between these N2 data-points. N3 data points would be considered along axis-2. Similarly, it goes on. The dimension of the array is reduced as well, since 1 or more axis are gone.

As an ex: For a 2D array, Let's try computing across the 2 axis. ex: data = numpy.array([[1, 2, 3], [4, 5, 6]]);

1. axis=0: adding across 1st axis or axis=0 means adding across all rows, i.e adding all col (vertically down) for each row.

ex: result = data.sum(axis=0); print(result) => prints [1+4 , 2+5, 3+6] = [5 7 9] => This is a 1D array now instead of 2D array.

2. axis=1: adding across 2nd axis or axis=1 means adding across all cols, i.e adding all row (horizontally across) for each col.

ex: result = data.sum(axis=1) => prints [ [1+2+3] [4+5+6] ] = [6 15]  => this is again a 1D array

More ways to generating a new array: there are many functions in numpy to generate a new array with any given shape, and inititalize it with values.

1. arange: arange function returns an ndarray object containing evenly spaced values within a given range (i.e arange= array range). The array is 1D and it's size is the range of numbers that will fit in that array.

Syntax: numpy.arange(start, stop, step, dtype) => "stop" is required (final element value is n-1), all others are optional. By default start=0, step=1 and dtype is same type as stop, so if stop is float, then type is float too.

x = np.arange(5) => returns [0 1 2 3 4]. Here range is defined as 0 to 4 with step of 1. data type=integer since 5 is integer.

x = np.arange(10,20,2) => returns [10 12 14 16 18]. It's 1D array with 5 elements in it.

2. linspace: Similar to arange. It returns ndarray object with evenly spaced numbers over a specified interval.

Syntax: numpy.linspace(start, stop, num) => start, stop are required. num=50 by default

np.linspace(2.0, 3.0, num=5) => returns array([2.  , 2.25, 2.5 , 2.75, 3.  ]) => Here 5 samples are included b/w 2 and 3 with equally spaced values.

3. zeros/ones: These are 2 other functions that init an array with zeros or ones.

Syntax: numpy.zeros(shape, dtype) => Returns an array containing all 0 with given shape. dtype is optional and is float by default.

x = np.zeros(2) => returns [0. 0.]. 2 implies 1 dim array with shape of (2,)

y = np.ones((3,2), dtype=int) => returns a 2 dim array of shape (3,2), with type of 1 as integer, so it's 1 and NOT 1.0 or 1. (i.e NOT decimal 1, but integer 1)

[[ 1  1]
 [ 1  1]
 [ 1  1]]

4. random: There is a random module in NumPy to generate random data. It has lots of methods which are very useful in AI and ML for generating random dataset.

from numpy import random => this is not really needed generally, but here we need it since numpy has it's own random module (while python has it's inbuilt random module), and we want to use numpy's random module. When we import numpy, we import all it's modules and methods, including random module. So, "from numpy import random" is not needed. But then we have to use np.random everywhere, to indicate that we are using numpy random module. If we just call "random", then we'll be calling python's inbuilt random module. So, we add this line "from numpy import random" to use numpy random directly. since using random is more convenient (instead of np.random).

Seed: All random numbers generated for a given seed. Seed provides i/p to pseudo random num generator to generate random numbers corresponding to that seed. Different seeds cause numpy to genrate diff set of random numbers.

np.random.seed(1) => this will generate pseudo random numbers for all random functions using seed=1. We could use any integer number as seed. We don't really need to provide this seed at all, since by default, numpy chooses a random seed and generates random num corresponding to that seed. But then our seq of random numbers generated will be diff for each run of pgm, which will be difficult to debug or reproduce. so, we usually assign a seed, when coding our program the 1st time. Once we have debugged the pgm with couple of seeds, we can get rid of this seed function.

randint():

ex: x = random.randint(100) => randint method says to generate integer random number, and arg=100 says the range is from 0 to 100-1 (i.e 0 to 99). Note, we could have written np.random.randint(100) too, but we don't need that since "from numpy import random" imports random into current workspace.

To generate 1D or 2D random numbers, we can specify size.

ex: random.randint(50, size=(3,5)) => generates a 2D array of size=3X5, with each element being a random int from 0 to 49

rand():

ex: random.rand(3) => just "rand()" method returns random float b/w 0 to 1. Number inside it reps the size of array, i.e 3 means it's a 1D array of size 3. i.e random.rand(size=(3)), however we don't write it that way (size=3) with rand method, we directly specify the size, as rand method is different than randint

ex: x = random.rand(3, 5) => returns 2D array with matrix=3X5.

[[0.14252791 0.44691071 0.59274288 0.73873487 0.22082345]

[0.00484242 0.36294206 0.88507594 0.56948479 0.15075563]

[0.69195833 0.75111379 0.92780785 0.57986471 0.6203633 ]]

randn(): returns samples from standard normal distribution. Std normal dist is gaussian distribution with mean mu, and spread sigma. So here instead of having equal probability for different numbers, it has probability distribution that is higher for numbers closer to mean, and the probability keeps on falling down as you get away from mean. 99% of the values lie within 3 sigma of mean. We provide shape of array as i/p.

ex: x=random.randn(3,4,5) => returns 3D array of shape=(3,4,5)with random float in it which have mean=0, sigma=1.

To get values corresponding to other mean and sigma, just multiply the terms appr:

ex; Two-by-four array of samples from N(3, 6.25): Here mean=3, sigma=√6.25 = 2.5

3 + 2.5 * np.random.randn(2, 4) => 67% of numbers will be b/w 3-2.5 to 3+2.5, i.e in b/w 0.5 to 5.5
array([[-4.49401501,  4.00950034, -1.81814867,  7.29718677],   # random
       [ 0.39924804,  4.68456316,  4.99394529,  4.84057254]])  # random

ex: np.random.randn() => returns single random float "2.192...".  Here only float is returned since no array shape specified. 


Operations on Array:

array slicing => array[start: end-1: step] => If omitted, start=0, step=1, end=last index of array. For nD array, we can slice each index of the array.

IMP: When we provide the last index of array, it's last_index-1. So, arr[2:4] will have arr[2], arr[3], but not arr[4], as range is from 2:(4-1). This is the same behaviour that we saw with lists/tuples/arrays in python. One other thing to note is that complex slicing is allowed on numpy multi dimensional arrays which were not possible on lists/tuples/arrays in python. This is where numpy turns out to be much more powerful in terms of operations being done on arrays. Also, in numpy, we access elements of array via arr[2,3,0], while the same element accessed in a list/tuple/array via arr[2][3][0] (i.e commas are needed in numpy. However, python list syntax works for numpy too, i.e arr[2][3][0] is equally valid in numpy, but we don't access numpy arrays that way)

np_arr = [[300, 200,100,700,212], [600, 500, 400,900,516], [21, 23,45,67,45]]

ex: np_arr[1] = returns entry of index=0, which is itself an array, so returns all of that array => [300 200 100 700 212]

ex: np_arr[1,2] => returns 400 (since it's index=1 for outer array and index=2 for inner array. So, for multi dim array, we specify indices separated by commas.  np_arr[1][2] also works, though as explained above, that's not the right way.

ex: np_arr[0:2:1] = Here we provided the outermost array index (since there are no commas for inner indices). The range is from 0 to 1 with increment of 1. So, this returns as below:

[[300 200 100 700 212]
 [600 500 400 900 516]]

ex: arr_3D[1,0,1:2] = [ [ [8 9] ] ] => since it's 3D array, it reports the final slice of the array as 3D array. Here we take index=1 for axis=0 which is [[7 8 9] 010 11 12]], then we take index=0 for axis=1, which is [7 8 9] , then we take slice 1:2 of this final one which [8 9]

x=np_arr[0:2,3:1:-1]  => Here we provide index range for both dimension of array. axis=0 goes from 0 to 1 (since range 0:2 implies 0 and 1), while axis=1 goes from index 3 to 2 in reverse direction (if we do 1:3:-1, this would return empty array, since 1:3 index can never be achieved by going in reverse dir. This is important to remember). NOTE: array entries are now reversed, i.e the array x gets assigned the values as [700 100] instead of [100 700] as in original array. Array "x" still remains a 2D array.

x =

[[700 100]
 [900 400]]

ex: [1 2 3 4 5]; arr[: : 2] = [1 3 5] => prints every other element (since start and end are not specified, start=0 and end=length of array.

ex: arr = np.array([[[1, 2], [3, 4]], [[5, 6],[2,3]]])

ex: print(arr[:]) => prints all elements of array since no start/end specified. All 3D elements printed. Same as what would have been printed with print(arr)

ex: print(arr[0,:]) => This prints all elements of index=0 for axis=0. The same o/p is printed with arr[0][:] (i.e list/tuple format in python) prints [[1 2]   [3 4]]. NOTE: it's 2D array now.

print(arr[:,0]) => prints [[1 2] [5 6]]. This says that for axis=0, slice everything since no range specified, so the whole array is returned. Then 0 says that for axis=1 return index=0. Array for axis=1 is [ [1 2] [3 4] ] and [ [5 6] [2 3] ]. index=0 is [1 2] from 1st one and [5 6] from 2nd one.

reshape: Reshaping means changing the shape of an array. reshape(m,n) changes an array into m arrays with n elements each (i.e turns the array into 2D array), provided it's possible to do that, else it returns an error. similarly reshape(p,q,r) changes an array into 3D array with p arrays that each contains q arrays, each with r elements. reshape(1)

arr = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])

newarr = arr.reshape(4, 3) => this changes the above 1D array into 2D array with 4 arrays and each having 3 elements. So, newarr.ndim=2, newarr.shape=(4,3) since it has 4 arrays with 3 elements in each.

[[ 1  2  3]
 [ 4  5  6]
 [ 7  8  9]
 [10 11 12]]


ex: n_arr = arr.reshape((1, arr.shape[0])) => Since arr is 1D array, shape=(12, ) i.e 12 followed by blank. To convert it into 2D array, we use reshape method as shown. since shape[0] returns 12, this becomes newarr=arr.reshape(1,12). This becomes 2D array with 1 row and 12 elements in each. So arr=[1 2 .. 12] while n_arr = [ [ 1 2 ... 12] ]. NOTE: 2 square brackets in n_arr, as compared to single brackets in arr (since it's a 2D array now). n_arr.ndim=2, n_arr.shape=(1,12)

assert (a.shape == (1,12)) => This asserts or checks for the condition that shape of array a is (1,12). This is helpful to figure out bugs in code, since if the shape is not as expected, this will throw an error.

newarr.reshape(12) => When only 1 integer provided, then result is 1D array of that length. So, this returns [1 2 ... 12]. We can also provide -1 as the length of array to get same result.

newarr.reshape(-1) => flattens the array, i.e converts any array into 1D array. So, this returns [1 2 ... 12]. However, if we provide other integers for new shape along with -1 as last integer, then array is converted into required shape, with other values inferred.


newarr.reshape(d_3D.shape[0],-1) => Here 1st value is 2 (from above example). So tuple is (2,-1) meaning it's 2X6 (since 6 is inferred automatically. -1 implies flatten other dimension, so 6 is the only other value). result is: 

[[ 1  2  3  4  5  6]
 [ 7  8  9 10 11 12]]

Other way to flatten an array is by using func ravel() or method ravel. It's same as reshape(-1).

ex: np.ravel(newarr) => converts newarr array into flattened 1D array. We could also apply method ravel on newarr as newarr.ravel()

We usually want a 2D array, with one row, instead of 1D array with 1 row. It's easier to work with 2D array. NOTE: They are kind of same except that there are 2 square brackets in 2D array, while only 1 square bracket in 1D.

new_arr = arr.shape(arr.shape[1]*arr.shape[2]*arr.shape[3], arr.shape[0]) => Here we convert an array of shape (m,n,p,q) into array of shape (n*p*q, m) i.e we convert 4D array into 2D array with outer m array not flattened, but everything inside it flattened.

squeeze: this func removes one-dimensional entry from the shape of the given array. This is used in opposite scenarios where 2D array is converted to 1D array. Axis to be squeezed should be of length=1. By default, axis0 is squeezed. We can specify axis to be squeezed

ex: y=np.squeeze(x) => x is 3D array with shape (1,3,3) while y now becomes 2D array with (3,3), i.e axis0 is squeezed

X = 
[[[0 1 2] [3 4 5] [6 7 8]]] Y = [[0 1 2] [3 4 5] [6 7 8]] The shapes of X and Y array: (1, 3, 3) (3, 3)
r_: This is a simple way to build up arrays quickly. Translates slice objects to concatenation along the first axis. dd
ex:
np.r_[np.array([1,2,3]), 0, 0, np.array([4,5,6])] => returns array([1, 2, 3, 0, 0, 4, 5, 6]) => This ex concatenates 1D array then 0, 0, then another 1D array. It concatenates along axis=0, still returns 1D array.
ex:
np.r_['1,2,0', [1,2,3], [4,5,6]] => the numbers within '...' before the array specifies how to concatenate. Here number 1 specifies concat along axis=1 (2nd axis) array([[1, 4], [2, 5], [3, 6]])
 
c_: Translates slice objects to concatenation along the second axis.
np.c_[np.array([[1,2,3]]), 0, 0, np.array([[4,5,6]])] => returns array([[1, 2, 3, 0, 0, 4, 5, 6]]) => This ex concatenates 2D array along axis=1 (2nd axis).


Matrix Operations:

matrix transpose: This is one of the useful functions to find transpose of a matrix. Transpose of 2D array is easy to see, it's rows and columns are swapped, so rows become columns and columns becomes rows (i.e 3X4 matrix becomes 4X3 matrix, w/o any change to any of the contents). You can transpose any n Dim matrix too, and specify how to transpose it. By default for n Dim marux, the order is rveresed, i.e 2X3X4 matrix becomes 4X3X2 matrix.

ex: np.transpose(newarr) => changes newarr from 4X3 to 3X4 array

Instead of using function, we can also use method to transpose.

ex: y=newarr.T => Here we are applying "T" (T is the name for transpose) method to newarr object. Result is same as transpose function above.

matrix dot operation: To find dot product of 2 matrix, we use dot function. NOTE: dot operation is different than multiplication operation. Mult just multiplies each element of 1 array with that of other array, while dot operation is the mult/add of differnt elements of array. You can find more details of dot operation on matrix in "high school maths" section. For 2-D vectors, it is the equivalent to matrix multiplication. For 1-D arrays, it is the inner product of the vectors. For N-dimensional arrays, it is a sum product over the last axis of a and the second-last axis of b. The dimensions of two matrix being dot has to compatible for matrix dot operation, else we'll get an error. Instead of using dot function, we can write a for loop and iterate over each element of 2 array and sum them appr. However, this for loop takes a long time to run, as it can't use parallel instructions such as SIMD (single inst multiple data). Dot function in python uses these SIMD inst or GPU (if available), which significantly speeds up the multiplication/addition part.  Using dot operation is called vectorization, and in AI related courses, we'll always hear this term, where we'll always be asked to vectorize our code (meaning put it an array form and then use dot functions to do multiplication)

a = np.array([[1,2],[3,4]]) => 2D array of 2x2
b = np.array([[11,12],[13,14]]) => 2D array of 2x2
np.dot(a,b)
This produces below 2D array of 2x2 which is calculated as follows =>
[[1*11+2*13, 1*12+2*14],[3*11+4*13, 3*12+4*14]]
[[37  40] 
 [85  92]] 

matrix add/sub/mult/div operations: All other matrix operations as add, divide, multiply, abs, log, etc can be done by using specific matrix functions similar to matrix mult shown above, instead of using for loop.

ex: c=np.add(a,b) => adds 2 matrix a and b. Each element of matrix a is added to corresponding element of matrx b. Similarly for np.subtract(a,b)

ex: c=np.divide(a,b) => divides 2 matrix a and b. Each element of matrix a is divided by corresponding element of matrx b. Similarly for np.multiply(a,b)

Other misc operations: Many other operations defined working on single array.

log: log: ex: c=np.log(a) => computing log of each element of array "a"

abs: ex: c=np.abs(a) => computing absolute value of each element of array "a"

sum: There is other operator "sum" (NOT add) which adds the each row or column of an array to return 1D array.

ex: A = [ [300, 200,100], [600, 500, 400] ]
C=np.sum(A,axis=0) => adds each col (since axis=0) and returns 1D array with shape=(3,). result=[900 700 500]

C=np.sum(A,axis=1) => adds each row (since axis=1) and returns 1D array with shape=(2,). result=[600 1500]

C=np.sum(A) => adds all rows and cols (since no axis specified, it adds across all axis) and returns a scalar 2100.

Broadcasting: Array broadcasting is a concept in Python, where we can perform matrix operations, even when the matrix are not entirely compatible. Python expands the required rows or columns by duplicating them. Certain rules apply as follows:

Rule 1. matrix of dim=mXn operated with matrix of dim=1Xn (1 row only) or with matrix of dim=mX1 (1 col only) => operations are +, -, *, /. The matrix 1Xn or mX1 are converted into matrix mXn by duplicating rows or col, and then operation is performed.

ex: A = [ [200, 100] , [300, 400] ] , B = [ [1, 2] ] => Here A is 2X2 matrix, while B is 1X2 matrix.

C= np.sum(A,B) => Here, B is broadcast to 2X2 matrix, by duplicating 1st row. so, result is C = [[201 102]  [301 402]]

Rule 2: matrix of dim 1Xn or of dim mX1 => We can do operations of +, -, *, / on these matrix with a real number. The real number will be converted into 1Xn or mX1 matrix and then operation performed.

ex: B = [ [1, 2, 3] ] => This is 1X3 matrix. If we add real number 2 to this matrix, then it's converted to [ [ 2, 2, 2 ] ] and then addition performed.

C=np.add(B,2) => [ [1, 2, 3] ]  + 2 = [ [3 4 5] ]

Other operations on array: iteration over elements of array, join, split, search, sort, etc are miscellaneous functions provided to work on arrays.

 

HDF5 => HDF5 file stands for Hierarchical Data Format 5. It's also called h5 in short. The h5py package is a Pythonic interface to this HDF5 binary data format.

It is an open-source file which comes in handy to store large amount of data. As the name suggests, it stores data in a hierarchical structure within a single file. So if we want to quickly access a particular part of the file rather than the whole file, we can easily do that using HDF5. This functionality is not seen in normal text files.

HDF5 files are the ones used in AI projects, since they can be store TB of data, and can easily be sliced as if they were NumPy arrays.

We'll need to install HDF5 module in Python. To use HDF5, numpy also needs to be imported. Look in numpy section for it's installation.

Installation:

CentOS: We install it using pip.

sudo python3.6 -m pip install h5py => installs HDF5 for python 3.6

HDF5 Format:

Very good tutorial on HDF5 is on this link: https://twiki.cern.ch/twiki/pub/Sandbox/JaredDavidLittleSandbox/PythonandHDF5.pdf

or from local link HDF5

HDF5 includes only two basic structures: a multidimensional array of record structures, and a grouping structure. H5py uses straightforward NumPy array and python dictionary syntax. For example, you can iterate over datasets in HDF5 file, or check out the .shape or .dtype attributes of datasets.

HDF5 files are organized in a hierarchical structure, with two primary structures: groups and datasets.

  • HDF5 group: a grouping structure containing instances of zero or more groups or datasets, together with supporting metadata. A group has two parts:
    • A group header, which contains a group name and a list of group attributes.
    • A group symbol table, which is a list of the HDF5 objects that belong to the group.
  • HDF5 dataset: a multidimensional array of data elements, together with supporting metadata. A dataset is stored in a file in two parts: a header and a data array.
    • dataset header header contains information that is needed to interpret the array portion of the dataset, as well as metadata (or pointers to metadata) that describes or annotates the dataset. Header information includes the name of the object, its dimensionality, its number-type, information about how the data itself is stored on disk, and other information used by the library to speed up access to the dataset or maintain the file's integrity.
    • data array: Data array is where actual data is stored.
Ex of HDF5 file: trefer1.h5

HDF5 "trefer1.h5" { GROUP "/" { DATASET "Dataset3" { DATATYPE { H5T_REFERENCE } DATASPACE { SIMPLE ( 4 ) / ( 4 ) } DATA { DATASET 0:1696, DATASET 0:2152, GROUP 0:1320, DATATYPE 0:2268 } } GROUP "Group1" { DATASET "Dataset1" { DATATYPE { H5T_STD_U32LE } DATASPACE { SIMPLE ( 4 ) / ( 4 ) } DATA { 0, 3, 6, 9 } } DATASET "Dataset2" { DATATYPE { H5T_STD_U8LE } DATASPACE { SIMPLE ( 4 ) / ( 4 ) } DATA { 0, 0, 0, 0 } } DATATYPE "Datatype1" { H5T_STD_I32BE "a"; H5T_STD_I32BE "b"; H5T_IEEE_F32BE "c"; } } } }

Usage:

Most of the times when doing an AI project, we waon't be doing anything more than reading or writing HDF5. Let's look at these 2 operations.

ex: reading an h5 file

import numpy as np

import h5py
test_dataset = h5py.File('dir1/test.h5', "r") #opens the file in read mode
test_set_x = np.array(test_dataset["test_x"][:]) # get all of array from beginning to end
 
ex: writing an h5 file

 f=h5py.File("testfile.hdf5")

arr=np.ones((5,2))

f["my dataset"]=arr #this stores the 5X2 array into file testfile.hdf5

 

SAT = Scholastic Aptitude Test

SAT is a standardized test used by most colleges in USA for undergraduate admission in any department. If you want to apply to any college in USA, your chances of getting accepted are greatly improved if you have a high score in SAT. However, SAT is just one component. Your GPA in school, and recommendation from your high school teachers carry lot more value than SAT scores, typically for high ranked colleges. More info here on wiki: https://en.wikipedia.org/wiki/SAT

Your kid will be taking the SAT exam in his high school, if he wants to attend a college after his high school. some colleges don't require SAT at all, while most colleges accept either ACT or SAT exam (both are of similar difficulty).

SAT is a 3 hour long exam, and has four sections: Reading, Writing and Language, Math (no calculator), and Math (calculator allowed). The optinal essay writing 5th section is not really required. The total score possible is 1600 (400 from Reading, 400 from Writing and Language, and 800 from Maths)

1. Reading: It has one section with 52 multiple-choice questions and a time limit of 65 minutes. There are 5 passages to read, and then answer 10-11 questions related to the passage. The passages are from 4 different fields, and do not require any prior knowledge, except the ability to read and infer correctly.

2. Writing and Language: It has one section with 44 multiple-choice questions and a time limit of 35 minutes. Not sure how many passages are supposed to be there, but I've seen 4 passages with 11 questions in each. The passages here are similar to ones in Reading section, but they focus more or writing side, i.e suggest corrections, punctuations, improving sentence structure, etc.

3. Maths: Maths portion is divided in 2 sections. It has total 58 questions (45 are multiple choice, while 13 require you to write an answer) and a time limit of 80 minutes.

  • The Math Test – No Calculator section has 20 questions (15 multiple choice and 5 grid-in) and lasts 25 minutes.
  • The Math Test – Calculator section has 38 questions (30 multiple choice and 8 grid-in) and lasts 55 minutes.

There are many sample papers in link below in Resources. One such sample paper is here: SAT_practise.pdf

Percentile performance:

Depending on the score, you will get a percentile score, and that decides how well you performed. The avg score for SAT is 1060 out of 1600.  A score of over 1500 out of 1600 is considered very good, and will place you in top 98% - 99% of the kids who took the test nationally. The wiki link, shows what your percentile scores are for different scores. Just over 2.2M students took SAT from the class of 2019. That means high school students who are graduating in 2019, took the SAT test anytime in 2017, 2018, 2019, etc, but the total number was 2.2M. Most students take SAT test in 11th grade (since 12th grade gets too busy applying for colleges). Since number of high school kids is about 18M as seen in section "USA basic facts", that means there are about 4M kids graduating every year from high school. So, out of these, more than half end up taking SAT exam, and these are the students who are serious about going to college. A significant portion of students who take SAT exam apply for colleges. Even then, note that only 30% of US workforce is comprised of people with 4 year college degree, so even though about 50% of the students take exam, only half of these kids end up completing the 4 year college and get a degree, rest end up dropping out from college.

Resource:

1. college board: SAT is wholly owned by collegeboard.org, which is a non profit. It has a lot of free resource, sample papers to help you practice for the exam.

https://collegereadiness.collegeboard.org/sat/practice/full-length-practice-tests

2. Khan Academy: This is a wonderful resource for SAT, and I don't see a reason as to why you will ever want to get paid services for SAT preparation. In fact, starting 2015, college board has partnered with Khan Academy to provide free SAT preparation. Here's a link to get started:

https://www.khanacademy.org/mission/sat

3. mometrix: A guy named George sent me an email with the link to the website www.mometrix.com, and I really liked the free sample papers available on this website. Thanks George for the contribution !!  I've included the link below. It has free sample papers for all 3 subjects.

https://www.mometrix.com/academy/sat-practice-test/

I'll keep adding more links as I find them .....