Bash: Bash is most popular and is the one enabled by default on most Linux distro.

Bash documentation is on gnu website (www.gnu.org), as well as on tldp website: (www.tldp.org)

few good bash doc from these websites:

https://www.gnu.org/software/bash/manual/bash.pdf

http://tldp.org/LDP/Bash-Beginners-Guide/Bash-Beginners-Guide.pdf

http://tldp.org/LDP/abs/abs-guide.pdf

https://www.shellscript.sh/'

Bash startup files: When bash shell starts the first time, it is the login shell. it reads few startup files, before it brings the shell prompt. These are /etc/profile, then ~/.bash_profile, ~/.bash_login and then ~/.profile. When bash exits the login shell, it reads ~/.bash_logout before exiting. When bash shell starts as non-login shell (i.e we invoke terminal "bash" by after the gui has come up), it reads and executes ~/.bashrc (if it exists).

Simple bash script example: create a file test.sh (1st line of script specifies what interpreter to use, 2nd line is regular cmd)

#!/bin/bash

echo "hello";

Run this script by doing "chmod 755 test.sh", and then typing ./test.sh. Running ./test.sh is similar to doing "bash test.sh" as bash interpreter is used by default based on 1st line of script. Just typing "test.sh" on cmd line won't work, as then bash wouldn't know where test.sh is, so it will start looking in it's std paths (which is indicated in env var SHELL), and if it doesn't find there, it would complain that this cmd was not found. However, if we provide the full path, then ./ is not needed, i.e typing "/home/ashish/scripts/test.sh" on cmd line would work, as the shell doesn't need to figure out the path."./" is needed to tell the shell to run the script in current dir.

bash itself has many options that can be provided on cmdline that controls it's behaviour.

ex: bash --dump-strings --rc_file file1.rc --version --verbose -ir -D -o option => options may be single char or multi char. some options need -, while some others need --. options with -- need to be provided before options with -, else -- options won't be recognized

Bash syntax:

Characters (alphanumeric and special char on keyboard) are used to form words. These words form the syntax of any language. Usually some reserved words are formed using alphabetic keywords (a-z, A-Z), these are called reserved keywords (such as for, if, etc). Other words using alphabetic or alphanumeric (a-z, A-Z, 0-9) char are used to form variables. Other special characters (;, +, &, etc) remaining on the keyboard are used by the language to do special task. In most languages, space is used to separate out different words (i.e break line into tokens). These reserved Keywords or cmds (i.e if, for, ls, echo, etc), variables (strings, etc) and special characters (expressions, etc), are building blocks of any programming language. Bash is no different.

We'll look at reserved keywords/cmds, variables and special char. On a single line, different tokens are separated by whitespace (tab, space or blank line). Special characters as ;, &, #, etc are identified as tokens even if there's no white space around. However it's always safe to have spaces around special char too. Each line in bash is a cmd followed by args for that cmd. Separate lines are separated by newline (enter key), semicolon (;) or control characters (explained later).

A. Commands:

In bash, keyword are basically cmds (simple or complex). These cmds may be of 2 types:

1. simple cmd: A simple cmd is just a sequence of words separated by blanks, and terminated by control characters explained below. 1st word is the actual cmd, with rest of the words being cmd's args. Any of the control character or newline (enter key) ends the cmd. Simple cmds are those unix cmds that are explained in other section. These cmds are in reality programs that were written in unix world by various users to help in carrying out basic tasks. However, their use became widespread, so a lot of these cmds became std with stdardized options, and were supported by all linux distro. Simple cmds themselves are of 2 types, depending on whether they are part of the shell, or if they are external pgm being called:

A. shell in built cmds: These are not passed to the kernel, but are rather interpreted by shell itself. So, these are fast. ex: cd, exec, pushd, bg, fg, etc. Different shells have different in build cmds.

 Bash in built cmds: http://www.gnu.org/software/bash/manual/html_node/Bash-Builtins.html

Any bash builtin cmd takes options preceeded by -, and also accepts -- to signify end of options for that cmd. These shell builtin cmds are explained later. Few ex:

  • alias: alias allows a string to be substituted for a word when it is used as the first word of a simple command. Alias are usually put in initialization files like ~/.bashrc etc so that the shell sees all the aliases (instead of typing them each time or sourcing the alias file). One caveat to note is that Aliases are not inherited by child processes, so any bash script that you run will not inherit these alias until you redefine these alias again in your script.
    • ex: alias e='emacs' => Now we can type "e" on cmd line and it will get replaced by emacs and so emacs will open.

B. external cmds or programs: These external cmds are not built in. Their binary pgm needs to be called and loaded by the kernel, just like any other pgm that we run. Here, shell passes the control to kernel, and then once pgm is done, kernel trasfers control back to shell. ex: grep, find, etc. Some cmds that can easily be external i.e echo, are actually built-in for efficiency reasons (as echo gets called very frequently)

Remember, any cmd you type on a shell (i.e ls *, pwd, grep "my" file.txt, etc) is interpreted by that shell. So, all syntax rules that applies to shells in general also apply to shell scripts. We can write shell cmds on the prompt, (known as interactive shell session) or we can type them in a file and run that file (known as shell scripting). Depending on what shell you have on which you are typing cmds, they will generate slightly different outputs. Unix cmds are nothing but programs written to support some functionality. These unix cmds may be built as part of shell, so that shell will not need to call the program for that unix cmd separately. That speeds up the execution of linux cmd. Any linux cmd that is not built into the shell, will be called just like any other program (like calling pgm "emacs"). To keep shell interface consistent, a lot of UNIX cmds that had been in use for a long time, (i.e ls, grep, pwd, etc) are defined by GNU standard and supported by all shells. They have a defined consistent syntax, which is honored by all shells. That makes it eaiser to learn these unix cmds once, and then use it anywhere (since syntax remains the same). We'll talk about unix cmds in separate section

2. complex/compound cmd: These cmds are shell pgm constructs. Apart from above simple cmds, shell support a lot of other keywords, similar to programming language statements as if, for, etc. These keywords are called complex or compound cmds. These make shells more powerful, as simple unix cmds can be made conditional, and complicated scripts are possible. We'll look at syntax of these cmds later.

B. Variables:

Apart from these simple/compound cmds, we can have variables which are combination of alphanumeric characters and some other special characters as _, etc (i.e myname is set as "jk_679"). However what all characters can be used as part of variable name is dependent on shell syntax.  variables are basically memory location name which store a value, which may be any combination of char. In bash, variable (aka memory location name) themselves may be any combination of char (only alphabet, number, _ allowed). Var which are not assigned any value have an undef or null value (bash doesn't differentiaite b/w the 2). vars are assigned using "=" sign. There are 2 kinds of var: global and local. 

1. local var: var only available in current shell. Can contain letters, numbers and _, but can't start with number. Usually given lowercase name. "set" cmd prints all var (local and global). NOTE: no space can be provided on any side of = sign, as then it will treat LHS as a keyword/cmd by itself, and rest of = and RHS as args of that cmd. If no such keyword/cmd exist, it will error out "cmd not found". This problem exists b/c no special cmd used for assignment (as "set" in csh).

ex: libs="123_2"; => NOTE: so space around = sign. This whole line is treated as a single cmd with no args as there is no space in this line. semicolon signals end of cmd.

ex: me_12=a/b/c?*; => all of these special char are treated as literals and are part of RHS, as parser is looking for space or ; to mark end of RHS assignment char stream. single/double quoting may be preferred here to remove any ambiguity (i.e me_12='a/b/c?*'; explained later). However me_12 itself can't contain any of these special char (i.e my_?12 is invalid, as then parser will treat = as literal, and whole thing will be seen as 1 big cmd, since ? messes up the parser in seeing this as a var. As soon as parser sees ?, it doesn't consider my_?12 as var anymore)

export: Var created above are local as any child process or subshell wont be able to see this var. To pass var global (accessible to other shells or subshells), we use "export" builtin cmd. local var created within a script are local to that script only, while global var can be accessed across other scripts also run from within same shell. 

export var1="12a/c" => Now, var1 can be seen by child process. We can do assignment and export in 2 separate lines too (i.e var="123"; export var1;)

 2. global var: These are also called environment variables (or reserved variables), and many of them are available across other shells (but some are bash specific such as BASHOPTS, BASH_ARGC, etc). Usually given all uppercase name. "env" or "printenv" cmd prints all env var. few ex of env var are PATH, HOME, SHELL, PWD, PS1, PS2, CC, CFLAGS, LD_LIBRARY_PATH, TERM, USER, USERNAME, etc. Some are readonly while some can be set. To make a local var global, we should put "export" cmd above in a file that gets sourced at startup such as ~/.bashrc.

Accessing variables: $ is used to get value of a variable assigned previously. ex: abc=10;  echo $abc; => here $abc can be used to get the value stored in var abc. It's preferred to quote $var in double quotes ("$var") to prevent reinterpretation of special char (except $, `, \, everything else is hidden from shell reinterpretation when inside double quotes, as we'll see later). Indirect referencing of var (similar to pointers in C) is also available via \$$varname (Not really needed for simple scripts). $, {}, =, etc are special char discussed later.

Just as we use $ before user defined var to access it's value, we can use $ before special shell reserved var to access their value (i.e echo $SHELL). We may also modify these values by assigning value to it. i.e for changing the PATH var in bash, we may overwrite or append PATH env variable.

ex: in /home/ashish/.bashrc_profile (or .bashrc), we may have PATH env var. This is a very imp env var to find out path of any cmd (when the path to that cmd is not provided). We usually want to add more std paths to this env var, so that scripts or cmds that we may have in other non std dir may still be run w/o providing the full path. One way to modify this var is to do as follows in .bashrc_profile file:

echo $PATH => shows std path, usually "/usr/local/bin:/usr/bin:/usr/sbin". This means any cmd or script for which the path is not provided is going to be searched in these std paths. It's searched in the order this list is provided, i.e /usr/local/bin is searched 1st, then /usr/bin and son, until that script or cmd is found. If it reaches the end of this w/o finding the cmd, it gives an error "cmd not found".

PATH=$PATH:/home/ashsish/scripts => This adds or appends this new path "/home/ashsish/scripts" to already existing list of paths in PATH env var, so any script /home/ashish/scripts/mytest.pl can now just be run by calling "mytest.pl", instead of providing the full path " /home/ashish/scripts/mytest.pl". We have to do an "export" cmd also in this file, so that this new modified PATH is avialable to child process or subshell. Here, we appended the path, instead of modifying it as that's safer and more efficient to do.

echo $PATH => now it shows the modified path var, which has pur custom path appended. So, it shows "/usr/local/bin:/usr/bin:/usr/sbin:/home/ashsish/scripts"

prompt: to set prompt (prompt is the dir name followed by $ sign you see in any terminal. Terminals with bash shell usually show a "$" sign as prompt) in bash: here we overwrite PS1 var. see bash documentation for prompt.

echo $PS1 => shows [\u@\h \W]\$ => This means that it's showing "[", then username, then @, then hostname, then current working dir followed by "]" and finally a $ sign. So, my prompt in a terminal looks like [ashish@linux-desktop ~]$

below assignment modifies PS1 or bash prompt to something custom.
PS1="\[\e[31m\] \w \[\e[0m\][\!]\[\e[32m\]$ \[\e[0m\]"; => \w =>current working dir, \u =>user name, \h =>host name, \n =>newline, \! => history number. \[ ... \] is used to embed a seq of non-printing char. The \[\e[31m\] sequence set the text to bright red and \[\e[32m\] sequence set the text to bright green and \[\e[0m\] returns to normal color.

Whenever we run any pgm in a shell, shell stores the cmd line that was run, in reserved vars. These can be accessed within the pgm, by calling these special reserved var. These special reserved var are not available outside of that pgm which invoked it. So, the pgm needs to be such that it can access these shell var, If the pgm running is itself a shell pgm, it's very easy to use these var.

command line agruments : are stored in $0, $1, $2 and so on. $0 refers to 1st arg which is script name itself, $1 refers to first arg following script name, $2 to 2nd arg, and so on. $# stores the number of args (i.e if $n is the last arg, then $# is n). $argv[0], $argv[1] also store these cmd line args. $#argv stores number of args (real args excluding the name of script itself). If we want to pass all args as a list , we can use $argv[*]

ex: ./test.csh name1 name2 => $0= test_csh, $1=name1, $2=name2

ex: foreach name ($argv[*]); echo $name; end => this prints all args above.

Below some ex show how $ is followed by some other special character, or a number to access these special var.

ex: ./command -yes -no /home/username

  • $# = 3 (stores num of cmd line args that were passed to pgm)
  • $0 = ./command, $1 = -yes etc. (stores value of 1st arg, 2nd arg, etc). These are called positional parameters.
  • $* = -yes -no /home/username => Stores all the arguments that were entered on the command line ($1 $2 ...). $* shouldn't be used, as it can cause bugs.
  • $@ = array: {"-yes", "-no", "/home/username"} => stores value of args as array, each quoted via " "
  • $$ = expands to process ID of the shell or script in which it appears
  • $? = expands to exit staus of most recently executed foreground pipeline. This is used very often in bash scripts to check if cmd on previous line executed succesfully by using if-else cond. exit status of 0 means cmd executed successfully, while any other number indicates error.
  • $! = expands to exit staus of most recently executed background cmd.
  • !$ = gets arg from last cmd.
    • ex: less a.log
    • grep "me" !$ => this cuases !$ to be replaced with a.log as that was the arg from last cmd above

Getopts () builtin: Parsing cmd line options more gracefully: Aove we saw how to parse and retrieve cmd line args using $1, $2, etc. However, if we want to do this in how linus cmds parse their options, then above code will get complicated. We would like specify options flas as -<flag_name> followed by a value as <flag_value>. There's a GNU inbuilt shell cmd called "getopts" (NOT getopt as that's an older version which isn't GNU POSIX compliant).

Here's an excellent article on how to do it: https://www.howtogeek.com/778410/how-to-use-getopts-to-parse-linux-shell-script-options/

Data types: Any pgm language has diff data types it supports. In bash, Primitive data types are char string (i.e everything is interpreted as group of char). However, depending on context, strings interpreted as numbers (if they contain digits only) for arithmetic operations. Floating point numbers are not natively supported, but external pgms like bc or dc can be used. The derived data type in bash is array which is not that commonly used. Bash does provide a "declare" or "typeset" builtin cmd to declare var as constant, interger and array. NOTE: bash doesn't support a lot of data types, so it's limited in doing complex data manipulations.

1. String: Everything is char string. It's internally interpreted as numbers depending on context, or if declared via cmd.

2. constant: ex: readonly TEMP=abc;

3. array: array contains multiple values. format is same as assigment operator,except that multiple values are given, so, we use braces (). We do not define size of array, so any +ve index number can be used. Associative arrays are also allowed, which allow subscript to be any string instead of +ve number. To dereference  the array (i.e get vale of an item), we use $ just like with other var. However, since array uses [ ] (which is a special char used for some other function explained later), it may get interpreted differently, so we use { } around array to remove any ambiguity.

ex: my_array=(one two three) => this assigns my_array[0]=one, my_array[1]=two and so on.

ex: echo $my_array[2] => this prints "three[2]". this is because var name can't have [ .. ], so as soon as [ is encountered, var interpretation stops, so val of "my_array" is printed, which is the first index (index 0) of this array, so it actually prints $my_array[0]. Then it prints "[2]" since for echo it's just literals to print. If we want to print index 2 for this array, we need to use curly braces, i.e echo ${my_array[2]}. Now everything inside { } is interpreted as var name, so it prints "three".

ex: me_arr[7]=me => this is equally valid syntax. Eve though var name has square brackets in it's name, which is invalid, I guess the shell is able to see it as an array var, instead of a regular var. Not sure, why "echo $me_arr[7]" doesn't work (since $me_arr[0] is not defined, it has no value for $my_arr[0], so it just prints [7]). It needs curly braces to work i.e "echo ${me_arr[7]}"

ex: echo ${my_array[*]} => this dispplays all values of an array. Instead of *, we could use @ also. $my_array[*] would interpret $my_array and [*] separately, and would print "one[*]" on screen

ex: echo ${#my_array} => prints num of elements in array. In general, ${#VAR} prints num of char in VAR.

ex: my_array[name]='ajay'; => associative array

C. Special Characters or metacharacters:

Apart from alphanumeric characters on keyboard (that are used for cmds and variables), we have multiple other characters on keyboard, and some of them have special meaning. These are special characters that you find on your keyboard (other than alphabet and numbers). When these special characters are not part of a var name, they can have their own special meaning. These special char are called meta characters.

Printable char: Following are special char seen on your keyboard (some of these are accesible by using shift key). We'll study special meaning of each of these later.

~ ` ! @ # $ % ^ & * ( ) _ - + = { } [ ] | \ : ; " ' < > , . /

Non-printable char: Apart from above printable char, we also have other non printable keys as backspace (i.e delete), space, tab, ctrl, shift, alt, esc, enter (return), insert, function and few other. These all keys have ASCII codes associated with them, but for right now we'll consider 2 most important ones being used => space and enter keys.

A. space key: "space" special character is used to separate out words into "tokens".

B. Enter key: Enter key or newline key is used to separate out cmds from each other (by signifying end of current cmd). : may also be used t separate out cmds.

NOTE: The way parser works is it will separate out keywords/cmds separated by space, newline or other special char. Other special char may be parsed into their own token. Any special char can be placed right after/before each other, to form even more special char (i.e groups of 2 or 3 special char as &&, <=, etc, which may be separated out as their own token and be considered a special character group).

ASCII:

Each of the keys on the keyboard have a ASCII code associated with them:

http://www.asciitable.com/

Each ascii code is 1 byte, so 256 ascii codes possible (0 to 255 in decimal). Code 0 to 127 are normal characters, while codes 128 to 255 are special characters. These special characters correspond to extended character set and are encoding dependent. So, they are not same on all platforms. Ascii codes from 128-255 are called extended ASCII codes, and are special char as pi, beta, square root, etc that are not present on keyboard.

Std Keyboard have total of 105 keys of which there are 53 keys in 4 rows out of which caps lock and 2 shift don't have equiv ascii code. BS, TAB, ENTER don't have 2 ascii codes as it's the same key with or without shift. This accounts for 47*2+3=97 normal characters. Space, Del and ESC keys are 3 more keys with equiv ascii codes. Remaining 28 ascii codes are from 00(0x00) to 31(0x1F) [4 of these: BS, TAB, CR, ESC are already accounted for in the 53 keys).
 ASCII codes (shown as decimal, hex):
 00 = 0x00 = NULL (end of string is always a null character)
 ASCII codes for CTRL-A is 0x01, CTRL-B is 0x02 and so on till CTRL-Z is 0x1A (decimal 26)
 08 = 0x08 = BS (backspace) => equiv to CTRL-H (^H)
 09 = 0x09 = TAB (horizontal tab) => equiv to CTRL-I (^I)
 10 = 0x0A = LF (Line feed or NEW LINE \n ) = moves the cursor all the way to the left and advances to new line => equiv to CTRL-J
 13 = 0x0D = CR (carriage return or ENTER = moves the cursor all the way to the left but doesn't advance to new line => equiv to CTRL-M (^M).
NOTE: Any file made in windows treats end of line as "CR LF" so it assigns ascii value "0x0D 0x0A", but linux files treat end of line as "LF" only, so ascii code is "0x0A". So, for files imported from windows to linux (files created in windows using notepad, and then opened in linux using vi or emacs) , linux shell sees these extra "0x0D" and it prints ^M at end of each line. Many newer editors (xemacs) are smarter and ignore "0x0D" when it precedes "0x0A". So, be aware of transferring text files b/w Linux and Windows. It doesn't affect functionality, but is a nuisance nonetheless.
 27 = 0x1B = ESC (escape)
 32 = 0x20 = SPACE (space key)
 48 to 57 = 0x30 to 0x39 = 0 to 9
 65 to 90 = 0x41 to 0x5A = A to Z
 97 to 122 = 0x61 to 0x7A = a to z
 127 = 0x7F = DEL
 special character from 128 to 255 are encoding dependent and so different for different OS. If we have NUMLOCK key activated, then press "ALT" key and ASCII extended code in decimal on numeric pad on right, and then release "ALT" key. We'll get special character printed on screen. So, if we pressed 128, we would get special char for 0x80 (which is c with tilda).
 128 = 0x80 = special character c with a tilda (`) at bottom.
 254 = 0xFE = special character filled in square
 255 = 0xFF = special character "nothing" or "blank"

Printable char (1-9, a-z, A-Z, @, ?, etc) are from ASCII code 33 to 126.

A very good doc of these special char is here: http://www.tldp.org/LDP/abs/html/special-chars.html

showkey cmd: To know what ascii code is generated for which key, we can run cmd "showkey -a" which will show ascii codes generated for any key.

a=97, esc=27, enter=13, backspace=127

Many combination of keys such as "ctrl key + C key" pressed together, etc are not printable but can be used on cmd line of shell to do various things. This is what we are going to study next.

cmd line editing: Any shell has cmd line i/f thru which we enter cmds. Editing on this i/f is provided by Readline library, which is used by several pgms including bash. That is why cmd editing is same across different shells as bash, csh, etc. cmd line editing is basically using keys to edit typed char or moving the cursor to desired location on the line. We use simple keys like delete key, arrow keys etc to edit cmd line, but we also have combination of keys available which can do more efficient editing. These are divided under control keys and meta keys. These control or meta characters are not normally used inside  a script, though they can be used by using their octal or hexadecimal code. i.e ascii code 0x0a is the code for newline or control char C-j. These control/meta char have the same meaning irrespective of whether caps lock key is turned ON or not, i.e pressing control + small w keys behaves same as pressing control + capital W keys (small w or capital W doesn't matter).

control keys: control keys are keys activated by pressing "ctrl" key. When we press "control" key and then "c" key (not together but press ctrl key, and then while keeping it pressed, press the c key), it kills the current cmd. This is the specal behaviour invoked by pressing these 2 keys together. Pressing "control" key and "k" key is refrred as text "C-k" (Control k). control key by itself doesn't have an ascii code, but when pressed in combination with an alphabet, it generates ascii decimal code 1 to 26, i.e C-a = 1, while C-z = 26. Howevr, as we can see from ascii table above, all the ascii codes within 1 to 26 are already assigned to other keys as tab, enter, etc. In such a case, ctrl key with that char serves same purpose as the key. Ex: tab key is assigned code=9, but C-i is also assigned code 9. So, tab functionality can also be achieved by pressing C-i. Similarly, enter key has code=13, but C-m is also assigned code 13. So, functionality of enter can be achieved by pressing C-m. C-0 to C-9 may also be mapped to ascii codes as 28, 29, 30, 31 etc or just be the ascii code of that number itself implying no special treatment. Control keys show on the terminal o/p as caret (^). So, C-a shows as ^A and so on. This is also a very common way, of how control keys are written in books.

We have control characters from C-a to C-z, but we'll look at only few important ones only.

C-b/C-f => move back/forward 1 char

C-a/C-e => Move to start/end of line

C-d => log out of a shell, similar to exit cmd

C-k => Kill the text from the current cursor position to the end of the line. (kills text forward)

C-u => Kill the text from the current cursor position to the start of the line. (kills text backward)

C-w => Kill from the cursor to the previous whitespace. (kills text backward)

C-y => This is used to yank (copy back) the text killed via above cmds.

job control cmds: A subset of control key cmds allow us to selectively stop or continue execution of various processes. The shell associates a job with each pipeline. It keeps a table of currently executing jobs, which may be listed with the jobs command.

C-c => kills currently running process

C-z => suspends currently running process and returns control to bash. In MSDOX file system, C-z is EOF (END OF FILE) character.

C-s => suspend. This freezes stdin in a terminal. Use C-q to restore stdin. Many times, when you see a terminal as not responding on cmd line, it's because it's suspened (due to someone accidently pressing C-s, as C-s is a cmd in emacs to save, so if the cursor is not in emacs window but instead on terminal, then it will freeze the terminal w/o user knowing it). In such cases, use C-q to see if it restores it.

jobs => lists all jobs running in that shell (NOTE: all jobs under control of this shell only). Many options supported. This shows jobid in [ .. ].

bg <job_id> => This resumes suspended job in background. This is equiv to appending & at end of cmd (to cause any job to run in background). If no >job_id> provided, then current job is used

fg <job_id> => This resumes suspended job in foreground. If no >job_id> provided, then current job is used. Typing %<job_id> brings that job to foreground too.

kill <job_id> => kills that job. Many options supported.

meta keys: Similarly pressing "Meta" key and "k" key is refrerred as "M-k" (Meta k). In older linux keyboards there used to be meta key, but windows dominated PCs had ALT key instead. So, linux started treating ALT key the same as Meta key.Alt key by itself doesn't have an ascii code, but when pressed with a char, it generates a sequence of 2 ascii codes. For ex, when we press M-a, there is no corrsponding ascii code for this. Instead, the terminal program receives a character sequence beginning with the escape character (byte 27 oe 0x1B, sometimes written \e or ^[) followed by char "a", so ascii code 27 followed by ascii code 97. Since "meta" or "alt" key generates this "esc" character as the first char, we can achieve the same behaviour by pressing esc key and then pressing the char. If Alt key is not present or is inconvenient, many people prefer to use the "Esc" key. However, when pressing esc key, we press Esc key, leave it and then press the char. This is different than what we do with control key or alt key, where we press the key and the char simutaneously. The reason for this is that, we need to generate 2 ASCII ocde for mimicing meta key behaviour, so we press esc key, release it and then in quick succession press the char key to generate those 2 ascii codes. Sometimes, it woks by pressing esc key. keeping it presed and then pressing the char key, but that works by luck as ascii code for 2 keys get generated right one after the other, even though esc key was kept pressed all the while. It may not always work, so better to use the approach of pressing and releasing "Esc" key.

Esc key is printed on screen as "^[", so esc +K = ^[K as shown on screen. Alt + K also shows as ^[K. So M-k rep as ^[K. showkey cmd shows ascii decimal code 27 and 75 being generated for M-k irrespective of whether Alt key or Esc key used. Meta keys have a loose convention that they operate on words, while control keys operate on char. word is defined as seq of letters and digits, so anything separated by whitespace, /, etc is considered 1 word. Some imp meta keys below:

M-b/M-f => move back/forward 1 word (instead of 1 char as in C-b/C-f).

M-d => Kill from the cursor to the end of the current word, or, if between words, to the end of the next word. (kills text forward)
M-DEL => Kill from the cursor the start of the current word, or, if between words, to the start of the previous word. (kills text backward)

Other keys: Pressing other keys as "home", "end" , etc generate their own sequence of ascii codes. "home", "end" egnerate sequence of 3 ascii codes, "page up", "page down" generate seq of 4 ascii codes, while some function keys generate seq of 5 ascii codes. so, their behaviour could be mimicked by pressing the keys corresponding to these ascii codes in quick succession. However, the theme that remains common is that all these key combination starts with "Esc" key, since that is the way, readline identifies that the seq of char coming after "esc" key is special and is to be treated as one.

special character usage: Some of these meta or special char take on different meaning depending on context, so it can be confusing. We'll talk about some of the imp meta char below:

1. # => comment. These are used to write comments, can be at beginning or end of line. Anything after a # is ignored by shell (until the new line). So, if we put "\" (which is continuation of line) at end of comment line, it doesn't inlcude the next line as part of this line (i.e above \ quoting rule doesn't apply to comment). Another shell, tclsh differs from bash/csh in this regard, as \ in comment line causes continuation of comment on next line.

2. " ' \ => These characters help hide special char from shell.

Hiding special characters from shell by using these 3 char (" ' \) : Though special char above have special meaning, we may force the shell to treat them as literals. The double quotation marks (""), the single quotation marks (''), and the backslash (\) are all used to hide special characters from the shell. Hiding means it preserves the literal value of the character, instead of interpreting it. Each of these methods hides varying degrees of special characters from the shell. These 3 are referred as quoting mechanism.

I. Double Quotes " " : weak quoting

The double quotation marks are the least powerful (weak quoting) of the three methods. When you surround characters with double quotes, all special characters are hidden from the shell, except these 3: $, `(backtick), and \ . The dollar sign and the backticks retain their special meaning within the double quotes. backslash has special rules as to when it retains it's special meaning.
The backslash retains its meaning only when followed by dollar (\$), backtick (\`), double quote(\"), backslash(\\) or newline(\newline_at_end_of_line (i.e \ followed by "enter" key), not the newline char \n). Within double quotes, the backslashes are removed from the input stream when followed by one of these characters. Backslashes preceding characters that don't have a special meaning are left unmodified for processing by the shell interpreter (i.e \n is left unmodified). A double quote may be quoted within double quotes by preceding it with a \. Similarly $, ` may be printed as is by preceeding with \.

This type of quoting is most useful when you are assigning strings that contain more than one word to a variable. For example, if you wanted to assign the string hello there to the variable greeting, you would type the following command:

ex: greeting="Hello there \" name ' me" => This command would store the string "Hello there" name ' me" in the variable "greeting" as one word. If you typed this command without using the quotes, then it would error out (as Hello would be assigned to greeting, but then it would find unknown cmd "there"

test="This is"; echo $test => prints "This is" on screen. Since $test is not quoted, $ expansion is done for var named test.

echo "$test" => prints "This is" as $test is substituted. However, echo '$test' prints $test on screen and not "this is" as $test is not substituted by it's value.

mkdir "test dir" => creates dir named "test dir" (with space in between test and dir, as space looses it's special meaning of separarting tokens). If we do mkdir test dir, then it creates 2 dir with names test and dir. mkdir 'test dir' works same way as mkdir "test dir".


II. Single Quotes ' ': strong quoting

Single quotes are the most powerful form of quoting. They hide all special characters from the shell (so a superset of double quotes). This is useful if the command that you enter is intended for a program other than the shell. Because the single quotes are the most powerful, you could have written the previous example using single quotes too. Only thing to remember is that single quote may not occur within single quote (even if preceeded with \, as nothing is escaped in single quotes), as at that point, the 2nd single quote is identified as ending of quotes (i.e anything within 1st and 2nd single quote is treated as 1 string).

ex:  greeting="Hello there $LOGNAME \n"  => This would store the string "Hello there " and the value of $LOGNAME into the variable greeting (greeting would be assigned "Hello there Ashish \n". The LOGNAME variable is a shell variable that contains the username of the person who is logged in to the system.

ex: greeting='Hello there $LOGNAME \n' => the single quotes would hide the dollar sign from the shell and the shell wouldn't know that it was supposed to perform a variable substitution. So, greeting would be assigned "Hello there $LOGNAME \n"

NOTE: special characters such as \n, \t, have been assigned to newline (\n), tab (\t) etc. However, in bash they are not treated as special char with 1 byte ASCII value , but instead as 2 literals. So, above \n is printed as literal, not expanded as newline ascii char. However "\\n" would print \n (as \ is escaped with double quotes), while '\\n' would print \\n (as \ is not escaped with single quotes).

\a=bell(alert), \b=backspace, \n=newline, \r=return(enter), \t=horizontal tab, \xH or \xHH = 8 bit character whose val is 1 to 2 hex digits, \uHHHH = 16 bit unicode char whose value is 1 to 4 hex digits

Assigning newline to a var was not possible in bash with old syntax (as \n wouldn't get converted to newline), but now with printf, we can do that.

ex:

printf -v str "Hello \n"; => stores Hello followed by newline byte into var "str".

echo "$str"; => we need to use double quotes for echo in order to print newline, else newline are converted into space.

III. Backslash \ : NOTE: we refer \ as backslash, as it leans backward (as if resting on a chair). forward slash is / as it leans forward (as if about to fall down to the front). In Linux, all dir paths etc are forward slash (key on bottom of keyboard), while in windows dir paths are separarted by backslash (key on top of keyboard).

Using the backslash is the third way of hiding special characters from the shell. Like the single quotation mark method, the backslash hides all special characters from the shell, but it can hide only one character at a time, as opposed to groups of characters. You could rewrite the greeting example using the backslash instead of double quotation marks by using the following command:

ex: greeting=Hello\ There => In this command, the backslash hides the space character from the shell, and the string "Hello there" is assigned to the variable "greeting". bash did not even look at the space, it just saw the escape character, and continued with space just as with any other valid character. Then when it ultimately hit a new line, it cmpleted the cmd, and assigned the whole thing to "greeting".

Backslash quoting is used most often when you want to hide only a single character from the shell. This is usually done when you want to include a special character in a string. For example, if you wanted to store the price of a box of computer disks into a variable named disk_price, you would use the following command:

ex: disk_price=\$5.00 => The backslash in this example would hide the dollar sign from the shell. If the backslash were not there, the shell would try to find a variable named 5 and perform a variable substitution on that variable. Assuming that no variable named 5 were defined, the shell would assign a value of .00 to the disk_price variable. This is because the shell would substitute a value of null for the $5 variable (any undefined variable is assigned a null value). The disk_price example could also have used single quotes to hide the dollar sign from the shell.

\\, \', \", \? => all of these quoting characters and other special characters are escaped due to \.

If we put \ at end of line in a script, then it escapes newline char. Bash removes \n byte altogether (since \ tells it to hide newline char) and causes continuation of 1st line on 2nd line, w/o any newline in b/w. ex:

a=cde\
efg;

echo a=$a; #prints cdeefg (w/o any space b/w cde and efg). However putting space before/after \ causes error, i.e "a=cde \" (since there's space after assignment causing parser to end assignment at cde, and treat efg as next cmd) or "a=cde\ " (since space after \ doesn't escape newline, but escapes space, which isn't really escaped)

 

3. End of cmd: A newline character (by pressing enter/return) is used to denote end of 1 cmd line. For multiple cmds on same line, semicolon (;) can be used to separate multiple cmd. There has to be a space after semicolon, else parser will not see ; as a token.

Ex:
> wc temp ; rm temp ; mkdir ana => ; used only when cmds are on same line. Enter separates cmds on separate lines, so no ; needed for each line. Any unix cmd can be used directly in tcsh/bash scripts.

Control operators: Some special characters or combination perform control function (i.e to indicate separation of cmds or end of cmd). These are: |, |&, &, ;, ||, &&, ;;, ;&, ;;&, (, ). We already saw semicolon as a control operator. We'll see others below. Exit status of combination of cmds is the exit staus of last cmd executed.

  • Pipe cmd (|): Pipe is one of the most used cmds in linux, and does redirection. Pipe is done using "|" (key on right side of keyboard just above enter key). The o/p of each cmd is connected via pipe (| or |&) to i/p of next cmd. "|" passes cmd's stdout to next cmd's stdin, while "|&" passes cmd's stdout and stderr to next cmd 's stdin. This is a method of chaining commands together. Each cmd in pipeline is executed in it's own subshell.

    ex: cat *.txt | sort | uniq => meres all .txt files, sorts them, and then deletes duplicate lines. pipe operator keeps on passing o/p of 1 cmd as i/p to next cmd.

  • List: List is seq of 1 or more pipeline separated by operators (; or & or || or &&), and optionally terminated by one of these (; or & or newline).
    • OR/AND (|| &&): Many times, we have a chain of cmds, and we want subsequent cmds to be executed, depending on if previous ones executed successfully. For ex, when running make, a lot of cmds make sense to run only if previous make cmds ran w/o error. in such cases, &&, || come in handy. They are logical AND OR of 2 booleans which may be true or false. exit status of these is the exit status of last cmd executed (0=sucess, non-zero=failure. So 0=TRUE and non zero integer is FALSE. Any thing that is not a number is also FALSE). So, if exit status can be decided by 1st cmd itself, then 2nd cmd is not executed.
      • cmd1 && cmd2 => cmd2 is executed only if cmd1 is success (i.e returns exit status of 0. That means first cmd returns TRUE, so result = TRUE && cmd2 = cmd2. So, cmd2 is run).
      • cmd1 || cmd2 => cmd2 is executed only if cmd1 is failure (i.e returns exit status of anything non zero. That means first cmd returns FALSE, so result = FALSE || cmd2 = cmd2. i.e cmd2 has to be executed to determine the boolean value of this expression, so, cmd2 is run).
    • semicolon (;): This separates diff cmds, and indicates that shell should wait for current cmd to finish, before executing next cmd. Newline serves same purpose as ;
    • background (&): shell executes current cmd in background, so that next cmd does not wait for previous cmd to finish, but can start immediately.

4. source or . cmd: dot command or period is equiv to builtin "source" cmd. "source" cmd used to source file (i.e read cmds in the file specified). This cmd works in all shells. When invoked from cmd line, "source" or "." executes a script, while when invoked from within a script, they load the file sourced (i.e load the code in the script, similar to #include of C pgm)

ex: prompt> source abc.py => executes abc.py script

ex: prompt> . abc.py => same as above

5. backquote or backtick (`): This is the key just below ESC key on top left of keyboard (it's NOT the single quote key found on double qote key on right middle of keyboard). This makes available the o/p of a cmd for assigment to a var or to other cmd. cmd subs invokes a subshell and can remove trailing newlines if present. cmd may be any cmd that can be typed on an interactive shell. parenthesis () explained later achieves same purpose as backtick. Since backtick retains it's special meaning in double quotes, we can always enclose ` within " .. ".

ex: rm `cat file.txt` => here o/p of cmd cat is passed on to rm cmd.

ex: a="`ls -al`"; echo $a; => lists all the files. " .. " doesn't make any difference in o/p. var "a" stores the o/p as a string, not an array. Incsh, this o/p is stored in an array.

ex: echo `pwd`/test => here cmd within backtick is expanded, so it will print something like this: /home/jack/test

6A. user interaction: All languages provide some way of getting i/p from a user and dumping o/p. In bash, we can use these builtin cmds to do this:

Output cmds: echo and printf cmds supported. echo is a linux cmd supported by all shells, while printf is bash specific.

I. echo: echo built-in command outputs its arguments, separated by spaces and terminated with a newline character (or ;). The return status is always zero. echo takes a couple of options (look in manual)

 ex: echo -n abc rad 17 => even w/o single or double quotes, this prints everything, as everything before a newline is considered it's arguments. newline is automatically added at end, not here since -n option used (-n suppresses newline to be added at end). There are many more options supported.

II. printf: This is another inbuilt cmd. This is specific to bash, and it's implementation may different b/w different bash versions. This is good replacement for echo as it follows syntax similar to C language.

ex: printf "a=$a b=%d \n" $b => $ is expanded, so $a takes value of var a. %d is similar to C lang, where args outside double quotes are substituted for %d. So, assuming a=1, b=2, it prints "a=1 b=2" with a newline at end (by default printf doesn't add a newline). NOTE: there is no comma outside double quotes before putting the var name (as in C lang)

Input cmds: read is the builtin i/p cmd supported. There is no other way to read i/p in bash. read is bash specific cmd, and not supported by other shells. csh has "$<" for reading i/p.

I. read: the i/p line is read until enteris pressed. There are various options supported. By default, the line read is stored in var "REPLY"

ex: read; echo "read line is = $REPLY" => If i/p entered is "My name", then var REPLY stores "My name".

ex: read Name Age Address; echo "name= $Name, age = $Age, addr = $Address"; This splits the line into various words. 1st word is assigned to $Name, 2nd word to $Age, and remaining words to $Address. The characters in the value of the IFS variable are used to split the input line into words or tokens; By default $IFS is space, tab or newline, so words are split on space boundary.

ex: read -a my_array => here various words of line are assigned to array = my_array. my_array[0] stores 1st word, my_array[1] stores 2nd word, and so on ...

ex: read -p "what's your name? " name => here, this prints the string on prompt, so that we don't have to do a echo separartely. $fname stores the i/p line entered after echoing "what's your name? "

6B.  IO redirection: good link here:

http://www.tldp.org/LDP/abs/html/io-redirection.html

redirection means capturing o/p from a file, script, cmd, pgm etc and passing it to another file, script, cmd, pgm, etc. By default, there are always 3 files open: stdin (keyboard), stdout (screen) and stderr (error msg o/p to the screen). Each open file has a numeric file descriptor, so Unix assigns these 3 files, file descriptors of 0, 1 and 2. For opening additional files, there remain descriptors 3 to 9.

The file descriptors or handles for the default files are:

i/p = STDIN or file descriptor 0 (fd0). This is /dev/stdin

o/p = STDOUT or file descriptor 1 (fd1). This is /dev/stdout

error = STDERR or file descriptor 2 (fd2). This is /dev/stderr (usually points to /dev/null)

process file desc for each process are in /proc/<process_id>/fd/0,1,2 (for i/p, o/p, err). These are just soft links pointing to /proc/self/fd/0,1,2. /dev/stdin, stdout, stderr are equiv to /dev/fd/0,1,2 (FIXME: wrong??). So, when we pipe o/p of 1 cmd into i/p of other cmd using pipe (|) cmd, cmd just changes the soft link of fd/1 of 1st cmd to fd/0 of 2nd cmd (so that o/p file of 1st cmd points to i/p file of 2nd file).

So, when we run "read" cmd, it takes i/p from STDIN, and when we run printf, it dumps o/p to STDOUT. If we want to change this default behaviour (redirect i/p or o/p to other places instead of these 3 default files), we use redirectio operator < (to redirect i/p), > (to redirect o/p). The > operator creates new file (or overwrites existng one), but if we want to append to existing file, then use >>. Other Redirect operators are >, >>, <, >&, &>. <&-, >&- are used for closing various file descriptors.

output redirection using > or >> => ">" redirects o/p to named file instead of outputting to stdout (i.e screen). > overwrites the file if present, while >> appends to the existing file if present.

 ex: ls -al > cmd.txt => So, here it lists the contents of current dir not on stdout (screen), but on file "cmd.txt". Any error from cmd is still directed to STDERR. >& or &> causes both STDOUT and STDERR to be redirected to file cmd.txt. It's preferable to have no space after > or >> (i.e ls >>cmd.txt)

input redirection using < => "<" redirects i/p to be taken from the named file instead of taking i/p from stdin (i.e keyboard).

 ex: grep "ajay" <names.txt => So, here it looks for name "ajay" in file names.txt instead of taking i/p from stdin (keyboard). We can have a space after <, but preferable to have no space.

 We can use file descriptor numbers too for redirection, i.e M>N => file descriptor M is redirected to file descriptor N

ex: ls -al 2> file1 => this redirects fd2 (or STDERR) to file1. N> implies redirect for fd0, fd1 or fd2 depending on N=0,1,2. When N not provided then > implies STDOUT while < implies STDIN.

ex: cmd1 2>error.txt => file desc 2 (i.e stderr) gets redirected to error.txt => error msg from o/p of cmd1 gets redirected to error.txt

ex: cmd2 &>out.txt => &> redirects both stdout and stderr to out.txt

here cmd uses << and <<< for redirection. Here's little intro:

HEREDOC (<<): Here documents are used in most shells. This is a form of i/p redirection. Frequently, your script might call on another program or script that requires input. The here document provides a way of instructing the shell to read input from the current source until a line containing only the search string (aka limit string) is found (no trailing or starting blanks). All of the lines read up to that point are then used as the standard input for a command.They are called heredoc probably because the document is here instead of coming externally from some other file.

We can use any char inside the HEREDOC. If search string is w/o any quoting (i.e no " " or ' '), then all text within the heredoc is treated like regular bash lines, and parameter substitution, expansion, etc done. Many times, we use heredoc to generate a script to be used later. In such cases, we want to treat text inside heredoc literally with no substition/exapnsion etc done (to print text as is with no modification). We can do this by putting single quotes or double quotes around limit string. The reason this works is because quoting the limit string effectively escapes the $, `, and \, and causes them to be interpreted literally. Not sure why double quotes work, since they should not escape $, ` and \ (FIXME ??). Using HEREDOC is lot better than using bunch of echo/printf to print those in a file, and then redirecting i/p from that file. There may be space after << (doesn't matter)

ex: here NAMES is the search string. Last NAMES should be on a line by itself with no trailing spaces to be identified as end of HEREDOC.

cat << "NAMES"

Roy $name; c=$[$a+$b]; echo $c; => everything is printed literally since "NAMES" used. If we just used NAMES w/o the quotes in start, then $a $b, $c, $name will be substituted and expression will be evaluated.

Bob #next line has NAMES by itself (no spaces) to indicate end of heredoc

NAMES

HERE string (<<<):here string can be considered as a stripped-down form of a here document. It consists of nothing more than COMMAND <<< $WORD, where $WORD is expanded and fed to the stdin of COMMAND.

ex: grep -q "txt" <<< "$VAR" => here i/p to grep is taken from $VAR

ex: String="This is"; read -r -a Words <<< "$String" => reads words from the given string

7. Brackets [ ] , braces { } and parenthesis ( ) : [] and {} are used in pattern matching using glob. All [], {}, () are used in pattern matching in BRE/ERE. See in regular expression section. However, they are used in other ways also: both single [], (), {} and double [[]], (()), {{}} versions are supported.

I. single ( ) { } [ ]:

( ) { } => these are used to group cmds, to be executed as a single unit. parenthesis (list) causes all cmds in list to be executed in separate subshell, while curly braces { list; } causes them to be executed in same shell. NOTE: parenthesis (list) do not require blank space to be recognized as parenthesis, as () are operators, and hence recognized as a separate token, blank space or not. However, curly braces { list; } has historically been a reserved word (just like for, if, etc), so it needs a blank or other metacharacter, else parser will not recognize it as separate token. Also a ; or newline is required to indicate end of list in { list; }, but not in (list).

subshell: any cmd enclosed within braces ( ) are run in a separate shell, similar to the "direct" or "csh/bash/etc" execution of a shell script. Ex:
/home/proj/ > (cd ~ ; pwd; a=5;) => runs cd in a subshell and prints dir name after doing cd. then it returns back to main shell, forgetting all actions it did. a=5 is valid only in subshell, and not remembered outside of subshell
/home/kagrwal
/home/proj/ > pwd => note, pwd prints dir that it was in prior to executing cd cmd.
/home/proj/

NOTE: if we run any shell script, the linux kernel program invoke a program to "run the script". It runs script in a separate shell that it creates for the sole purpose of running script. Any actions you do inside script (cd .., etc) are carried out in the new shell that it created. At the end of execution of script, the shell is killed, all actions that the script did in that new shell are gone along with that new shell, and control returns back to original shell that lauched that script. If we do not want to spawn a new child shell, we can use "exec" cmd to run our cmd, which will replace current shell with the new cmd.

brackets ( ) allows cmd substitution too. To save the output of any linux cmd to any var, we can use $ infront of subshell, and then use that var. Cmds which have newline character in their o/p are stripped of their newline char.

myvar=$( ls /etc | wc -l ) => returns number of word count from this cmd, and assigns the value to var "myvar"

echo $myvar => prints value stored in myvar variable. It can be number, string, etc

$(cmd) or `cmd` both achieve same purpose.

ex: echo $(date); or echo `date`; => both print date as "Mon Jun 24 16:41:55 PDT 2017".

brackets ( ) also does array init as shown in arrays above.

{ } => Braces { } are also used to unambiguously identify variables. They protect the var within {} as one var. { ..} is optional for simple parameter expansion (i.e $name is actually simplified form of ${name})

ex: var1=abc; path=${var1}_file/txt; echo $path; => This assigns path to abc_file/txt. If we did path=$var1_file/txt, then parser would look for var named var1_file/txt, which doesn't exist. So, it will print nothing as $path is undefined.

{} are also used to substitute parts of var name:

ex: ${STRING/name/bob} => this replaces 1st match of pattern "name" with "bob" in var STRING

ex: ${STRING//maya/bob} => this replaces all matches (see // instead of /) of pattern "maya" with "bob" in var STRING

We can also assign/change value to a parameter within { } by using =,+,-,?, etc. By putting ":" before the operator , we assign new value only if parameter exists, and value is not null. By omitting ":", we assign new value if parameter exists, irrespective of whether value is null or not. := and :- are more commnly used.

ex: := (new value assigned to parameter only if parameter doesn't exist)

IN_STEP=A;
IN_STEP=${IN_STEP:=rtl}; => here IN_STEP is assigned rtl, only if it was defined, but value was null. Since, value for IN_STEP was not null, IN_STEP is not assigned new value of rtl, but retains old value of "A"
OUT_STEP=C; echo "Mine= ${OUT_STEP=B}"; => since ":" is omitted, OUT_STEP is assigned new value of B (since parameter OUT_STET exists but with value of "C"). So, Mine=C is printed.

ex: :- (here one of the 2 parameters is assigned to the whole thing, depending on if 1st parameter exists or not)

a=${test1:-tmp} => here if "test1" is undefined or null, then "tmp" i substituted, else value of "$test1" is substituted. So a=$test1 or a=tmp. Note, it's not $tmp, but tmp

Indirect expansion of parameter within braces is done when it's of form {!PARAMETER}. i.e 1st char is !. Bash uses the value of the variable formed from the rest of "PARAMETER" as the name of the variable; this variable is then expanded and that value is used in the rest of the substitution, rather than the value of "PARAMETER" itself. Since expansion is done, wild cards may be used.

ex: echo "${!SH*}" => SH* is expanded. Matching var are SHELL, SHELLOPTS (reserved vars). These var "SHELL SHELLOPTS" in printed rather than $SHELL, $SHELLOPTS value. FIXME Not clear ???

{ } can also be used for separate out a block of code. spaces should be used here. ex: a=3; { c=95; .... } echo $c;

{} are used in glob and RE/ERE too. They are used for expansion of patterns inside them. They are explained more in Regular expression topic. ex: echo a{c,d,e}f => prints acf adf aef. Here there can't be any space b/w inside braces, else braces won't be recognized as regex.

Other uses: ${#parameter}, ${parameter/pattern/string}, and many more.

[ ] => square brackets are used for globbing as explained above, but they are also cmd by themselves (i.e /usr/bin/[ is a binary executable, so putting "[" in script calls the cmd "[" . [ can be builtin cmd or an external cmd). It has the same functionality as the test command (it's actually a synonym for test), except that the last argument needs to be the closing square bracket ]. Also, it needs blank space to be identified as a cmd, else parser will recognize it as a globbing char. The test is a built in command and is frequently used as part of a conditional expr. It is used to test file attributes, and perform string and arithmetic comparisons. It has a lot of options, and can take in anywhere from 0 args to 5+ args, wgich can do a lot of complicated tests. detailed doc for test cmd is here: https://www.computerhope.com/unix/bash/test.htm

ex: num=4; if (test $num -gt 5); then echo "yes"; else echo "no"; fi => This tests if $num is greater than 5. Since $num is smaller than 5, then "yes" is printed. Here it prints "no"

ex: num=4; if [ $num -gt 5 ]; then echo "yes"; else echo "no"; fi => this is equiv to above as cmd "test" is replaced by [ ... ]. NOTE: there are spaces on both side of [.   ] also needs to have space on both sides, but ; instead of space is parsed correctly. However, if we omit the space before ]; then we get error "missing ]".

[ ] are used to denote array elements as explained in array section above, and for evaluating integer expressions as explained below.

arithmetic operators: $[expression] is used to evaluate arithmetic expr => similar to (( ... )). command "expr" or "let" can also be used to do arithmetic operations. Syntax similar to C lang are used here:

  • number arithmetic: +, -, *, /, %, **(exponent), id++/id-- (post inc/dec), ++id/--id (pre inc/dec)
  • bitwise: &, |, ^(bitwise xor), ~(bitwise negation), <<(left shift), >>(right shift). ~ is also used as expansion to home dir name.
  • logical: &&, ||, !(logical negation)
  • string comparison: ==(equality), !=(inequality), <=, >=, < ,>. These are not arithmetic comparisons but lexicographical (alphabetic) comparisons on strings, based on ASCII numbering. Since , we need to precede < > with backslash, so that these characters are not interpreted as redirection, but instead as alphabetic comparators. So, test "Apple" \< "Banana" is true, but test "Apple" \< "banana" is false, because all lowercase letters have a lower ASCII number than their uppercase counterparts. To test numerical numbers, we use -lt, -gt with the test cmd explained above.
  • arithmetic comparison: -lt, -le, -gt, -ge, -ef, -ne => These are used for arith comparison (i.e [ 12 -gt 14 ] returns false, as 12<14 and not >.
  • assignment: =(assigns RHS to LHS), *=, /= %= += -= <<= >>= &= ^= |= => these are assigments where RHS is operated on by the operator before =, and then assigned to LHS (i.e a*=b; is same as a=a*b.
  • matching: =~ this is a matching operator (similar to perl syntax) where string on RHS is considered ERE, and is matched with string on LHS. ex: [ $line =~ *?(a)b ]
  • condional  evaluation: expr ? expr1 : expr2 => similar to C if else stmt
  • comma : comma is used as separator b/w expr

ex: c=$[2+3]; = prints 5.

ex: a=7;b=10; c=$[a*b]; => gives 70. Note: we can use var name or var value interchangeably. i.e c=$[$a*$b]; also gives same result. Looks like internal conversion is done??

ex: expr $a + $b; => this prints 17. Note: spaces has to be provided for args of "expr" as it's syntax demands that. expr $a+$b will error out. expr 5 + 4 will print 9.

ex: a=$(expr 10 \* $a ); echo $a => will print value of 10*$a which is 10*7=70. NOTE: space is needed after last arg of expr too (due to syntax of expr), else it will assign null to a (as expr will not be evaluated)

NOTE: bash lacks "expression grammer", i.e we can't directly operate on 2 numbers. ex: c=$a * $b is not valid as direct number arithmetic or arithmetic comparison (as $a < $b) is not supported. That is why we have to use cmd "expr" or "test" to achieve this. This is a big drawback of bash. csh allows direct arithmetic operations.

ex: i=2; j=`expr $i '*' 5`; => as can be seen here, we can't directly do j=$i*2. We had to use expr along with cmd substitution.

II. double (( )) {{ }} [[ ]]

(( ... )) =>  double parentheses are used for arithmetic operations. We can't directly add numbers as c=a+b; We need to enclose them as c=$((a+b)) => this evaluates the expr and subsitutes the result. spaces are not required around (( )). However, this should not be used. Instead $[expr] is preferred.

ex: c=$((2+3)); => prints 5. same as c=$[2+3]

ex: ((a++)); => inc var a (c style manipulation of var). $ is not needed infront of (( unless we want to assign the result to some other var

Used as conditional constructs to test expr.

ex:  (( $num == 40 ))

also in for loop as shown above.

ex: for ((i=0; i<10; i++)); => This works since any expr in (( .. )) is valid, so ((i=0)) assigns i to 0, ((i<10)) is conditional construct to test, and ((i++)) is arithmetic expr to inc i.

NOTE: $(cmd) and $((expr)) usage. single braces are used for cmd evaluation/substitution, while double braces for expr evaluation/substitution.

[[ ... ]] => double square bracket are equiv to new "test" cmd, which enables additional functionality, not available with old test cmd or single brackets. This was added later in bash, and is a keyword, rather than a pgm, so we won't find pgm named "[[". It does the same thing (basically) as a single bracket. It's not POSIX complaint, so use single [ .. ] instead of double [[ ... ]], to ensure portability. [[ ... ]] , but is a bash builtin. You can think of [[ ... ]]  as a superset of [ ... ], where it can do everything that single bracket does, but also a lot more. Also, [[ ... ]] is easier to use as they don't require escape char for args inside (as no glob expansion etc done in new test). Also, ==, &&, || are supported in new test. In old test or [ ], = is used to test for equality (= can still be used to test for equality in new test also, but == preferred in new test cmd)

ex: var=abcd; if [[ $var == abc ]]; then echo "yes"; else echo "no"; fi => prints "no".

{{ ... }} => double braces are not defined as anything special in bash. So, do not use the. Using ${{SHELL}} will print nothing (${SHELL} is the correct form).

 NOTE: $ infront of (), [] or {} has different meanings. $(cmd) causes cmd evaluation and substitution, while $[expr] causes expr evaluation and substitution. $[expr] is same as $((expr)). ${VAR} is same as $VAR, and is used to remove var name ambiguity.

8. pattern matching: glob and RE/ERE are used in a lot of unix cmds for pattern matching. Below special characters are used for that. More details in Regular expression section. NOTE: many of these characters are used for other purposes also (dual/triple purpose, depending on what else is around them, for ex: as shown in section above braces, brackets and parenthesis are used for eval/substitution etc). So, whenever we use these, we have to be careful that they are interpreted correctly.

* ? [] ! {} => These characters are used as wildcards in pattern matching in glob. curly braces may be used too, depending on your system's settings.

. * ^ $ \ [] {} <> () ? + | [: :] => These characters are used as wildcards in pattern matching in RE/ERE. These are explained in Regular expression topic.

9. looping constructs: These 3 are used to form loops. => until, while, for. "break" and "continue" builins are used to control loop execution. break exits the loop, not the script. continue continues the loop w/o going thru the remaining stmt in loop that are after continue.

  • while: while <test-cmds>; do <consequent-cmds>; done => execute <consequent-cmds> as long as <test-cmds> have exit staus which is zero (i.e stop when exit status is non-zero, which implies failure of <test-cmds>). "true" may be used in <test-cmd> to run loop infinitely.
    • while [ $i -lt 4 ]; do i=$[$i+1]; done
  • until: until <test-cmds>; do <consequent-cmds>; done => execute <consequent-cmds> as long as <test-cmds> have exit staus which is non-zero (i.e stop when exit status is 0, which implies success of <test-cmds>). This is opposite of while, i.e contune loop while <test-cmd> is "false"
    • until [ $i -gte 4 ]; do i=$[$i+1]; done => equiv to above "while" ex.
  • for: There are 2 formats of for loop:
    • for name in <LIST> ... ; do <cmds>; done => name takes values from each item of LIST in each loop. If "in <LIST>" not provided, then "in $@" is used as default (i.e values from cmd line args)
      • for i in `ls -al`; do cat $i; done . => or "for i in $(ls); ..."
    • for ( (expr1; expr2; expr3 ) ); do <cmds>; done => similar to C pgm style of for loop

10. Conditional constructs: These 3 are used to test for conditions: if-else, case, select. Note the ending keyword for if block is fi (if written backward), for case is esac (case written backword).

  • if-else: if <test-cmds>; then <consequent-cmds>; elif <more-test-cmds>; then <more-consequent-cmds>; else <alternate-consequent-cmds> fi => execute <consequent-cmds> if <test-cmds> have exit status 0, else continue with further test-cmds.
    • ex: if [ -a file.txt ]; then echo "file exists"; else echo "doesn't exist"; fi => here test-cmds are put inside [ test-cmd ], as square brackets provide test (explained below). option -e can also be used instead of -a.
    • ex: if [ -d $dirname ]; then echo "dir $dirname exists"; fi => This checks for existence of directory with name $dirname. there are many more options supported. Look in bash manual.
    • ex: if (( ("$year" %4) == "0" )) || (( $year -ne "1999" )); then echo "this"; fi => (( ... )) can be used to test expr as explained above. We can also use C style ?: within (( ... )) to test.
    • ex: if (( var0 = var1<3?a:b )) => This is equiv to => if var1<3 then var0=a else var0=b. NOTE: there is no space anywhere b/w elements of ?:, as presence of space messes the parser
    • if ...; then ... elif .. then ... else ... fi => here no ; needed for elif (the first if still needs a ;)
    • if [ "$T1" = "$T2" ]; then echo expression evaluated as true; else echo expression evaluated as false; fi =>NOTE: = used instead of == as it's within [ ].
    • ex: rm abc*; if [ "$?" -eq 0]; then echo "success"
  • case: case <word> in pattern1 ) <cmd_list1> ;;  pattern 2 ) <cmd_list2> ;; ... esac. Each of this pattern list and the cmd is known as a clause. So, we can have multiple clause, each for set of matching patterns. Clauses are separated by ;;, but ;& and ;;& can also be used which have different meaning. NOTE: pattern list only has ending brace, and no starting brace. Also double semicolon used instead of single semicolon for separating clausesThe same effect as case can be achieved with if-else, but the code looks messy if we have too many if-else, so case is preferred in such cases.
    • case $animal in
    • horse | dog |cat) echo "animal 1";; #1st  bracket ( is optional. Also | is used to match multiple patterns, so that if any of them match, this cmd executed
    • kangaroo | a[bcd]*) echo "animal2";;
    • *) echo "unknown";; #* means default match, as * matches everything.
    • esac
  • select: same syntax as for, except that "break" has to be used to get out of select loop. select <name> in <words> ... ;  do <cmds>; done . It is very useful for generating user options menu, similar to what you see when a bash script asks you for your choice on screen. More details in pdf manual

11. null cmd or colon (:) => It's a NOP cmd. It's shell built-in cmd, and it's exit status is always true (or 0). It's used in while loops, if-else, etc (explained later) when there's no condition to be specified.

ex: while : do ... done => here : returns true, so eqiv to while true do ... done. So, it's endless loop.

ex: if condition then : else ... fi => here "then" doesn't have any stmt to execute, so : used.

ex=> : $(a=23) => Any line in bash is cmd followed by args. Here if we just do "$(a=23), then that is interpreted as a cmd, which is not true. Putting : before that makes : a cmd, and rest as arg $(a=23), which works fine. similar ex => : $[ n = n + 1 ] =>works fine

11. existence of file/dir: -e/-d
#!/bin/csh
if (-d dir1) then ... endif
if (!(-e ${AMS_DIR}/net.v)) then ... else ... endif


exit6325

 

Advanced Bash cmds: Read in bash manual for more on this.

 

 

 

 

World Basic Facts:

 

2018 stats from World (updated to 2020 numbers at some places):

A very good place to see a lot of world stats is here: https://ourworldindata.org

 


 

Population:

Let's start with population since that's the most important factor in determining the economy of a country and it's prosperity. Population increases by having births, while decreases by deaths. Pretty simple. If every couple has 2 kids, then assuming 2 people (i.e parents) die for every 2 kids born, then the world population will remain constant. You will hear the term "fertility rate", which is the number of children per woman. It needs to be above 2 (as seen above in the example of a couple) for the population to increase, which is known as the replacement rate. Of course, depending on how long people live, this replacement rate may vary a little. If people live longer, then a lower replacement rate may be enough to keep the population constant. If you ever heard the slogan "hum do, hamare do" (For hindi handicapped, it means "2 kids per family"), that's where it comes from - when a couple has 2 kids, the population will hold steady.

Largest countries in world by population: http://www.worldometers.info/world-population/population-by-country/

Total population = 8B, China=1.4B, India=1.4B, USA=0.33B, Indonesia=0.28B, Pakistan=0.2B, Brazil=0.2B, Nigeria=0.2B.

Among 225 countries, only 90 countries have population of over 10M (1 crore which is the population of any metropolitan city in India). In fact, only 13 countries have population over 100M. After Bangladesh, India is most densely populated country when looking at top 50 countries by population. Pakistan and Bangladesh though much smaller in area are still at no. 6 and no. 8 when it comes to population. If Pakistan and Bangladesh were 1 country (as they were before 1970), they would be 3rd largest country ahead of USA. In the next 30 years, India will be the most populous country, while Russia and Japan with their declining population will fall out of top 10. What you see is that more and more 3rd world countries are moving up the chart - USA and China will be the only developed countries to be in top 10.

Area wise, Russia is the biggest country followed by China, USA, Canada, Brazil and Australia all of which are about half the size of Russia. Next comes India which is much smaller at about 1/6 the the size of Russia. Among large countries, Australia, Canada and Russia are very sparsely populated with less than 10 people per km^2.

Past and future population growth of all major countries is listed here: https://en.wikipedia.org/wiki/List_of_countries_by_past_and_projected_future_population

Population growth rate:

If we look at population growth since 1985 to 2020, world population went from 5B to 8B in 35 years, implying about 1.5% growth rate. Growth rate has slowed down to just over 1% as of 2021. It's inching down by 0.03% every year, so world population growth rate will be under 1% by 2022-2023 or so.

Death rate is around 1% of population, while birth rate is around 2%. That is what gives net 1% population growth rate. Birth rates have been falling, while death rate has hung kind of around 1% (it's falling too, but by much slower rate). In most developed countries where population growth has  stagnated, the main reason is falling birth rate - it has come close to the death rate and population can't grow anymore.

Top causes of death worldwide: (out of deaths of 70M/year worldwide as of 2023):

  • Diseases: A lot of disease are due to old age, i.e heart attack.
  • Accident: Accidents kill > 3M people every year worldwide. Out of these, 1.2M are crashes by cars and other vehicles. About 0.8M are due to falls among children and elderly people.

Why is the birth rate 2%? As a very simple scenario, let's consider a couple in their mid thirties having 2 kids. Then these 2 kids have 2 kids of their own when they are around 35. Then those 2 kids have 2 of their own when they are around 35. That means 6 new kids get born every 100 years or so for every couple that is there. So, every year it's 0.06 kids per 2 people. That implies 0.06/2*100=3 new births per 100 people or 3% birth rate. Now this birth rate is dependent on number of kids per couple as well as the age at which they have these kids. Looks like couples are having less than 2 kids on avg, and they are having them in their 40's instead of 30's. Then the birth rate will come out to be 2%.

Why is the death rate 1%? This is much simpler. Let's assume a person dies at age of 70 on avg. If we assume a population of 70 people with ages from 0 to 70 in serial order, then 1 person out of 70 dies every year. That implies 1/70 death every year. So, for every 100 people, death rate = 1/70*100 =1.5% death rate. However, we have to consider birth rate too, since that determines the mix of old and young people. If we have more young people and less old people in that mix of 70 people, then the deaths will be < 1 every 70 people, bringing the death rate to lower than 1.5%. As an example, consider US population group. 50M out of 330M people are over age 65. Assuming avg age to die is 78 yrs (avg life expectancy), these 50M people will die in next 13 years, implying 50M/330M=0.15 ppl every 13 years, or 0.011 ppl every year. or out of 100 people, it's 0.011*100=1.1 ppl => 1.1% death rate. So, mix of the population and avg life expectancy determines the death rate.

Country Population:

Biggest Asian country: China: Went from 1B in 1985 to 1.4B in 2020, implying about 1% growth. It's population growth has stalled to almost 0 now (due to it's "one child per couple" policy), so it will likely stay at this level.

Other Asian countries: India, Pakistan, Bangladesh, Indonesia, Philippins, Vietnam: All 6 countries almost doubled in population in last 35-40 years, implying almost a 2% growth rate. They are going to grow at close to 1% for the foreseeable future, implying their economies will keep growing just thru their population growth.  These are all 3rd world countries, so their economies run on back of high internal population growth. These top 4 countries combined have population over 2B (or 25% of world population).

Declining Asian countries: Russia, Japan and South Korea are all struggling with stagnant or declining population. Their economies will continue to suffer unless they can import people or export their products. Russia since peaking at 148M in 2000 has declined to 146M as of 2020. Similarly Japan since peaking at 127M in 1995 has declined to 126M as of 2020. South Korea had been growing at 0.5% in last decade, but it's population is now stagnant at around 50M.

S Korean population growth: https://www.bloomberg.com/news/articles/2022-08-24/fastest-aging-wealthy-economy-breaks-own-fertility-record-again

It shows S Korea's fertility rate at 0.8, and it's population declining from 50M in 2020 to 24M in 2100.

North American countries: USA, Canada: Both USA and Canada have grown their population by about 50% in last 35 years, which is commendable for a developed nation. Their growth rate is still > 0.5%. Since they are developed countries, their internal growth rate is declining as people have fewer babies. A big reason for their high growth rate is immigration, which is basically importing people to juice up their economic numbers. Canada at 37M people is about 1/10th the size of USA which is at at 330M.

Oceanian country: Australia is the only developed country besides USA and Canada which has strong population growth. It grew 60% in the last 35 years, growing from 15M in 1985 to 25M in 2020. It's population is 25M and is expected to continue growing at > 1% for the forseeable future. Australia is also big on immigration, though not as big as Canada.

South American countries: Brazil, Mexico, Colombia, Argentina: All 4 grew their population by 50%-60% in last 35 years, which is at a slower rate than Asian economies, but still a decent rate, and expected to continue at that rate. They also being 3rd world economies, are able to grow population internally as people in these countries continue having more babies. They are going to grow at close to 1% for the foreseeable future, Brazil has about 200M people, followed by Mexico at 125M, Colombia at 50M and Argentina at 45M.

European countries: Germany, UK, France, Italy, Spain: Most European countries are suffering with population decline or zero growth. Again the reason is that they are developed economies and people are having fewer kids. Germany, the largest economy and most populated Euro nation was 78M in 1980, peaked at 82M in 2000 and started declining. However, as of 2020, it's population has increased to 84M, and is growing at about 0.5M/year. The reason it was able to reverse the population decline was thru immigration. Britain also falls in the same camp as Germany. It's population also kept on growing thru immigration although at a lower rate. It's still growing by 0.5M/year. The next 3 countries: France, Italy and Spain are stuck at 0% growth rate, since they didn't import people in large enough numbers. So, these top 5 countries have about 300M people, but growing at about 1M people per year. Most of this growth is thru importing people.

African countries: Nigeria, Ethiopia, Egypt, DR Congo, S. Africa, Kenya: African countries win the gold medal for population growth. Many African countries have more than doubled their population in last 35 years. All these countries have upward of 2% growth rate, and will continue to have higher rates for a long time. So, these African countries are going to rule the world, when it comes to exporting people. These 6 countries combined have > 600M people and will likely double their population in next 30-35 years.

 


 

Immigration:

Let's look at immigration component of the population growth: https://worldpopulationreview.com/country-rankings/immigration-by-country

 Countries by population and immigrant population (people born in other country). Data is as of 2020:

1. USA: Total population = 330M, Immigrant population = 50m (15% of total population). 50% of the population growth is due to immigrants. More details in USA basic facts.

2. Russia: Total population = 145M, Immigrant population = 12m (8% of total population). Russia not only has lots of immigrants, but also a lot of emigrants (people who leave to go to another country), which stood at about 10M. So, net effect is that it doesn't gain from immigration. So, Russia's population will keep on declining, taking it's GDP down with it. The only saving grace is Oil, of which Russia is biggest producer and 2nd biggest exporter.

3. Germany: Total population = 83M, Immigrant population = 10m (12% of total population). As per this link: https://en.wikipedia.org/wiki/Demographics_of_Germany, there were 20M ppl with immigrant background, defined as ppl with atleast 1 parent who was born outside Germany. So, 10M ppl were born outside Germany, and remaining 10M are kids of these immigrants (kids themselves were born in Germany). So, 25% of German population is not native. Germany is a huge immigrant hub. Since 1970, the natural population growth in Germany has been negative -100K to -200K every year. Deaths have been at 1.1% of population, while births have been at 0.9% of population, resulting in -0.2% negative growth rate not accounting for immigrants. Hence there was a steep decline in native population over the last 50 years. What saved it since 1980's is huge immigrant population coming in every year. Even birth rate has improved due to immigrants having more babies than native Germans. Since mid 2000, Germany is allowing even more immigrants at about 0.5M/year, which allows it to keep it's population growth +ve at about 0.5% per year.

4. UK: Total population = 67M, Immigrant population = 10m (14% of total population). In 1950, foreign born population was 2M (or 5% of population). Now as of 2020, it's 10M. Link here: https://en.wikipedia.org/wiki/Foreign-born_population_of_the_United_Kingdom. So, UK is also a big immigrant hub. Largest immigrant population is Indians - 0.8M of the population is India born. India, Pakistan and Bangladesh comprise 1.5M of population (roughly 2.5% of population).  UK population increased by 8M from 2000 to 2018. Of that about 5M was due to immigrants, accounting for 60% of population increase. In last couple of years, population has increased by 0.5M/year of which 0.35M/year (or 70%) increase is due to import of people.

5. France: Total population = 65M, Immigrant population = 8m (12% of total population). 20% of the French population is with immigrant background, defined as ppl with atleast 1 parent who was born outside France. However, most of the population growth in France is due to native French population growth and not due to immigrants. Here's the link: https://en.wikipedia.org/wiki/Demographics_of_France#Population_projections. Death rate is 0.9%, while birth rate is 1.1% resulting in 0.2% (or 150K) population growth per year. That matches closely to net population growth implying net immigration every year is small. Most of the immigration happened before 2000 (after world war 1 and 2). France is the most fertile in terms of baby births, so their native growth will keep sustaining them. However, the population growth rate is so small at 0.2% that it contributes almost nothing to GDP.

6. Canada: Total population = 37M, Immigrant population = 8m (20% of total population). Canada lives and breathes on immigration. 80% of the population growth in Canada is due to immigrants. Their population grew from 25M in 1985 to to 37.5M in 2019, implying about 50% increase. As of 2019, their growth rate was about 1.4% or 500K/year, which was the highest growth rate of any developed country. Canada sets an immigration target for each year, so they can raise the target as much as they want depending on how many more people they need to import to juice up their GDP numbers. We can expect to see 1.2% population increase per year for the foreseeable future. Canadian Govt has an immigration target of 400K immigrants per year for the next few years. Canada wants to get to 100M before the end of the century. Canada is the worst developed country to immigrate to, since they import slaves to work, who then are said to get all benefits in retirement. I guess it's still something for people from 3rd world countries where they get nothing, so that keeps "import of slaves" going !!

7. Australia: Total population = 25M, Immigrant population = 7m (30% of total population). Australia is also big on immigration, right behind Canada. Population of Australia increases by 1.2% (300K) per year of which 150K is due to immigrants. So, 50% of the population growth is due to immigrants. Link: https://worldpopulationreview.com/countries/australia-population. Australia is expected to keep growing at this growth rate for foreseeable future. Their population will almost double by 2100. This is the highest population growth of any developed country. Whatever is the shortfall in the native growth rate will be made up by immigrants.

8. Italy: Total population = 61M, Immigrant population = 6m (10% of total population). Population grew from 57M in 1985 to 61M in 2019, implying 0.2% population growth. In recent years, population growth has been negative at -0.2%, making it the fastest shrinking country in the world. In 2019, there were 650K deaths and 450K births, resulting in -200K native population decline. Net immigration in Italy is about 100K/year, resulting in net -100K/yr decline. In absence of +ve population growth, house prices have been falling in Italy (Italy, spain and Ireland are the only 3 countries in EU where house prices have been falling), and construction has come down to really low levels. Home ownership is already high at 72%. With population projected to go to 40M by end of century, everyone will own a house with even zero construction of new houses. That's good news for people wanting to immigrate to Italy.

9. Spain: Total population = 46M, Immigrant population = 6m (12% of total population). Here's a link: https://en.wikipedia.org/wiki/Demographics_of_Spain. Population grew from 38M in 1985 to 46M in 2019, implying 0.4% population growth.  However most of this population growth was driven in 2000's via mass immigration. For 2019, there were 350K births and 400K deaths, resulting in -50K population decline. This was largely offset by 200K-300K immigrants. However immigration hasn't been steady, due to high unemployment in Spain. Spain's native population growth will remain negative even with +ve immigration. by 2060, Spain is still projected to have 40M people which is much better than Italy. However, due to declining population, Spain's economy will be heading south.

Most of the immigrants to these countries came from 3rd world countries as India, Pakistan, Bangladesh, Philippines. China, Russia, Mexico, etc. India was the largest exporter of people at 16M people exported to other countries. So, 1% of the Indians born in India have already immigrated and settled in other countries over the last 20-30 years.

 


 

GDP (as of 2020):

When we say GDP, we are referring to nominal GDP (unless mentioned otherwise). Total World GDP is at $80T. Look in GDP section for details. As you can see in GDP section, countries with largest population and largest land area tend to have higher GDP. Printing money has the biggest effect on nominal GDP followed by population growth. NOTE: GDP is measured in US dollars, so if more USD gets printed, the GDP of US as mesued in USD goes up, even though printing more USD just devalued the currency.

GDP numbers for top countries are: https://en.wikipedia.org/wiki/List_of_countries_by_GDP_(nominal)

Historical GDP is shown here for all countries:

https://en.wikipedia.org/wiki/List_of_countries_by_past_and_projected_GDP_(nominal)

All countries below have nominal GDP > $1T, and have large population base. Exceptions are Canada and Australia, which in spite of lower population, still manage to be in top 15 largest economies. However, both of them rely on huge immigration to drive up those GDP numbers.

1. USA: GDP= $21T. Large immigrant population and money printing allows it to keep growing it's nominal GDP at 5%/year.

2. China: GDP=$15T. China's GDP grew

3. Japan: GDP=$5T. GDP has remained stuck at this level for last 20 years. They had close to 0% GDP growth since the "Real Estate Bubble burst" in 1980's.

4. Germany: GDP= $4T. Large immigrant population every year allows it to keep it's GDP growing, even though it's native population is declining. Germany's GDP grew from $2T in 2000 to $3.5T in 2009, and then from $3.5T in 2009 to $4T in 2020, implying a paltry 1% nominal GDP growth in last 10 years. Most of it is driven by 0.5% population growth due to immigrants.

5. India:

6. UK: Heavily reliant on immigration to bump their population number, which in turn drives GDP. Helping them is also heavy worldwide usage of their currency "pound".

7. France

8. Italy

9. Canada

10. South Korea

11. Russia

12. Brazil

13. Australia

14. Spain

15. Indonesia

16. Mexico

 


 

Government Debt:

This refers to total amount of money that Central Govt owes to others. Govt Debt is also called public debt or Federal debt or debt held by the government. Central or Federal Govt accumulates debt when the money it collects as taxes is lower than the money it spends on various programs. We are talking about debt of Central Govt and NOT state Govt, since Central Govt is the one that has the power to print money. Central Govt in any country can take any amount of debt in their own currency as they have the power to print money in their own currency.The problem arises when Country A takes debt from other country B in other country's currency. That other country B demands interest and payback of principal. The only way for  country A to pay interest and principal is to somehow sell things to the other country B, so that they can get the currency notes of country B to pay the interest. This is where countries go bankrupt. Debt of any country matters, but more important is in which country's currency is the debt owed in.

Debt to GDP ratio matters, and any debt over 100% of GDP is usually considered risky. Japan has debt to GDP ratio of over 250% which is insane. However, almost all of it's debt is in it's own currency, so Japan can just print more Yen at any time, and eliminate the debt by paying itself. So, the risk of default is zero, and debt doesn't matter at all, whether it's 1% of GDP or 1000% of GDP.

On the other hand, Greece has debt which is only 150% of GDP, but still came close to bankruptcy. Reason is that the debt of Greece is in Euro, which Greece can't print. Euro is the common currency of Euro zone nations (around 27 nations), and they collectively decide whether to print more currency or not. Of course, countries which are doing good, don't want to print more currency, as the value of currency goes down. But countries whose economy is doing poorly want to print as much money as they can. This is where countries like Greece end up in trouble as the currency is not in their control.

What about USA? Well, USA is in the best position because US dollar is world's reserve currency. So, no matter what kind of debt USA has and who owns it, it can always print as many US dollars as it wants and pay off the debt. So, the risk of default is zero, just as it's for Japan. No matter whether US debt is owned by china or Japan, US can just pay them anytime. It just chooses not to print trillions of dollars to pay the debt.

NOTE that as the amount of debt goes up, the nominal GDP also goes up. Nominal GDP is just the total amount of money in the system. So, it's pretty hard for Debt to Nominal GDP ratio to go too high. That is why governments like to show this number, as debt to gdp ratio can never look too bad. Or if it looks bad tody, it will eventually start improving as the debt makes it way into the GDP number, which the current govt can take the credit for.

This link shows GDP as well as debt for various countries. Many GDP numbers look to be wrong based on wikipedia GDP estimates.

https://usdebtclock.org/world-debt-clock.html

As we see below, countries with high GDP are also the countries with high debt (as higher debt drives GDP higher). Debt as of mid 2020 is

1. USA: $27T (GDP=$21T)

2. Japan: $12T (GDP=$5T) => highest debt to GDP ratio for any large economy.

3. China: $8.4T (GDP=$15T)

4. UK: $3.5T (GDP=$2.8T)

5. Italy/France/Germany/Spain: $2.9T/$2.9T/$2.4T/$1.6T (GDP=$2T/$2.5T/$3.8/$1.3)) => Germany is in the best position of al euro nations when it comes to debt to GDP ratio, which is paltry 70%, while for other countries it's > 120%. That's why Germany keeps pushing back on printing more Euro, to bail out other countries.

6. India: $2.5T ((GDP=$2.9T)

7. Brazil: $1.8T (GDP=$1.8T)

8. Canada: $1.9T (GDP=$1.8T)

9. Mexico: $0.8T (GDP=$1.3T)

10. South Korea: $0.8T (GDP=$1.6T). Large population decline of 50% or more expected in next 100 years, so GDP will start declining too (unless they can make up for that via increased exports)

11. Australia: $0.7T (GDP=$1.4T)

12. Russia: $0.3T ($1.7T) => lowest debt to GDP ratio for any developed country. However with the population declining, GDP will start falling too.

These top 15 countries account for 75% of world GDP ($85T), and their debt is also pretty close to their GDP level (except China, Germany, Mexico, Korea, Australia and Russia).

Government Bond interest rates:

Not only the Government debt matters, but the interest rate that it has to pay on that debt also matters. Of course the interest rate is decided by the central bank of the country which is a part of the government. So, Governments have power to decide how much money to print, as well as the interest rate at which they are going to loan that printed money to themselves as well as others. It loans itself that printed money at a rate that it finds convenient, so it's all a scam in the end.

These are the rates fixed by the Central banks of different countries: http://www.worldgovernmentbonds.com/central-bank-rates/

As you can see above, some countries as Switzerland, Denmark, Japan have negative interest rates, implying debt will pay itself off if kept long enough.

These are the interest rates on Government bonds: http://www.worldgovernmentbonds.com/

As you can see almost half the European nations have negative interest rates on 10 year government bonds. Germany, Switzerland and Denmark have 10 yr interest rates below -0.5%. Home loan rates and deposit rates are all under 1% for most of the developed countries. In fact, Denmark mortgage rates went negative at -0.5% per year for a 10 year mortgage, implying you were being paid every month by the bank for having a mortgage. If you kept refinancing the mortgage, you will eventually owe nothing to the bank. Insane times !!!

What's puzzling is that interest rates on Government bonds are negative for many European countries, even though the Central bank rates are at 0%. May be the central bank is buying Government bonds from open market very aggressively at negative rates. But then why not take the Central bank interest rate negative, to keep both rates in sync ??

 


 

Oil:

Oil production per year = 100M barrels per day as of 2018. That is also the consumption rate. 1 barrel is 40 gallons or 160 litres, so per day, we are consuming 100Mx160=1600M litres. That equates to 0.2 litres per person per day.  We consume 35B barrels per year. Assuming 1Barel cost $100 USD, we spend $3.5T per year on Crude Oil alone. That's a big contributor of World GDP at about 4%. That's also a lot of money in nominal terms (around $500 per person per year), which if given to bottom 25% of people directly, would not leave anyone poor. In fact, we are consuming as much oil per day as the amount of water we drink every day. Not really, they are off by a factor of 5 !!

Largest producers of Oil: These countries below produce about 90% of world oil.

USA: produces 12M barrels per day. Consumes 20M, so imports 8M. => BIGGEST producer, BIGGEST consumer, BIGGEST importer

Russia: produces 12M barrels per day. . Consumes 6M, so exports 6M. => BIGGEST producer, 2nd BIGGEST exporter

OPEC: Saudi Arabia, Iran, Iraq, UAE, Kuwait, Venezuela, Nigeria, Angola, Qatar, Algeria, Libya (all OPEC countries) = produces 40M barrels per day. consumes 15M only, exports about 25M. Biggest exporter of oil is Saudi Arabia at 8M (produces 10M, consumes 2M) => BIGGEST producer, BIGGEST exporter

China: produces 4M barrels per day. Consumes 12M, so imports 8M. => 2nd BIGGEST consumer, 2nd BIGGEST importer

Canada: produces 4M barrels per day. Consumes 1M, exports 3M. BIG producer, exports most of it.

Brazil: produces 3M barrels per day. Consumes most of it, exports 0.5M.

Mexico: produces 2M barrels per day. Consumes 1M, exports 1M.

India: produces 1M barrels per day. Consumes 6M, so imports 5M. => 3rd BIGGEST consumer, 3rd BIGGEST importer

Japan: produces 4K (almost nothing) barrels per day. Consumes 4M, so imports 4M. => 4th BIGGEST consumer, 4th BIGGEST importer

South Korea: produces nothing. Consumes 3M, so imports 3M. => 5th BIGGEST consumer, 5th BIGGEST importer

 


 

Phones:

1.5B smartphones sold in 2017 with total revenue of $0.5T (implying $300/phone). These smartphones also require phone service which can easily be $300/year (assuming $25/month for USA market). So, total money spent on phone +service every year is $1T, or more than 1% of GDP. Smartphone sales are projected to reach 2B in 2019 (with 1.3B of these to be 4G enabled), and all phone sales (including dumb mobile phone) to reach 2.35B. Since bottom 85% of world lives on < $20/day, they can't afford any of these smartphones or the phone services that go with it. Assuming top 15% or 1B people of the world buy these smartphones every year, not sure where the remaining 1B sales come from. Since just Samsung, Apple and few more sold over 0.5B high end expensive phones, almost everyone living on >$50/day is buying these phones every year. Hard to believe, that !!

 


 

Milk:

Milk is such a important part of food consumption everywhere in the world, that it's economic impact on the economy can't be neglected.

About 1 Trillion litres of milk is produced every year (930M tonnes in 2022). This implies about Quarter Litre/person per day milk consumption for all of the world population. This seems reasonable as most people who can afford drinking milk drink a glass of milk a day. Then they also eat other products based off milk. Milk is mostly gotten from mammary animals such as cow, buffalo, goat and sheep. Milk also comes from plants for plant based milk. We are talking about milk coming from animals in this section.

Wiki link => https://en.wikipedia.org/wiki/Milk

Milk is 87% water, so it's density is close to that of water at 1.03kg/litre (buffalo milk is slightly more dense than cow milk).

India is the largest producer of milk at 200M tonnes/year, followed by USA at 100M tonnes (as of 2022). 250M dairy cows produce 1T litres of milk, implying ~4000L/cow per year. In US, a single cow produces 10K litres/year, while in India, they only produce 1K litres/yr. By contrast, China, 3rd largest producer has a yield of only 2K litres/cow per year.

 


 


 

 

 

 

Debt in USA:

From the article in "Banks and CU", we see that total assets combined for banks and CU is about $20T (as od 2018). Total deposits=$13.4T, while total loans=$10.8T. This loan only includes loans that are sitting on banks books. There are many loans that banks/CU have sold to other investors, via bundling them as mortgage security with certain interest payments to holders of such securities. Most of these securities are sold to govt backed agencies (explained below) which in turn sells them to other investors. So, $11T bank/CU loans only shows part of the loan that is owned by consumers. A lot of debt held by consumers has been securitized and is hed by people like you and me when we buy such bonds.

Total debt is comprised in 2 parts:

1. Consumer debt => Debt taken by consumers to buy house, cars, etc.

2. Government debt => Debt taken by the government if it ends up spending more than what it collects in taxes, then it has to take debt to fund it's operations.

We'll see at both of these categories.

 


 

Consumer debt:

Total debt of consumers in USA is about $16T as of Q2, 2018. $10T of that is mortgage related, $4T is consumer credit (revolving+non-revolving), $2T is others

  • Mortgage debt = $15T. $15T includes not just consumer mortgage debt ($10T), but also mortgage debt made out to corporations, builders, etc ($5T). Only $10.7T is for 1-4 family residence (house, condo, etc). Non residential= $2.8T (offices, buildings,tec), while Multifamily residence(apartments) = $1.3T.  Banks/CU have about $5T, Life insurance companies = $0.5T, Federal National Mortgage Association (FNMA aka Fannie Mae)=$3.2T, Federal Home Loan Mortgage Corp (FHLMC, aka Freddie Mac)=$1.9T, Mortgage pool/trust=$3T (out of which, Govt National Mortgage Association(GNMA aka Ginnie Mae)=$1.9T, private mortgage conduits=$0.8T), individuals/others=$0.8T. Thus govt agencies own about $7.1T (50% of total mortgage debt), 90% of which is in 1-4 family residence. Ginnie Mae is the only govt owned corp, while Fannie Mae and Freddie Mac are govt sponsored entities (GSE). However, securities issued by all 3 of these are considered to be backed by US govt (same guarantee as on govt issued treasuries). Most of the consumer debt that we talk about is the mortgage for 1-4 family residence which is $10.7T. Banks/CU have only about $2.6T of it, govt agencies have about $6T, private=$1.5T. Consumers have about $10.2T of it, while remaining might be on book of builders temporarily?
  • Revolving credit = $1T (credit card loan). This is called Revolving credit as this loan is hold only temporarily (It's supposed to be paid in 30 days, and doesn't have a specific payment period of longer time). Since most of the consumers pay their credit card debt either wholly or partially every month, only outstanding balances reported to credit bureaus at the end of each billing cycle is what is reported here). So, all of this debt carries high interest (easily $100B/yr in interest). Banks are holders of $0.9T, CU $50B, while financial companies about $25B. So, most of credit card business is owned by banks, where 90% of money is loaned out by banks.
  • Non revolving credit = $2.9T (student loans=$1.5T, auto loan=$1.2T). Of this banks are holders of $0.7T, CU has $0.4T, fed govt has $1.2T (mostly student loans), and finance companies about $0.5T

 UPDATE 2024: As of Q4, 2024, Total debt is about $18T. $12.6T of that is mortgage related, $4.5T is consumer credit (revolving+non-revolving), $1T is others ($0.45T is Home equity loan, while $0.55T is others). So, consumer debt is growing by about 2%/year.

 

  • Mortgage debt = $12.6T (only Consumer mortgage debt).
  • Revolving credit = $1.2T (credit card loan).
  • Non revolving credit = $3.3T (student loans=$1.6T, auto loan=$1.65T).

 

 


 

Government debt:

Treasury department (a branch of government which deals with issuing debt for govt) issues Treasury securities (debt) that pays you interest and also guarantees your principal. Principal is guaranteed by the govt of USA, as the govt can always print money (or give itself a credit for that amount in it's account) and pay the principal back. Since this money is 100% risk free, it carries the lowest interest rates. Banks/CU can raise money thru their own debt offering, but it will always be at a higher rate than Treasury rates, since there's risk of losing principal if the bank goes bankrupt. But since deposits in banks/CU is guaranteed via govt up to $250K for single owners, those can also have deposit rates close to those of treasury. However, what we see today, that deposit rates for 99% of the banks/CU is actually lower than those for treasury. In that case, just buy a treasury directly from government. You will need to open an account on treasurydirect.gov, and then you can buy as much as you want (except for few exceptions). Even better news is that the interest on this is exempt from local state income tax (not the federal tax), so you may save some money if you live in a state with high state income tax.

Treasury dept sells Bills, notes, Bonds, TIPS, FRN, etc. US govt has lot of debt ($22T as of Dec, 2018), and growing by $1T every year (or 5% every year, same rate as GDP). Of this, $16T is debt held by public, while $6T is intra governmental holding (money sitting in Social security accounts). $6.3T of public debt is held by foreign countries (china=$1.2T, Japan=$1T, Brazil=$0.3T, Ireland=$0.3T, UK=$0.3T). It paid $0.5T in interest on all this debt for fiscal year 2018 (implying an effective interest rate of 2.5%). See this link for details: https://www.treasurydirect.gov/govt/reports/pd/mspd/2018/opds122018.pdf

  • Public debt = $16T: in form of securities issued => Bills=$2T, Notes=$10T, Bonds=$2T, TIPS=$1.5T, FRN(Floating Rate Notes)=$0.4T, GAS (Govt Account series)=$0.3T, US Savings=$0.2T (of this $16T, $6.3T is held by foreign countries, while $2.2T is held by federal Reserve). So, only half of the total public debt is actually held by public ($7.5T of securities is in accounts of US public, i.e insurance companies, mutual funds, business accounts, public company, pension funds etc).
  • Intra governmental debt = $6T. Of this SSA (federal old age and survivors insurance fund, aka social security fund)=$2.8T, OPM(Civil service Retirement and Disability fund)=$0.9T, DOD(Military retirement Fund)=$0.8T, and remaining from 100's of other funds.

More reports can be found here: https://www.treasurydirect.gov/govt/reports/reports.htm

Current Interest rates for treasury can be found here: https://www.treasurydirect.gov/GA-FI/FedInvest/todaySecurityPriceDate.htm

 


 

 

 

 

Linux Pattern matching in Commands:

There are many linux commands available, such as ls, rm, etc. We use file names as options with many of these unix cmds, but sometimes we also use wild card patterns with them to match more than one file. Before we talk about cmds, let's talk about pattern matching, as it forms the basis of cmds.

Pattern matching:

 


 

glob:  This expansion of wild card characters in simple unix cmds is done by a separate program  called glob present in /etc/glob, and then output of this is passed as arg to unix cmd. In later versions of linux, glob() was provided as a library function, which could be used by any program (including the shell). The most common wildcards in glob are *, ?, [ ] and !. These are called metacharacters, as we are not using them as characters to match. They have special meaning, as described below. Everything else is treated as a literal character.

  • * => matches 0 or more characters. ex: Law* matches Law, Lawyer, but not ByLaw. *Law* will match ByLaw. This happens because glob attempts to match entire and not substring (different than RE). So, Law* would match a string starting with letter Law.
  • ? => matches exactly 1 character, ex: ?at matches cat, but not at
  • [abc] => matches one char in bracket. char can be anything including *, ?, etc with exception of - and ]. explained below. ex: [CB]at matches Cat but not cat. [aT[]r matches ar, [r.
  • [a-z] => matches one char from range in bracket. range is a-z, A-Z, 0-9. Note - is not treated as literal character, but as special range char. To match "-" as a literal, it's supposed to be first char in the list (i.e [-a-c] will match -, a, b, c). Similarly matching opening bracket [ is fine, but closing bracket is matched only when it's first char (i.e[]a-c] will match ], a, b, c). ex: num[ab-g0-7XY] matches num0, numb, numX, but not num00 or numx
  • [!abc] => matches one char that is not in bracket. ex: [!bc]at matches rat, Bat, but not cat or bat
  • [!0-7] => matches one char that is not from range in bracket. ex: num[!a-f] matches numx, but not numa or numxx
  • \ => backslash is used to escape the special meaning of metacharacters above. For ex, if we want ? to be treated as a literal, instead of having the special meaning, we need to precede the metacharacter with \ (i.e \? will treat ? as a literal). In that sense \ is also a metacharacter for escping other metacharacters. One thing to note is that *,?,[ ], ! are the only special characters in glob that will need to be escaped using "\" if we want them to be treated as literal, everything else is treated as literal. 

 globbing on filenames is supported by all unix shells as bash, csh, etc (both on cmd line and in scripts). PHP, Perl, Python all have glob() function in them. Also, wildcards here are used only for file name matching (not text matching as in RE, explained later), and meaning of *,?,[] is different than those in RE.

There are many variations of glob cmd. glob cmd used in tcl has multiple switches starting with -. -- indicates end of options. glob cmd in csh is slightly different than one in csh. In linux, it's simple glob with no options. There are symbolic constants (as GLOB_ONLYDIR, etc), which modify the behaviour of glob (similar to options in tcl glob cmd). One of the most common options of glob (GLOB_BRACE) is to include curly braces {} (similar to csh style), to match complete strings. Which of thes options are enabled depends on your particulat linux distro.

  • {string1,string2,...} => matches strings mentioned inside curly braces. {} can be nested too. strings themselves can be patterns as {*abc*,myname*,cd}*.c

ex: Linux: glob [a-c]*.so => finds all files starting with a,b,c and ending with .so

ex: Linux: glob {bti,chip)* => finds all files starting with bti or chip in their name. This is supported by default on CentOS.

ex: Tcl: glob -types {d f r w} * => find all types of file/dir which match types list. d=dir, f=plain file, r=rd permission, w=wrt permission.

 


 

Regular Expression: One problem with glob is that it matches simple patterns. They do not allow match for multiple repetition of preceding string. This worked fine for early unix machines. But later on in 1980's people started using complex pattern matching,  which was called as "Regular Expression" or RE or regex.  RE can describe full set of regular language over any given finite alphabet. This is a concept from compilers, where programs need to be parsed. RE are used to parse these programs and get tokens out. Any pattern can be matched using RE. We support some more wildcards in RE, and then it's able to match any kind of complex pattern. Tcl supports both globbing and RE.

A very good link on RE is: http://www.grymoire.com/Unix/Regular.html.

Another good link to play with any regex and see how it behaves is this link: https://regex101.com/

NOTE:

1. even though RE share many same wildcards as glob, RE are very different than glob. Shell scripts as bash, csh use glob, and NOT RE. Similarly unix cmds as find, ls, etc use simple file pattern matching as glob. glob cmd is used internally to expand the file name pattern, and then that is returned to the cmd for processing.

2. The extent of pattern matching in RE is to match the longest (greedy) or smallest (eager)  possible pattern. However, POSIX standards mandate that longest pattern be matched. So, A.*B matches AAB as well as AABCAB (even thogh AAB has already been matched in 1st 3 letters of this word, match will return the whole 6 letter word).

3. Forward slash, /, which is used extensively in linux as dir path, is not used in any glob or RE. This makes it very convenient as a lot of searches are for paths, and luckily we don't need to escape these /.

In 1980's (before the advent of Linux), there was no standard for RE. People started writng complex pattern matching in their programs, which were all different for different utilities as vi, sed, etc. So, a company named "Sun Microsystems" went through every utility and forced each one to use one of two distinct regular expression libraries - regular or extended. So, we have "regular regular expression" and "extended regular expression". they are also known as regular/basic RE and advanced/extended RE. These are as per IEEE POSIX standard. Both RE serve as an standard, which has been adopted for many tools. Perl have there own RE which have no basic or advanced RE. These perl RE have become a de-facto standard since they have a rich and powerful set of atomic expressions.

There are 3 parts to RE:

  1.  Anchors are used to specify the position of the pattern in relation to a line of text.
  2. Character Sets match one or more characters in a single position.
  3. Modifiers specify how many times the previous character set is repeated.

ex: ^ab.* => Here ^ is an anchor, "ab" are character sets and .* is a modifier.

These are the 2 types of RE:

1. Basic RE (BRE): smaller set. It added . ^ and $ as metacharacters (on top of *, [], ! \) , but didn't add ? as in glob. () { } <> were regarded as meta characters only when preceeded by \.

  • . => matches any single char except newline (exactly which character is considered newline is encoding and platform specific, but LineFeed/Return (LF) char is always considered newline). . inside square bracket is treated as lieral. ex: [a.c] matches any of a or . or c, but a.c matches abc, adc, etc. To match newline in linux, just use \n in the pattern, i.e .*\n.* will match 2 consecutive lines (see deails on * in next bullet)
  • * => matches 0 or more of preceding char. Thus it's different than glob, where 0 or more char are matched. ex: Law* will match Law, La (0 or more of w, note w is not to be matched as it's used as a quantifier for *), Laww, Lawww but not Liw. It will also match Layer (Layer matches as anything after La can match), Lawyer, ByLaw as RE match substring too. We very commonly use .* to match anything (. says match any char except newline, and * following it says match 0 or more of this, basically implying match 0 or more of any char). i.e a.*b will match ab, artsb, acb, but not "a" at end of line (a followed by newline).
  • [abc0-2z6-8] => same as glob.
  • Anchors: ^ and $ are used as beginning or end anchors. The use of "^" and "$" as indicators of the beginning or end of a line is a convention other utilities use. The vi editor uses these two characters as commands to go to the beginning or end of a line. The C shell uses "!^" to specify the first argument of the previous line, and "!$" is the last argument on the previous line. 
    • ^ => beginning of line anchor. Matches starting position of any line. ex: ^Love will match any line starting with letter Love. ^ is an anchor only if it's the 1st char in a RE, otherwise it behaves as a literal.
    • $ => end of line anchor. Matches ending position of any line. ex: Love$ will match any line ending with letter Love. $ is an anchor only if it's the last char in a RE, otherwise it behaves as a literal.
  • [^abc] or [^0-5] => here caret is used as negation metacharacter when used inside square bracket (instead of ! in glob).Thus ^ has 2 meanings. Functionality is same as glob. ex: [^ } => matches anything that's not a space (there's a space after caret in this example). If "-" is 1st or last char in [ ] then, underscore is treated as literal for matching purpose. ex: [^-0-9] will match anything except underscore and digit. Similarly, if ] is 1st char after opening bracket, then ] is treated as literal. ex: []0-9] will match ] or digit.
  • \ => backslash is special metacharacter that turns any metacharacter above into a literal for matching purpose. This is called "escaping metacharacter". For ex, if we try to match "done[" (done followed by a square bracket), RE will see [ as metacaharcter and complain of invalid RE if it doesn't find a closing ]. In order to signal that [ is to be used as a literal, we put backslash. ex: done\[ will now match done[. If we want to match done\, then will need to escape \ by doing done\\
  • () => defines marked subexpression. Meaning any string that matches with pattern in this bracket can be recalled later using \1, \2, ..., \9 (where \1 means 1st matched subexpression and so on). BRE mode requires () be escaped using \( \), or else () will be treated as literals. ex: to match 5 letter plaindromes (that read same from front or back, eg: radar, do: \([a-z]\)\([a-z]\)[a-z]\2\1
  • {m,n} => matches preceding char atleast m times, but not more than n times. ex: a{3,5} matches aaa, aaaa, aaaaa, but not anything else. a[1,} matches 1 or more of "a". BRE mode requires {} be escaped using \{ \}, or else {} will be treated as literals.
  • <the> => matches words only if they are on a word boundary (ideally word boundary means word having spaces on both beginning and end of word. However, here we have some exceptions as explained further) The character before the "t" must be either a new line character, or any character other than a number, letter, or underscore. The character after the "e" must also be a character other than a number, letter, or underscore or it could be the end of line character. This makes it easy to match words without worrying about spaces, punctuation marks, etc. Ex: <[tT]he> will match The, .the, "is the way", but not "they". BRE mode requires < > be escaped using \< \>, or else <> will be treated as literals.

NOTE: the reason that () { } <> were treated as literals, is because they weren't assigned special meaning in early days. They were added later as metacharacters in RE. So, to not break existing programs, only way was to use \ with ( ) { } < > when used as metacharacter.

2. Extended RE (ERE): It added ?, + and | metacharacter, and removed need for escaping () {}  (i.e it started treating () {} as genuine metacharacter. Now you have to escape them to use as literals. So, totally opposite of how it was in BRE, confusing). But this was done to fix the mistake in RE (where backward compatibility was important). ERE was newly defined RE, and so no backward compatibility issue was present here. <> was removed from ERE. ERE wasn't really needed as whatever could be matched by using ERE could be done by using BRE, except for one exception (the "|" operator in ERE has no equivalent matching operator in BRE)

  • ? => matches 0 or 1 of preceding char. Thus it's different than glob. However, it's same as \{0,1\} of RE. ex: a.?b will match ab, acb, but not adcb (as .? will match 0 or 1 of any char except newline)
  • + => matches 1 or more of preceding char. It's same as \{1,\} of RE. ex: a.+b will match acb, acdb but not ab (as .+ will match 1 or more of any char except newline)
  • | => choice operator matches expression before or after the operator. ex: (cat|dog) will match cat or dog. This choice or alternation operator is most useful addition to to ERE, as without this, it's difficult to match different choice of words. | put in ( ). Now, we can also have *,?,+ etc following () to look for 0 or more repetition of what's matched. Eg: (Tom|Rob)+ will matchTomRob or TomTom. lack of <> matching can be made up by using |. Ex: <the> is equiv to  (^|[^_a-zA-Z0-9])the([^_a-zA-Z0-9]|$) => basically this says "the" should not match any alphanumeric char or underscore at start or end. "the" could be start of line or end of line. So, there was really no need of ERE, as we could have added "|" operator to BRE. To not break backward compatibility, we could escape this using \| in BRE, and then BRE would have worked just the same as ERE. Unfortunately, that's not what happened, though emacs used this technique to get away from ERE all together.

NOTE: use of *, ?, + in RE/ERE changes meaning of char preceeding it, as that char is not used in it's normal form for matching, but instead is used as a qualifier for *,?,+. It behaves as if the previous character is glued to these *,?,+. Ex, a.b would not match ab, as . implies a single char has to be in between a and b, but when we do a.*b, then it matches ab, as . loses it's value of matching a single char. Instead . is glued to *, which combined together as .* means match 0 or more of any char. Similarly .? means match 0 or 1 of any char, and .+ matches 1 or more of any char.

Using *? is tricky => internet indicates that it's a lazy match trying to match as little as possible that satisfies the match criteria (by default, any match tries to be greedy match as per POSIX std), no justification on how it ended up that way. So .*? will do lazy match of .*, i.e least possible match of 0 or more of any char. Ex: a.*b will match complete abdbcb (greedy match), but a.*?b will match first 2 letters (ab) only.

IMPORTANT: Forward slash / is NOT a regex. If you see BRE and ERE meta characters above, none of them have / as a meta char (the only regex related to slash is back slash \). So, when matching patterns having linux path (i.e /home/Joe), you don't have to escape anything. So, match it directly by pasting it. So easy !!

 


 

Other class: There are also character class, which provide shorthand notation for matching digits, letters, spaces, etc. Just as we used \1 to refer to 1st matching substring, we can use \d, \w, \s etc heavily used in Perl. However their definition and usage is not consistent across all tools. POSIX std defines [: ... :] for such char class, but more commonly \d, \s, \w are widely supported across many cmds and tools. These are the differences b/w POSIX [] and Perl \d etc.

  1. POSIX char classes can only be used within [], so we need to use [[:alpha:]0-9] to match alphabetic + numeric char. [:xxxxx:] is a substitute for the character set only, i.e [:digit] is substitute for 0-9, so [[:digit:]] is replacement for [0-9].
  2. Perl style \d, \w does the matching too. i.e \w is equiv to [_a-zA-A0-9]. It's matching for any alphanumeric
  • [:alnum:]  => matches any alphanumeric char. [:alnum:] equiv to a-zA-A0-9. [:alpha:] matches only letters(a-z,A-Z), not digits.
  • [:word:] or \w => alphanumeric + underscore. [:word:] equiv to _a-zA-A0-9. \w is equiv to [_a-zA-A0-9]. \W is negation of \w i.e \W is "not matching \w", equiv to [^_a-zA-A0-9]
  • [:digit:] or \d => digits. [:digit:] equiv to 0-9. \d is equiv to [0-9]. \D is negation of \d
  • [:space:] or \s => whitespace char. equiv to [:space:] is equiv to whitespace,\t\r\n\v\f] while \s is equiv to [ \t\r\n\v\f]. \S is negation of \s
  • [:blank:] or \b => space and tab, mostly known as word boundary. This is very common when searching for separate words. ex: \b[a-zA-Z]\b will match every word containing letters only. \b is equiv to (^\w|\w$|\W\w|\w\W). \B is negation of \b (i.e non word boundary)

Regex website:

Below site allows you to verify your regex. It gives you any error in your regex, and allows you to type pattern to match. It's very helpful to check for the correctness of your regex.

https://regex101.com/

ex: In regex, type me.*\n.*, and in test string, type

1st line: me coming

2nd line I go

3rd he is

Now on right side, it shows any errors in regex, and then shows the matching part. In this ex, 1st 2 lines match completely.

 


 

UNIX cmds:

Different Linux cmds/apps use different pattern match. glob/BRE/ERE/char_class are supported by default or by adding options to cmds. Most Linux utilities use BRE by default.

  • vi, the earliest editor uses BRE as expected. Other common linux utilities also use BRE.
  • grep uses RE by default. egrep (or grep -E)  uses ERE. You can use Re/ERE for patterns, while filenames must still be in glob style.
  • sed uses RE by default. "sed -r" uses ERE
  • awk uses ERE.
  • less supports ERE. However depending on version of less installed (type less --version to check your version), it may support GNU regex or something else. We have to use forward slash "/" once we are in less screen to match anything. Then use backslash "\" as escape char. So, /.* will match every line (since \n is not matched by .), \s, \d+ will match digits, etc. to match "ma bc", we can just type "ma bc" or "ma\sbc". Both match.
  • ls supports glob. ex: ls {mint*a,chip}* => this lists all file names starting with mint and having "a" somewhere after that, and starting with chip. ls doesn't have RE as there's no pattern to be provided in ls cmd.
  • emacs uses it's own version of RE. See in emacs section.
  • find cmd has 2 args. One is the path and the other is the filename. filename is always glob, and path is also glob. See in "Linux cmds" section for more deatils
  • Perl uses it's own version of RE. See in perl section.

 


 

unix executables = binaries and scripts:

There are two kinds of executable Unix programs: binaries and scripts. Any "executable" file is recognized by "x" permission set on file. "execute" permission tells the kernel that it's executable file. Whenever you type name of an executable file on command line (i.e emacs or ny other application), the kernel calls exec SYSTEM CALL. The details of the whole process is explained here on this link: https://stackoverflow.com/questions/8352535/how-does-kernel-get-an-executable-binary-file-running-under-linux. In short, the kernel checks first few bytes of the file (called as magic number) to check whether it's a binary file or a script. Most of the executable files are binary files, so kernel loads them in memory and processor runs them directly. For ex: vi, soffice, etc. Also, shell program like bash, csh etc are also binary executables that are run similar way. Many of these binary executable programs have optional arguments that provide the names of files they work on. For ex: vi text.c. Here vi is binary executable, which has a argument as text.c. So, "vi" binary executable program works on test.c, which is a plain text file. This test.c file doesn't need to be executable, as vi program processes this file. The processor never runs test.c as it has no binary machine language code. However for shell scripts, there is a different rule. For ex: csh test.csh. Here csh is binary executable that works on test.csh. However test.csh is required to be executable, since it can change anything on machine (since it has access to unix system commands). May be it's this security reason that Linux forces these shell scripts to be executable before you can run them using shell binary executable as bash, csh, etc. It doesn't force any other kind of ASCII text files to be set to executable.

Now, whenever you provide a file to run, and if it's executable, then kernel runs it in steps shown on link above. To find out whether it's binary or script executable file or some other text file, unix uses magic number concept.  Any file can have magic number as first few bytes of file. This tells it what program to use to run this file, when the name of program to run this file is not provided. The magic number is a binary bit pattern, but it may happen to correspond to printable ASCII characters. This allows magic numbers to be used in text files. For example, the magic number for PostScript files is 0x25 0x21, which is %!, and the magic number for executable script files is #!. Binary programs run on hardware directly, while scripts need a program or interpreter to run them. When we generate a binary executable a.out for a C pgm, we get a binary that has first few bytes as the magic number, then next few as some other header info, and after that comes the real machine language instructions (as MOV, LD, etc for x86 processor). That is why, binaries generated for each OS differ from each other, and binary for Linux will not run on windows, even though the underlying hardware processor is the same, and the generated machine code is also the same.  The format of binary executable in Unix is called ELF format.

ELF executables (ELF stands for the Executable and Linkable Format) start with a 7F byte and then ASCII letters “ELF”. (That is why when we run "hexdump a.out" on Linux, we see first 4 bytes as "0x7f 0x45 0x4C 0x46"). Scripts start with hex code 0x23 0x21, which in ASCII code is #!. a shebang line that begins with ASCII characters “#!” and then a path to an interpreter is given, so that Linux knows that it is e.g. a Perl program or a shell script - and if a shell script then which shell should be used to interpret it. The magic number concept is used in Unix to type or identify more than just executable programs. For example, the two byte magic number 0x1f 0x8b identifies a particular species of compressed file (GNU gzip files).

Once the kernel sees "x" set on the file, it will check to see if the current user has the right permission to execute it. If so, it checks first few bytes. If first few bytes do not match any magic number, it will run the file using current shell as interpreter. If first few bytes do match the magic number, it executes the program accordingly. It calls the handler in exec process. For binaries, it executes it directly, while for scripts, it executes it with interpreter name provided. If no interpretor name provided, it will run using interpretor name following magic number in that script file. Note that the file needs to have read permission set too, since the interpretor will need to read the file when it tries to run it.

If you try to run a script with no execute permission, kernel generates an error and doesn't allow you to run the executable script. Extension in the file name has no meaning in Linux, it's for user readability only. So, tests.tcl doesn't mean anything to Linux kernel. It just sees it as a long file name. The magic number in tests.tcl tells it that it's a tcl file. If you open a "text.xls" file without providing the program name as "soffice test.xls", then magic number in test.xls is used to figure out which program to use. NOTE that test.xls is not a executable file (it's a plain read/write ASCII file), but it still can have magic number. That magic number will be ignored by preprocessor in soffice program, but may be used by kernel. That is why magic numbers in Unix files have first byte as comment character for that particular program (i.e for csh scripts, # is used as first byte for magic number so that it can be seen as a comment by csh interpretor, so that behaviour of test.csh remains same irrespective of whether it's invoked with an interpretor name or without an interpreter name.

So,in summary 3 ways to run executables:

1. a.out => kernel sees it as binary executable, and knows how to run it. No extra program needed to run it.

2. csh test => csh is seen as binary executable, and "test" is the name of csh file provided as an argument. Magic number in test, even if provided, is not used for anything. So if x is set on "test", then it's run using csh interpreter.

3. test => checks magic number in file test. If no magic number found, and if it's executable, then runs it using current shell as interpreter. If magic number found, and if it's executable, then uses path of interpreter provided in the file to run it. If the file is set to non executable, then linux desktop manager/environment decides which program to use to open this file.

-------------------