Page 1 of 3

Awk - A nifty little tool for text manipulation and more.

Posted: 06 Jan 2014 04:14
by berserker
awk has been around since the unix days in the 70s and has been a standard tool in most unix-like OS. It is primarily used for text and string manipulation. The GNU awk is one of the most widely used awk version nowadays and now its also ported to the windows OS so its now convenient to use it as part of your batch scripting tool set.

In this thread, I shall show some examples on how one can use this tool for easy file/text manipulation in a Windows batch environment. (if you can download 3rd party tools not installed by default). This is mostly for beginners to using awk or looking for tools to parse strings/text.

The syntax for awk is simple

Code: Select all

pattern { action }


For example, to print a file

Code: Select all

C:\> awk "{print}" myFile.txt

In the above example, the "action" is "print". This is the equivalent of the command

Code: Select all

 type myFile.txt


The cmd.exe on windows doesn't like single quotes, so we have use double quotes for the "action" part.

awk has a "BEGIN" and "END" pattern block. The "BEGIN" pattern only executes once before the first record is read by awk. For example, you can initialize variables inside this block

Code: Select all

awk "BEGIN{a=10} ....." myFile.txt


or just do some calculation (simple calculator)

Code: Select all

C:\>awk "BEGIN { print 1+2 } "
3

Likewise the "END" pattern is executed once only after all the records in the file has been read. For example, you want to print the last line of the file

Code: Select all

C:\> more myFile.txt
C:\original\1\2\3
C:\original\1\2\4
C:\original\1\2\5
C:\original\1\2\36
test

C:\>awk "END{ print $0} " myFile.txt
test



"$0" means the current line/record.

to be continued...

-berserker

Parsing Structured Text

Posted: 06 Jan 2014 04:33
by berserker
One common task almost everyone does is getting information from files. If you have a field delimited text file to parse, then awk might be just the tool you need.

For example, if you have this simple csv file where the delimiter is "|"

Code: Select all

C:\>type myFile.txt
1|2|3|4|5
6|7|8|9|10
a|b|c|d|e



and you wish to get the 3rd column. In awk, the 3rd column is denoted by $3. Likewise, 2nd column as $2 and so on. So to get the 3rd column, issue this command

Code: Select all


C:\>awk -F"|" "{print $3}" myFile.txt
3
8
c


the -F option is the field delimiter. Here, the "|" is specified as field delimiter. Hence awk will break the record into fields or tokens, with each field denoted by "$" and a numeric value. eg $1 means the first field, $9 means the 9th field and so on.
The above is equivalent to DOS for /f command with tokens

Code: Select all

for /f "tokens=* delims=|" ..........

To print the last field, use $NF. To print the last 2nd field, use $(NF-1)


One of the feature of awk and -F is its ability to take in a regular expression, or multiple characters as the field delimiter. For example we have a file with delimiters ",%#"

Code: Select all

C:\>type myFile.txt
1,%#2,%#3,%#4,%#5
6,%#7,%#8,%#9,%#10
a,%#b,%#c,%#d,%#e


Issuing the same commands as before but pass ',%#' to -F and printing the 2nd column

Code: Select all

C:\>awk -F",%#" "{print $2}" myFile.txt
2
7
b


to be continued

- berserker

Length of a string or record

Posted: 06 Jan 2014 04:54
by berserker
Often we need to get the length of a string or a line of record in the file. awk provides the length() function to do this. For example

Code: Select all

C:\>echo "test"| awk "{print length}"
6


why is it 6 and not 4 ? This is because the "echo" command in DOS "counts" the double quotes as characters. Hence you get 6. To calculate the string length of some variable you can just pipe to awk without the quotes

Code: Select all

C:\>echo test| awk "{print length}"
4


Use the usual for loop (for /f ...) to capture the result

How about going through a file and displaying the lines that are of a certain length?
eg we have this file

Code: Select all

C:\>type myFile.txt
abcd
abcd
abcdefghi
abcdefghi
abcdefghijklmn


and we want to get those lines whose length is 4.

Code: Select all

C:\>awk "length==4" myFile.txt
abcd
abcd


writing length==4 this way is considered the "pattern" part of the awk syntax. So its not like this:

Code: Select all

c:\>awk "{length==4}" myFile.txt


The "pattern" part of the awk syntax is usually a regular expression or some conditions.

Another example, search for length greater than 4 and less than 10 will yield

Code: Select all


C:\>awk "length>4 && length <10" myFile.txt
abcdefghi
abcdefghi


If you want to write out the "action" part of the awk syntax, then the above is the same as

Code: Select all

C:\>awk "length>4 && length <10 {print} " myFile.txt
abcdefghi
abcdefghi


to be continued ...

-berserker

Re: Awk - A nifty little tool for text manipulation and more

Posted: 06 Jan 2014 04:55
by berserker
foxidrive wrote:That's useful. Keep the primer going. :)

BTW, the code output above after

Code: Select all

Issuing the same commands as before but pass ',%#' to -F


is the same as the earlier output, which doesn't make as much sense to a newbie reading the primer. You kind of expect something different.

Maybe a different text to show a different output would be useful. Even just print $2


ok thanks, we try to change it :)

Operators

Posted: 06 Jan 2014 05:22
by berserker
awk provides the usual maths operators to help you perform calculations in your script. Here I only list some that are commonly used.

Exponents

x ^ y
x ** y

Add, minus, divide, multiply -> +, - , / , *

Modulus : %

x++ , ++x : post and pre increment operators
x-- , --x : post and pre decrement operators

x += 1 : Adds 1 to the value of x
x -= 1 : Minus 1 from the value of x


Boolean operators:
! : not operator
&& : Logical AND
|| : Logical OR

Relational Operators
<<, >>, <, > , =

Regular expression matching operators
~ : matching
!~ : non-matching

Ternary operator (conditional expression )
?:


For square roots, there is the sqrt() function. eg sqrt(100)

For trigonometry, there are cosine(), sine(), tan() etc functions.

To generate random numbers, there is the rand() function eg

Code: Select all

C:\>awk "BEGIN{ print rand() }"
0.237788


To generate different random numbers everytime you run the awk command, use the srand() function

Code: Select all

C:\>gawk "BEGIN{ srand(); print rand() }"
0.14306

C:\>gawk "BEGIN{ srand(); print rand() }"
0.807121

C:\>gawk "BEGIN{ srand(); print rand() }"
0.663245




To concatenate strings , just write them next to each other, like this

Code: Select all

C:\>awk "BEGIN{ print \"2\" \"3\" }"
23


If writing the awk command on the command line, we have to take care of the double quotes that is used inside awk , by escaping the quotes. In unix shell, it can be written like this :

Code: Select all

awk 'BEGIN{ print "2" "3"}'



For more information on operators, please consult the manual.

to be continued ...

-berserker

Simple string manipulation

Posted: 06 Jan 2014 05:47
by berserker
Here I cover simple string manipulation in awk using its in-built string functions
1) Getting part of a string - substring-ing
2) Getting index of a string
3) Splitting a string
4) Uppercase and lowercase

1) Getting part of a string

awk provides the substr() function to get part of a string, for example

Code: Select all

C:\>echo chimpanzee| awk "{print substr($0,2,5) }"
himpa


$0 is the current record/line, in this case, its the standard input passed to awk using the pipe. substr($0,2,5) just says to get the 5 characters starting at position 2 of the current record. It is the same as the DOS internal build in

Code: Select all

%variable:~1,5%

where %variable% is "chimpanzee". Note that the "echo" command in DOS is particular about spaces (ref:foxidrive), so in the example above, no spaces after "echo chimpanzee"


2) Getting index of a string
This is equivalent to saying "get the first occurence of a string inside a string. eg To find the first occurence of the letter "h" in "elephant"

Code: Select all

C:\>echo elephant| awk "{print index($0,\"h\") }"
5


(take note of the escaping of double quotes when writing on the command line)
If the letter is not found, index() will return 0. So you can check for ERRORLEVEL ==0 in DOS shell. This is useful if you want to see if a string is found inside another string.

Code: Select all

C:\>echo elephant| awk "{print index($0,\"z\") }"
0



Next, the split command. awk provides the split() command to split a string based on a pattern. For example, let's split the word "euphoria" on the letter "p"

Code: Select all

C:\>echo euphoria| awk "{ n=split($0,array,\"p\") } END{ print array[1], n} "
eu 2


Again, $0 means current record (which is euphoria passed in from std input). the split() function takes in the first argument as the record, the 2nd argument as an array, and the last argument as the pattern to split on. This pattern can be a regular expression.

The results of the split are stored in "array". In the above example, we print out the first item of the array at the END block. split() returns the number of items in the array. So in the above, "n" has a value of 2, meaning there are 2 items in the array.


4) Uppercase and lowercase
Often you might want to change the case of words/strings in your task objective. Awk provides in-built functions tolower() and toupper(). eg
This one liner change all the characters in the file to uppercase

Code: Select all

C:\>type myFile.txt
dostips.com

C:\>awk "{ print toupper($0) ;}" myFile.txt
DOSTIPS.COM



If you want to change only one string,

Code: Select all

C:\>echo test|awk "{print toupper($0) }"
TEST

C:\>echo TEST|awk "{print tolower($0) }"
test


As usual, capture the result in using a DOS for loop.


to be continued...

-berserker

Re: Awk - A nifty little tool for text manipulation and more

Posted: 06 Jan 2014 06:02
by berserker
foxidrive wrote:It would be useful to also show there how to seed the randomize function, otherwise every time you execute that command it will show the same number. Newbies might scoff. ;)

good point. added.

Printing in awk

Posted: 06 Jan 2014 06:29
by berserker
awk provides at least 2 forms of printing to output,
1) print
2) printf and
3) sprintf()

1) print.
The basic statement to display output to the user is the print statement. It should be too difficult to understand how to use it. Just

Code: Select all

print "your string"

Sometimes you also can redirect to an output file inside of awk by using the output redirection operator ">"

Code: Select all

C:\>awk "BEGIN{print \"dostips.com\" > \"testfile\" }"

C:\>type testfile
dostips.com


2) printf().
This printf statement syntax look like this:

Code: Select all

printf("format" , item1, item2 ...)

This printf statement is very much similar to printf() from C language, where you can put format specifiers such as %s (string), %d (integer), %f (float). For example, to format some number or floats to 2 decimal places

Code: Select all

C:\>awk "BEGIN{ printf(\"%.2f\" , 100) }"
100.00
C:\>awk "BEGIN{ printf(\"%.2f\" , 3.14244) }"
3.14


To right justify a string 15 places

Code: Select all

C:\>awk "BEGIN{ printf(\"%15s\" , \"mystring\") }"
       mystring


If you want to pad a string with 0's in front, eg

Code: Select all

C:\>awk "BEGIN{ printf(\"%05d\" , 100) }"
00100



3) sprintf().
sprintf() works the same as printf() and it allows the formatting to be saved to a variable.

Code: Select all

C:\>awk "BEGIN{PI = sprintf(\"%.4f\", 22/7); print PI }"
3.1429

here, the value of 22/7 is saved to "PI" variable with 4 decimal places. This variable can then be used in other parts of the awk script.

For more info and examples on print, printf and sprintf please consult the manual.

to be continued

- berserker

Awk Loops

Posted: 06 Jan 2014 06:59
by berserker
Awk loops works the same as those in C language. Here I touch 2 of the most common loops ,
1) for loop
2) while loop

The syntax for a "for" loop in awk is this

Code: Select all

for (initialization; condition; increment)
       body


eg to generate a range of numbers from 1 to 9

Code: Select all

C:\>awk "BEGIN{ for(i=1;i<10;i++){ print i }  }"
1
2
3
4
5
6
7
8
9


Use a DOS for loop to catch each number and use as desired. This is the same as

Code: Select all

FOR /L %%G IN (1,1,9) DO echo %%G 




The while loop is another popular form of looping construct in most programming language. For example, setting a count down and printing 10 "*"

Code: Select all

C:\>awk "BEGIN{count=10; while(count>0 ){ print \"*\" ; count--}  }"
*
*
*
*
*
*
*
*
*
*


To put in more clearly

Code: Select all

BEGIN{
    count=10         # set count of 10.
    while(count>0 ) {
        print \"*\"    # print *
        count--       # decrement the count each time through the loop
    }     
}


Of course, the above can be written with the for loop as well

Code: Select all

C:\>awk "BEGIN{for(c=10;c>0;c--) print \"*\"  }"
*
*
*
*
*
*
*
*
*
*



to be continued

-berserker

Awk arrays

Posted: 06 Jan 2014 07:20
by berserker
Most programming language support data structures such as arrays that can be used to stored similar collection of items, instead of individual variables. Awk has arrays too and its called associative arrays. That means each array is a collection of pairs, called, index and value.

Here are simple example of how to use arrays in Awk.

Code: Select all

C:\>awk "BEGIN{a[1]=\"one\" ; a[2]=10; print a[1]\",\"a[2] }"
one,10



In the above example, we declare array "a" with item "1" having a value of "one" (a string) and item "2" with value of 10 (integer). Arrays in awk can have different data types for items and values. eg

Code: Select all

C:\>awk "BEGIN{a[\"two\"]=2; print a[\"two\"] }"
2

here, the item is "two" (a string) and value is the integer 2.


To iterate an array:, use an awk for loop

Code: Select all

C:\>awk "BEGIN{a[1]=\"one\" ; a[\"two\"]=2; for(item in a) {print item\" \"a[item]  } }"
two 2
1 one


To put it more clearly:

Code: Select all

BEGIN{
    a[1]=\"one\"
    a[\"two\"]=2
    for( item in a )  {
       print item\" \"a[item] 
    }
}

In awk, arrays have no order indexing, not like normal arrays in C. So by printing the array in the above case, the result is arbitrary.

To get the size of an array, you can use length() function as described earlier

Code: Select all

C:\>awk "BEGIN{a[1]=\"one\" ; a[\"two\"]=2; a[2]=100; print length(a) }"
3



To see if an item exists in array, we can use the if statement

Code: Select all

C:\>awk "BEGIN{a[1]=\"one\" ; a[\"one\"]=1; a[2]=100; if (2 in a) { print \"ok\"}  }"
ok


More clearer this way:

Code: Select all

BEGIN{
    a[1] =\"one\"    # define array items and values
    a[\"one\"] = 1
    a[2] = 100
    if (2 in a) {
       print \"ok\"
    } 
}


To remove an item in array, use the delete statement, eg

Code: Select all

C:\>awk "BEGIN{a[1]=\"one\" ; a[\"one\"]=1; a[2]=100; delete a[2]; if (2 in a) { print \"ok\"}else {print \"not ok\"}  }"
not ok


More clearer this way:

Code: Select all


BEGIN{
    a[1]=\"one\"
    a[\"one\"]=1
    a[2]=100;
    delete a[2]    # delete item "2"
    if (2 in a) {
        print \"ok\"
    }else {
        print \"not ok\"
    } 
}


To delete a whole array: just do delete array


See the manual for more elaborate examples on using arrays

to be continued ...

- berserker

Awk Flow control

Posted: 06 Jan 2014 07:33
by berserker
Making decisions are part of our thought process every day. If you want to tell the computer to do something then the language must provide if/else constructs for that. :)

Awk provides the usual if/else/else if constructs that most languages have.

Code: Select all

C:\>awk "BEGIN{ b=2; if( b==2 ) print \"it is 2\" }"
it is 2


Basic construct

Code: Select all

if ( condition ) {
   ....
} else if (condition) {
   ...
} else {
  ....
}



The break Statement jumps out of loops, like this:

Code: Select all

for ( conditions ){
   ...
   break   #breaks out of for loop
   ...
}





The continue Statement is also used in loop and skips the rest of the loop causing the next cycle around the loop to begin immediately. eg

Code: Select all

   for (x = 0; x <= 10; x++) {
        if (x == 2) {
           continue     # this continue skips the print statement below
        }
        print "something"
   }



Later version of Awk supports the switch statement, but its seldom needed as an if/else is good enough for most task. If you want to know more about switch statements, check the manual.

Re: Awk - A nifty little tool for text manipulation and more

Posted: 06 Jan 2014 07:36
by berserker
foxidrive wrote:
substr($0,2,5) just says to get the characters starting at position 2, ending at position 5 of the current record.


This starts at position 2 for a length of 5 characters, no?
yes. good spot. Will change.


foxidrive wrote:I noticed a couple of minor points here and there in your explanations:
The following is echoing "chimpanzee " with a trailing space. And when using echo "test"| awk then the quotes are part of the string too and not special characters.

Batch is particular with spaces in various spots.


Code: Select all

echo chimpanzee | awk "{print substr($0,2,5) }"

yes, i almost forgot about that space. though i have mentioned about the double quotes somewhere :). will change

Some common Awk Variables

Posted: 06 Jan 2014 08:24
by berserker
Awk has some internal variables that you should be familiar with for parsing string and files.
1) NR
2) NF
3) FS
4) RS
5) ORS
6) OFS


1) NR
NR stands for number of input records awk has processed since the beginning of the program's execution. For example, you want to find the line count of a file

Code: Select all

C:\>type myFile.txt
1,%#2,%#3,%#4,%#5
6,%#7,%#8,%#9,%#10
a,%#b,%#c,%#d,%#e

C:\>type myFile.txt | awk "END{print NR}"
3

This is the same as what the Unix wc -l command gives you.

2) NF
NF stands for the number of fields in the current input record. For example

Code: Select all

C:\>type myFile.txt
1,2,3,4,5
6,7,8,9,0,10

C:\> awk -F"," "{print NF}" myFile.txt
5
6

here, because we have set the -F option (field delimiter) as comma, then the first record will have 5 fields, and the 2nd record will have 6.


3) FS

This is the input field separator, similar to -F option passed to awk. Usually its defined in the BEGIN block before any records are processed

Code: Select all

awk "BEGIN{FS=","} {print}" myFile.txt

FS can be any characters (multicharacters as well) and regular expressions


4) RS
RS stands for input record separator. By default awk's RS is the newline character, that's why awk processed lines one by one by default. You can set the RS to a different value, for example, let's say you want to display the above myFile.txt each number on a line by itself.

Code: Select all

C:\>more myFile.txt
1,2,3,4,5
6,7,8,9,0,10

C:\>awk "BEGIN{RS=\",\"}{ print $0 } " myFile.txt
1
2
3
4
5
6
7
8
9
0
10

Here, RS is set to comma "," , so now each record is just the numbers by itself.

5) ORS
ORS stands for Output record separator. Its default is newline "\n" and is the output of every print statement. For example, let's say you want "wrap" lines in a file to become a single line eg,

Code: Select all

C:\>awk "BEGIN{ORS=\"#\"}{ print $0 } " myFile.txt
1,2,3,4,5#6,7,8,9,0,10#



you can change the ORS to "#", and the output will become one line. Notice the "#". Orignially, its "\n", now its "#". Hence this gives the effect of joining to become a single line.


6) OFS
This is the output field separator. ITs default is space, and its the output between the fields printed by a print statement. For example, changing the field separator to "#"

Code: Select all

C:\>type myFile.txt
1,2,3,4,5
6,7,8,9,0,10

C:\>awk "BEGIN{OFS=\"#\"; FS=\",\"}{$1=$1;print } " myFile.txt
1#2#3#4#5
6#7#8#9#0#10



In the above example, because we are changing the OFS, the record need to be rebuild to "reflect" the changes. Hence its common idiom to use $1=$1. (you can consult the manual for explanation)


to be continued ..

-berserker

Re: Awk - A nifty little tool for text manipulation and more

Posted: 06 Jan 2014 08:30
by berserker
foxidrive wrote:

Code: Select all

awk "BEGIN{a[1]=\"one\" ; a[\"two\"]=2; for(item in a) {print item\" \"a[item]  } }"


How is the order defined. This prints two 2 before 1 one


arrays in awk has no indexing as in normal arrays we see in C. So its arbitrary. I will add this point in.

foxidrive wrote:I added a point to my previous comment about the errorlevel in gawk, you may have missed it.

yes, i missed it. thanks for spotting. awk in this case, I am refererring to gawk and not the old awk in old unixes. I will insert your comments as necessary.

Re: Awk - A nifty little tool for text manipulation and more

Posted: 06 Jan 2014 08:40
by berserker
foxidrive wrote:This code has unbalanced parentheses - and it lacks a command line version. Can that also be implemented?

Code: Select all

   for (x = 0; x <= 10; x++) {
        if (x == 2) {
           continue     # this continue skips the print statement below
        }
        print "something"


fixed the parenthesis. Actually you can write it all on the command line , using appropriate ";" to terminate statements. But its not usually advisable because its hard to read. That's why awk can take in input file (as a script ) much like if you write vbscript, you need to use cscript. eg awk -f myawkscript.awk input_file_to_parse

I am coming to take later, on how to write awk script.