Awk tips and tricks and Bioinformatics applications
Awk programs
Awk programs are a sequence of combinations of “search pattern” and “action”. Check this example, it sums up all cell counts in this table:
$ cat << EOF > testfile
# "promoter" "head" "body" "tail"
# "FBgn0001168" 0 0 2 0
# "FBgn0032600" 0 0 2 0
# "FBgn0039536" 0 0 2 0
# "FBgn0052816" 0 0 2 0
# "FBgn0085819" 0 0 1 0
# "FBgn0263993" 0 1 0 0
# EOF
$ awk '{print $1}' testfile
$ awk '
BEGIN {
a = 0
}
NR > 1 { # skip the first row (column header)
a = a + $2 + $3 + $4
}
END {
print a
}
' testfile
Parameters
Field separators (-F
)
The parameter -F
is provided a regexp. To define both ‘=’ and ‘;’ as field separators, do this:
awk -F'[=;]' '{print $2}' infile.txt
Examples
Filter a tsv file by column2==”foo”
## $0 means the whole line
awk '{ if ($2=="foo") {print $0} }' inputfile.tsv
## or, for csv:
awk -F, '{ if ($2=="foo") {print $0} }' inputfile.csv
How many reads are mapped [awk, wc]
The third bit (“unaligned read”) has to be unset/zero/not 1:
awk '!and($2, 0x004)' infile.sam | wc -l
Get only aligned and only primary lines
The flags 256 (secondary alignment) and 4 (unmapped read) have to be off:
grep -v ^@ m850-short.sam | awk '!and($2, 0x104)'