Reading and Processing Text: grep, sed, awk and Friends

K
Kai··4 min read

This is the group of tools behind the power of the Linux command line. True to the Unix philosophy (Article 0), each one does one thing well — viewing, filtering, cutting, sorting, counting, transforming text. Learn each on its own, then Article 6 combines them with pipes.

Create a sample file in the lab to practice:

cd /tmp
printf "alice 90 math\nbob 75 science\ncarol 88 math\ndave 75 math\nalice 95 science\n" > scores.txt

Viewing content: cat, less, head, tail

cat scores.txt        # print the whole file to the screen
less scores.txt       # view a long file with paging (q to quit, / to search)
head -2 scores.txt    # first 2 lines
tail -1 scores.txt    # last line

cat suits short files; for long files use less (scroll with arrows/Space, quit with q). head/tail grab the first/last few lines.

Especially useful when debugging: tail -f follows a log file in real time, printing each new line as it's written:

tail -f /var/log/syslog    # tail the log, Ctrl+C to stop

Filtering lines: grep

grep prints the lines matching a pattern — the tool you use most to search through logs and files:

grep math scores.txt        # lines containing "math"
grep -i MATH scores.txt     # -i: case-insensitive
grep -v math scores.txt     # -v: invert (lines NOT containing "math")
grep -c math scores.txt     # -c: count matching lines
grep -n math scores.txt     # -n: include line numbers
grep -r "TODO" /duong/dan   # -r: search recursively through a directory
alice 90 math
carol 88 math
dave 75 math

grep also understands regular expressions (regex), e.g. grep "^alice" scores.txt filters lines that start with "alice" (^ = start of line).

Cutting columns: cut

With columnar data, cut pulls out the column you need:

cut -d ' ' -f1 scores.txt   # -d ' ' splits on space, -f1 takes column 1
alice
bob
carol
dave
alice

-d specifies the delimiter, -f the column number. Handy with CSV files (-d ',') or /etc/passwd (-d ':').

Sorting and removing duplicates: sort, uniq

sort scores.txt              # sort alphabetically
sort -k2 -n scores.txt       # -k2 by column 2, -n sort NUMERICALLY (not by text)
cut -d ' ' -f1 scores.txt | sort | uniq      # unique names
cut -d ' ' -f1 scores.txt | sort | uniq -c   # count how many times each name appears

Note: uniq only collapses adjacent duplicate lines, so you almost always have to sort first. uniq -c counts the repetitions — the pattern sort | uniq -c | sort -rn is handy for ranking (e.g. the most frequent IPs in a log).

Counting: wc

wc scores.txt        # number of lines, words, characters
wc -l scores.txt     # count lines only (-l)
 5 15 73 scores.txt

wc -l is extremely handy for counting: e.g. grep error log | wc -l counts the number of errors.

Transforming text: sed

sed (stream editor) edits text as a stream; the most common use is substitution:

sed 's/math/TOAN/g' scores.txt    # replace "math" -> "TOAN" on every line
alice 90 TOAN
bob 75 science
carol 88 TOAN
...

The s/old/new/g syntax: s = substitute, g = global (every occurrence on the line, not just the first). By default sed prints the result and does not modify the original file; add -i to edit the file in place (sed -i 's/.../.../g' file) — careful, it overwrites.

Processing by column: awk

awk is more powerful for columnar data: it splits each line into fields $1, $2... and lets you filter and compute:

awk '{print $1, $2}' scores.txt          # print columns 1 and 2
awk '$2 > 80 {print $1, $2}' scores.txt   # only lines where column 2 > 80
alice 90
carol 88
alice 95

$2 > 80 is the condition, {print ...} is the action. awk can also sum, average, count by group... — a whole little language. For a beginner, remembering "split into columns $1 $2 ... then filter/print" covers most use.

When to use which: grep to filter lines by pattern; cut to take a column simply; sed to substitute/transform text; awk when you need to process columns with conditions/computation. They overlap, but these are each one's strength.

🧹 Cleanup

rm -f scores.txt

Wrap-up

You have a text-processing toolset: cat/less/head/tail (view, with tail -f to follow logs), grep (filter lines), cut (take columns), sort/uniq (sort + dedupe), wc (count), sed (substitute), awk (process by column). Each is small and specialized for one job.

The real power comes from combining them: one's output becomes the next one's input. Article 6 explains that mechanism — pipes and redirection — along with the three data streams stdin/stdout/stderr behind it.