Reading and Processing Text: grep, sed, awk and Friends
This is the group of tools behind the power of the Linux command line. True to the Unix philosophy (Article 0), each one does one thing well — viewing, filtering, cutting, sorting, counting, transforming text. Learn each on its own, then Article 6 combines them with pipes.
Create a sample file in the lab to practice:
cd /tmp
printf "alice 90 math\nbob 75 science\ncarol 88 math\ndave 75 math\nalice 95 science\n" > scores.txt
Viewing content: cat, less, head, tail
cat scores.txt # print the whole file to the screen
less scores.txt # view a long file with paging (q to quit, / to search)
head -2 scores.txt # first 2 lines
tail -1 scores.txt # last line
cat suits short files; for long files use less (scroll with arrows/Space, quit with q). head/tail grab the first/last few lines.
Especially useful when debugging: tail -f follows a log file in real time, printing each new line as it's written:
tail -f /var/log/syslog # tail the log, Ctrl+C to stop
Filtering lines: grep
grep prints the lines matching a pattern — the tool you use most to search through logs and files:
grep math scores.txt # lines containing "math"
grep -i MATH scores.txt # -i: case-insensitive
grep -v math scores.txt # -v: invert (lines NOT containing "math")
grep -c math scores.txt # -c: count matching lines
grep -n math scores.txt # -n: include line numbers
grep -r "TODO" /duong/dan # -r: search recursively through a directory
alice 90 math
carol 88 math
dave 75 math
grep also understands regular expressions (regex), e.g. grep "^alice" scores.txt filters lines that start with "alice" (^ = start of line).
Cutting columns: cut
With columnar data, cut pulls out the column you need:
cut -d ' ' -f1 scores.txt # -d ' ' splits on space, -f1 takes column 1
alice
bob
carol
dave
alice
-d specifies the delimiter, -f the column number. Handy with CSV files (-d ',') or /etc/passwd (-d ':').
Sorting and removing duplicates: sort, uniq
sort scores.txt # sort alphabetically
sort -k2 -n scores.txt # -k2 by column 2, -n sort NUMERICALLY (not by text)
cut -d ' ' -f1 scores.txt | sort | uniq # unique names
cut -d ' ' -f1 scores.txt | sort | uniq -c # count how many times each name appears
Note: uniq only collapses adjacent duplicate lines, so you almost always have to sort first. uniq -c counts the repetitions — the pattern sort | uniq -c | sort -rn is handy for ranking (e.g. the most frequent IPs in a log).
Counting: wc
wc scores.txt # number of lines, words, characters
wc -l scores.txt # count lines only (-l)
5 15 73 scores.txt
wc -l is extremely handy for counting: e.g. grep error log | wc -l counts the number of errors.
Transforming text: sed
sed (stream editor) edits text as a stream; the most common use is substitution:
sed 's/math/TOAN/g' scores.txt # replace "math" -> "TOAN" on every line
alice 90 TOAN
bob 75 science
carol 88 TOAN
...
The s/old/new/g syntax: s = substitute, g = global (every occurrence on the line, not just the first). By default sed prints the result and does not modify the original file; add -i to edit the file in place (sed -i 's/.../.../g' file) — careful, it overwrites.
Processing by column: awk
awk is more powerful for columnar data: it splits each line into fields $1, $2... and lets you filter and compute:
awk '{print $1, $2}' scores.txt # print columns 1 and 2
awk '$2 > 80 {print $1, $2}' scores.txt # only lines where column 2 > 80
alice 90
carol 88
alice 95
$2 > 80 is the condition, {print ...} is the action. awk can also sum, average, count by group... — a whole little language. For a beginner, remembering "split into columns $1 $2 ... then filter/print" covers most use.
When to use which:
grepto filter lines by pattern;cutto take a column simply;sedto substitute/transform text;awkwhen you need to process columns with conditions/computation. They overlap, but these are each one's strength.
🧹 Cleanup
rm -f scores.txt
Wrap-up
You have a text-processing toolset: cat/less/head/tail (view, with tail -f to follow logs), grep (filter lines), cut (take columns), sort/uniq (sort + dedupe), wc (count), sed (substitute), awk (process by column). Each is small and specialized for one job.
The real power comes from combining them: one's output becomes the next one's input. Article 6 explains that mechanism — pipes and redirection — along with the three data streams stdin/stdout/stderr behind it.