As I’ve become a better shell programmer over the last year or two, I’ve been surprised to discover some tools I didn’t know about. It eventually dawned on me, as I did more and more brute-force processing of large datasets, as well as some of the more delicate things that went into Aspersa -> Percona Toolkit, that many tasks I used to do with SQL and spreadsheets can be accomplished easily with well-structured text files and Unix utilities. And they don’t require loading data into a database or spreadsheet (the latter of which almost always performs terribly).
To give an idea, here are some of the relational operations (in SQL speak) you can perform:
- SELECT col1, col2… can be implemented with several variants of Unix utilities:
cut
andawk
are the two most obvious. I tend to useawk
only when needed, or when it’s more convenient to combine operations into a single tool. - JOIN can be implemented with the… wait for it…
join
utility. You’ll need tosort
its input first, though. - Many GROUP BY operations can be performed with combinations of
grep -c
,sort
with or without the-urnk
options (look at the man page — you can apply options to individual sort keys), anduniq
with or without the-c
option. Many more can be done with 20 or 30 characters ofawk
. - Output formatting is easy with
column
, especially with the-t
option.
In addition to the above, Bash’s subshell input operator syntaxes can help avoid a lot of temporary files. For example, if you want to join two unsorted files, you can do it like this:
$ join <(sort file1) <(sort file2)
That’s kind of an overview — I end up hacking together a bunch of things, and I’m sure I’m forgetting something. But pipe-and-filter programming with whitespace-delimited files is generally a much more powerful (and performant) paradigm than I realized a few years ago, and that’s the point I wanted to share overall.
As a concrete example, I remember a mailing list thread that began with “I have a 500GB file of 600 billion strings, max length 2000 characters, unsorted, non-unique, and I need a list of the unique strings.” Suggestions included Hadoop, custom programs, Gearman, more Hadoop, and so on — and the ultimate solution was sort -u
and sort --merge
, trivially parallelized with Bash. (By the way, an easy way to parallelize things is xargs -P
.)
What are your favorite “low-level” power programming techniques?
Further Reading:
- Introducing MySQL Table Checksum
- Aspersa, a new opensource toolkit
- How to use meta-data to sort itself
- MySQL Table Sync 0.9.2 released
- MySQL Find 0.9.0 released
PlanetMySQL Voting: Vote UP / Vote DOWN