Implementing SQL with Unix utilities

As I’ve become a better shell programmer over the last year or two, I’ve been surprised to discover some tools I didn’t know about. It eventually dawned on me, as I did more and more brute-force processing of large datasets, as well as some of the more delicate things that went into Aspersa -> Percona Toolkit, that many tasks I used to do with SQL and spreadsheets can be accomplished easily with well-structured text files and Unix utilities. And they don’t require loading data into a database or spreadsheet (the latter of which almost always performs terribly).

To give an idea, here are some of the relational operations (in SQL speak) you can perform:

SELECT col1, col2… can be implemented with several variants of Unix utilities: cut and awk are the two most obvious. I tend to use awk only when needed, or when it’s more convenient to combine operations into a single tool.
JOIN can be implemented with the… wait for it… join utility. You’ll need to sort its input first, though.
Many GROUP BY operations can be performed with combinations of grep -c, sort with or without the -urnk options (look at the man page — you can apply options to individual sort keys), and uniq with or without the -c option. Many more can be done with 20 or 30 characters of awk.
Output formatting is easy with column, especially with the -t option.

In addition to the above, Bash’s subshell input operator syntaxes can help avoid a lot of temporary files. For example, if you want to join two unsorted files, you can do it like this:

$ join <(sort file1) <(sort file2)

That’s kind of an overview — I end up hacking together a bunch of things, and I’m sure I’m forgetting something. But pipe-and-filter programming with whitespace-delimited files is generally a much more powerful (and performant) paradigm than I realized a few years ago, and that’s the point I wanted to share overall.

As a concrete example, I remember a mailing list thread that began with “I have a 500GB file of 600 billion strings, max length 2000 characters, unsorted, non-unique, and I need a list of the unique strings.” Suggestions included Hadoop, custom programs, Gearman, more Hadoop, and so on — and the ultimate solution was sort -u and sort --merge, trivially parallelized with Bash. (By the way, an easy way to parallelize things is xargs -P.)

What are your favorite “low-level” power programming techniques?

Further Reading:

PlanetMySQL Voting: Vote UP / Vote DOWN

Implementing SQL with Unix utilities

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112