Ego Pontem

Staying in touch. Kyoto, Japan. August 2023.

Tech writers sometimes need to get their hands dirty. Sometimes they want to but can’t. How would you a convert a list of regional data centres in CSV to an unordered list in markdown? CSV and markdown are both text formats, so you could use an editor to search and replace and copy and paste. That’s especially tedious if you have to do all that work again when the CSV file changes.

Or you could ask a developer to rig up a script for you. You’d have to bribe the developer’s manager for it and wait a few weeks. But then you could run the script yourself whenever you need to.

Or you could write your own damned script. It’s not that hard and once you learn some basics you’ll be surprised how useful your new skills are in other situations.

You’ll need a command line

First, you’ll need access to a computer that has a command line and related tools.

We’ll use the most popular commandline, GNU Bash. But there are others that do the same things. Apple macOS, Linux, and other Unix-y systems have a command line already set up for you.

Do this: If you’re stuck with Microsoft Windows, I suggest installing Windows Subsystem for Linux (WSL), which gives you the tools you need.

You get a command line by running a terminal application. For example, on macOS it’s called, well, Terminal. On Linux and Unix-y systems, run whatever terminal thing it has. On Windows, it’s called Command Prompt.

And you’ll need a text editor

Use a text editor, not a word processor. Microsoft Word won’t work. If you don’t already have your fave text editor set up, use TextEdit on macOS, Notepad on Windows. Linux and Unix-y machines most likely have GNU nano or something like it.

Finally, you’ll need awk

Yeah, awk. Awk is a command-line tool for processing text files. It’s really good with tabular text files, where each row is a line of text and each column is marked with a separator. For CSV, the separator is a comma.

We’ll actually be using gawk, which is GNU’s implementation of awk. It’s probably the most popular awk out there.

Step 1: Make a directory for your project

Do this: Start your terminal application.

When the command line is ready for your command, it shows a prompt. Typically this prompt is a dollar sign, or maybe your username or the name of the computer followed by a dollar sign.

$ ▮

When you see a prompt, you can enter your command then press the Enter key to run it.

For your first command, let’s make a folder for your project. You’ll put the files you’ll work on in there.

Do this: Enter mkdir datacenters.

$ mkdir datacenters
$ ▮

The mkdir command makes a folder with the name you specify. As you can see, it doesn’t give any indication about its work unless there’s an error. Since there was no problem making the new folder, you just get another prompt. You can see the result of its work by seeing what’s in the current directory.

Do this: Enter ls.

$ ls
Desktop Documents Downloads Photos datacenters
$ ▮

The ls command outputs the names of the files and directories in the current directory. The other names (Desktop, Documents, and so on) are other files and folders in the same folder as your new datacenters folder.

Let’s go to the new directory, which makes it the new, current directory.

Do this: Enter cd datacenters.

$ cd datacenters
$ pwd
/home/marc/datacenters
$ ▮

The cd command changes the current directory. The pwd command outputs the current directory’s full path. My current directory is /home/marc/datacenters. Yours will be something similar that also ends with datacenters.

Step 2: Get the CSV file

Now lets get some CSV to work with. Your spreadsheet might look like this:

Screenshot of a spreadsheet with data centres

The eventual result we want from this spreadsheet is a file named datacenters.md:

There's a data centre ready to serve storage,
compute, and databases for our customers
around the globe:

- Catania
- Geneva
- Kyoto
- La Plata
- Montreal

Please contact our Sales department for more info.

I’ve exported this spreadsheet as a CSV file:

city,state-prov,country,storage,compute,database,id
Catania,Sicily,IT,TRUE,FALSE,TRUE,32848
Geneva,New York,US,FALSE,FALSE,TRUE,28342
Kyoto,Kyoto,JP,FALSE,TRUE,TRUE,81283
La Plata,Buenos Aires,AR,TRUE,TRUE,TRUE,90123
Montreal,Quebec,CA,TRUE,FALSE,FALSE,17902

Do this: In your text editor, create a file named in.csv then copy the CSV above and paste it. Save the file in your project directory.

You can check to see that your CSV file is in your project directory and has the correct contents.

Do this: Enter cat in.csv.

$ cat in.csv
city,state-prov,country,storage,compute,database,id
Catania,Sicily,IT,TRUE,FALSE,TRUE,32848
Geneva,New York,US,FALSE,FALSE,TRUE,28342
Kyoto,Kyoto,JP,FALSE,TRUE,TRUE,81283
La Plata,Buenos Aires,AR,TRUE,TRUE,TRUE,90123
Montreal,Quebec,CA,TRUE,FALSE,FALSE,17902
$ ▮

The cat tool outputs the contents of the file you specify. You could have also used the less tool. It shows a file’s contents one screenful at a time. You can go up or down a screenful with the Page Up and Page Down keys. You can return to the prompt by pressing the q key (lowercase), for “quit”.

Step 3: Let’s get awking

We’ll write our csv-to-markdown script incrementally. This is a natural way to do it, the command line makes it easy to interact and iterate.

Let’s create the simplest awk program, an empty file.

Do this: Enter touch datacenters.awk.

What happened? Nothing, except that the touch command created a new, empty file named datacenters.awk.

Do this: Enter ls -l.

$ ls -l
total 4
-rw-r--r--  1 marc  marc    0 Nov 17 10:51 datacenters.awk
-rw-r--r--  1 marc  marc  259 Nov 17 10:51 in.csv
$ ▮

The -l, a hyphen followed by a lowercase L, in ls -l is an option. This option tells the ls tool to list files in long format. You can ignore most of this output, but take a look at the column with 0 and 259. This is the column for file sizes. Notice how datacenters.awk has 0 bytes, it is indeed empty.

Now let’s see this script in action.

Do this: Enter gawk --csv -f datacenters.awk in.csv

$ gawk --csv -f datacenters.awk in.csv
$ ▮

Excellent. Nothing happened. That’s what we expected, after all, because our awk script is empty.

Let’s take a closer look at the gawk command you entered. You can probably figure out what it means:

gawk: This is the command to run.
--csv: This option tells gawk that we’re working with CSV input. Notice that there are two hyphens instead of one. This is called a long option because it’s more than one letter long. Normal options are a single letter or digit and preceded by a single hyphen.
-f datacenters.awk: The -f option specifies the script for gawk to use. An awk script can have any file name, but we end the file’s name with .awk, the conventional file name extension for awk scripts.
in.csv: This is our input file for our awk script.

A couple of things to keep in mind:

A command line expects items, like options and file names in a command, to be separated by spaces. In other words, as a beginner at least, avoid putting spaces in your file names. The command line has the ability to handle spaces in file names, but it can get complicated so we won’t cover that here.
Uppercase and lowercase matter in Linux and Unix-y systems. To keep things simple, we’ll use lowercase letters as much as possible, including file names.

Step 4: The identity script

Now we’ll edit our awk script to make it do something, more or less. Well, more less than more. We’ll create an identity script.

In mathematics, the identity function returns the value that you give it. In other words, it doesn’t do anything more than repeat what you tell it. In the command line, an identity script outputs its input. How is that useful? It isn’t immediately useful, but it’s a good starting point to build on.

Do this: Open datacenters.awk in your text editor then copy and paste the following. It’s just a single line. Make sure you end it by pressing Enter. Save your file and quit the editor to return to the command line.

{ print; }

Simple, right? Let’s unpack our script. An awk script is pretty straightfoward. It’s organized into pairs of patterns and actions. The awk tool reads its input one line at a time. For each line, it checks to see if the script has any patterns that match the line. For each pattern that is true for the line, awk performs the pattern’s action.

What we’ve done is create a single pattern and action in our script. You can’t see the pattern because we’re relying on the default pattern, also called the empty pattern. The empty pattern is always true for every line.

An action is wrapped in { and }. The default action is to do nothing. But we want our action to repeat the line that we’re currently processing. That’s what the print statement does. The default for the print statement is to print the matching line. The statement ends with a semi-colon (;). We use semi-colons to separate statements in an action. This is optional when there’s only one statement in an action, but we put it here out of habit.

Now let’s run our command again to see if it really is the identity script.

Do this: In the command line, enter gawk --csv -f datacenters.awk in.csv

$ gawk --csv -f datacenters.awk in.csv
city,state-prov,country,storage,compute,database,id
Catania,Sicily,IT,TRUE,FALSE,TRUE,32848
Geneva,New York,US,FALSE,FALSE,TRUE,28342
Kyoto,Kyoto,JP,FALSE,TRUE,TRUE,81283
La Plata,Buenos Aires,AR,TRUE,TRUE,TRUE,90123
Montreal,Quebec,CA,TRUE,FALSE,FALSE,17902
$ ▮

There you go. Our identity script does what we expect.

Step 5: Pick a specific column

We want our script to only output contents of the city column, which is the first column. To do this, we give the print statement an argument that specifies this column.

Do this: Open datacenters.awk in your text editor then make the following change:

{ print $1; }

The $1 argument for the print statement specifies the first column, our city column.

Do this: In the command line, enter gawk --csv -f datacenters.awk in.csv

$ gawk --csv -f datacenters.awk in.csv
city
Catania
Geneva
Kyoto
La Plata
Montrea
$ ▮

Step 6: Format for markdown

We’re getting closer! Let’s format our output as an unordered list in markdown. Each list item in an unordered list starts with a hyphen, followed by a space, then the text for the item.

Do this: Open datacenters.awk in your text editor then make the following change:

{ print "- " $1;}

We’ve given 2 arguments to print, a string containing the beginning of a list item in markdown, a hyphen and space. Notice that we wrapped the string in double quotes. The next argument is the value of our first column.

Do this: In the command line, enter gawk --csv -f datacenters.awk in.csv

$ gawk --csv -f datacenters.awk in.csv
- city
- Catania
- Geneva
- Kyoto
- La Plata
- Montrea
$ ▮

Look at that, you’ve converted CSV to markdown! Now we can put some finishing touches to get the final output we’re after.

Step 7: Ignore the first line

You’ve probably been annoyed by it by now, the column name city in the first line of our output. Want we want to do is ignore this first line in the input so it doesn’t show up in the output. We can do this with a new pattern-action.

Do this: Open datacenters.awk in your text editor then change it to this:

NR == 1 { next; }
{ print "- " $1; }

You already know that the 2nd line in our script does. Let’s take a look at the new, first line. Unlike the 2nd line in our script, this new pattern-action has an explicit pattern, NR == 1. It uses awk’s built-in variable named NR. Its value is the number of the input line that awk is currently processing. For the first line of input, NR’s value is 1. So that’s what we check for. NR == 1 means “Is NR’s value equal to 1?” When this pattern is true, awk does its action.

The action for this pattern is the next statement, which tells awk to stop looking for more matching patterns for this line and move on to the next line. Notice that we put this pattern-action at the beginning of our script. We don’t want awk to process any other actions when NR is 1.

Do this: In the command line, enter gawk --csv -f datacenters.awk in.csv

$ gawk --csv -f datacenters.awk in.csv
- Catania
- Geneva
- Kyoto
- La Plata
- Montreal
$ ▮

There. Our markdown output shows just the cities without the first line.

Step 8: Beginning and end

We want our output to have text before and after the list of cities. For that we add a couple of new pattern-action pairs. Take your time with this one, it’s our biggest change to our script so far.

Do this: Open datacenters.awk in your text editor then change it to this:

BEGIN {
        print "There's a data centre ready to serve storage,"
        print "compute, and databases for our customers"
        print "around the globe:";
        print "";
}

END {
        print "";
        print "Please contact our Sales department for more info.";
}

NR == 1 {
        next;
}

{
        print "- " $1;
}

There are a few new things going on here:

The first line starts with #. Awk ignores everything after the # until the end of the line that it’s on. This is a comment. A comment is for people, not awk. Use comments to remind yourself, and others, of what the script does, tricky parts that aren’t obvious, and so on.
We’ve organized the actions a little differently. The opening { and } aren’t on the same line, and statements in the actions are indented. We do this to make it easier to read the script and organize the pattern-action pairs.
The BEGIN and END patterns do exactly what you expect. The BEGIN pattern is true only before Awk has read the first line of input, END is true only after it has read the last line of input.
Notice the print ""; statements. This prints an empty line so we can separate the paragraphs from the unordered list. We use "" as an argument to print so that it doesn’t output its default value.

Do this: In the command line, enter gawk --csv -f datacenters.awk in.csv

$ gawk --csv -f datacenters.awk in.csv
There's a data centre ready to serve storage,
compute, and databases for our customers
around the globe:

- Catania
- Geneva
- Kyoto
- La Plata
- Montreal

Please contact our Sales department for more info.
$ ▮

Someone changed their mind

The Product Manager’s barber’s plumber’s cousin wants datacenters listed only if they offer database services. You’ll have to update the markdown.

Guess what, you can do that easily. Let’s take look at our CSV data to figure this out.

Step 9: Filter for databases

Do this: Enter head -1 in.csv.

$ head -1 in.csv
city,state-prov,country,storage,compute,database,id
$ ▮

The head tool outputs the first lines of its input. In this case, we use the option -1 (that’s a hyphen with a number 1) to specify just the first line, which contains the names of the columns.

The database column is the 6th column. We’ll use this to update our script with a new pattern for outputting markdown list items.

Do this: Open datacenters.awk in your text editor then change it to this:

BEGIN {
        print "There's a data centre ready to serve databases";
        print "for our customers around the globe:";
        print "";
}

END {
        print "";
        print "Please contact our Sales department for more info.";
}

NR == 1 {
        next;
}

$6 == "TRUE" {
        print "- " $1;
}

Here’s what we did:

In the BEGIN pattern, we revised the intro text.
We replaced the default pattern for the action that outputs list items with a pattern that checks to see if the database columns is equal to TRUE.

Let’s see if we get what we expect.

Do this: In the command line, enter gawk --csv -f datacenters.awk in.csv

$ gawk --csv -f datacenters.awk in.csv
There's a data centre ready to serve databases
for our customers around the globe:

- Catania
- Geneva
- Kyoto
- La Plata

Please contact our Sales department for more info.
$ ▮

Step 10: Generate an output file

So far we’ve seen our output show up on the terminal. That’s handy because we can see immediately if our script is doing what we want it to. You can redirect this output to a file instead, ready to copy or send to whoever or whatever needs it.

Do this: Enter gawk --csv -f datacenters.awk in.csv > out.md.

$ gawk --csv -f datacenters.awk in.csv > out.md
$ ▮

Notice the > out.md we’ve added to the end of our command. The greater-than sign (>) tells the command line to redirect output from the terminal to a file named out.md. I’ll leave it to you to figure out if out.md contains what you expect it to.

Actually, > is redirecting from standard output. Standard output is the name for, well, the usual output of a command-line program. By default, standard output goes to the terminal. But you can use > to redirect it to a file.

Surprise! There’s also standard input. Standard input is the usual input to a command-line program, typically you at the keyboard. Some commands, like awk, use standard input if you don’t specify an explicit input file. You can tell the command line to redirect standard input from a file with <.

Do this: Enter gawk --csv -f datacenters.awk < in.csv > out.md.

$ gawk --csv -f datacenters.awk < in.csv > out.md
$ ▮

Notice how we don’t tell gawk which file to use to get its input. Instead, we’ve redirected standard input from in.csv. Since awk has no explicit input file to work with, it uses standard input, which in this case comes from in.csv.

In practice, you’re right to think that this doesn’t make any difference to our input or output in this case. But redirection is a command-line superhero-level power for lots of things that are beyond the scope of this little page.

Where to go from here

Here’s what you can do now:

Some rudiments of using the command line.
How to filter and format CSV into markdown.

Just these skills can solve quite a few problems. Awk alone is quite the Swiss army knife for tabular data.

The Linux and Unix world has a lot of other text-processing tools besides awk. You’ve already used cat and head. There are many others, including tail, sort, and uniq.

Go ahead, explore with your new skills. Try these exercises:

Fancier printing: The printf statement in awk lets you format text and numbers in more sophisticated ways than print. Change the script to also include the data center’s ID.
Control statements: The if and switch statements lets you check for conditions inside an action. Try either one of them to convert the 2-letter codes in the country column to a full country name. For example, an input of CA should be output as Canada.
Awk patterns: You can also match text, including regular expressions. Start and end a pattern with a forward slash (/) to match literal text.