Ego Pontem

Learn | Nov 20 2024

Staying in touch. Kyoto, Japan. August 2023.

Tech writers sometimes need to get their hands dirty. Sometimes they want to but can’t. How would you a convert a list of regional data centres in CSV to an unordered list in markdown? CSV and markdown are both text formats, so you could use an editor to search and replace and copy and paste. That’s especially tedious if you have to do all that work again when the CSV file changes.

Or you could write your own damned script. It’s not that hard and once you learn some basics you’ll be surprised how useful your new skills are in other situations.

What you’ll learn

You’ll learn something practical: convert CSV to markdown. The goal here isn’t so much to give you foundational knowledge as to shine a bit of light on some practical info to make your day-to-day tech writing less tedious.

Step 0: You’ll need a command line

Is this your first time on the command line? Take a guided tour first.

Step 1: Make a directory for your project

What we call a folder in a GUI is called a directory in the command line. Folder and directory are synonymous, so we’ll stick with directory.

Let’s make a directory for your project. You’ll put the files you’ll work on in there.

Do this: Start Tech Writer Tools.

What happened: The shell shows a prompt to let you know that it’s ready for your next command:

techwriter:~$ ▮

The prompt shows your username (which is techwriter in this sandbox), the current directory (which we’ll get into in a moment), and ends with a dollar sign, $.

The shell’s shorthand notation for your home direcory is a tilde (~). According to the prompt, then, your current directory (~) is your home directory.

What exactly is your home directory? Conceptually, it’s the directory where you keep your documents, photos, and other files. Your computer has other directories, like the directory for your computer’s operating system. It’s always a good idea to keep your personal and professional files in your home directory.

Where exactly is your home directory? Let’s find out.

Do this: Type pwd then press Enter.

techwriter:~$ pwd
/home/techwriter
techwriter:~$ ▮

What happened: The pwd command tells the shell to output the full path of your present working directory, also known as the current directory.

In this case, you’re in your home directory, /home/techwriter. Note that this is your home directory in Tech Writer Tools, not your regular home directory on your computer.

Now let’s create a new directory in your home directory.

Do this: Type mkdir datacenters then press Enter.

techwriter:~/$ mkdir datacenters
techwriter:~/$ ▮

What happened: The mkdir command makes a directory with the name you specify. As you can see, it doesn’t give any indication about its work unless there’s an error. Since there was no problem making the new directory, you just get another prompt.

Also notice that, according to your prompt, your current directory is still your home directory. The mkdir command creates directories but doesn’t change the current directory.

Let’s see the what’s in the current directory.

Do this: Enter ls. (From now on, when you’re asked to “Enter” a command, type the command then press Enter.)

techwriter:~/$ ls
datacenters welcome.txt
techwriter:~/$ ▮

What happened: By itself, the ls command outputs the names of the files and directories in the current directory. The other name, welcome.txt in this case, is another other file in the same directory as your new datacenters directory.

Let’s go to the new directory, which makes it the new, current directory.

Do this: Enter cd datacenters.

techwriter:~/$ cd datacenters
techwriter:~/datacenters$ ▮

What happened: Your current directory is now datacenters, as shown by the prompt. You don’t have to, but you can confirm that with pwd. Let’s do it anyway.

Do this: Enter pwd.

techwriter:~/datacenters$ pwd
/home/techwriter/datacenters
techwriter:~/datacenters$ ▮

For the rest of this lesson, we’ll just show the $ prompt, not the full prompt.

Step 2: Get the CSV file

Now lets get some CSV to work with. Your spreadsheet might look like this:

Screenshot of a spreadsheet with data centres

The eventual result we want from this spreadsheet is a file named output.md:

There's a data centre ready to serve storage,
compute, or fresh donuts to our customers
around the globe:

- Catania
- Geneva
- Kyoto
- La Plata
- Montreal

Please contact our Sales department for more info.

I’ve exported this spreadsheet as a CSV file:

city,state-prov,country,storage,compute,donut,id
Catania,Sicily,IT,TRUE,FALSE,TRUE,32848
Geneva,New York,US,FALSE,FALSE,TRUE,28342
Kyoto,Kyoto,JP,FALSE,TRUE,TRUE,81283
La Plata,Buenos Aires,AR,TRUE,TRUE,TRUE,90123
Montreal,Quebec,CA,TRUE,FALSE,FALSE,17902

Do this: Copy the CSV text above into the clipboard.

Do this: Enter nano input.csv.

This starts the nano editor and opens a file named input.csv.

Do this: Use the Terminal app to paste the clipboard into nano.

What happened: Your Terminal app’s window should look a lot like this:

Do this: Press Ctrl + S. In other words, hold down the Control key, press the S key, then let go of both keys.

What happened: You’ve just saved your CSV data in the input.csv file.

Do this: Press Ctrl + X to leave nano.

You’ll be back at the prompt again.

$ ▮

You can check to see that your CSV file is in your project directory and has the correct contents.

Do this: Enter cat input.csv.

$ cat input.csv
city,state-prov,country,storage,compute,donut,id
Catania,Sicily,IT,TRUE,FALSE,TRUE,32848
Geneva,New York,US,FALSE,FALSE,TRUE,28342
Kyoto,Kyoto,JP,FALSE,TRUE,TRUE,81283
La Plata,Buenos Aires,AR,TRUE,TRUE,TRUE,90123
Montreal,Quebec,CA,TRUE,FALSE,FALSE,17902
$ ▮

What happened: The cat tool outputs the contents of the file you specify.

Step 3: Let’s get awking

We’ll write our csv-to-markdown script incrementally. By “incrementally” I mean that we’ll start at the simplest possible place then gradually build up a solution that solves our problem. This is a natural way to do it, the command line makes it easy to interact and iterate.

Let’s create the simplest awk program, an empty file.

Do this: Enter touch datacenters.awk.

What happened: Nothing, except that the touch command created a new, empty file named datacenters.awk. Don’t believe me? Look for yourself.

Do this: Enter ls -l.

$ ls -l
total 4
-rw-r--r--  1 techwriter  techwriter    0 Nov 17 10:51 datacenters.awk
-rw-r--r--  1 techwriter  techwriter  259 Nov 17 10:51 input.csv
$ ▮

What happened: The -l, a hyphen followed by a lowercase L in ls -l, is an option. This option tells the ls tool to list files in long format. You can ignore most of this output, but take a look at the column with 0 and 259. This is the column for file sizes. Notice how datacenters.awk has 0 bytes, it is indeed empty.

Now let’s see this script in action.

Do this: Enter gawk --csv -f datacenters.awk input.csv

$ gawk --csv -f datacenters.awk input.csv
$ ▮

What happened: Excellent. Nothing happened. That’s what we expected, after all, because our awk script is empty.

Let’s take a closer look at the gawk command you entered. You can probably figure out what it means:

gawk: This is the command to run.
--csv: This option tells gawk that we’re working with CSV input. Notice that there are two hyphens instead of one. This is called a long option because it’s more than one letter long. Normal options are a single letter or digit and preceded by a single hyphen.
-f datacenters.awk: The -f option specifies the script for gawk to use. An awk script can have any file name, but we end the file’s name with .awk, the conventional file name extension for awk scripts.
input.csv: This is our input file for our awk script.

A couple of things to keep in mind:

In a command line, the shell expects items, like options and file names in a command, to be separated by spaces. So, as a beginner at least, avoid putting spaces in your file names. The command line has the ability to handle spaces in file names, but it can get complicated so we won’t cover that here.
Uppercase and lowercase matter in Linux and Unix-y systems. To keep things simple, we’ll use lowercase letters as much as possible, including file names.

Step 4: The identity script

Now we’ll edit our awk script to make it do something, more or less. Well, more less than more. We’ll create an identity script.

In mathematics, the identity function returns the value that you give it. In other words, it doesn’t do anything more than repeat what you tell it. In the command line, an identity script outputs its input. How is that useful? It isn’t immediately useful, but it’s a good starting point to build on.

Do this: Enter nano datacenters.awk then copy and paste the following. It’s just a single line. Make sure you end it by pressing Enter.

{ print; }

Do this: Press Ctrl + S then Ctrl + X to save datacenters.awk and quit nano.

It’s a simple script, right? Let’s unpack it.

Awk works in a straightforward way. It reads its input one line at a time. For each line, it checks to see if there’s anything to do. If there is, awk does it.

How does awk know what to do? That’s what an awk script is for. An awk script is pretty straightfoward. It’s organized into pattern-action pairs:

pattern: For each input line, awk checks the script for any patterns that match the input line.
action: For each pattern that is true for the line, awk performs the pattern’s action.

What we’ve done is create a single pattern and action in our script.

{ print; }

Actually, you can’t see the pattern because we’re relying on the default pattern, also called the empty pattern. The empty pattern is always true for every line.

An action is wrapped in { and }. The default action is to do nothing. But we want our action to repeat the line that we’re currently processing. That’s what the print statement does. The default for the print statement is to print the matching line. The statement ends with a semi-colon (;). We use semi-colons to separate statements in an action. This is optional when there’s only one statement in an action, but we put it here as a good habit.

So our simple datacenters.awk script has a single pattern-action that matches all lines and outputs them.

Now let’s run our command again to see if it really is the identity script.

Do this: In the command line, enter gawk --csv -f datacenters.awk input.csv

$ gawk --csv -f datacenters.awk input.csv
city,state-prov,country,storage,compute,donut,id
Catania,Sicily,IT,TRUE,FALSE,TRUE,32848
Geneva,New York,US,FALSE,FALSE,TRUE,28342
Kyoto,Kyoto,JP,FALSE,TRUE,TRUE,81283
La Plata,Buenos Aires,AR,TRUE,TRUE,TRUE,90123
Montreal,Quebec,CA,TRUE,FALSE,FALSE,17902
$ ▮

What happened: There you go. Our identity script does what we expect.

Step 5: Pick a specific column

We want our script to only output contents of the city column, which is the first column. To do this, we give the print statement an argument that specifies this column.

Do this: Enter nano datacenters.awk, make the following change, then save and quit nano:

{ print $1; }

The $1 argument is a dollar sign ($) followed immediately by the number 1. This argument for the print statement specifies the first column, our city column.

Do this: In the command line, enter gawk --csv -f datacenters.awk input.csv.

$ gawk --csv -f datacenters.awk input.csv
city
Catania
Geneva
Kyoto
La Plata
Montreal
$ ▮

Step 6: Format for markdown

We’re getting closer! Let’s format our output as an unordered list in markdown. In markdown, each list item in an unordered list starts with a hyphen, followed by a space, then the text for the item.

Do this: Open datacenters.awk in nano, make the following change, then save and quit.

{ print "- " $1; }

We kept the $1 but we’ve preceded it by a new argument, "- ". This new argument is a text string, often called just a string. A string is a piece of text that should be used literally. In awk, we specify a string by wrapping it in straight quotes.

In this case, we want each line of output to start with a hyphen and a space followed by the value of the first column in the input line.

Do this: In the command line, enter gawk --csv -f datacenters.awk input.csv

$ gawk --csv -f datacenters.awk input.csv
- city
- Catania
- Geneva
- Kyoto
- La Plata
- Montreal
$ ▮

Look at that, you’ve converted CSV to markdown! Now we can put some finishing touches to get the final output we’re after.

Step 7: Ignore the first line

You’ve probably been annoyed by it by now, the column name, city, is in the first line of our output. Want we want to do is ignore this first line in the input so it doesn’t show up in the output. We can do this with a new pattern-action.

Do this: Edit datacenters.awk with the following, then save and exit nano:

NR == 1 { next; }
{ print "- " $1; }

You already know that the 2nd line in our script does. Let’s take a look at the new, first line.

Unlike the 2nd line in our script, this new pattern-action has an explicit pattern, NR == 1. It uses awk’s built-in variable named NR. Its value is the number of the input line that awk is currently processing. For the first line of input, NR’s value is 1. So that’s what we check for. NR == 1 means “Is NR’s value equal to 1?” When this pattern is true, awk does its action.

Yes, that’s a double equal sign (==), which is a test for equality. A single equal sign means something else entirely, which we won’t get into here.

The action for this pattern is the next statement, which tells awk to stop looking for more matching patterns for this line and move on to the next line. Notice that we put this pattern-action at the beginning of our script. We don’t want awk to process any other actions when NR is 1.

Do this: In the command line, enter gawk --csv -f datacenters.awk input.csv

$ gawk --csv -f datacenters.awk input.csv
- Catania
- Geneva
- Kyoto
- La Plata
- Montreal
$ ▮

There. Our markdown output shows just the cities without the first line.

Step 8: Beginning and end

We want our output to have text before and after the list of cities. For that we add a couple of new pattern-action pairs. Take your time with this one, it’s our biggest change to our script so far.

Do this: Edit datacenters.awk in nano with the following, then save and quit.

# Convert datacenter csv to markdown

BEGIN {
        print "There's a data centre ready to serve storage,"
        print "compute, or fresh donuts to our customers"
        print "around the globe:";
        print "";
}

END {
        print "";
        print "Please contact our Sales department for more info.";
}

NR == 1 {
        next;
}

{
        print "- " $1;
}

There are a few new things going on here:

The first line starts with #. Awk ignores everything after the # until the end of the line that it’s on. This is a comment. A comment is for people, not awk. Use comments to remind yourself, and others, of what the script does, tricky parts that aren’t obvious, and so on.
We’ve organized the actions a little differently. The opening { and } aren’t on the same line, and statements in the actions are indented. We do this to make it easier to read the script and organize the pattern-action pairs. Awk really doesn’t care where we put the spaces.
The BEGIN and END patterns do exactly what you expect. The BEGIN pattern is true only before Awk has read the first line of input, END is true only after it has read the last line of input.
Notice the print ""; statements. This prints an empty line so we can separate the paragraphs from the unordered list. We use "" as an argument to print so that it doesn’t output its default value.

Do this: In the command line, enter gawk --csv -f datacenters.awk input.csv

$ gawk --csv -f datacenters.awk input.csv
There's a data centre ready to serve storage,
compute, or fresh donuts to our customers
around the globe:

- Catania
- Geneva
- Kyoto
- La Plata
- Montreal

Please contact our Sales department for more info.
$ ▮

Step 9: Filter for donuts

The Product Manager’s barber’s plumber’s cousin wants datacenters listed only if they offer fresh donuts. You’ll have to update the markdown.

Guess what, you can do that easily. Let’s take look at our CSV data to figure this out.

Do this: Enter head -1 input.csv.

$ head -1 input.csv
city,state-prov,country,storage,compute,donut,id
$ ▮

What happened: The head tool outputs the first lines of its input. In this case, we use the option -1 (that’s a hyphen with a number 1) to specify just the first line, which contains the names of the columns.

The donut column is the 6th column. We’ll use this to update our script with a new pattern for outputting markdown list items.

Do this: Edit datacenters.awk in nano with the following, then save and quit nano:

# Convert datacenter csv to markdown

BEGIN {
        print "There's a data centre ready to serve donuts";
        print "to our customers around the globe:";
        print "";
}

END {
        print "";
        print "Please contact our Sales department for more info.";
}

NR == 1 {
        next;
}

$6 == "TRUE" {
        print "- " $1;
}

Here’s what we did:

In the BEGIN pattern, we revised the intro text.
We replaced the default pattern for the action that outputs list items with a pattern that checks to see if the donut column is equal to TRUE.

Let’s see if we get what we expect.

Do this: In the command line, enter gawk --csv -f datacenters.awk input.csv

$ gawk --csv -f datacenters.awk input.csv
There's a data centre ready to serve donuts
to our customers around the globe:

- Catania
- Geneva
- Kyoto
- La Plata

Please contact our Sales department for more info.
$ ▮

You did it!

Congratulations, you wrote a script that generates markdown from CSV.

What you learned

The shell is highly interactive. Take advantage of that to solve your problem. Start with an identity script then incrementally add to it until it does what you want.
Create directories with mkdir, change the current directory with cd, see what the current directory is with pwd, list directory contents with ls.
You can output the contents of a file with cat.
You can create new, empty files with touch.
You can see the first few lines of a file with head.
Awk processes tabular text files, where every line is a record, and every record contains columns. In a CSV file, each row is a line of text and each line of text contains columns that are separated with a comma (,).
An awk script is organized into pattern-action pairs. The default pattern is to match every line. The default action is to do nothing.