Tech writers sometimes need to get their hands dirty. Sometimes they want to but can’t. How would you a convert a list of regional data centres in CSV to an unordered list in markdown? CSV and markdown are both text formats, so you could use an editor to search and replace and copy and paste. That’s especially tedious if you have to do all that work again when the CSV file changes.
Or you could ask a developer to rig up a script for you. You’d have to bribe the developer’s manager for it and wait a few weeks. But then you could run the script yourself whenever you need to.
Or you could write your own damned script. It’s not that hard and once you learn some basics you’ll be surprised how useful your new skills are in other situations.
You’ll need a command line
First, you’ll need access to a computer that has a command line and related tools.
You get a command line by running a terminal application. For example, on macOS and Windows it’s called, well, Terminal. On Linux and Unix-y systems, run whatever terminal thing it has.
Do this: Install Tech Writer Tools.
Tech Writer Tools comes with a bunch of tools. Here are the ones you’ll be using:
GNU Bash is a shell, the most popular one. A shell is a tool itself. It takes your commands and runs them. It’s the tool that you interact with to run other tools.
GNU nano is a a text editor. It’s not a word processor, something like Microsoft Word won’t do. For one thing, nano works in a terminal.
Awk. Yeah, awk. It’s a command-line tool for processing text files. It’s really good with tabular text files. Tabular is just another way so say that each row is a line of text and each column is marked with a separator. For example, in a CSV file, the separator is a comma.
We’ll actually be using gawk, which is GNU’s implementation of awk. It’s probably the most popular awk out there.
The Tech Writer Tools sandbox
For this lesson, don’t worry about breaking things because there’s nothing to break. By default, Tech Writer Tools acts like a sandbox. Anything you do in it doesn’t affect the rest of your computer.
Be careful, though. This also means that whatever you do in Tech Writer Tools gets wiped clean when you stop Docker Desktop.
There’s an easy way to save your work automatically, but we’ll keep Tech Writer Tools as a sandbox for now.
Step 1: Make a directory for your project
Do this: Start Tech Writer Tools.
When the command line is ready for your command, it shows a prompt. The shell shows a prompt to let you know that it’s ready for your next command.
The prompt ends with a dollar sign, $. When you’re inside Tech Writer Tools, your username is techwriter and your prompt will look like this:
techwriter:~/$ ▮
For the rest of this lesson, we’ll just show the $ prompt, not the full prompt.
For your first command, let’s make a folder for your project. You’ll put the files you’ll work on in there.
Do this: Type mkdir datacenters then press Enter.
$ mkdir datacenters
$ ▮
The mkdir command makes a folder with the name you specify. As you can see, it doesn’t give any indication about its work unless there’s an error. Since there was no problem making the new folder, you just get another prompt. You can see the result of its work by seeing what’s in the current directory.
Do this: Enter ls. (From now on, when you’re asked to “Enter“ a command, type the command then press Enter.)
$ ls
datacenters welcome.txt
$ ▮
The ls command outputs the names of the files and directories in the current directory. The other name, welcome.txt in this case, is another other file in the same folder as your new datacenters folder.
Let’s go to the new directory, which makes it the new, current directory.
Do this: Enter cd datacenters.
$ cd datacenters
Do this: Enter pwd.
$ pwd
/home/techwriter/datacenters
$ ▮
The cd command changes the current directory. The pwd command outputs the current directory’s full path, which is /home/techwriter/datacenters.
Step 2: Get the CSV file
Now lets get some CSV to work with. Your spreadsheet might look like this:
The eventual result we want from this spreadsheet is a file named output.md:
There's a data centre ready to serve storage,
compute, and fresh donuts for our customers
around the globe:
- Catania
- Geneva
- Kyoto
- La Plata
- Montreal
Please contact our Sales department for more info.
I’ve exported this spreadsheet as a CSV file:
city,state-prov,country,storage,compute,donut,id
Catania,Sicily,IT,TRUE,FALSE,TRUE,32848
Geneva,New York,US,FALSE,FALSE,TRUE,28342
Kyoto,Kyoto,JP,FALSE,TRUE,TRUE,81283
La Plata,Buenos Aires,AR,TRUE,TRUE,TRUE,90123
Montreal,Quebec,CA,TRUE,FALSE,FALSE,17902
Do this: Copy the CSV text above into the clipboard.
Do this: Enter nano input.csv.
This starts the nano editor and opens a file named input.csv.
Use the terminal app to paste the clipboard into nano.
Do this: Press Control-S. In other words, hold down the Control key, press the S key, then let go of both keys.
You’ve just saved your CSV data in the input.csv file.
Do this: Press Control-X to leave nano.
You’ll be back at the prompt again.
$ ▮
You can check to see that your CSV file is in your project directory and has the correct contents.
Do this: Enter cat input.csv.
$ cat input.csv
city,state-prov,country,storage,compute,donut,id
Catania,Sicily,IT,TRUE,FALSE,TRUE,32848
Geneva,New York,US,FALSE,FALSE,TRUE,28342
Kyoto,Kyoto,JP,FALSE,TRUE,TRUE,81283
La Plata,Buenos Aires,AR,TRUE,TRUE,TRUE,90123
Montreal,Quebec,CA,TRUE,FALSE,FALSE,17902
$ ▮
The cat tool outputs the contents of the file you specify. You could have also used the less tool. It shows a file’s contents one screenful at a time. You can go up or down a screenful with the Page Up and Page Down keys. You can return to the prompt by pressing the q key (lowercase), for “quit”.
Step 3: Let’s get awking
We’ll write our csv-to-markdown script incrementally. This is a natural way to do it, the command line makes it easy to interact and iterate.
Let’s create the simplest awk program, an empty file.
Do this: Enter touch datacenters.awk.
What happened? Nothing, except that the touch command created a new, empty file named datacenters.awk.
Do this: Enter ls -l.
$ ls -l
total 4
-rw-r--r-- 1 techwriter techwriter 0 Nov 17 10:51 datacenters.awk
-rw-r--r-- 1 techwriter techwriter 259 Nov 17 10:51 input.csv
$ ▮
The -l, a hyphen followed by a lowercase L, in ls -l is an option. This option tells the ls tool to list files in long format. You can ignore most of this output, but take a look at the column with 0 and 259. This is the column for file sizes. Notice how datacenters.awk has 0 bytes, it is indeed empty.
Now let’s see this script in action.
Do this: Enter gawk --csv -f datacenters.awk input.csv
$ gawk --csv -f datacenters.awk input.csv
$ ▮
Excellent. Nothing happened. That’s what we expected, after all, because our awk script is empty.
Let’s take a closer look at the gawk command you entered. You can probably figure out what it means:
- gawk: This is the command to run.
- --csv: This option tells gawk that we’re working with CSV input. Notice that there are two hyphens instead of one. This is called a long option because it’s more than one letter long. Normal options are a single letter or digit and preceded by a single hyphen.
- -f datacenters.awk: The -f option specifies the script for gawk to use. An awk script can have any file name, but we end the file’s name with .awk, the conventional file name extension for awk scripts.
- input.csv: This is our input file for our awk script.
A couple of things to keep in mind:
- A command line expects items, like options and file names in a command, to be separated by spaces. In other words, as a beginner at least, avoid putting spaces in your file names. The command line has the ability to handle spaces in file names, but it can get complicated so we won’t cover that here.
- Uppercase and lowercase matter in Linux and Unix-y systems. To keep things simple, we’ll use lowercase letters as much as possible, including file names.
Step 4: The identity script
Now we’ll edit our awk script to make it do something, more or less. Well, more less than more. We’ll create an identity script.
In mathematics, the identity function returns the value that you give it. In other words, it doesn’t do anything more than repeat what you tell it. In the command line, an identity script outputs its input. How is that useful? It isn’t immediately useful, but it’s a good starting point to build on.
Do this: Enter nano datacenters.awk then copy and paste the following. It’s just a single line. Make sure you end it by pressing Enter.
{ print; }
Do this: Press Control-S then Control-X to save datacenters.awk and quite nano.
It’s a simple script, right? Let’s unpack it.
Awk works in a srtaightforward way. It reads its input one line at a time. For each line, it checks to see if there’s anything to do. If there is, awk does it.
How does awk know what to do? That’s what an awk script is for. An awk script is pretty straightfoward. It’s organized into pattern-action pairs. For each input line, awk checks the script for any patterns that match the input line. For each pattern that is true for the line, awk performs the pattern’s action.
What we’ve done is create a single pattern and action in our script.
{ print; }
Actually, you can’t see the pattern because we’re relying on the default pattern, also called the empty pattern. The empty pattern is always true for every line.
An action is wrapped in { and }. The default action is to do nothing. But we want our action to repeat the line that we’re currently processing. That’s what the print statement does. The default for the print statement is to print the matching line. The statement ends with a semi-colon (;). We use semi-colons to separate statements in an action. This is optional when there’s only one statement in an action, but we put it here as a good habit.
So our simple datacenters.awk script has a single pattern-action that matches all lines and outputs them.
Now let’s run our command again to see if it really is the identity script.
Do this: In the command line, enter gawk --csv -f datacenters.awk input.csv
$ gawk --csv -f datacenters.awk input.csv
city,state-prov,country,storage,compute,donut,id
Catania,Sicily,IT,TRUE,FALSE,TRUE,32848
Geneva,New York,US,FALSE,FALSE,TRUE,28342
Kyoto,Kyoto,JP,FALSE,TRUE,TRUE,81283
La Plata,Buenos Aires,AR,TRUE,TRUE,TRUE,90123
Montreal,Quebec,CA,TRUE,FALSE,FALSE,17902
$ ▮
There you go. Our identity script does what we expect.
Step 5: Pick a specific column
We want our script to only output contents of the city column, which is the first column. To do this, we give the print statement an argument that specifies this column.
Do this: Enter nano datacenters.awk, make the following change, then save and quit nano:
{ print $1; }
The $1 argument for the print statement specifies the first column, our city column.
Do this: In the command line, enter gawk --csv -f datacenters.awk input.csv
$ gawk --csv -f datacenters.awk input.csv
city
Catania
Geneva
Kyoto
La Plata
Montreal
$ ▮
Step 6: Format for markdown
We’re getting closer! Let’s format our output as an unordered list in markdown. Each list item in an unordered list starts with a hyphen, followed by a space, then the text for the item.
Do this: Open datacenters.awk in nano, make the following change, then save and quit.
{ print "- " $1;}
We’ve given 2 arguments to print, a string containing the beginning of a list item in markdown, a hyphen and space. Notice that we wrapped the string in double quotes. The next argument is the value of our first column.
Do this: In the command line, enter gawk --csv -f datacenters.awk input.csv
$ gawk --csv -f datacenters.awk input.csv
- city
- Catania
- Geneva
- Kyoto
- La Plata
- Montreal
$ ▮
Look at that, you’ve converted CSV to markdown! Now we can put some finishing touches to get the final output we’re after.
Step 7: Ignore the first line
You’ve probably been annoyed by it by now, the column name, city, is in the first line of our output. Want we want to do is ignore this first line in the input so it doesn’t show up in the output. We can do this with a new pattern-action.
Do this: Edit datacenters.awk with the following, then save and exit nano:
NR == 1 { next; }
{ print "- " $1; }
You already know that the 2nd line in our script does. Let’s take a look at the new, first line. Unlike the 2nd line in our script, this new pattern-action has an explicit pattern, NR == 1. It uses awk’s built-in variable named NR. Its value is the number of the input line that awk is currently processing. For the first line of input, NR’s value is 1. So that’s what we check for. NR == 1 means “Is NR’s value equal to 1?” When this pattern is true, awk does its action.
The action for this pattern is the next statement, which tells awk to stop looking for more matching patterns for this line and move on to the next line. Notice that we put this pattern-action at the beginning of our script. We don’t want awk to process any other actions when NR is 1.
Do this: In the command line, enter gawk --csv -f datacenters.awk input.csv
$ gawk --csv -f datacenters.awk input.csv
- Catania
- Geneva
- Kyoto
- La Plata
- Montreal
$ ▮
There. Our markdown output shows just the cities without the first line.
Step 8: Beginning and end
We want our output to have text before and after the list of cities. For that we add a couple of new pattern-action pairs. Take your time with this one, it’s our biggest change to our script so far.
Do this: Edit datacenters.awk in nano with the following, then save and quit.
# Convert datacenter csv to markdown
BEGIN {
print "There's a data centre ready to serve storage,"
print "compute, and fresh donuts for our customers"
print "around the globe:";
print "";
}
END {
print "";
print "Please contact our Sales department for more info.";
}
NR == 1 {
next;
}
{
print "- " $1;
}
There are a few new things going on here:
- The first line starts with #. Awk ignores everything after the # until the end of the line that it’s on. This is a comment. A comment is for people, not awk. Use comments to remind yourself, and others, of what the script does, tricky parts that aren’t obvious, and so on.
- We’ve organized the actions a little differently. The opening { and } aren’t on the same line, and statements in the actions are indented. We do this to make it easier to read the script and organize the pattern-action pairs.
- The BEGIN and END patterns do exactly what you expect. The BEGIN pattern is true only before Awk has read the first line of input, END is true only after it has read the last line of input.
- Notice the print ""; statements. This prints an empty line so we can separate the paragraphs from the unordered list. We use "" as an argument to print so that it doesn’t output its default value.
Do this: In the command line, enter gawk --csv -f datacenters.awk input.csv
$ gawk --csv -f datacenters.awk input.csv
There's a data centre ready to serve storage,
compute, and fresh donuts for our customers
around the globe:
- Catania
- Geneva
- Kyoto
- La Plata
- Montreal
Please contact our Sales department for more info.
$ ▮
Someone changed their mind
The Product Manager’s barber’s plumber’s cousin wants datacenters listed only if they offer fresh donuts. You’ll have to update the markdown.
Guess what, you can do that easily. Let’s take look at our CSV data to figure this out.
Step 9: Filter for donuts
Do this: Enter head -1 input.csv.
$ head -1 input.csv
city,state-prov,country,storage,compute,donut,id
$ ▮
The head tool outputs the first lines of its input. In this case, we use the option -1 (that’s a hyphen with a number 1) to specify just the first line, which contains the names of the columns.
The donut column is the 6th column. We’ll use this to update our script with a new pattern for outputting markdown list items.
Do this: Edit datacenters.awk in nano with the following, then save and quit nano:
# Convert datacenter csv to markdown
BEGIN {
print "There's a data centre ready to serve donuts";
print "for our customers around the globe:";
print "";
}
END {
print "";
print "Please contact our Sales department for more info.";
}
NR == 1 {
next;
}
$6 == "TRUE" {
print "- " $1;
}
Here’s what we did:
- In the BEGIN pattern, we revised the intro text.
- We replaced the default pattern for the action that outputs list items with a pattern that checks to see if the donut column is equal to TRUE.
Let’s see if we get what we expect.
Do this: In the command line, enter gawk --csv -f datacenters.awk input.csv
$ gawk --csv -f datacenters.awk input.csv
There's a data centre ready to serve donuts
for our customers around the globe:
- Catania
- Geneva
- Kyoto
- La Plata
Please contact our Sales department for more info.
$ ▮
Step 10: Generate an output file
So far we’ve seen our output show up on the terminal. That’s handy because we can see immediately if our script is doing what we want it to. You can redirect this output to a file instead, ready to copy or send to whoever or whatever needs it.
Do this: Enter gawk --csv -f datacenters.awk input.csv > output.md.
$ gawk --csv -f datacenters.awk input.csv > output.md
$ ▮
Notice the > output.md we’ve added to the end of our command. The greater-than sign (>) tells the command line to redirect output from the terminal to a file named output.md.
Do this: I’ll leave it to you to figure out if output.md contains what you expect it to.
Actually, > is redirecting from standard output. Standard output is the name for, well, the usual output of a command-line program. By default, standard output goes to the terminal. But you can use > to redirect it to a file.
Surprise! There’s also standard input. Standard input is the usual input to a command-line program, typically that’s you at the keyboard. Some commands, like awk, use standard input if you don’t specify an explicit input file. You can tell the command line to redirect standard input from a file with <.
Do this: Enter gawk --csv -f datacenters.awk < input.csv > output.md.
$ gawk --csv -f datacenters.awk < input.csv > output.md
$ ▮
Notice how we don’t tell gawk which file to use to get its input. Instead, we’ve redirected standard input from input.csv. Since awk has no explicit input file to work with, it uses standard input, which in this case comes from input.csv.
In practice, you’re right to think that this doesn’t make any difference to our input or output in this case. But redirection is a command-line superhero-level power for lots of things that are beyond the scope of this little page.
Where to go from here
Here’s what you can do now:
- Some rudiments of using the command line.
- How to filter and format CSV into markdown.
Just these skills can solve quite a few problems. Awk alone is quite the Swiss army knife for tabular data.
The Linux and Unix world has a lot of other text-processing tools besides awk. You’ve already used cat and head. There are many others, including tail, sort, and uniq.
Go ahead, explore with your new skills. Try these exercises:
- Fancier printing: The printf statement in awk lets you format text and numbers in more sophisticated ways than print. Change the script to also include the data center’s ID.
- Control statements: The if and switch statements lets you check for conditions inside an action. Try either one of them to convert the 2-letter codes in the country column to a full country name. For example, an input of CA should be output as Canada.
- Awk patterns: You can also match text, including regular expressions. Start and end a pattern with a forward slash (/) to match literal text.