This article describes a collection of shell tools that I found useful for day to day MLOps or cloud development. They are fd, bat, ripgrep, fzf, jq, and perl. (You can close this page if you are familiar with them already.) These tools cover the tasks of file search, file traversal, file peek, text search, text transformation (filter and substitution), and JSON processing. The below describe how these tools interact to accomplish tasks, and some opinionated good practice.

This tool set is especially useful for cloud development. Cloud-based work delegates the heavy computing to the cloud, while most of your work is to process text, either structuring data or reformatting commands. This workflow usually involves a quick peek into the file and then a pipeline to parse and transform text (aka a local ETL pipeline), thus a handy toolkit is very useful. The shell built-in relevant tools cat/find/grep/sed/awk is not satisfactory and the below are the enhancements.

File traversal: fd, bat, fzf

These are just newer tools that either replace the shell built-in counterparts or simply make your life easier. fd link is a drop-in replacement of find, bat link for cat, while fzf link is an add-on. They all provide highlighted output, as well as performance boost or common usage defaults. They are famous and nothing special to offer here.

Text search: ripgrep

ripgrep link is a Rust-based replacement of grep, and is very fast. I like that it also provides matched lines and the file names. My common usage pattern is to find code snippet and context within certain files:

# non-recursive search of ‘def' within py source files in current folder, along with 2 lines before and after
rg --max-depth 1 -g '*.py' -A2 -B2 'def'

Text transformation: jq and perl

jq link parses JSON files nicely and is almost the required tool if you work with cloud, so nothing special to offer here.

For text processing pipelines the well-known go-to tool set is the combination of grep/sed/awk. However I prefer perl for its versatility and its respect for full regex syntax. It's also a built-in command, so to me it's an absolute gain compared to others.

perl is used for "one-liners" here, as opposed to run a standalone script. The most common options are -e, -n, -p, and -F. -e enters execution mode thus is required for all one-liners. We choose either -ne or -pe (see below). -F splits a line and is used for column processing (i.e. the awk usage scenario).

There are two main patterns for perl processing: "filter" and/or "map". If you need to filter stdin streams by a pattern, you use -ne, which allows you to detailed control the operation (e.g. filter then print):

some_commands | perl -ne '/match1/ && /match2/ && print'` # or perl -ne '!/match/ & print'

If you are just transforming stdin stream, you use -pe, which provides some code sugar under the hood so that you can directly provide the regex pattern for search and substitution. However, you cannot filter lines by pattern in this mode:

some_commands | perl -pe 's/match/replace/'

To me it's easy to confuse the options and the regex flavors of grep/sed/awk. Not sure why perl is not as popular as I expected in shell pipelines.

Comments

comments powered by Disqus

Published

Category

research

Tags