Data Science on the Command Line
As data sets are getting larger and more prevalent, researchers are having to do a lot more of the leg work in regards to core programming — thereby spending more time with tools like GIT and Linux (something we rarely had to before!).
For the software engineers reading this post: you probably won’t find the following super useful but as someone who’s been through those early self-taught days as a junior researcher, I feel the pain of budding Data Scientists or ML researchers!
Given all that, I thought about which commands I use daily and which commands I wished I had known earlier. So from that, I now present my top 5 Linux commands that have helped me in my career!
grep sounds like the noise frogs make, but actually it stands for
Global regular expression print. That long phrase doesn’t make much sense outright, but the essential use case for the
grep command is to search for a particular string in a given file.
The function is fairly quick and incredibly helpful when you’re trying to diagnose an issue on your production box, in which for example, you may think a
TXT file has some bad data.
As an example, say we’re searching for the string
'this’ in any file which begins with the name
$ grep "this" demo_*
demo_file:this line is the 1st lower case line in this file.
demo_file:Two lines above this line is empty.
demo_file:And this is the last line.
demo_file1:this line is the 1st lower case line in this file.
demo_file1:Two lines above this line is empty.
demo_file1:And this is the last line.
Not so bad huh? We can see on the left hand side that there are two files that begin with demo (
Command 2: wget
Now we move onto something a little bit more sophisticated but still something we use quite a lot. The
wget command is a useful utility used to download files from the internet. It runs in the background so can be used in scripts and cron jobs.
To utility is called as follows:
wget <URL> -O <file_name>
Where the following is an example if we wanted to download a file:
Command 3: wc
Often you have a file of arbitrary length and something smells fishy: maybe the size of the file seems too small for the number of rows you expect or something you’re just curious how many words are in it. Either way, you want to inspect it a bit more and need a command to do so.
wc command helps out in that it essentially counts a few different things for the file in reference:
# wc --help
Usage: wc [OPTION]... [FILE]...
-c, --bytes print the byte counts
-m, --chars print the character counts
-l, --lines print the newline counts
-L, --max-line-length print the length of the longest line
-w, --words print the word counts
--help display this help and exit
--version output version information and exi
So, say we want to count the number of lines in a file:
wc -L tecmint.txt
or maybe the number of characters:
wc -m tecmint.txt
Command 4: Vi
vi command is super helpful as it allows you to open and explore a file. The command works as follows:
And it takes you into an editor sort of thing. Now in this editor, you can use the following characters to navigate:
k Up one line
j Down one line
h Left one character
l Right one character (or use <Spacebar>)
w Right one word
b Left one word
However, in reality, you’ll find navigation pretty naturally. The following commands will be the most useful though:
ZZ Write (if there were changes), then quit
:wq Write, then quit
:q Quit (will only work if file has not been changed)
:q! Quit without saving changes to file
You’ll learn to love
vi, I swear!
So I’ve saved the best for last as I really use this command quite a lot.
CTRL+R isn’t really a command but more a shortcut type of thing. It allows you to search your
history of used commands by typing in something which resembles the command, and then similar commands that you’ve used before come up!
For example, say you’ve just run a really long command and for whatever reason your terminal session breaks and you have to re-run the command again. With this command, you can quickly search for it again instead of reconstructing the command from scratch!
Let’s say I’m trying to remember a command that begins with hi, but I can’t remember it all. I type in
ctrl+r and then I see what it recommends:
Perfect! The command
history has been recommended and that’s exactly the command we were looking for. If you press tab at this point, the autocomplete fills in the line:
I’ve actually always struggled to use both
GIT but over time, I’ve managed to remember a few key commands that’ve helped my development as an independent researcher. I can work fairly independently now and it’s thanks to the above command line tools that I’m able to so.
Therefore, I really recommend spending a few hours getting used to linux as the small lessons you take now will really help progress your use of the system going forward. It’s pure upside!
Thanks again! If you have any questions or need any help, please message =]
Keep up to date with my latest work here!