Step-by-Step Tutorial

Step 1: Create a Working Directory on Your Computer

You will need to set up a working location on your local computer. This will be a folder/directory that will contain copies of the two primary code bases used in this tutorial. For the remainder of this tutorial we will assume that you have created a new folder called web-of-science-export within your computer’s Desktop folder. We will then refer to this location using the Unix-style path convention ~/Desktop/web-of-science-export.

Attention Windows Users

If you are a Windows user, before cloning the wos-explorer, it is possible there will be some line feed encoding settings that can prevent your scripts from running correctly with the Unix-based CHTC systems and the Bash commands. To make sure your version of wos-explorer is using the correct encoding, run these two commands (you will not see anything happen, but if there is no error message it has likely worked):

git config --global core.autocrlf false
git config --global core.eol lf

Step 2: Clone the CHTC Recipes Project

On your computer, navigate in your terminal program to the project folder:

cd ~/Desktop/web-of-science-export

Clone the CHTC Recipes project:

git clone https://github.com/UW-Madison-Library/chtc-recipes.git

We will only be using the recipe/project in the sub-folder wos-findbyexport.

Step 3: Clone & Build the Web of Science Explorer Package

The CHTC Recipes themselves are collections of scripts for the CHTC’s Condor high throughput system. This code base includes Unix/Linux bash scripts, Python scripts and Condor scripts. The Python code requires the use of a custom Python library developed by the Libraries called the Web of Science Explorer. When running these scripts on the CHTC’s computing cluster, each compute node in the cluster will run the code using a Docker container. This container is predefined by the Libraries and includes all the dependencies required, namely Python 3 and the Web of Science Explorer package.

If you would like to use the Python code base on your own computer while testing out your own custom scripts, you will need to clone or download the project from GitHub and install it from the source code. Cloning and installing the wos-explorer package is a relatively quick and easy process.

Navigate back to the web-of-science-export directory and clone the wos-explorer project:

cd ~/Desktop/web-of-science-export
git clone https://github.com/UW-Madison-Library/wos-explorer.git

Next we will build this code base as a package so that it can be used by the CHTC recipe. Note that the python command below is aliased to Python version 3. The Web of Science Explorer package and these tutorials have only been tested using Python 3.

cd wos-explorer
python setup.py sdist

A compressed tar file will appear in the dist directory that can be installed using the Python pip command:

cd dist/
pip install wos_explorer-0.8.0.tar.gz

Note that the version number in the generated Python package may be different. You can then import objects into your own Python scripts:

from wos_explorer.article_collection import ArticleCollection

Step 4: Preparing Your Input Data: savedrecs.txt

To get started, log into the Web of Science database on the UW–Madison Library page.

  1. Perform your search
  2. At the top of the results you will see a button that says “Export”. From the menu select “Fast 5K”
  3. In the window that appears select “Tab Delimited (Mac)” from the File Format dropdown. Even if you are using a Windows operating system, you should select the Mac option so that the file uses Unix style line endings as this file will ultimately be used in a Linux computing environment within the CHTC servers.
  4. A file named savedrecs.txt will automatically download. Move that file to the ~/Desktop/web-of-science-export/chtc-recipes/wos-findbyexport directory on your computer.

Step 5: Prepare a Working Directory on the CHTC Server

To login to the CHTC server double check you are logged into the WiscVPN and then enter the credentials you received during your CHTC onboarding meeting:

ssh <netid>@submit-1.chtc.wisc.edu

Enter your password when prompted (it will not appear on the screen). Note that your submit server version number may vary. For example, your CHTC account may have been assigned to the submit-2 server. If you have trouble logging in, you may need to check with the CHTC Research Facilitators to verify that their servers are configured to allow connections from the VPN pool that you are using. The general UW-Madison VPN pool should permit access.

Once you are logged in you will see “CHTC” spelled out in oversized characters.

Next make the directory where you will upload the wos-findbyexport scripts.

mkdir wos-findbyexport

Type exit to return to your own computer.

Step 6: Uploading wos-findbyexport to the CHTC Server

Once back on your computer, navigate to the directory in which you have the wos-findbyexport scripts and use a secure FTP command to copy them to the CHTC submit server. A command like this will suffice:

scp -r ./* <netID>@submit-1.chtc.wisc.edu:/home/<NetID>/wos-findbyexport

It will prompt you for your password. As it copies the files from your machine to the CHTC server they will print to the screen.

Step 7: Run the Jobs on the CHTC Server

You are now ready to run the jobs on the CHTC server. For your reference, here is the link to the CHTC instructions on starting your job on the CHTC server:

Running Your First CHTC Jobs

The first step is to login the submit server again, navigate to the wos-findbyexport directory you created and run the Condor submit command followed by the .dag file you want it to execute:

condor_submit_dag wos-findbywosexport.dag

The .dag file will schedule which jobs to run, so for now, all you need to do is check in on the process periodically to make sure it continues to run correctly.

There are several commands that allow you to check on the process, each with their own features. The most basic one is:

condor_q

The CHTC has created an extended guide on how to evaluate your jobs as they run using variations on the condor_q command. This guide lists the commands you can run to evaluate multiple aspects of the jobs as they run or to view them in certain formats to fit your needs.

Learning About Your Jobs Using condor_q

Step 8: Download the Output Data

Once you are sure the process has completed, you can begin viewing the output files to check the results. You will be looking for the JSON output files located in the CHTC’s staging file system. You can access the staging file system using the transfer.chtc.wisc.edu server.

Because the CHTC servers are not intended for storage, it is best practice to download your output files and then remove them from the server. You can use any FTP program to do this. To fetch the files using the Unix secure copy utility, run the following command in your terminal window:

scp <USERNAME>@transfer.chtc.wisc.edu:/staging/<USERNAME>/findbyexportfile-matches/*.gz .

After you have downloaded all the necessary files remove everything from the submit server.