Processing Overview

After you have set up your programming environment and set up all the accounts to run the process as indicated in this guide’s Prerequisites page, you are ready begin the process outlined by the step-by-step tutorial. Described below is a high level breakdown of how each step in the process works. You can find a description of how the codebase works in the README on the wos-findbyexport GitHub page.

Figure 1: Web of Science Find by Export File jobs represented as a directed acyclic graph (DAG).

The file you submit to the computing cluster defines the overall process as a Directed Acyclic Graph (DAG). It provides an overview of the entire process at a macro level as it will be carried out on the CHTC servers. You can view the DAG file in the chtc-recipes GitHub repository: wos-findbywosexport.dag. The DAG file is simply a file that schedules the order in which the jobs are submitted to the CHTC servers.

In the course of running through this schedule four main processes are completed (see steps one through four in the diagram above). One important feature of this process is that every stage must be completed before the next one can begin. For example, even though some sub-jobs are completed before others in Step 2, the process waits until all Step 2 jobs are completed to move to Step 3. This is a crucial feature of the process and what defines the unidirectional and discretized nature of the DAG file which schedules the entire process:

  1. Step 1 uses the scripts in the parsepublicationyears folder in GitHub. The code in this file reads the savedrecs.txt file you downloaded to identify the years in which the articles in those results were published and writes them to the file years.txt
  2. Step 2, found in findbyexportfile, kicks off one Condor job for each entry in the years.txt file This will produce two data outputs:
    1. Article records corresponding to savedrecs.txt
    2. A list of the articles referenced by the articles in savedrecs.txt
  3. Step 3, found in the sortreferences file, takes the outputs from Step 2 and sorts the cited reference IDs by publication year so it will only be necessary to search the dataset for years in which articles were published. The sorted reference year/article ID combinations will then be fed into Step 4.
  4. Step 4 uses the code in the findreferences file to search the dataset to find all the matched references and writes their metadata in the output files for you to examine and interpret.