Prerequisites

Programming Environment

This guide is intended to provide an overview of using the UW-Madison Libraries’ CHTC recipes code for processing the Clarivate Web of Science data set within the UW-Madison high throughput computing cluster. The guide assumes that you have a basic familiarity with the Python programming in a Unix or Linux style command line environment. In addition, you should be comfortable using the git version control software to obtain a copy of the two required code bases.

The Python codebase and scripts used in this project are relatively simple. The Web of Science Explorer is a utility to make it easier to process the raw data. The other Python code (CHTC recipes) is comprised of procedural scripts that are run in the computing cluster. These scripts are orchestrated by a handful of Bash shell scripts.

For individuals who do not yet have experience with Linux, Python and git, please consider taking the Software Carpentry course that is offered a few times each year by the UW-Madison Data Science Hub. See the Data Science Workshops website for information on upcoming Software Carpentry sessions at UW-Madison.

Request Access to the Web of Science Data Set

Request access to the Clarivate Web of Science data set using the Libraries’ Technical Assistance request form. You will be contacted by someone from the Library Technology Group and asked to confirm that you have read and agree to the publisher’s terms and conditions for use of the data set.

Web of Science User Account

Set up a user account within the Web of Science user interface. This account is required for exporting a set search results that will be used as input for data processing jobs.

  1. Go to the UW-Madison Libraries’ database entry for the Web of Science
  2. Click on the link to access the database
  3. Once logged into the Web of Science, click on the Sign In link and create a new account

One of the first steps in the step-by-step tutorial requires that you download a set of search results from the Web of Science database user interface. You will need a Web of Science user account to export a data file in the require format, their Fast 5K export format.

CHTC Account

Finally, you will also need to set up a user account with the Center for High Throughput Computing (CHTC). If you are setting up your account for the first time, you will also need to have an introductory meeting with one of the CHTC Research Computing Facilitators. In this meeting they will provide you with an overview of the Condor parallel computing cluster you will be using.

Begin the process for setting up your account at the Getting Started With CHTC web page.

After submitting the form, one of the facilitators will email you to schedule a meeting and ask you to have someone from the Libraries’ Technical Assistance verify that you have access to the Clarivate Web of Science data set and that you are working on the project.

In this meeting, the facilitator will explain the different types of resources at the CHTC, how they work, and which of the resources your project will use. The facilitator will also introduce you to the CHTC workflow, the server you will be using to submit the code and its outputs, and the basics of working with the Condor high throughput computing environment.

After the meeting, you will receive an email with links to the content you cover with the facilitator, links to the CHTC’s troubleshooting resources, and contact information for any help you may need. Your facilitator will also check in with you periodically and you will receive emails about any CHTC service or access interruptions.

Visit the CHTC webpage in advance of your meeting to acquaint yourself with the resources there.