Clarivate Web of Science Data Set

Clarivate Web of Science Data Set

The Big Ten Academic Alliance (BTAA) has negotiated a contract with the publisher Clarivate for each member institution to receive a copy of the raw data for its Web of Science database.

This data set is intended to be used for academic research and data science projects, such as numerical or statistical analyses of data elements. Data is restricted to the use by current faculty, staff, students and researchers at UW-Madison and who are based in the United States. The terms and conditions for use of the data set include:

  • Commercial use of the data set or derived data is strictly prohibited
  • Distribution of the data set or derivative data sets created from the original is prohibited
  • The intellectual property rights for the data or derivative data sets are owned by Clarivate and may not be shared
  • After affiliation with UW-Madison ends, individuals may no longer use the data set [1]

See below for the terms governing UW-Madison’s copy of the data set.

If you would like to begin using this data set, please contact the UW-Madison Libraries’ Library Technology Group using our Technical Assistance contact form.

Code Examples & Tutorial

The UW-Madison Libraries have also developed some code samples to assist researchers with getting started using the data set.

Web of Science Explorer
https://github.com/UW-Madison-Library/wos-explorer

A simple utility written in the Python programming language to find and parse article records within the Web of Science data set. These scripts work with a JSON serialization of the data and are based on taking advantage of the single-record-per-line JSON data to stream through the data with a low memory footprint.

CHTC Recipes
https://github.com/UW-Madison-Library/chtc-recipes

This project contains two simple examples that demonstrate how to work with the Web of Science data within UW-Madison’s Center for High Throughput Computing (CHTC) parallel computing environment.

Web of Science CHTC Tutorial

A tutorial for getting up and running using the Python and CHTC projects referenced above.

About the Data Set

This data set is based on the content of the UW-Madison Libraries subscriptions to the Web of Science database and the specific data files included in our subscriptions package. The data covers information published between 1900 and 2021. The Libraries receive new data each spring through the previous calendar year.

The data is available in an XML format and is estimated to be about a terabyte in size in its uncompressed form.

Excerpted Terms and Conditions

Rights and Restrictions on Client’s Use of Data Set

The following excerpt comes from the 2017 contract between the BTAA and Clarivate.

(1) Data is to be used only for academic research and data science projects
(2) Data is restricted to the use of faculty, staff, students and researchers at Big Ten Academic Alliance member institutions located in the United State as identified above in the Project Scope. For researchers based outside of the United States, Clarivate requires the right to review access for membership prior to granting access to the data.
(3) Commercial use of the data set or derived data is strictly prohibited
(4) Data License does not include sharing outside Big Ten Academic Alliance institutions (ex. other institutions, government agencies, or corporate entities).
(5) Access to the data archive is dependent on the Institutions Web of Science back file depth for all institutions. Universities who do not have a current license for the full Web of Science archive must request access for any missing file depth(s) to Clarivate for individual research projects.
(6) Access is contingent on Participating Member maintaining existing Web of Science subscription.
(7) Big Ten Academic Alliance shall make Individual Participating Member Institutions aware of the Rights and Restrictions outlined in the Statement of Work.

Supplemental Terms and Conditions Applicable to the Data Set

The following excerpt comes from the original license to the data, also known as the Custom Dataset, in 2015.

With respect to any license of a Custom Dataset, Client may use such Custom Dataset to perform numerical or statistical analyses of data elements derived from a Content Service. In addition, notwithstanding any language to the contrary contained herein, Client may (i) download the Custom Dataset for use in data analytics, and proprietary or third party tools; (ii) use web crawlers to extract patterns from the Custom Dataset; and (iii) create derivative databases consisting of the above-mentioned analytics; provided, however, that all Intellectual Property Rights to such Custom Dataset or derivative databases shall be owned by [Clarivate, formerly Thomson Reuters]; all such rights granted in this clause are limited to Client’s internal, non-commercial use of the Custom Dataset, and Client may not distribute or sublicense to any third party any portion of the Custom Dataset or derivative databases created under this clause. Use of the Custom Dataset may also be limited to a specific project if so designated on the Cover Sheet.

[1] Individuals requiring access to the data set after leaving UW-Madison should contact the libraries for assistance. Clarivate is willing to discuss special terms for providing access on a case-by-case basis to researchers needing continued access.