Workflow

This section describes how doiget-tdm can fit in with a text data mining workflow.

Obtain DOIs of interest

The first step is to obtain a set of DOIs for which you would like to acquire full-text content. Obtaining such DOIs is outside of the scope of doiget-tdm and depends on your text data mining goals. For example, you might query the Crossref API to find all the DOIs for a particular author, or institution, or journal, etc. For input into doiget-tdm, you can save these DOIs either as a text file with one DOI per line or as a CSV file with a column named “DOI”.

Begin configuring doiget-tdm

There are a few configuration options for doiget-tdm that are worth considering at this point, particularly:

data_dir

This is the directory in which the full-text content and metadata for the DOIs is stored.

data_dir_n_groups

If you have a large number of DOIs, it is recommended to set this option so that there are not too many directories directly within data_dir.

email_address

It is highly recommended to provide your email address, which is passed to Crossref when obtaining DOI metadata so as to access their ‘polite’ pool in the API.

See Configuration for how to specify these options, and for details on the other configuration options that are available in doiget-tdm.

Evaluate publishers

Each DOI will have an associated publisher, and this publisher will affect the process used to acquire its full-text content. Hence, the next step is to evaluate the publishers that are responsible for the DOIs in the set of interest.

Unless the publisher was defined as part of the query used to obtain the set of DOIs, it is likely that the DOIs will be distributed across multiple publishers. Being able to accurately identify the publishers is assisted by using doiget-tdm to download the metadata (from Crossref) for each of the DOIs. For example, assuming you have saved a file containing the DOIs of interest in doi_list.txt, you can run the following command to acquire the DOI metadata:

doiget-tdm fulltext acquire --only-metadata doi_list.txt

To determine the publishers for the set of DOIs, you can then use doiget-tdm to generate a status report on the set of DOIs:

doiget-tdm status doi_list.txt

This will print a summary of the status of the DOI set, including the distribution of publishers.

Note

You can also provide a --output-path option to doiget-tdm status to save a file that has one row per DOI and columns that relate to aspects of the metadata like the publisher.

Make publisher agreements and update doiget-tdm configuration

As described in Available publishers, many of the publishers require permission to perform TDM and have particular configuration requirements. Obtaining such access is often facilitated by an institutional librarian, with whom the publisher subscriptions are made.

If there are publishers that are in the set of DOIs but not in the Available publishers for doiget-tdm, you can investigate Defining a new publisher to add custom functionality to doiget-tdm.

Acquire full-text

You can then acquire the full-text by running:

doiget-tdm acquire doi_list.txt

For each DOI in the file doi_list.txt, doiget-tdm will infer the publisher that is responsible for the DOI and will then use publisher-specific logic for acquiring the full-text content — using the publisher-specific configuration that you have provided.

Note

Some publishers require requests to be made from a specific IP address, so you might need to run this command on multiple machines. Such publishers tend to have a valid_hostname configuration option, which only attempts to acquire the full-text content for a particular DOI if the hostname of the requesting machine matches the value of valid_hostname. However, you can also provide one (or more) member IDs (using the --only-member-id parameter) and it will only attempt to acquire DOIs with matching member IDs.

Use full-text content

The acquisition of full-text content will store files within your defined data_dir. You can interact with the content using Python or directly in the filesystem.

Accessing within Python

If doing further processing using Python, you can use the doiget-tdm api. For example, for the DOI “10.1371/journal.pbio.1002611”:

import doiget_tdm

doi = doiget_tdm.DOI(doi="10.1371/journal.pbio.1002611")
work = doiget_tdm.Work(doi=doi)

if work.metadata.exists:
    # print a summary of the metadata
    work.metadata.show()

# there is a local copy of the full-text
if work.fulltext.exists:
    # load the full-text content, as bytes
    fulltext = work.fulltext.load()
    fulltext_content = fulltext.data
    fulltext_format = fulltext.fmt

You can also iterate through all the entries in the data directory:

import doiget_tdm

for work in doiget_tdm.iter_unsorted_works():
    pass

Accessing within the filesystem

The retrieved files will be stored within the directory specified by the data_dir configuration option. The specific location within data_dir depends on the value of the data_dir_n_groups configuration option:

data_dir_n_groups is 0

Files for the DOI are stored in ${DATA_DIR}/${QUOTED_DOI}/

data_dir_n_groups is > 0

Files for the DOI are stored in ${DATA_DIR}/${DOI_GROUP}/${QUOTED_DOI}/

Here, ${DATA_DIR} is the value of data_dir, ${DOI_GROUP} is a number between 0 and data_dir_n_groups - 1, and ${QUOTED_DOI} is the DOI string in ‘quoted’ form (see quote).

For example, the data for the DOI “10.1371/journal.pbio.1002611” will be stored in:

data_dir_n_groups is 0

${DATA_DIR}/10.1371%2Fjournal.pbio.1002611/

data_dir_n_groups is 5,000

${DATA_DIR}/1785/10.1371%2Fjournal.pbio.1002611/

Note

The use of the ‘quoted’ form of DOI strings is to work around the conflict between the presence of characters like / in DOI strings and the meaning of characters like / in filesystems — as a directory separator, in this case.

The filesystem path for a given DOI (or set of DOIs) can be obtained from the command-line using the show-doi-data-path option in doiget-tdm.