Creating a database¶

A database is created from a CrossRef public data export via the crossref-lmdb create command - see Command-line reference for the command options and defaults.

For example, the following command will read from a public data export in the public_data_export directory (i.e., the directory containing the series of 0.json.gz, 1.json.gz, etc. files) and create a database in the db subdirectory:

crossref-lmdb create --public-data-dir public_data_export/ --db-dir db/

Warning

Creating a database takes a very long time! When using the full 2024 public data export, it could take several days.

Filtering items¶

The intended usage of this database is where you do not need all of the metadata within the public data export; for example, you may only want metadata for DOIs that relate to journal articles or span a particular range of published years. Items can be prevented from entering the database by providing a --filter-path argument to crossref-lmdb create. This argument needs to be a Python file that contains a function called filter_func. This function must accept one argument, a dict-like representation of the item metadata, and must return True if the item is to be included in the database and False otherwise.

Note

The function must be self-contained, in that any import statements must appear within the body of the function itself.

For example, if we only wanted to include journal articles in the database, we could create a file called journal_article_filter.py containing the code:

def filter_func(item):
    return "type" in item and item["type"] == "journal-article"

If we only wanted to include publications from say 2021 to 2023, we could instead specify the function as:

def filter_func(item):

    import datetime
    import crossref_lmdb.date

    pub_date = crossref_lmdb.date.get_published_date(item=item)

    if pub_date is None:
        return False

    start_date = datetime.date(year=2021, month=1, day=1)
    end_date = datetime.date(year=2024, month=1, day=1)

    date_ok = start_date <= pub_date < end_date

    return date_ok

Note

We are using a helper function from crossref_lmdb.date to extract a Python date object from the item metadata.

Resuming database creation¶

If the database creation gets interrupted, it can be resumed by using the --start-from-file-num command-line option. The argument to --start-from-file-num is the file number to resume from, which is the value that is reported by the progress bar.

Maximum database size¶

The crossref-lmdb create command has an option called --max-db-size-gb, which is required by LMDB to constrain the maximum allowable database size. On Linux and Mac platforms, this is not pre-filled and so it is safe to use a large value (see the LMDB documentation about the map_size argument for more details). However, on Windows it does seem to be pre-filled and so it needs to be set to a value that is appropriate for your anticipated database size (the default is also lowered to 2 GB).