3.2. Input files

Input scripts contain settings that tell FitSNAP how to perform a fit. Our input scripts take the form of configuration files with a format explained by Python’s native ConfigParser class. These configuration files are composed of sections, each of which contains keys with values, e.g. like:

[SECTION1]
key1 = value1
key2 = value2

[SECTION2]
key3 = value3
key4 = value4
key5 = value5

In FitSNAP, each section declares a setting for a certain aspect of the machine learning problem. For example we have a BISPECTRUM section whose keys determine settings for the bispectrum descriptors that describe interatomic geometry, a CALCULATOR section whose keys determine which LAMMPS computes to use for calculating the descriptors, a SOLVER section whose keys determine which numerical solver to use for performing the fit, and so forth.

There are many examples on the GitHub repo, for example the linear SNAP tantalum example has the following input script:

[BISPECTRUM]
numTypes = 1
twojmax = 6
rcutfac = 4.67637
rfac0 = 0.99363
rmin0 = 0.0
wj = 1.0
radelem = 0.5
type = Ta
wselfallflag = 0
chemflag = 0
bzeroflag = 0
quadraticflag = 0

[CALCULATOR]
calculator = LAMMPSSNAP
energy = 1
force = 1
stress = 1

[ESHIFT]
Ta = 0.0

[SOLVER]
solver = SVD
compute_testerrs = 1
detailed_errors = 1

[SCRAPER]
scraper = JSON

[PATH]
dataPath = JSON

[OUTFILE]
metrics = Ta_metrics.md
potential = Ta_pot

[REFERENCE]
units = metal
atom_style = atomic
pair_style = hybrid/overlay zero 10.0 zbl 4.0 4.8
pair_coeff1 = * * zero
pair_coeff2 = * * zbl 73 73

[GROUPS]
# name size eweight fweight vweight
group_sections = name training_size testing_size eweight fweight vweight
group_types = str float float float float float
smartweights = 0
random_sampling = 0
Displaced_A15 =  1.0    0.0       100             1               1.00E-08
Displaced_BCC =  1.0    0.0       100             1               1.00E-08
Displaced_FCC =  1.0    0.0       100             1               1.00E-08
Elastic_BCC   =  1.0    0.0     1.00E-08        1.00E-08        0.0001
Elastic_FCC   =  1.0    0.0     1.00E-09        1.00E-09        1.00E-09
GSF_110       =  1.0    0.0      100             1               1.00E-08
GSF_112       =  1.0    0.0      100             1               1.00E-08
Liquid        =  1.0    0.0       4.67E+02        1               1.00E-08
Surface       =  1.0    0.0       100             1               1.00E-08
Volume_A15    =  1.0    0.0      1.00E+00        1.00E-09        1.00E-09
Volume_BCC    =  1.0    0.0      1.00E+00        1.00E-09        1.00E-09
Volume_FCC    =  1.0    0.0      1.00E+00        1.00E-09        1.00E-09

[EXTRAS]
dump_descriptors = 1
dump_truth = 1
dump_weights = 1
dump_dataframe = 1

[MEMORY]
override = 0

We explain the sections and their keys in more detail below.

3.2.1. [BISPECTRUM]

This section contains settings for the SNAP bispectrum descriptors from Thompson et. al.

  • numTypes number of atom types in your set of configurations located in the [PATH] section

  • type contains a list of element type symbols, one for each type. Make sure these are ordered correctly, e.g. if you have a LAMMPS type 1 atom that is Ga, and LAMMPS type 2 atoms are N, list this as Ga N.

The remaining keywords are thoroughly explained in the LAMMPS docs on computing SNAP descriptors but we will give an overview here. These are hyperparameters that *could* be optimized for your specific system, but this is not a requirement. You may also use the default values, or values used in our examples, which are often well behaved for other systems.

  • twojmax determines the number of bispectrum coefficients for each element type. Give an argument for each element type, e.g. for two element types we may use 6 6 declaring twojmax = 6 for each type. Higher twojmax increases the number of bispectrum components for each atom, thus potentially giving more accuracy at an increased cost. We recommend using a twojmax of 4, 6, or 8. This corresponds to 14, 30, and 55 bispectrum components, respectively. Default value is 6.

  • rcutfac is a cutoff radius parameter. One value is used for all element types. We recommend a cutoff between 4 and 5 Angstroms for most systems. Default value is 4.67 Angstroms.

  • rfac0 is a parameter used in distance to angle conversion, between 0 and 1. Default value is 0.99363.

  • rmin0 another parameter used in distance to angle conversion, between 0 and 1. Default value is 0.

  • wj list of neighbor weights. Give one argument for each element types, e.g. for two element types we may use 1.0 0.5 declaring a weight of 1.0 for neighbors of type 1, and 0.5 for neighbors of type 2. We recommend taking values from the existing multi-element examples.

  • radelem list of cutoff radii, one for each element type. These values get multiplied by 2 * rcutfac to determine the effective cutoff of a particular type. For each element, the effective cutoff radius is 2 * rcutfac * radelem.

  • wselfallflag is 0 or 1, determining whether self-contribution is for elements of a central atom or for all elements, respectively.

  • chemflag is 0 or 1, determining whether to use explicit multi-element SNAP descriptors as explained in Cusentino et. al., and used in the InP example. This increases the number of SNAP descriptors to resolve multi-element environment descriptions, and therefore comes at an increase in cost but higher accuracy. This option is not required for multi-element systems; the default value is 0.

  • bzeroflag is 0 or 1, determining whether or not B0, the bispectrum components of an atom with no neighbors, are subtracted from the calculated bispectrum components.

  • quadraticflag is 0 or 1, determining whether or not to use quadratic descriptors in a linear model, as done by Wood and Thompson, and illusrated in the Ta_Quadratic example.

The following keywords are necessary for extracting per-atom descriptors and individual derivatives of bispectrum components with respect to neighbors, required for neural network potentials. See more info in PyTorch Models

  • bikflag is 0 or 1, determining whether to compute per-atom bispectrum descriptors instead of sums of components for each atom. We do the latter for linear fitting because of the nature of the linear problem, which saves memory, but per-atom descriptors are required for neural networks.

  • dgradflag is 0 or 1, determining whether to compute individual derivatives of descriptors with respect to neighboring atoms, which is required for neural networks.

3.2.2. [CALCULATOR]

This section houses keywords determining which calculator to use, i.e. which descriptors to calculate.

  • calculator is the name of the LAMMPS connection for getting descriptors, e.g. for SNAP descriptors use LAMMPSSNAP.

  • energy is 0 or 1, determining whether to calculate descriptors associated with energies.

  • force is 0 or 1, determining whether to calculate descriptor gradients associated with forces.

  • stress is 0 or 1, determining whether to calculate descriptors gradients associated with virial terms for calculating and fitting to stresses.

  • per_atom_energy is 0 or 1, determining whether to use per-atom energy descriptors in association with bikflag = 1

  • nonlinear is 0 or 1, and should be 1 if using nonlinear solvers such as PyTorch models.

3.2.3. [ESHIFT]

This section declares an energy shift applied to each atom type. These values are free to choose however desired. For example these values could come from the per-atom energy predicted in a vacuum from ab initio calculations. These values may also be treated as hyperparameters.

3.2.4. [SOLVER]

This section contains keywords associated with specific machine learning solvers.

  • solver name of the solver. We recommend using SVD for linear solvers and PYTORCH for neural networks.

3.2.5. [SCRAPER]

This section declares which file scraper to use for gathering training data.

  • scraper is either JSON or XYZ.

If using the XYZ scraper, each Group of configurations has its own XYZ file containing configurations of atoms concatenated together, in extended XYZ format. Follow the example in examples/Ta_XYZ.

If using the JSON scraper, each Group may have its own directory containing separate JSON files for each configuration. Guarantee compatibility with FitSNAP by using our tools/VASP2JSON.py conversion script; this requires that your DFT training data be in VASP OUTCAR format. Likewise for tools/VASPxml2JSON.py.

We are also working on a scraper that directly reads VASP output; more documentation on this coming soon.

3.2.6. [PATH]

This section contains a dataPath keyword that locates the directory of the training data. For example if the training data is in a file called JSON in the previous directory relative to where we run the FitSNAP executable, this section looks like:

[PATH]
dataPath = ../JSON

3.2.7. [OUTFILE]

This section declares the names of output files.

  • metrics gives the name of the error metrics markdown file. If using LAMMPS metal units, energy mean absolute errors are in eV and force errors are in eV/Angstrom.

  • potential gives the prefix of the LAMMPS-ready potential files to dump.

3.2.8. [REFERENCE]

This section includes settings for an optional potential to overlay our machine learned potential with. We call this a “reference potential”, which is a pair style defined in LAMMPS. If you choose to use a reference potential, the energies and forces from the reference potential will be subtracted from the target ab initio training data. We also declare units in this section.

The minimum working reference potential setup involves not using a reference potential at all, where the reference section would look like (using metal units):

[REFERENCE]
units = metal
pair_style = zero 10.0
pair_coeff = * *

The rest of the keywords are associated with the particular LAMMPS pair style you wish to use.

3.2.9. [GROUPS]

Each group should be its own sub-directory in the directory given by the dataPath/ keyword in the [PATH] section. There are a few different allowed syntaxes; subdirectory names in the first column is common to all options.

group_sections declares which parameters you want to set for each group of configurations.

For example:

group_sections = name training_size testing_size eweight fweight vweight

means you will supply group names, training size as a decimal fraction, testing size as a decimal fraction, energy weight, force weight, and virial weight, respectively. We must also declare the data types associated with these variables, given by

group_types = str float float float float float

Then we may declare the group names and parameters associated with them. For a particular group called Liquid for example, this looks like:

Liquid        =  1.0    0.0       4.67E+02        1       1.00E-08

where Liquid is the name of the group, 1.0 is the training fraction, 0.0 is the testing fraction, 6.47E+02 is the energy weight, 1 is the force weight, and 1.00E-8 is the virial weight.

Other available keywords are

  • random_sampling is 0 or 1. If 1, configurations in the groups are randomly sampled between their training and testing fractions.

  • smartweights` is 0 or 1. If 1, we declare statistically distributed weights given your supplied weights.

A few examples are found in the examples directory.

3.2.10. [EXTRAS]

This section contains keywords on optional info to dump. By default, linear models output error metric markdown files that should be sufficient in most cases. If more detailed errors are required, please see the output Pandas dataframe FitSNAP.df used by linear models. Examples and library tools for analyzing this dataframe are found in our Colab Python notebook tutorial.

3.2.11. [MEMORY]

This section contains keywords for dealing with memory. We recommend using defaults.