DFSCLUSTERING
This is part of the crystallization module
It is only available if you configure PLUMED with ./configure –enable-modules=crystallization . Furthermore, this feature is still being developed so take care when using it and report any problems on the mailing list.

Cluster atoms based on their proximities and find the average properties of those atoms in a cluster.

This collective variable was developed for looking at nucleation phenomena, where you are interested in using studying the behavior of atoms in small aggregates or nuclei. In these sorts of problems you might be interested in the degree the atoms in a nucleus have adopted their crystalline structure or (in the case of heterogenous nucleation of a solute from a solvent) you might be interested in how many atoms are present in the largest cluster.

This collective variable is a function of a multicolvar and as such it must take a multicolvar as input. At first this may seem rather peverse but after a while one hopefully will come to appreciate just how many different properties of the cluster simply by using DFSCLUSTERING in tandem with the different multicolvars. As examples of some things that could be done you could use DFSCLUSTERING in tandem with COORDINATIONNUMBER to look at the coordination numbers of the atoms in the largest cluster or you could use DFSCLUSTERING in tandem with Q6 to look at the crystallinity in the largest cluster. Regardless of what you do, however, this works because a multicolvar is essentially calculating a large number of (we will call them) symmetry functions and each of these symmetry functions can be ascribed to a particular location in space. As an example the symmetry functions calculated by the command COORDINATIONNUMBER SPECIES=1-10 are the coordination numbers of atoms 1 through 10. In other words, the first of these symmetry functions measures the number of atoms that are within a certain cutoff of atom 1, the second measures the number of atoms that are within a cutoff of atom 2 and so on. In terms of location it seems logical to suppose that the first of these symmetry functions is located at atom 1's position, the second is at atom 2's position and so on. My point is that from the point of view of DFSCLUSTERING it does not particularly matter what is calculated by the underlying multicolvar. This action operates on the assumption that what we have is a set of positions in space that have a scalar (or vector) quantity associated with them.

Lets summarise thus far: we use multicolvar to convert an atomic configuration into a set of coordinates that have symmetry function values associated with them. Usually this means that we calculate something akin to a coordination number for each of the atoms in our system and that the positions of these ``coordination numbers" are essentially the positions of the atoms. The DFS clustering begins here. We start by calculating an adjacency matrix based on the positions of the various atoms. In other words, we construct an \(n\)-symmetry functions by \(n\)-symmetry functions matrix in which element \(i,j\) is either 0 or 1 and measures whether or not symmetry functions \(i\) and \(j\) are adjacent. As with NLINKS or SPRINT this measure of adjacency can measure whether or not the positions of the two symmetry functions are within a certain cutoff. Alternatively, if the symmetry functions are vector quanties such as MOLECULES or Q4 it can measure whether the two symmetry functions are within a certain cutoff of each other and whether or not the vector symmetry functions have the same orientation. An important difference between this Action and NLINKS or SPRINT is that in DFCLUSTERING a hard cutoff is used - although the user inputs a soft switching function the code sets any number that is greater than the tolerance set using the keyword TOL (which is by default machine epsilon) equal to one while any number less than this tolerance is set equal to zero. As a consequence of this we can use the adjacency matrix to determine whether or not a particular pair of atoms are connected or not. We can thus use a DFS clustering algorithm to determine the sizes of the various connected components in the underlying graph. These connected components should then correspond to the various clusters in our system.

Once we have determined which atoms are in each of the connected components of our system we can then determine what the properties of the atoms are in each of the clusters for our system. In other just as we would in any other multicolvar we can determine the average properties of the atoms in the largest cluster. Alternatively, we can determine how many of the atoms in the largest cluster in our system have a coordination number that is greater than a particular threshold and so on. To be clear the mathematics that we use in determining these quantities is identical to that used in multicolvar. The only difference is that now, rather than summing these quantities over all the symmetry functions, we sum only those symmetry functions for the atoms in a particular connected cluster.

Description of components

When the label of this action is used as the input for a second you are not referring to a scalar quantity as you are in regular collective variables. The label is used to reference the full set of quantities calculated by the action. This is usual when using MultiColvar functions. Generally when doing this the previously calculated multicolvar will be referenced using the DATA keyword rather than ARG.

This Action can be used to calculate the following scalar quantities directly. These quantities are calculated by employing the keywords listed below. These quantities can then be referenced elsewhere in the input file by using this Action's label followed by a dot and the name of the quantity. Some amongst them can be calculated multiple times with different parameters. In this case the quantities calculated can be referenced elsewhere in the input by using the name of the quantity followed by a numerical identifier e.g. label.lessthan-1, label.lessthan-2 etc. When doing this and, for clarity we have made the label of the components customizable. As such by using the LABEL keyword in the description of the keyword input you can customize the component name

Quantity Keyword Description
vmean VMEAN the norm of the mean vector. The output component can be refererred to elsewhere in the input file by using the label.vmean
vsum VSUM the norm of sum of vectors. The output component can be refererred to elsewhere in the input file by using the label.vsum
altmin ALT_MIN the minimum value. This is calculated using the formula described in the description of the keyword so as to make it continuous.
between BETWEEN the number/fraction of values within a certain range. This is calculated using one of the formula described in the description of the keyword so as to make it continuous. You can calculate this quantity multiple times using different parameters.
highest HIGHEST the lowest of the quantitities calculated by this action
lessthan LESS_THAN the number of values less than a target value. This is calculated using one of the formula described in the description of the keyword so as to make it continuous. You can calculate this quantity multiple times using different parameters.
lowest LOWEST the lowest of the quantitities calculated by this action
max MAX the maximum value. This is calculated using the formula described in the description of the keyword so as to make it continuous.
mean MEAN the mean value. The output component can be refererred to elsewhere in the input file by using the label.mean
min MIN the minimum value. This is calculated using the formula described in the description of the keyword so as to make it continuous.
moment MOMENTS the central moments of the distribution of values. The second moment would be referenced elsewhere in the input file using label.moment-2, the third as label.moment-3, etc.
morethan MORE_THAN the number of values more than a target value. This is calculated using one of the formula described in the description of the keyword so as to make it continuous. You can calculate this quantity multiple times using different parameters.
sum SUM the sum of values
Compulsory keywords
DATA the labels of the action that calculates the multicolvars we are interested in
CLUSTER ( default=1 ) which cluster would you like to look at 1 is the largest cluster, 2 is the second largest, 3 is the the third largest and so on.
WTOL if the base multicolvars have weights then you must define a hard cutoff on those you want to consider explicitally
Options
NOPBC ( default=off ) ignore the periodic boundary conditions when calculating distances
SERIAL ( default=off ) do the calculation in serial. Do not parallelize
TIMINGS ( default=off ) output information on the timings of the various parts of the calculation
USE_ORIENTATION ( default=off ) When computing whether two atoms/molecules are adjacent also take their orientations into account
MEAN ( default=off ) take the mean of these variables. The final value can be referenced using label.mean
VMEAN ( default=off ) calculate the norm of the mean vector. The final value can be referenced using label.vmean
VSUM ( default=off ) calculate the norm of the sum of vectors. The final value can be referenced using label.vsum
SUM ( default=off ) calculate the sum of all the quantities. The final value can be referenced using label.sum
HIGHMEM ( default=off ) use a more memory intensive version of this collective variable
LOWEST ( default=off ) calculate the lowest of these variables. The final value can be referenced using label.lowest
HIGHEST

( default=off ) calculate the highest of these variables. The final value can be referenced using label.highest

SWITCH This keyword is used if you want to employ an alternative to the continuous swiching function defined above. The following provides information on the switchingfunction that are available. When this keyword is present you no longer need the NN, MM, D_0 and R_0 keywords. You can use multiple instances of this keyword i.e. SWITCH1, SWITCH2, SWITCH3...
MORE_THAN calculate the number of variables more than a certain target value. This quantity is calculated using \(\sum_i 1.0 - \sigma(s_i)\), where \(\sigma(s)\) is a switchingfunction. The final value can be referenced using label.morethan. You can use multiple instances of this keyword i.e. MORE_THAN1, MORE_THAN2, MORE_THAN3... The corresponding values are then referenced using label.morethan-1, label.morethan-2, label.morethan-3...
LESS_THAN calculate the number of variables less than a certain target value. This quantity is calculated using \(\sum_i \sigma(s_i)\), where \(\sigma(s)\) is a switchingfunction. The final value can be referenced using label.lessthan. You can use multiple instances of this keyword i.e. LESS_THAN1, LESS_THAN2, LESS_THAN3... The corresponding values are then referenced using label.lessthan-1, label.lessthan-2, label.lessthan-3...
BETWEEN calculate the number of values that are within a certain range. These quantities are calculated using kernel density estimation as described on histogrambead. The final value can be referenced using label.between. You can use multiple instances of this keyword i.e. BETWEEN1, BETWEEN2, BETWEEN3... The corresponding values are then referenced using label.between-1, label.between-2, label.between-3...
HISTOGRAM calculate a discretized histogram of the distribution of values. This shortcut allows you to calculates NBIN quantites like BETWEEN.
MOMENTS calculate the moments of the distribution of collective variables. The \(m\)th moment of a distribution is calculated using \(\frac{1}{N} \sum_{i=1}^N ( s_i - \overline{s} )^m \), where \(\overline{s}\) is the average for the distribution. The moments keyword takes a lists of integers as input or a range. Each integer is a value of \(m\). The final calculated values can be referenced using moment- \(m\).
ALT_MIN calculate the minimum value. To make this quantity continuous the minimum is calculated using \( \textrm{min} = -\frac{1}{\beta} \log \sum_i \exp\left( -\beta s_i \right) \) The value of \(\beta\) in this function is specified using (BETA= \(\beta\)). The final value can be referenced using label.altmin.
MIN calculate the minimum value. To make this quantity continuous the minimum is calculated using \( \textrm{min} = \frac{\beta}{ \log \sum_i \exp\left( \frac{\beta}{s_i} \right) } \) The value of \(\beta\) in this function is specified using (BETA= \(\beta\)) The final value can be referenced using label.min.
MAX

calculate the maximum value. To make this quantity continuous the maximum is calculated using \( \textrm{max} = \beta \log \sum_i \exp\left( \frac{s_i}{\beta}\right) \) The value of \(\beta\) in this function is specified using (BETA= \(\beta\)) The final value can be referenced using label.max.

Examples

In this example the FCCUBIC multicolvar is used to determine how similar the environment around each atom is to an FCC unit cell. DFS clustering is used with the adjacency matrix constructed using the switching function specified by SWITCH={CUBIC D_0=0.4 D_MAX=0.5} - in pratice this means the clustering algorithm views two atoms as connected as long as they are within 0.5 nm of each other (as the tolerance is machine epsilon). The final quantity calculated measures the number of atoms in the largest cluster that have an FCCUBIC parameter greater than 0.035.

cubic1: FCCUBIC SPECIES=1-1000 SWITCH={CUBIC D_0=0.4  D_MAX=0.5} 
clust: DFSCLUSTERING DATA=cubic1 CLUSTER=1 SWITCH={CUBIC D_0=0.4   D_MAX=0.5} MORE_THAN={CUBIC D_0=0.035  D_MAX=0.045}

This second example adds a layer of complexity to the previous one. Once again the FCCUBIC multicolvar is used to determine how similar the environment around each atom is to an FCC unit cell. However, in this case the intervening MFILTER_MORE means that when the clustering is performed by DFSCLUSTERING only those atoms from the FCCUBIC action that have an fcccubic switching function greater than 0.045 are explicitly clustered using the DFS algorithm. As described on the manual page for MFILTER_MORE this command takes each of the various symmetry functions input by the input multicolvar (in this case the FCCUBIC multicolvar labelled cubic1) and ascribes each of them a weight, \(w_i\) using:

\[ w_i = 1.0 - s( v_i ) \]

where \(v_i\) is the value of the \(i\)th symmetry function and \(s()\) is the switching function specified through the SWITCH keyword. Within the DFSCLUSTERING these continuous switching functions have to be converted into discrete (discontinuous) switching functions, which is done through the WTOL keyword. The value given to this keyword ensures that any symmetry function calculated by cubic1 whose weight (calculated by MFILTER_MORE) is less than 0.03 is ignored during the DFS clustering.

Once again the final quantity to be calculated here is the number of atoms in the largest cluster that have an FCCUBIC parameter greater than 0.035. There is a very important and subtle distinction between what is being calculated here and what was calculated in the first example, however. The difference is best understood by considering an example. Imagine we had a very large extended but largely disordered cluster of atoms. By virtue of sheer blind luck a few widely-spaced atoms in this disordered cluster would have a configuration around them that might be very similar to the structure inside an fcc crystal. This despite the fact that the overall structure of the cluster does not resemble an fcc crystal. Suppose that this were clustered using the first command above. The final result you would get would be equal to the number of fcc-like atoms in the disordered cluster as essentially all the atoms in the system are connected - the fact that there is not an ordered arrangement is not particularly important. Now suppose instead that this structure were analysed using the command below. The final value of this quantity would most likely be equal to one as it is unlikely that two of the fcc-like atoms are close together in the cluster. The point is that by inserting the MFILTER_MORE command we have restricted the DFS clustering to only those atoms that are reasonably close in structure to fcc. These atoms are not necessarily close together in are disordered agregate and are thus not necessarily connected.

cubic1: FCCUBIC SPECIES=1-1000 SWITCH={CUBIC D_0=0.4  D_MAX=0.5} TOL=0.03 #MEAN
cf: MFILTER_MORE DATA=cubic1 SWITCH={CUBIC D_0=0.035 D_MAX=0.045}
clust: DFSCLUSTERING DATA=cf WTOL=0.03 CLUSTER=1 SWITCH={CUBIC D_0=0.4   D_MAX=0.5} MORE_THAN={CUBIC D_0=0.035  D_MAX=0.045}