Month: November 2014

Machine Learning in R, Packages

Machine Learning Algorithms

1. Prediction 

predict function

e.g.

> predicted_values <- predict(lm_model, newdata=as.data.frame(cbind(x1_test, x2_test)))

2. Apriori 

install arules package

the dataset must be a binary incidence matrix

e.g.

> dataset <- read.csv(“C:\\Datasets\mushroom.csv”, header=TRUE)

> mushroom_rules <- apriori(as.matrix(dataset), parameter = list(supp = 0.8, conf = 0.9))

> summary(mushroom_rules)

> inspect(mushroom_rules)

3. Logistic Regression

No extra package is needed.

>glm_mod <- glm(y~ x1+x2, family=binomial(link=”logit”), data=as.data.frame(cbind(y,x1,x2)))

4. K-Means Clustering

No extra package is needed.

If X is the data matrix and m is the number of clusters, then the command is:

> kmeans_model <- kmeans(x=X, centers=m)

5. k-Nearst Neighbor Classification

intall class package

Let X_train and X_test be matrices of the training and test data respectively, and labels be a binary vector of class attributes for the training examples.

For k equal to K, the command is:

> knn_model <- kn(train=X_train, test=X_test, cl=as.factor(labels), k=K)

Then knn_model is a factor vector of class attributes for the test set.

6. Naive Bayes

Install e1071 package

> nB_model <- naiveBayes(y~ x1 + x2, data=as.data.frame(cbind(y,x1,x2)))

7. Decision Trees (CART)

CART is implemented in the rpart package.

> cart_model <- rpart(y ~ x2 + x2, data=as.data.frame(cbind(y, x1, x2)), method=”class”)

8. AdaBoost

There are a number of different boosting functions in R.

One implementation that uses decision trees as base classifiers. Thus the rpart package should be loaded.

The boosting function ada is in the ada package.

Let X be the matrix of features, and labels be a vector of 0-1 class labels.

> boost_model <- ada(x=X, y=labels)

9. Support Vector Machines (SVM)

e1071 package

Let X be the matrix features, and labels be a vector of 0-1 class labels.

Let the regularization parameter be C.

> svm_model <- sum(x=X, y=as.factor(labels), kernel = “radial”, cost=c)

> summary(svm_model)

Dispersion correction treatment in DFT

DFT, approximations must be made for how electrons interact with each other.

Standard XC functionals include:

  • Local density approximation (LDA)
  • Generalized gradient approximation (GGA) functionals
  • Hybrid XC functionals

Standard XC functionals do not describe dispersion because:

  1. instantaneous density fluctuations are not considered
  2. they are “short-sighted” in that they consider only local properties to calculate the XC energy

Ground-Binding with incorrect asymptotics

The ground method does not describe the long range asymptotics and give incorrect shapes of binding curves and underestimate the binding of well separated molecules.

The result with LDA for dispersion bonded systems have limited and inconsistent accuracy and the asymptotic form of the interaction is incorrect.

Step one-Simple C6 corrections (DFT-D)

The basic requirement for DFT-based dispersion scheme: it yields reasonable −1/r6 asymptotic behavior for the interaction of particles in the gas phase, where r is the distance between the particles.

Approach: add an additional energy term which accounts for the missing long range attraction.

Four shortcomings:

  • the C6/r^6 dependence represents only the leading term of the correction and neglects both many-body dispersion effects and faster decaying terms such as the C8/r^8 or C10/r^10
  • It is not clear where one should obtain theC6 coefficients. The reliance on experimental data (ionization potentials and polarizabilities) limits the set of elements that can be treated to those present in typical organic molecules.
  • C6 coefficients are kept constant during the calculation, and so effects of different chemical states of the atom or the influence of its environment are neglected.
  • C6/r^6 function diverges for small separations (small r) and this divergence must be removed.

With the simple correction schemes the dispersion correction diverges at short inter-atomic separations and so must be “damped”. The damping function f(rAB, A, B) is equal to one for large r and decreases Edisp to zero or to a constant for small r.

Issues with damping function:

  • The shape of the underlying binding curve is sensitive to the XC functional used and so the damping functions must be adjusted so as to be compatible with each exchange-correlation or exchange functional.
  • This fitting is also sensitive to the definition of atomic size and must be done carefully since the damping function can actually affect the binding energies even more than the asymptotic C6 coefficients.
  • The fitting also effectively includes the effects of C8/r^8 or C10/r^10 and higher contributions.

Step two – Environment-dependent corrections

The simple “DFT-D” schemes: the dispersion coefficients are predetermined and constant quantities. The errors introduced by this approximation can be large.

The unifying concept:

The dispersion coefficient of an atom in a molecule depends on the effective volume of the atom. When the atom is “squeezed”, its electron cloud becomes less polarizable leading to a decrease of the C6 coefficients.

Three step 2 methods:

  • DFT-D3 of Grimme

Capture the environmental dependence of the C6 coefficients by considering the number of neighbors each atom has.

  • vdW(TS) of Tkatchenko and Scheffler

Relies on reference atomic polarizabilities and reference atomic C6 coefficients to calculate the dispersion energy.

During the calculation on the system of interest the electron density of a molecule is divided between the individual atoms and for each atom its corresponding density is compared to the density of a free atom.

  • BJ by Becke-Johnson

Based on the fact that around an electron at r1 there will be a region of electron density depletion, the so-called XC hole. This creates asymmetric electron density and thus non-zero dipole and higher-order electrostatic moments, which causes polarization in other atoms to an extent given by their polarizability.

C6 coefficients are altered through two effects:

  1. The polarizabilities of atoms in molecules are scaled from their reference atom values according to their effective atomic volumes.
  2. The dipole moments respond to the chemical environment through changes of the exchange hole, although this effect seems to be difficult to quantify precisely.

Step three – Long-range density functionals

Approaches that do not rely on external input parameters but rather obtain the dispersion interaction directly from the electron density.

Termed non-local correlation functionals since they add non-local (i.e., long range) correlations to local or semi-local correlation functionals.