# Calculating ROC curves¶

(Original entry published in CBDD Research Group Blog.)

Here you will find a a short tutorial about how to generate receiver operating
characteristic (ROC) curves and other statistics after running rDock molecular
docking (for other programs such as Vina or Glide, just a little modification on
the way `dataforR_uq.txt`

file is interpreted will make it work, see below).

I assume all of you are familiar with what ROC curves are, what are they for and how they are made.

Just in case, a very brief summary would be:

ROC curves are graphic representations of the relation existing between the sensibility and the specificity of a test. It is generated by plotting the fraction of true positives out of the total actual positives versus the fraction of false positives out of the total actual negatives.

In our case, we will use it for checking whether a docking program is able to select active ligands with respect to inactive ligands (decoys) and whether it is able to select these active ligands in the top % of a ranked database.

R Library ROCR is mandatory (try with command

`install.packages("ROCR")`

in R before downloading from source).

The example selected for this tutorial is a system from the DUD benchmark set, “hivpr” or “hiv protease”.

These are the files you will need (all can be downloaded in this Dropbox shared folder):

List of active ligands (

`ligands.txt`

)List of inactive ligands (

`decoys.txt`

)Output file with the docked poses of each ligand with the corresponding docking scores (

`hivpr_all_results.sd.gz`

)R script with all the R commands in this tutorial (

`ROC_curves.R`

)

Before getting into R, the resulted docked poses have to be filtered out for only having the best pose for each ligand (the smallest score – or highest in negative value). To do so run:

```
gunzip hivpr_all_results.sd.gz
sdsort -n -s -fSCORE hivpr_all_results.sd | sdfilter -f'$_COUNT == 1' > hivpr_1poseperlig.sd
# sdsort with -n and -s flags will sort internally each ligand by increasing
# score and sdfilter will get only the first entry of each ligand.
sdreport -t hivpr_1poseperlig.sd | awk '{print $2,$3,$4,$5,$6,$7}' > dataforR_uq.txt
# sdreport will print all the scores of the output in a tabular format and,
# with command awk, we will format the results.
```

Note

`sdsort`

and `sdreport`

are really useful tools for managing sd formatted
compound collections. They are very user-friendly and free to download. They
are provided along with rDock software in rDock website. Go to Download section for
downloading rDock.

This `dataforR_uq.txt`

(also in the Dropbox folder) file must contain one
entry per ligand with the docked scores (what R will use to rank and plot the
ROC curves).

## R commands for generating ROC curves¶

Then, run the following commands in R for plotting the ROC curves:

```
# load ROCR
library(ROCR);
# load ligands and decoys
lig
```

Which will give us the following plot:

Afterwards, other useful statistics such as AUC or Enrichment factors can also be calculated:

```
# AUC (area under the curve)
auc_rdock <- performance(predINTERuq, "auc")
auc.area_rdock <- slot(auc_rdock, "y.values")[[1]]
cat("AUC: \n")
cat(auc.area_rdock)
cat("\n\n")
```

```
AUC:
0.7700965
```

```
# Enrichment Factors
EF_rdock <- perfINTERuq@y.values[[1]] / perfINTERuq@x.values[[1]]
EF_rdock_1 <- EF_rdock[which(perfINTERuq@x.values[[1]] > 0.01)[1]]
EF_rdock_20 <- EF_rdock[which(perfINTERuq@x.values[[1]] > 0.2)[1]]
cat("Enrichment Factor top 1%:\n")
cat(EF_rdock_1)
cat("\n\n")
```

```
Enrichment Factor top 1%:
11.11817
```

```
cat("Enrichment Factor top 20%:\n")
cat(EF_rdock_20)
cat("\n\n")
```

```
Enrichment Factor top 20%:
3.200686
```

Moreover, a good analysis of these curves is to re-plot them in semilogarithmic scale (x axis in logarithmic scale). This way, one can focus on the early enrichment of the database and have a more detailed view of the selected actives in the top % of all the ligands.

```
jpeg("hivpr_semilog_ROC.jpg")
rdockforsemilog=perfINTERuq@x.values[[1]]
rdockforsemilog[rdockforsemilog < 0.0005]=0.0005
plot(rdockforsemilog, perfINTERuq@y.values[[1]],type="l", xlab="False Positive Rate", ylab="True Positive Rate", xaxt="n", log="x", col="blue", main="hivpr - Semilog ROC Curves")
axis(1, c(0, 0.001, 0.01, 0.1, 1))
x<-seq(0, 1, 0.001)
points(x, x, col="gray", type="l")
dev.off()
```

Obtaining the following semi-logarithmic ROC curves: