diff --git a/R-package/R/ndarray.R b/R-package/R/ndarray.R
index b5537a298593..fa79120f0aab 100644
--- a/R-package/R/ndarray.R
+++ b/R-package/R/ndarray.R
@@ -95,6 +95,10 @@ mx.nd.copyto <- function(src, ctx) {
 #'
 #' @return An \code{mx.ndarray}
 #'
+#' @rdname mx.nd.array
+#' 
+#' @return An Rcpp_MXNDArray object
+#' 
 #' @examples
 #' mat = mx.nd.array(x)
 #' mat = 1 - mat + (2 * mat)/(mat + 0.5)
diff --git a/R-package/vignettes/mnistCompetition.Rmd b/R-package/vignettes/mnistCompetition.Rmd
new file mode 100644
index 000000000000..b749bc9cb4e0
--- /dev/null
+++ b/R-package/vignettes/mnistCompetition.Rmd
@@ -0,0 +1,113 @@
+---
+title: "Handwritten Digits Classification Competition"
+author: "Tong He"
+date: "October 17, 2015"
+output: html_document
+---
+
+[MNIST](http://yann.lecun.com/exdb/mnist/) is a handwritten digits image data set created by Yann LeCun. Every digit is represented by a 28x28 image. It has become a standard data set to test classifiers on simple image input. Neural network is no doubt a strong model for image classification tasks. There's a [long-term hosted competition](https://www.kaggle.com/c/digit-recognizer) on Kaggle using this data set. We will present the basic usage of `mxnet` to compete in this challenge.
+
+## Data Loading
+
+First, let us download the data from [here](https://www.kaggle.com/c/digit-recognizer/data), and put them under the `data/` folder in your working directory.
+
+Then we can read them in R and convert to matrices.
+
+```{r, eval=FALSE}
+train <- read.csv('data/train.csv', header=TRUE)
+test <- read.csv('data/test.csv', header=TRUE)
+train <- data.matrix(train)
+test <- data.matrix(test)
+
+train.x <- train[,-1]
+train.y <- train[,1]
+```
+
+Here every image is represented as a single row in train/test. The greyscale of each image falls in the range [0, 255], we can linearly transform it into [0,1] by
+
+```{r, eval = FALSE}
+train.x <- train.x/255
+test <- test/255
+```
+
+In the label part, we see the number of each digit is fairly even:
+
+```{r, eval=FALSE}
+table(train.y)
+```
+
+## Network Configuration
+
+Now we have the data. The next step is to configure the structure of our network.
+
+```{r}
+data <- mx.symbol.Variable("data")
+fc1 <- mx.symbol.FullyConnected(data, name="fc1", num_hidden=128)
+act1 <- mx.symbol.Activation(fc1, name="relu1", act_type="relu")
+fc2 <- mx.symbol.FullyConnected(act1, name = "fc2", num_hidden = 64)
+act2 <- mx.symbol.Activation(fc2, name="relu2", act_type="relu")
+fc3 <- mx.symbol.FullyConnected(act2, name="fc3", num_hidden=10)
+softmax <- mx.symbol.Softmax(fc3, name = "sm")
+```
+
+1. In `mxnet`, we use its own data type `symbol` to configure the network. `data <- mx.symbol.Variable("data")` use `data` to represent the input data, i.e. the input layer.
+2. Then we set the first hidden layer by `fc1 <- mx.symbol.FullyConnected(data, name="fc1", num_hidden=128)`. This layer has `data` as the input, its name and the number of hidden neurons.
+3. The activation is set by `act1 <- mx.symbol.Activation(fc1, name="relu1", act_type="relu")`. The activation function takes the output from the first hidden layer `fc1`.
+4. The second hidden layer takes the result from `act1` as the input, with its name as "fc2" and the number of hidden neurons as 64.
+5. the second activation is almost the same as `act1`, except we have a different input source and name.
+6. Here comes the output layer. Since there's only 10 digits, we set the number of neurons to 10.
+7. Finally we set the activation to softmax to get a probabilistic prediction.
+
+## Training 
+
+We are almost ready for the training process. Before we start the computation, let's decide what device should we use.
+
+```{r}
+devices <- lapply(1:2, function(i) {
+  mx.cpu(i)
+})
+```
+
+Here we assign two threads of our CPU to `mxnet`. After all these preparation, you can run the following command to train the neural network!
+
+```{r}
+set.seed(0)
+model <- mx.model.FeedForward.create(softmax, X=train.x, y=train.y,
+                                     ctx=devices, num.round=10, array.batch.size=100,
+                                     learning.rate=0.07, momentum=0.9,
+                                     initializer=mx.init.uniform(0.07),
+                                     epoch.end.callback=mx.callback.log.train.metric(100))
+```
+
+## Prediction and Submission
+
+To make prediction, we can simply write
+
+```{r}
+preds <- predict(model, test)
+dim(preds)
+```
+
+It is a matrix with 28000 rows and 10 cols, containing the desired classification probabilities from the output layer. To extract the maximum label for each row, we can use the `max.col` in R:
+
+```{r}
+pred.label <- max.col(preds) - 1
+table(pred.label)
+```
+
+With a little extra effort in the csv format, we can have our submission to the competition!
+
+```{r}
+submission <- data.frame(ImageId=1:nrow(test), Label=pred.label)
+write.csv(submission, file='submission.csv', row.names=FALSE, quote=FALSE)
+```
+
+
+
+
+
+
+
+
+
+
diff --git a/R-package/vignettes/ndarrayAndSymbolTutorial.Rmd b/R-package/vignettes/ndarrayAndSymbolTutorial.Rmd
new file mode 100644
index 000000000000..2b608066b753
--- /dev/null
+++ b/R-package/vignettes/ndarrayAndSymbolTutorial.Rmd
@@ -0,0 +1,286 @@
+MXNet R Tutorial on NDArray and Symbol
+============================
+
+This vignette gives a general overview of MXNet's R package.  MXNet contains a
+mixed flavor of elements to bake flexible and efficient
+applications. There are mainly three concepts:
+
+* [NDArray](#ndarray-numpy-style-tensor-computations-on-cpus-and-gpus)
+  offers matrix and tensor computations on both CPU and GPU, with automatic
+  parallelization
+* [Symbol](#symbol-and-automatic-differentiation) makes defining a neural
+  network extremely easy, and provides automatic differentiation.
+* [KVStore](#distributed-key-value-store) easy the data synchronization between
+  multi-GPUs and multi-machines.
+
+## NDArray: Vectorized tensor computations on CPUs and GPUs
+
+`NDArray` is the basic vectorized operation unit in MXNet for matrix and tensor computations. 
+Users can perform usual calculations as on R's array, but with two additional features:
+
+1.  **multiple devices**: all operations can be run on various devices including
+CPU and GPU
+2. **automatic parallelization**: all operations are automatically executed in
+   parallel with each other
+
+### Create and Initialization
+
+Let's create `NDArray` on either GPU or CPU
+
+```{r}
+require(mxnet)
+a <- mx.nd.zeros(c(2, 3)) # create a 2-by-3 matrix on cpu
+b <- mx.nd.zeros(c(2, 3), mx.gpu()) # create a 2-by-3 matrix on gpu 0
+c <- mx.nd.zeros(c(2, 3), mx.gpu(2)) # create a 2-by-3 matrix on gpu 0
+c$dim()
+```
+
+We can also initialize an `NDArray` object in various ways:
+
+```{r}
+a <- mx.nd.ones(c(4, 4))
+b <- mx.rnorm(c(4, 5))
+c <- mx.nd.array(1:5)
+```
+
+To check the numbers in an `NDArray`, we can simply run
+
+```{r}
+a <- mx.nd.ones(c(2, 3))
+b <- as.array(a)
+class(b)
+b
+```
+
+### Basic Operations
+
+#### Elemental-wise operations
+
+You can perform elemental-wise operations on `NDArray` objects:
+
+```{r}
+a <- mx.nd.ones(c(2, 3)) * 2
+b <- mx.nd.ones(c(2, 4)) / 8
+as.array(a)
+as.array(b)
+c <- a + b
+as.array(c)
+d <- c / a - 5
+as.array(d)
+```
+
+If two `NDArray`s sit on different divices, we need to explicitly move them 
+into the same one. For instance:
+
+```{r}
+a <- mx.nd.ones(c(2, 3)) * 2
+b <- mx.nd.ones(c(2, 3), mx.gpu()) / 8
+c <- mx.nd.copyto(a, mx.gpu()) * b
+as.array(c)
+```
+
+#### Load and Save
+
+You can save an `NDArray` object to your disk with `mx.nd.save`:
+
+```{r}
+a <- mx.nd.ones(c(2, 3))
+mx.nd.save(a, 'temp.ndarray')
+```
+
+You can also load it back easily:
+
+```{r}
+a <- mx.nd.load('temp.ndarray')
+as.array(a[[1]])
+```
+
+In case you want to save data to the distributed file system such as S3 and HDFS, 
+we can directly save to and load from them. For example:
+
+```{r,eval=FALSE}
+mx.nd.save(a, 's3://mybucket/mydata.bin')
+mx.nd.save(a, 'hdfs///users/myname/mydata.bin')
+```
+
+### Automatic Parallelization
+
+`NDArray` can automatically execute operations in parallel. It is desirable when we
+use multiple resources such as CPU, GPU cards, and CPU-to-GPU memory bandwidth.
+
+For example, if we write `a <- a + 1` followed by `b <- b + 1`, and `a` is on CPU while
+`b` is on GPU, then want to execute them in parallel to improve the
+efficiency. Furthermore, data copy between CPU and GPU are also expensive, we
+hope to run it parallel with other computations as well.
+
+However, finding the codes can be executed in parallel by eye is hard. In the
+following example, `a <- a + 1` and `c <- c * 3` can be executed in parallel, but `a <- a + 1` and
+`b <- b * 3` should be in sequential.
+
+```{r}
+a <- mx.nd.ones(c(2,3))
+b <- a
+c <- mx.nd.copyto(a, mx.cpu())
+a <- a + 1
+b <- b * 3
+c <- c * 3
+```
+
+Luckily, MXNet can automatically resolve the dependencies and
+execute operations in parallel with correctness guaranteed. In other words, we
+can write program as by assuming there is only a single thread, while MXNet will
+automatically dispatch it into multi-devices, such as multi GPU cards or multi
+machines.
+
+It is achieved by lazy evaluation. Any operation we write down is issued into a
+internal engine, and then returned. For example, if we run `a <- a + 1`, it
+returns immediately after pushing the plus operator to the engine. This
+asynchronous allows us to push more operators to the engine, so it can determine
+the read and write dependency and find a best way to execute them in
+parallel.
+
+The actual computations are finished if we want to copy the results into some
+other place, such as `as.array(a)` or `mx.nd.save(a, 'temp.dat')`. Therefore, if we
+want to write highly parallelized codes, we only need to postpone when we need
+the results.
+
+## Symbol and Automatic Differentiation
+
+WIth the computational unit `NDArray`, we need a way to construct neural networks. MXNet provides a symbolic interface named Symbol to do so. The symbol combines both flexibility and efficiency.
+
+### Basic Composition of Symbols
+
+The following codes create a two layer perceptrons network:
+
+```{r}
+require(mxnet)
+net <- mx.symbol.Variable('data')
+net <- mx.symbol.FullyConnected(data=net, name='fc1', num_hidden=128)
+net <- mx.symbol.Activation(data=net, name='relu1', act_type="relu")
+net <- mx.symbol.FullyConnected(data=net, name='fc2', num_hidden=64)
+net <- mx.symbol.Softmax(data=net, name='out')
+class(net)
+```
+
+Each symbol takes a (unique) string name. *Variable* often defines the inputs,
+or free variables. Other symbols take a symbol as the input (*data*),
+and may accept other hyper-parameters such as the number of hidden neurons (*num_hidden*)
+or the activation type (*act_type*).
+
+The symbol can be simply viewed as a function taking several arguments, whose
+names are automatically generated and can be get by
+
+```{r}
+arguments(net)
+```
+
+As can be seen, these arguments are the parameters need by each symbol:
+
+- *data* : input data needed by the variable *data*
+- *fc1_weight* and *fc1_bias* : the weight and bias for the first fully connected layer *fc1*
+- *fc2_weight* and *fc2_bias* : the weight and bias for the second fully connected layer *fc2*
+- *out_label* : the label needed by the loss
+
+We can also specify the automatic generated names explicitly:
+
+```{r}
+net <- mx.symbol.Variable('data')
+w <- mx.symbol.Variable('myweight')
+net <- sym.FullyConnected(data=data, weight=w, name='fc1', num_hidden=128)
+arguments(net)
+```
+
+### More Complicated Composition
+
+MXNet provides well-optimized symbols (see
+[src/operator](https://github.com/dmlc/mxnet/tree/master/src/operator)) for
+commonly used layers in deep learning. We can also easily define new operators
+in python.  The following example first performs an elementwise add between two
+symbols, then feed them to the fully connected operator.
+
+```{r}
+lhs <- mx.symbol.Variable('data1')
+rhs <- mx.symbol.Variable('data2')
+net <- mx.symbol.FullyConnected(data=lhs + rhs, name='fc1', num_hidden=128)
+arguments(net)
+```
+
+We can also construct symbol in a more flexible way rather than the single
+forward composition we addressed before.
+
+```{r}
+net <- mx.symbol.Variable('data')
+net <- mx.symbol.FullyConnected(data=net, name='fc1', num_hidden=128)
+net2 <- mx.symbol.Variable('data2')
+net2 <- mx.symbol.FullyConnected(data=net2, name='net2', num_hidden=128)
+composed_net <- net(data=net2, name='compose')
+arguments(composed_net)
+```
+
+In the above example, *net* is used a function to apply to an existing symbol
+*net*, the resulting *composed_net* will replace the original argument *data* by
+*net2* instead.
+
+### Argument Shapes Inference
+
+Now we have known how to define the symbol. Next we can inference the shapes of
+all the arguments it needed by given the input data shape.
+
+```{r}
+net <- mx.symbol.Variable('data')
+net <- mx.symbol.FullyConnected(data=net, name='fc1', num_hidden=10)
+```
+
+The shape inference can be used as an earlier debugging mechanism to detect
+shape inconsistency.
+
+### Bind the Symbols and Run
+
+Now we can bind the free variables of the symbol and perform forward and backward.
+The bind function will create an ```Executor``` that can be used to carry out the real computations.
+
+For neural nets, a more commonly used pattern is ```simple_bind```, which will create
+all the arguments arrays for you. Then you can call forward, and backward(if gradient is needed)
+to get the gradient.
+
+```{r, eval=FALSE}
+A <- mx.symbol.Variable('A')
+B <- mx.symbol.Variable('B')
+C <- A * B
+
+texec <- mx.simple.bind(C)
+texec.forward()
+texec.backward()
+```
+
+The [model API](../../R-package/R/model.R) is a thin wrapper around the symbolic executors to support neural net training.
+
+You are also highly encouraged to read [Symbolic Configuration and Execution in Pictures](symbol_in_pictures.md),
+which provides a detailed explanation of concepts in pictures.
+
+### How Efficient is Symbolic API
+
+In short, they design to be very efficienct in both memory and runtime.
+
+The major reason for us to introduce Symbolic API, is to bring the efficient C++
+operations in powerful toolkits such as cxxnet and caffe together with the
+flexible dynamic NArray operations. All the memory and computation resources are
+allocated statically during Bind, to maximize the runtime performance and memory
+utilization.
+
+The coarse grained operators are equivalent to cxxnet layers, which are
+extremely efficient.  We also provide fine grained operators for more flexible
+composition. Because we are also doing more inplace memory allocation, mxnet can
+be ***more memory efficient*** than cxxnet, and gets to same runtime, with
+greater flexiblity.
+
+
+
+
+
+
+
+
+
+
+
diff --git a/doc/R-package/Makefile b/doc/R-package/Makefile
index a59a3fde4220..7ca47d63776d 100644
--- a/doc/R-package/Makefile
+++ b/doc/R-package/Makefile
@@ -3,6 +3,8 @@ PKGROOT=../../R-package
 
 # ADD The Markdown to be built here
 classifyRealImageWithPretrainedModel.md:
+mnistCompetition.Rmd:
+ndarrayAndSymbolTutorial.Rmd:
 
 # General Rules for build rmarkdowns, need knitr
 %.md: $(PKGROOT)/vignettes/%.Rmd
diff --git a/doc/R-package/index.md b/doc/R-package/index.md
index d8e381f8f442..20d5e70f1ac3 100644
--- a/doc/R-package/index.md
+++ b/doc/R-package/index.md
@@ -10,6 +10,8 @@ The MXNet R packages brings flexible and efficient GPU computing and deep learni
 Tutorials
 ---------
 * [Classify Realworld Images with Pretrained Model](classifyRealImageWithPretrainedModel.md)
+* [Handwritten Digits Classification Competition](mnistCompetition.md)
+* [Tutorial on NDArray and Symbol](ndarrayAndSymbolTutorial.md)
 
 Installation
 ------------
diff --git a/doc/R-package/mnistCompetition.md b/doc/R-package/mnistCompetition.md
new file mode 100644
index 000000000000..dd806dfe777b
--- /dev/null
+++ b/doc/R-package/mnistCompetition.md
@@ -0,0 +1,209 @@
+---
+title: "Handwritten Digits Classification Competition"
+author: "Tong He"
+date: "October 17, 2015"
+output: html_document
+---
+
+[MNIST](http://yann.lecun.com/exdb/mnist/) is a handwritten digits image data set created by Yann LeCun. Every digit is represented by a 28x28 image. It has become a standard data set to test classifiers on simple image input. Neural network is no doubt a strong model for image classification tasks. There's a [long-term hosted competition](https://www.kaggle.com/c/digit-recognizer) on Kaggle using this data set. We will present the basic usage of `mxnet` to compete in this challenge.
+
+## Data Loading
+
+First, let us download the data from [here](https://www.kaggle.com/c/digit-recognizer/data), and put them under the `data/` folder in your working directory.
+
+Then we can read them in R and convert to matrices.
+
+
+```r
+train <- read.csv('data/train.csv', header=TRUE)
+test <- read.csv('data/test.csv', header=TRUE)
+train <- data.matrix(train)
+test <- data.matrix(test)
+
+train.x <- train[,-1]
+train.y <- train[,1]
+```
+
+Here every image is represented as a single row in train/test. The greyscale of each image falls in the range [0, 255], we can linearly transform it into [0,1] by
+
+
+```r
+train.x <- train.x/255
+test <- test/255
+```
+
+In the label part, we see the number of each digit is fairly even:
+
+
+```r
+table(train.y)
+```
+
+## Network Configuration
+
+Now we have the data. The next step is to configure the structure of our network.
+
+
+```r
+data <- mx.symbol.Variable("data")
+```
+
+```
+## Error in eval(expr, envir, enclos): could not find function "mx.symbol.Variable"
+```
+
+```r
+fc1 <- mx.symbol.FullyConnected(data, name="fc1", num_hidden=128)
+```
+
+```
+## Error in eval(expr, envir, enclos): could not find function "mx.symbol.FullyConnected"
+```
+
+```r
+act1 <- mx.symbol.Activation(fc1, name="relu1", act_type="relu")
+```
+
+```
+## Error in eval(expr, envir, enclos): could not find function "mx.symbol.Activation"
+```
+
+```r
+fc2 <- mx.symbol.FullyConnected(act1, name = "fc2", num_hidden = 64)
+```
+
+```
+## Error in eval(expr, envir, enclos): could not find function "mx.symbol.FullyConnected"
+```
+
+```r
+act2 <- mx.symbol.Activation(fc2, name="relu2", act_type="relu")
+```
+
+```
+## Error in eval(expr, envir, enclos): could not find function "mx.symbol.Activation"
+```
+
+```r
+fc3 <- mx.symbol.FullyConnected(act2, name="fc3", num_hidden=10)
+```
+
+```
+## Error in eval(expr, envir, enclos): could not find function "mx.symbol.FullyConnected"
+```
+
+```r
+softmax <- mx.symbol.Softmax(fc3, name = "sm")
+```
+
+```
+## Error in eval(expr, envir, enclos): could not find function "mx.symbol.Softmax"
+```
+
+1. In `mxnet`, we use its own data type `symbol` to configure the network. `data <- mx.symbol.Variable("data")` use `data` to represent the input data, i.e. the input layer.
+2. Then we set the first hidden layer by `fc1 <- mx.symbol.FullyConnected(data, name="fc1", num_hidden=128)`. This layer has `data` as the input, its name and the number of hidden neurons.
+3. The activation is set by `act1 <- mx.symbol.Activation(fc1, name="relu1", act_type="relu")`. The activation function takes the output from the first hidden layer `fc1`.
+4. The second hidden layer takes the result from `act1` as the input, with its name as "fc2" and the number of hidden neurons as 64.
+5. the second activation is almost the same as `act1`, except we have a different input source and name.
+6. Here comes the output layer. Since there's only 10 digits, we set the number of neurons to 10.
+7. Finally we set the activation to softmax to get a probabilistic prediction.
+
+## Training 
+
+We are almost ready for the training process. Before we start the computation, let's decide what device should we use.
+
+
+```r
+devices <- lapply(1:2, function(i) {
+  mx.cpu(i)
+})
+```
+
+```
+## Error in FUN(1:2[[1L]], ...): could not find function "mx.cpu"
+```
+
+Here we assign two threads of our CPU to `mxnet`. After all these preparation, you can run the following command to train the neural network!
+
+
+```r
+set.seed(0)
+model <- mx.model.FeedForward.create(softmax, X=train.x, y=train.y,
+                                     ctx=devices, num.round=10, array.batch.size=100,
+                                     learning.rate=0.07, momentum=0.9,
+                                     initializer=mx.init.uniform(0.07),
+                                     epoch.end.callback=mx.callback.log.train.metric(100))
+```
+
+```
+## Error in eval(expr, envir, enclos): could not find function "mx.model.FeedForward.create"
+```
+
+## Prediction and Submission
+
+To make prediction, we can simply write
+
+
+```r
+preds <- predict(model, test)
+```
+
+```
+## Error in predict(model, test): object 'model' not found
+```
+
+```r
+dim(preds)
+```
+
+```
+## Error in eval(expr, envir, enclos): object 'preds' not found
+```
+
+It is a matrix with 28000 rows and 10 cols, containing the desired classification probabilities from the output layer. To extract the maximum label for each row, we can use the `max.col` in R:
+
+
+```r
+pred.label <- max.col(preds) - 1
+```
+
+```
+## Error in as.matrix(m): object 'preds' not found
+```
+
+```r
+table(pred.label)
+```
+
+```
+## Error in table(pred.label): object 'pred.label' not found
+```
+
+With a little extra effort in the csv format, we can have our submission to the competition!
+
+
+```r
+submission <- data.frame(ImageId=1:nrow(test), Label=pred.label)
+```
+
+```
+## Error in nrow(test): object 'test' not found
+```
+
+```r
+write.csv(submission, file='submission.csv', row.names=FALSE, quote=FALSE)
+```
+
+```
+## Error in is.data.frame(x): object 'submission' not found
+```
+
+
+
+
+
+
+
+
+
+
diff --git a/doc/R-package/ndarrayAndSymbolTutorial.md b/doc/R-package/ndarrayAndSymbolTutorial.md
new file mode 100644
index 000000000000..b5572c5e3d9d
--- /dev/null
+++ b/doc/R-package/ndarrayAndSymbolTutorial.md
@@ -0,0 +1,454 @@
+MXNet R Tutorial on NDArray and Symbol
+============================
+
+This vignette gives a general overview of MXNet's R package.  MXNet contains a
+mixed flavor of elements to bake flexible and efficient
+applications. There are mainly three concepts:
+
+* [NDArray](#ndarray-numpy-style-tensor-computations-on-cpus-and-gpus)
+  offers matrix and tensor computations on both CPU and GPU, with automatic
+  parallelization
+* [Symbol](#symbol-and-automatic-differentiation) makes defining a neural
+  network extremely easy, and provides automatic differentiation.
+* [KVStore](#distributed-key-value-store) easy the data synchronization between
+  multi-GPUs and multi-machines.
+
+## NDArray: Vectorized tensor computations on CPUs and GPUs
+
+`NDArray` is the basic vectorized operation unit in MXNet for matrix and tensor computations. 
+Users can perform usual calculations as on R's array, but with two additional features:
+
+1.  **multiple devices**: all operations can be run on various devices including
+CPU and GPU
+2. **automatic parallelization**: all operations are automatically executed in
+   parallel with each other
+
+### Create and Initialization
+
+Let's create `NDArray` on either GPU or CPU
+
+
+```r
+require(mxnet)
+```
+
+```
+## Loading required package: mxnet
+## Loading required package: methods
+```
+
+```r
+a <- mx.nd.zeros(c(2, 3)) # create a 2-by-3 matrix on cpu
+b <- mx.nd.zeros(c(2, 3), mx.gpu()) # create a 2-by-3 matrix on gpu 0
+```
+
+```
+## Error in eval(expr, envir, enclos): [15:41:37] src/storage/storage.cc:43: Please compile with CUDA enabled
+```
+
+```r
+c <- mx.nd.zeros(c(2, 3), mx.gpu(2)) # create a 2-by-3 matrix on gpu 0
+```
+
+```
+## Error in eval(expr, envir, enclos): [15:41:37] src/storage/storage.cc:43: Please compile with CUDA enabled
+```
+
+```r
+c$dim()
+```
+
+```
+## Error in c$dim: object of type 'builtin' is not subsettable
+```
+
+We can also initialize an `NDArray` object in various ways:
+
+
+```r
+a <- mx.nd.ones(c(4, 4))
+b <- mx.rnorm(c(4, 5))
+c <- mx.nd.array(1:5)
+```
+
+To check the numbers in an `NDArray`, we can simply run
+
+
+```r
+a <- mx.nd.ones(c(2, 3))
+b <- as.array(a)
+class(b)
+```
+
+```
+## [1] "matrix"
+```
+
+```r
+b
+```
+
+```
+##      [,1] [,2] [,3]
+## [1,]    1    1    1
+## [2,]    1    1    1
+```
+
+### Basic Operations
+
+#### Elemental-wise operations
+
+You can perform elemental-wise operations on `NDArray` objects:
+
+
+```r
+a <- mx.nd.ones(c(2, 3)) * 2
+b <- mx.nd.ones(c(2, 4)) / 8
+as.array(a)
+```
+
+```
+##      [,1] [,2] [,3]
+## [1,]    2    2    2
+## [2,]    2    2    2
+```
+
+```r
+as.array(b)
+```
+
+```
+##       [,1]  [,2]  [,3]  [,4]
+## [1,] 0.125 0.125 0.125 0.125
+## [2,] 0.125 0.125 0.125 0.125
+```
+
+```r
+c <- a + b
+```
+
+```
+## Error in eval(expr, envir, enclos): [15:41:37] src/ndarray/./ndarray_function.h:20: Check failed: lshape == rshape operands shape mismatch
+```
+
+```r
+as.array(c)
+```
+
+```
+## [1] 1 2 3 4 5
+```
+
+```r
+d <- c / a - 5
+```
+
+```
+## Error in eval(expr, envir, enclos): [15:41:37] src/ndarray/./ndarray_function.h:20: Check failed: lshape == rshape operands shape mismatch
+```
+
+```r
+as.array(d)
+```
+
+```
+## Error in as.array(d): object 'd' not found
+```
+
+If two `NDArray`s sit on different divices, we need to explicitly move them 
+into the same one. For instance:
+
+
+```r
+a <- mx.nd.ones(c(2, 3)) * 2
+b <- mx.nd.ones(c(2, 3), mx.gpu()) / 8
+```
+
+```
+## Error in eval(expr, envir, enclos): [15:41:37] src/storage/storage.cc:43: Please compile with CUDA enabled
+```
+
+```r
+c <- mx.nd.copyto(a, mx.gpu()) * b
+```
+
+```
+## Error in eval(expr, envir, enclos): [15:41:37] src/storage/storage.cc:43: Please compile with CUDA enabled
+```
+
+```r
+as.array(c)
+```
+
+```
+## [1] 1 2 3 4 5
+```
+
+#### Load and Save
+
+You can save an `NDArray` object to your disk with `mx.nd.save`:
+
+
+```r
+a <- mx.nd.ones(c(2, 3))
+mx.nd.save(a, 'temp.ndarray')
+```
+
+```
+## Error in eval(expr, envir, enclos): could not convert using R function : as.list
+```
+
+You can also load it back easily:
+
+
+```r
+a <- mx.nd.load('temp.ndarray')
+```
+
+```
+## Error in eval(expr, envir, enclos): [15:41:37] src/io/local_filesys.cc:149: Check failed: allow_null  LocalFileSystem: fail to open "temp.ndarray"
+```
+
+```r
+as.array(a[[1]])
+```
+
+```
+## Error in a[[1]]: object of type 'externalptr' is not subsettable
+```
+
+In case you want to save data to the distributed file system such as S3 and HDFS, 
+we can directly save to and load from them. For example:
+
+
+```r
+mx.nd.save(a, 's3://mybucket/mydata.bin')
+mx.nd.save(a, 'hdfs///users/myname/mydata.bin')
+```
+
+### Automatic Parallelization
+
+`NDArray` can automatically execute operations in parallel. It is desirable when we
+use multiple resources such as CPU, GPU cards, and CPU-to-GPU memory bandwidth.
+
+For example, if we write `a <- a + 1` followed by `b <- b + 1`, and `a` is on CPU while
+`b` is on GPU, then want to execute them in parallel to improve the
+efficiency. Furthermore, data copy between CPU and GPU are also expensive, we
+hope to run it parallel with other computations as well.
+
+However, finding the codes can be executed in parallel by eye is hard. In the
+following example, `a <- a + 1` and `c <- c * 3` can be executed in parallel, but `a <- a + 1` and
+`b <- b * 3` should be in sequential.
+
+
+```r
+a <- mx.nd.ones(c(2,3))
+b <- a
+c <- mx.nd.copyto(a, mx.cpu())
+a <- a + 1
+b <- b * 3
+c <- c * 3
+```
+
+Luckily, MXNet can automatically resolve the dependencies and
+execute operations in parallel with correctness guaranteed. In other words, we
+can write program as by assuming there is only a single thread, while MXNet will
+automatically dispatch it into multi-devices, such as multi GPU cards or multi
+machines.
+
+It is achieved by lazy evaluation. Any operation we write down is issued into a
+internal engine, and then returned. For example, if we run `a <- a + 1`, it
+returns immediately after pushing the plus operator to the engine. This
+asynchronous allows us to push more operators to the engine, so it can determine
+the read and write dependency and find a best way to execute them in
+parallel.
+
+The actual computations are finished if we want to copy the results into some
+other place, such as `as.array(a)` or `mx.nd.save(a, 'temp.dat')`. Therefore, if we
+want to write highly parallelized codes, we only need to postpone when we need
+the results.
+
+## Symbol and Automatic Differentiation
+
+WIth the computational unit `NDArray`, we need a way to construct neural networks. MXNet provides a symbolic interface named Symbol to do so. The symbol combines both flexibility and efficiency.
+
+### Basic Composition of Symbols
+
+The following codes create a two layer perceptrons network:
+
+
+```r
+require(mxnet)
+net <- mx.symbol.Variable('data')
+net <- mx.symbol.FullyConnected(data=net, name='fc1', num_hidden=128)
+net <- mx.symbol.Activation(data=net, name='relu1', act_type="relu")
+net <- mx.symbol.FullyConnected(data=net, name='fc2', num_hidden=64)
+net <- mx.symbol.Softmax(data=net, name='out')
+class(net)
+```
+
+```
+## [1] "Rcpp_MXSymbol"
+## attr(,"package")
+## [1] "mxnet"
+```
+
+Each symbol takes a (unique) string name. *Variable* often defines the inputs,
+or free variables. Other symbols take a symbol as the input (*data*),
+and may accept other hyper-parameters such as the number of hidden neurons (*num_hidden*)
+or the activation type (*act_type*).
+
+The symbol can be simply viewed as a function taking several arguments, whose
+names are automatically generated and can be get by
+
+
+```r
+arguments(net)
+```
+
+```
+## [1] "data"       "fc1_weight" "fc1_bias"   "fc2_weight" "fc2_bias"  
+## [6] "out_label"
+```
+
+As can be seen, these arguments are the parameters need by each symbol:
+
+- *data* : input data needed by the variable *data*
+- *fc1_weight* and *fc1_bias* : the weight and bias for the first fully connected layer *fc1*
+- *fc2_weight* and *fc2_bias* : the weight and bias for the second fully connected layer *fc2*
+- *out_label* : the label needed by the loss
+
+We can also specify the automatic generated names explicitly:
+
+
+```r
+net <- mx.symbol.Variable('data')
+w <- mx.symbol.Variable('myweight')
+net <- sym.FullyConnected(data=data, weight=w, name='fc1', num_hidden=128)
+```
+
+```
+## Error in eval(expr, envir, enclos): could not find function "sym.FullyConnected"
+```
+
+```r
+arguments(net)
+```
+
+```
+## [1] "data"
+```
+
+### More Complicated Composition
+
+MXNet provides well-optimized symbols (see
+[src/operator](https://github.com/dmlc/mxnet/tree/master/src/operator)) for
+commonly used layers in deep learning. We can also easily define new operators
+in python.  The following example first performs an elementwise add between two
+symbols, then feed them to the fully connected operator.
+
+
+```r
+lhs <- mx.symbol.Variable('data1')
+rhs <- mx.symbol.Variable('data2')
+net <- mx.symbol.FullyConnected(data=lhs + rhs, name='fc1', num_hidden=128)
+arguments(net)
+```
+
+```
+## [1] "data1"      "data2"      "fc1_weight" "fc1_bias"
+```
+
+We can also construct symbol in a more flexible way rather than the single
+forward composition we addressed before.
+
+
+```r
+net <- mx.symbol.Variable('data')
+net <- mx.symbol.FullyConnected(data=net, name='fc1', num_hidden=128)
+net2 <- mx.symbol.Variable('data2')
+net2 <- mx.symbol.FullyConnected(data=net2, name='net2', num_hidden=128)
+composed_net <- net(data=net2, name='compose')
+```
+
+```
+## Error in eval(expr, envir, enclos): could not find function "net"
+```
+
+```r
+arguments(composed_net)
+```
+
+```
+## Error in inherits(x, "Rcpp_MXSymbol"): object 'composed_net' not found
+```
+
+In the above example, *net* is used a function to apply to an existing symbol
+*net*, the resulting *composed_net* will replace the original argument *data* by
+*net2* instead.
+
+### Argument Shapes Inference
+
+Now we have known how to define the symbol. Next we can inference the shapes of
+all the arguments it needed by given the input data shape.
+
+
+```r
+net <- mx.symbol.Variable('data')
+net <- mx.symbol.FullyConnected(data=net, name='fc1', num_hidden=10)
+```
+
+The shape inference can be used as an earlier debugging mechanism to detect
+shape inconsistency.
+
+### Bind the Symbols and Run
+
+Now we can bind the free variables of the symbol and perform forward and backward.
+The bind function will create an ```Executor``` that can be used to carry out the real computations.
+
+For neural nets, a more commonly used pattern is ```simple_bind```, which will create
+all the arguments arrays for you. Then you can call forward, and backward(if gradient is needed)
+to get the gradient.
+
+
+```r
+A <- mx.symbol.Variable('A')
+B <- mx.symbol.Variable('B')
+C <- A * B
+
+texec <- mx.simple.bind(C)
+texec.forward()
+texec.backward()
+```
+
+The [model API](../../R-package/R/model.R) is a thin wrapper around the symbolic executors to support neural net training.
+
+You are also highly encouraged to read [Symbolic Configuration and Execution in Pictures](symbol_in_pictures.md),
+which provides a detailed explanation of concepts in pictures.
+
+### How Efficient is Symbolic API
+
+In short, they design to be very efficienct in both memory and runtime.
+
+The major reason for us to introduce Symbolic API, is to bring the efficient C++
+operations in powerful toolkits such as cxxnet and caffe together with the
+flexible dynamic NArray operations. All the memory and computation resources are
+allocated statically during Bind, to maximize the runtime performance and memory
+utilization.
+
+The coarse grained operators are equivalent to cxxnet layers, which are
+extremely efficient.  We also provide fine grained operators for more flexible
+composition. Because we are also doing more inplace memory allocation, mxnet can
+be ***more memory efficient*** than cxxnet, and gets to same runtime, with
+greater flexiblity.
+
+
+
+
+
+
+
+
+
+
+
diff --git a/doc/conf.py b/doc/conf.py
index 2551a291a3e2..413f3f661c56 100644
--- a/doc/conf.py
+++ b/doc/conf.py
@@ -42,6 +42,7 @@
 MarkdownParser.github_doc_root = github_doc_root
 source_parsers = {
     '.md': MarkdownParser,
+    '.Rmd': MarkdownParser,
 }
 os.environ['MXNET_BUILD_DOC'] = '1'
 # Version information.
@@ -71,7 +72,7 @@
 # The suffix(es) of source filenames.
 # You can specify multiple suffix as a list of string:
 # source_suffix = ['.rst', '.md']
-source_suffix = ['.rst', '.md']
+source_suffix = ['.rst', '.md', '.Rmd']
 
 # The encoding of source files.
 #source_encoding = 'utf-8-sig'