Big Data and R

Theodora FragkouliJanuary 2nd, 2014Last Updated: January 2nd, 2014

0 278 4 minutes read

This blog post is a presentation of tips on computing with Big Data in R, using Revolution R Enterprise 7.0 and RevoScaleR, Revolution’s R package for HPA computing, as introduced by Revolution Analytics blog. For more detailed information you can take a look at Tips on Computing with Big Data in R.

1 Upgrade your hardware

Since bigger is better, increasing memory and adding as many cores as R can use is very helpful. Also trying to avoid bottlenecks that occur in disk I/O and the speed of RAM, so as to use more cores.

2 Upgrade your software

Since R allows its core math libraries to be replaced, a performance boost can be achieved to any function that makes use of computational linear algebra algorithms. Revolution R Enterprise links in the Intel Math Kernel Libraries.

3 Minimize copies of the data

R does quite a bit of automatic copying. For example, when a data frame is passed into a function a copy of the data is made if the data frame is modified, and putting a data frame into a list also automatically causes a copy to be made. Moreover, many basic analysis algorithms, such as lm and glm produce multiple copies of a data set as the computations progress. Memory management is important.

4 Process data in chunks

Processing data a chunk at a time can scale computations without increasing memory requirements. There are several CRAN packages including biglm, bigmemory, ff and ffbase that can implement external memory algorithms or help with writing them. Revolution R Enterprise’s RevoScaleR package takes chunking algorithms to the next level by automatically taking advantage of the computational resources to run its algorithms in parallel.

5 Compute in parallel across cores or nodes

In order to scale computations to big data the CRAN package foreach provides easy-to-use tools for executing R functions in parallel on both on a single computer and across multiple computers. The foreach() function is particularly useful for “embarrassingly parallel” computations that do not involve communication among different tasks.
The statistical functions and machine learning algorithms in the RevoScaleR package are all Parallel External Memory Algorithm’s (PEMA’s). They automatically take advantage of all of the cores available on a machine or on a cluster (including LSF and Hadoop clusters.)

6 Take advantage of integers

In R, the two choices for “continuous” data are numeric, an 8 byte (double) floating point number and integer, a 4 byte integer. There are circumstances where storing and processing integer data can provide the dual advantages using less memory and decreasing processing time. For example, when working with integers, a tabulation is generally much faster than sorting and gives exact values for all empirical quantiles. Even when you are not working with integers scaling and converting to integers can produce fast and accurate estimates of quantiles. As an example, if the data consists of floating point values in the range from 0 to 1,000, converting to integers and tabulating will bound the median or any other quantile to within two adjacent integers. Then interpolation can get you even closer approximation.

7 Store data efficiently

When big data has to be efficiently accessed from disk appropriate data types should be used, so as to save storage space and access time. Integers should be preffered when possible, instead of doubles and floats, since they can represent 7 decimal digits of precision, which is more than enough for most data, and the take up half the space of doubles. Save the 64-bit doubles for computations.

8 Only read the data needed

Reading from disk the variables needed for computations and analysis, instead of reading a whole data set of variables can speed up the analysis considerably.

9 Avoid loops when transforming data

Since loops in R can be very slow compared with R’s core vector operations which are typically written in C, C++ or Fortran, they should be avoided.

10 Use C, C++, or Fortran for critical functions

Since R can integrate easily with other languages, including C, C++, and Fortran, one can pass R data objects to other languages, do some computations, and return the results in R data objects. So this ability of R can be used for critical functions. The CRAN package Rcpp, for example, makes it easy to call C and C++ code from R.

11 Process data transformations in batches

To avoid overhead of making multiple passes over large data sets write chunking algorithms that apply all of the transformations to each chunk. RevoScaleR’s rxDataStep() function is designed for one pass processing by permitting multiple data transformations to be performed on each chunk.

12 User row-oriented data transformations where possible

When writing chunking algorithms, try to avoid algorithms that cross chunk boundaries. In general, data transformations for a single row of data should not be dependent on values in other rows. The key idea is that a transformation expression should give the same result even if only some of the rows of data are in memory at one time. Data manipulations requiring lags can be done but require special handling.

13 Handle categorical variables efficiently and with care

Working with categorical or factor variables in big data sets can be challenging. For example, using R’s factor() function in a transformation on a chunk of data without explicitly specifying all of the levels that are present in the entire data set might end up with incompatible factor levels from chunk to chunk. Also, building models with factors having hundreds of levels may cause hundreds of dummy variables to be created that really eat up memory. The functions in the RevoScaleR package that deal with factors minimize memory use and do not generally explicitly create dummy variables to represent factors.

14 Be aware of 0utput with the same number of rows as your input

When output has the same number of rows as the data, for example, when computing predictions and residuals, the output should be written out to a file rather than kept in memory.

15 Think Twice Before Sorting

Since sorting is a time-intensive operation, we should use implementations of algorithms that avoid it. In R, the RevoScaleR function rxDTree() avoids sorting by working with histograms of the data rather that with the raw data itself.

Big Data and R

1 Upgrade your hardware

2 Upgrade your software

3 Minimize copies of the data

4 Process data in chunks

5 Compute in parallel across cores or nodes

6 Take advantage of integers

7 Store data efficiently

8 Only read the data needed

9 Avoid loops when transforming data

10 Use C, C++, or Fortran for critical functions

11 Process data transformations in batches

12 User row-oriented data transformations where possible

13 Handle categorical variables efficiently and with care

14 Be aware of 0utput with the same number of rows as your input

15 Think Twice Before Sorting

Thank you!

Theodora Fragkouli

Thank you!

1 Upgrade your hardware

2 Upgrade your software

3 Minimize copies of the data

4 Process data in chunks

5 Compute in parallel across cores or nodes

6 Take advantage of integers

7 Store data efficiently

8 Only read the data needed

9 Avoid loops when transforming data

10 Use C, C++, or Fortran for critical functions

11 Process data transformations in batches

12 User row-oriented data transformations where possible

13 Handle categorical variables efficiently and with care

14 Be aware of 0utput with the same number of rows as your input

15 Think Twice Before Sorting

Thank you!

Related Articles

Thank you!