Theodora Fragkouli

About Theodora Fragkouli

Theodora has graduated from Computer Engineering and Informatics Department in the University of Patras. She also holds a Master degree in Economics from the National and Technical University of Athens. During her studies she has been involved with a large number of projects ranging from programming and software engineering to telecommunications, hardware design and analysis.

Big Data and R

This blog post is a presentation of tips on computing with Big Data in R, using Revolution R Enterprise 7.0 and RevoScaleR, Revolution’s R package for HPA computing, as introduced by Revolution Analytics blog. For more detailed information you can take a look at Tips on Computing with Big Data in R.

1 Upgrade your hardware

Since bigger is better, increasing memory and adding as many cores as R can use is very helpful. Also trying to avoid bottlenecks that occur in disk I/O and the speed of RAM, so as to use more cores.

2 Upgrade your software

Since R allows its core math libraries to be replaced, a performance boost can be achieved to any function that makes use of computational linear algebra algorithms. Revolution R Enterprise links in the Intel Math Kernel Libraries.

3 Minimize copies of the data

R does quite a bit of automatic copying. For example, when a data frame is passed into a function a copy of the data is made if the data frame is modified, and putting a data frame into a list also automatically causes a copy to be made. Moreover, many basic analysis algorithms, such as lm and glm produce multiple copies of a data set as the computations progress. Memory management is important.

4 Process data in chunks

Processing data a chunk at a time can scale computations without increasing memory requirements. There are several CRAN packages including biglm, bigmemory, ff and ffbase that can implement external memory algorithms or help with writing them. Revolution R Enterprise’s RevoScaleR package takes chunking algorithms to the next level by automatically taking advantage of the computational resources to run its algorithms in parallel.

5 Compute in parallel across cores or nodes

In order to scale computations to big data the CRAN package foreach provides easy-to-use tools for executing R functions in parallel on both on a single computer and across multiple computers. The foreach() function is particularly useful for “embarrassingly parallel” computations that do not involve communication among different tasks.
The statistical functions and machine learning algorithms in the RevoScaleR package are all Parallel External Memory Algorithm’s (PEMA’s). They automatically take advantage of all of the cores available on a machine or on a cluster (including LSF and Hadoop clusters.)

6 Take advantage of integers

In R, the two choices for “continuous” data are numeric, an 8 byte (double) floating point number and integer, a 4 byte integer. There are circumstances where storing and processing integer data can provide the dual advantages using less memory and decreasing processing time. For example, when working with integers, a tabulation is generally much faster than sorting and gives exact values for all empirical quantiles. Even when you are not working with integers scaling and converting to integers can produce fast and accurate estimates of quantiles. As an example, if the data consists of floating point values in the range from 0 to 1,000, converting to integers and tabulating will bound the median or any other quantile to within two adjacent integers. Then interpolation can get you even closer approximation.

7 Store data efficiently

When big data has to be efficiently accessed from disk appropriate data types should be used, so as to save storage space and access time. Integers should be preffered when possible, instead of doubles and floats, since they can represent 7 decimal digits of precision, which is more than enough for most data, and the take up half the space of doubles. Save the 64-bit doubles for computations.

8 Only read the data needed

Reading from disk the variables needed for computations and analysis, instead of reading a whole data set of variables can speed up the analysis considerably.

9 Avoid loops when transforming data

Since loops in R can be very slow compared with R’s core vector operations which are typically written in C, C++ or Fortran, they should be avoided.

10 Use C, C++, or Fortran for critical functions

Since R can integrate easily with other languages, including C, C++, and Fortran, one can pass R data objects to other languages, do some computations, and return the results in R data objects. So this ability of R can be used for critical functions. The CRAN package Rcpp, for example, makes it easy to call C and C++ code from R.

11 Process data transformations in batches

To avoid overhead of making multiple passes over large data sets write chunking algorithms that apply all of the transformations to each chunk. RevoScaleR’s rxDataStep() function is designed for one pass processing by permitting multiple data transformations to be performed on each chunk.

12 User row-oriented data transformations where possible

When writing chunking algorithms, try to avoid algorithms that cross chunk boundaries. In general, data transformations for a single row of data should not be dependent on values in other rows. The key idea is that a transformation expression should give the same result even if only some of the rows of data are in memory at one time. Data manipulations requiring lags can be done but require special handling.

13 Handle categorical variables efficiently and with care

Working with categorical or factor variables in big data sets can be challenging. For example, using R’s factor() function in a transformation on a chunk of data without explicitly specifying all of the levels that are present in the entire data set might end up with incompatible factor levels from chunk to chunk. Also, building models with factors having hundreds of levels may cause hundreds of dummy variables to be created that really eat up memory. The functions in the RevoScaleR package that deal with factors minimize memory use and do not generally explicitly create dummy variables to represent factors.

14 Be aware of 0utput with the same number of rows as your input

When output has the same number of rows as the data, for example, when computing predictions and residuals, the output should be written out to a file rather than kept in memory.

15 Think Twice Before Sorting

Since sorting is a time-intensive operation, we should use implementations of algorithms that avoid it. In R, the RevoScaleR function rxDTree() avoids sorting by working with histograms of the data rather that with the raw data itself.

Related Whitepaper:

Software Architecture

This guide will introduce you to the world of Software Architecture!

This 162 page guide will cover topics within the field of software architecture including: software architecture as a solution balancing the concerns of different stakeholders, quality assurance, methods to describe and evaluate architectures, the influence of architecture on reuse, and the life cycle of a system and its architecture. This guide concludes with a comparison between the professions of software architect and software engineer.

Get it Now!  

Leave a Reply


five × = 45



Java Code Geeks and all content copyright © 2010-2014, Exelixis Media Ltd | Terms of Use | Privacy Policy
All trademarks and registered trademarks appearing on Java Code Geeks are the property of their respective owners.
Java is a trademark or registered trademark of Oracle Corporation in the United States and other countries.
Java Code Geeks is not connected to Oracle Corporation and is not sponsored by Oracle Corporation.

Sign up for our Newsletter

20,709 insiders are already enjoying weekly updates and complimentary whitepapers! Join them now to gain exclusive access to the latest news in the Java world, as well as insights about Android, Scala, Groovy and other related technologies.

As an extra bonus, by joining you will get our brand new e-books, published by Java Code Geeks and their JCG partners for your reading pleasure! Enter your info and stay on top of things,

  • Fresh trends
  • Cases and examples
  • Research and insights
  • Two complimentary e-books