Featured FREE Whitepapers

What's New Here?



The main difficulty with forecasting the future is that it hasn’t yet happened. – James BurkeWhen I first heard about #NoEstimates, I thought it was not only provocative, it can also be damaging. The idea of working without estimates seems preposterous to many people. It did to me.   I mean, how can you plan anything without estimates? How we use estimates When I started my career as a software developer, there was a running joke in the company. For each level of management, we should multiply an estimation by 1.3. If a developer said a task will take 3 months, the team leader will “refine” the estimation for 5 months, and at the program level it was now even more. A joke, but not remote from what I did later as a manager. Let’s first get this out of the way: modifying an estimate is down-right disrespectful. As a manager I think I know better than the people who are actually doing the work. Another thing that happens a lot is estimates turn to commitments, in the eyes of management, which the team now need to meet. This post is not about these abusive things, although they do exist. Why did the estimation process work like that? A couple of assumptions:Work is comprised of learning and development, and these are not linear, or sequential. Estimating the work is complex. Developers are an optimistic bunch. They don’t think about things going wrong. They under promise, so they can over deliver. They cannot foresee all surprises, and we have more experience, so we’ll introduce some buffers. When were they right on the estimates last time?The results were task lengths in the project plan. So now we “know” the estimate is 6 months, instead of the original 3 months estimation. Of course, we don’t know, and we’re aware of that. After all, we know that plans change over time. The difference is now we have more confidence about the estimate, so we can plan ahead with dependent work. Why we need estimates The idea of estimates is to provide enough confidence in the organization in order to make decisions about the future.  To answer questions like:Do we have enough capacity to take on more work after the project? Should we do this project at all? When should marketing and sales be ready for the launch? What should the people do until then?These are very good business questions. The problem is our track record: we’re horrible estimators (I point you to the last bullet). We don’t know much about the future. The whole process of massaging estimates so we can feel better about them, seems like we’re relying on a set of crystal balls. And we use these balls to make business decisions. There should be a better way. So what are the alternatives? That is the interesting question. Once we understand that estimates are just one way of making business decisions, and a crappy one at that, we can have an open discussion. The alternative can be cost-of-delay. It could be empirical evidence to forecast against. It can be limited safe-to-fail experiments. It can be any combination or modification of these things, and It can be things we haven’t discovered yet. #NoEstimates is not really about estimates. It’s about making confident, rational, trust-worthy decisions. I know what results estimates give. Let’s seek out better ones. For more information about #NoEstimates, you can read more on the blogs of Woody Zuill, Neil Killick and Vasco Duarte.Reference: #NoEstimates from our JCG partner Gil Zilberfeld at the Geek Out of Water blog....

The Simple Story Paradox

I’ve recently been following the #isTDDDead debate between Kent Beck (@kentbeck), David Heinemeier Hansson (@dhh), and Martin Fowler (@martinfowler) with some interest. I think that it’s particularly beneficial that ideas, which are often taken for granted, can be challenged in a constructive manner. That way you can figure out if they stand up to scrutiny or fall down flat on their faces. The discussion began with @dhh making the following points on TDD and test technique, which I hope I’ve got right. Firstly, the strict definition of TDD includes the following:    TTD is used to drive unit tests You can’t have collaborators You can’t touch the database You can’t touch the File system Fast Unit Tests, complete in the blink of an eye.He went on to say that you therefore drive your system’s architecture from the use of mocks and in that way the architecture suffers damage from the drive to isolate and mock everything, whilst the mandatory enforcement of the ‘red, green, refactor’ cycle is too prescriptive. He also stated that a lot of people mistake that you can’t have confidence in your code and you can’t deliver incremental functionality with tests unless you go through this mandated, well paved road of TDD. @Kent_Beck said that TDD didn’t necessarily include heavy mocking and the discussion continued… I’ve paraphrased a little here; however, it was the difference in the interpretation and experience of using TDD that got me thinking. Was it really a problem with TDD or was it with @dhh’s experience of other developer’s interpretation of TDD? I don’t want to put words into @dhh’s mouth, but it seems like the problem is the dogmatic application of the TDD technique even when it isn’t applicable. I came away with the impression that, in certain development houses, TDD had degenerated into little more than Cargo Cult Programming. The term Cargo Cult Programming seems to derive from a paper written by someone whom I found truly inspirational, the late Professor Richard Feynman. He presented a paper entitled Cargo Cult Science – Some remarks on science, pseudoscience and learning how not to fool yourself as part of Caltech’s 1974 commencement address. This later became part of his autobiography: Surely you must be joking Mr Feynman, a book that I implore you to read. In it, Feynman highlights experiments from several pseudosciences, such as educational science, psychology, parapsychology and physics, where the scientific approach of keeping an open mind, questioning everything and looking for flaws in your theory have been replaced by belief, ritualism and faith: a willingness to take other peoples results for granted in lieu of an experimental control. Taken from the 1974 paper, Feynman sums up Cargo Cult Science as: “In the South Seas there is a cargo cult of people. During the war they saw airplanes land with lots of good materials, and they want the same thing to happen now. So they’ve arranged to imitate things like runways, to put fires along the sides of the runways, to make a wooden hut for a man to sit in, with two wooden pieces on his head like headphones and bars of bamboo sticking out like antennas–he’s the controller–and they wait for the airplanes to land. They’re doing everything right. The form is perfect. It looks exactly the way it looked before. But it doesn’t work. No airplanes land. So I call these things cargo cult science, because they follow all the apparent precepts and forms of scientific investigation, but they’re missing something essential, because the planes don’t land.” You can apply this idea to programming where you’ll find teams and individuals carrying out ritualised procedures and using techniques without really understanding the theory behind them in the hope that they’ll work and because they are the ‘right thing to do’. In the second talk in the series @dhh came up with an example of what he called “test induced design damage” and at this I got excited because it’s something I’ve seen a number of times. The only reservation I had about the gist code was that to me it didn’t seem to result from TDD, that argument seems a little limited; I’d say that it was more a result of Cargo Cult Programming and that’s because in the instances where I’ve come across this example TDD wasn’t used. If you’ve seen the Gist, you may know what I’m talking about; however, that code is in Ruby, which is something I’ve little experience of. In order to explore this in more detail, I thought that I’d create a Spring MVC version and go from there. The scenario here is one where we have a very simple story: all the code does is to read an object from the database and place it into the model for display. There’s no additional processing, no business logic and no calculations to perform. The agile story would go something like this:Title: View User Details As an admin user I want to click on a link So that I can verify a user's detailsIn this ‘Proper’ N tier sample, I have a User model object, a controller and service layer and DAO together with their interfaces and tests. And, there’s the paradox: you set out to write the best code you possibly can to implement the story, using the well known and probably most popular MVC ‘N’ layer pattern and end up with something that’s total overkill for such a simple scenario. Something, as @jhh would say, is damaged. In my sample code, I’m using the JdbcTemplate class to retrieve a user’s details from a MySQL database, but any DB access API will do. This is the sample code demonstrating the conventional, ‘right’ way of implementing the story; prepare to do a lot of scrolling… public class User {   public static User NULL_USER = new User(-1, "Not Available", "", new Date());   private final long id;   private final String name;   private final String email;   private final Date createDate;   public User(long id, String name, String email, Date createDate) {     this.id = id;     this.name = name;     this.email = email;     this.createDate = createDate;   }   public long getId() {     return id;   }   public String getName() {     return name;   }   public String getEmail() {     return email;   }   public Date getCreateDate() {     return createDate;   } }   @Controller public class UserController {   @Autowired   private UserService userService;   @RequestMapping("/find1")   public String findUser(@RequestParam("user") String name, Model model) {     User user = userService.findUser(name);     model.addAttribute("user", user);     return "user";   } }   public interface UserService {   public abstract User findUser(String name); }   @Service public class UserServiceImpl implements UserService {   @Autowired   private UserDao userDao;   /**    * @see com.captaindebug.cargocult.ntier.UserService#findUser(java.lang.String)    */   @Override   public User findUser(String name) {     return userDao.findUser(name);   } }   public interface UserDao {   public abstract User findUser(String name); }   @Repository public class UserDaoImpl implements UserDao {   private static final String FIND_USER_BY_NAME = "SELECT id, name,email,createdDate FROM Users WHERE name=?";   @Autowired   private JdbcTemplate jdbcTemplate;   /**    * @see com.captaindebug.cargocult.ntier.UserDao#findUser(java.lang.String)    */   @Override   public User findUser(String name) {     User user;     try {       FindUserMapper rowMapper = new FindUserMapper();       user = jdbcTemplate.queryForObject(FIND_USER_BY_NAME, rowMapper, name);     } catch (EmptyResultDataAccessException e) {       user = User.NULL_USER;     }     return user;   } } If you take a look at this code, paradoxically it looks fine; in fact it looks like a classic text book example of how to write an ‘N’ tier MVC application. The controller passes responsibility for sorting out the business rules to the service layer and the service layer retrieves data from the DB using a data access object, which in turn uses a RowMapper<> helper class to retrieve a User object. When the controller has a User object it injects it into the model ready for display. This pattern is clear and extensible; we’re isolating the database from the service and the service from the controller by using interfaces and we’re testing everything using both JUnit with Mockito, and integration tests. This should be the last word in text book MVC coding, or is it? Let’s look at the code. Firstly, there’s the unnecessary use of interfaces. Some would argue that it’s easy to switch database implementations, but who ever does that? 1 plus, modern mocking tools can create their proxies using Class definitions so, unless your design specifically requires multiple implementations of the same interface, then using interfaces is pointless. Next, there is the UserServiceImpl, which is a classic example of the lazy class anti-pattern, because it does nothing except pointlessly delegate to the data access object. Likewise,the controller is also pretty lazy as it delegates to the lazy UserServiceImpl before adding the resulting User class to the model: in fact, all these classes are examples of the lazy class anti pattern. Having written some lazy classes, they are now needlessly tested to death, even the non-event UserServiceImpl class. It’s only worth writing tests for classes that actually perform some logic. public class UserControllerTest {   private static final String NAME = "Woody Allen";   private UserController instance;   @Mock   private Model model;   @Mock   private UserService userService;   @Before   public void setUp() throws Exception {     MockitoAnnotations.initMocks(this);     instance = new UserController();     ReflectionTestUtils.setField(instance, "userService", userService);   }   @Test   public void testFindUser_valid_user() {     User expected = new User(0L, NAME, "aaa@bbb.com", new Date());     when(userService.findUser(NAME)).thenReturn(expected);     String result = instance.findUser(NAME, model);     assertEquals("user", result);     verify(model).addAttribute("user", expected);   }   @Test   public void testFindUser_null_user() {     when(userService.findUser(null)).thenReturn(User.NULL_USER);     String result = instance.findUser(null, model);     assertEquals("user", result);     verify(model).addAttribute("user", User.NULL_USER);   }   @Test   public void testFindUser_empty_user() {     when(userService.findUser("")).thenReturn(User.NULL_USER);     String result = instance.findUser("", model);     assertEquals("user", result);     verify(model).addAttribute("user", User.NULL_USER);   } }   public class UserServiceTest {   private static final String NAME = "Annie Hall";   private UserService instance;   @Mock   private UserDao userDao;   @Before   public void setUp() throws Exception {     MockitoAnnotations.initMocks(this);     instance = new UserServiceImpl();     ReflectionTestUtils.setField(instance, "userDao", userDao);   }   @Test   public void testFindUser_valid_user() {     User expected = new User(0L, NAME, "aaa@bbb.com", new Date());     when(userDao.findUser(NAME)).thenReturn(expected);     User result = instance.findUser(NAME);     assertEquals(expected, result);   }   @Test   public void testFindUser_null_user() {     when(userDao.findUser(null)).thenReturn(User.NULL_USER);     User result = instance.findUser(null);     assertEquals(User.NULL_USER, result);   }   @Test   public void testFindUser_empty_user() {     when(userDao.findUser("")).thenReturn(User.NULL_USER);     User result = instance.findUser("");     assertEquals(User.NULL_USER, result);   } }   public class UserDaoTest {   private static final String NAME = "Woody Allen";   private UserDao instance;   @Mock   private JdbcTemplate jdbcTemplate;   @Before   public void setUp() throws Exception {     MockitoAnnotations.initMocks(this);     instance = new UserDaoImpl();     ReflectionTestUtils.setField(instance, "jdbcTemplate", jdbcTemplate);   }   @SuppressWarnings({ "unchecked", "rawtypes" })   @Test   public void testFindUser_valid_user() {     User expected = new User(0L, NAME, "aaa@bbb.com", new Date());     when(jdbcTemplate.queryForObject(anyString(), (RowMapper) anyObject(), eq(NAME))).thenReturn(expected);     User result = instance.findUser(NAME);     assertEquals(expected, result);   }   @SuppressWarnings({ "unchecked", "rawtypes" })   @Test   public void testFindUser_null_user() {     when(jdbcTemplate.queryForObject(anyString(), (RowMapper) anyObject(), isNull())).thenReturn(User.NULL_USER);     User result = instance.findUser(null);     assertEquals(User.NULL_USER, result);   }   @SuppressWarnings({ "unchecked", "rawtypes" })   @Test   public void testFindUser_empty_user() {     when(jdbcTemplate.queryForObject(anyString(), (RowMapper) anyObject(), eq(""))).thenReturn(User.NULL_USER);     User result = instance.findUser("");     assertEquals(User.NULL_USER, result);   } }   @RunWith(SpringJUnit4ClassRunner.class) @WebAppConfiguration @ContextConfiguration({ "file:src/main/webapp/WEB-INF/spring/appServlet/servlet-context.xml",     "file:src/test/resources/test-datasource.xml" }) public class UserControllerIntTest {   @Autowired   private WebApplicationContext wac;   private MockMvc mockMvc;   /**    * @throws java.lang.Exception    */   @Before   public void setUp() throws Exception {     mockMvc = MockMvcBuilders.webAppContextSetup(wac).build();   }   @Test   public void testFindUser_happy_flow() throws Exception {     ResultActions resultActions = mockMvc.perform(get("/find1").accept(MediaType.ALL).param("user", "Tom"));     resultActions.andExpect(status().isOk());     resultActions.andExpect(view().name("user"));     resultActions.andExpect(model().attributeExists("user"));     resultActions.andDo(print());     MvcResult result = resultActions.andReturn();     ModelAndView modelAndView = result.getModelAndView();     Map<String, Object> model = modelAndView.getModel();     User user = (User) model.get("user");     assertEquals("Tom", user.getName());     assertEquals("tom@gmail.com", user.getEmail());   } } In writing this sample code, I’ve added everything I could think of into the mix. You may think that this example is ‘over the top’ in its construction especially with the inclusion on redundant interface, but I have seen code like this. The benefits of this pattern are that it follows a distinct design understood by most developers; it’s clean and extensible. The down side is that there are lots of classes. More classes take more time to write and,you ever have to maintain or enhance this code, they’re more difficult to get to grips with. So, what’s the solution? That’s difficult to answer. In the #IsTTDDead debate @dhh gives the solution as placing all the code in one class, mixing the data access with the population of the model. If you implement this solution for our user story you still get a User class, but the number of classes you need shrinks dramatically. @Controller public class UserAccessor {   private static final String FIND_USER_BY_NAME = "SELECT id, name,email,createdDate FROM Users WHERE name=?";   @Autowired   private JdbcTemplate jdbcTemplate;   @RequestMapping("/find2")   public String findUser2(@RequestParam("user") String name, Model model) {     User user;     try {       FindUserMapper rowMapper = new FindUserMapper();       user = jdbcTemplate.queryForObject(FIND_USER_BY_NAME, rowMapper, name);     } catch (EmptyResultDataAccessException e) {       user = User.NULL_USER;     }     model.addAttribute("user", user);     return "user";   }   private class FindUserMapper implements RowMapper<User>, Serializable {     private static final long serialVersionUID = 1L;     @Override     public User mapRow(ResultSet rs, int rowNum) throws SQLException {       User user = new User(rs.getLong("id"), //           rs.getString("name"), //           rs.getString("email"), //           rs.getDate("createdDate"));       return user;     }   } }   @RunWith(SpringJUnit4ClassRunner.class) @WebAppConfiguration @ContextConfiguration({ "file:src/main/webapp/WEB-INF/spring/appServlet/servlet-context.xml",     "file:src/test/resources/test-datasource.xml" }) public class UserAccessorIntTest {   @Autowired   private WebApplicationContext wac;   private MockMvc mockMvc;   /**    * @throws java.lang.Exception    */   @Before   public void setUp() throws Exception {     mockMvc = MockMvcBuilders.webAppContextSetup(wac).build();   }   @Test   public void testFindUser_happy_flow() throws Exception {     ResultActions resultActions = mockMvc.perform(get("/find2").accept(MediaType.ALL).param("user", "Tom"));     resultActions.andExpect(status().isOk());     resultActions.andExpect(view().name("user"));     resultActions.andExpect(model().attributeExists("user"));     resultActions.andDo(print());     MvcResult result = resultActions.andReturn();     ModelAndView modelAndView = result.getModelAndView();     Map<String, Object> model = modelAndView.getModel();     User user = (User) model.get("user");     assertEquals("Tom", user.getName());     assertEquals("tom@gmail.com", user.getEmail());   }   @Test   public void testFindUser_empty_user() throws Exception {     ResultActions resultActions = mockMvc.perform(get("/find2").accept(MediaType.ALL).param("user", ""));     resultActions.andExpect(status().isOk());     resultActions.andExpect(view().name("user"));     resultActions.andExpect(model().attributeExists("user"));     resultActions.andExpect(model().attribute("user", User.NULL_USER));     resultActions.andDo(print());   } } The solution above cuts the number of first level classes to two: an implementation class and a test class. All test scenarios are catered for in a very few end to end integration tests. These tests will access the database, but is that so bad in this case? If each trip to the DB takes around 20ms or less then they’ll still complete within a fraction of a second; that should be fast enough. In terms of enhancing or maintaining this code, one small single class is easier to learn than several even smaller classes. If you did have to add in a whole bunch of business rules or other complexity then changing this code into the ‘N’ layer pattern will not be difficult; however the problem is that if/when a change is necessary it may be given to an inexperienced developer who’ll not be confident enough to carry out the necessary refactoring. The upshot is, and you must have seen this lots of times, that the new change could be shoehorned on top of this one class solution leading to a mess of spaghetti code. In implementing a solution like this, you may not be very popular, because the code is unconventional. That’s one of the reasons that I think that this single class solution is something that a lot of people would see as contentious. It’s this idea of a standard ‘right way’ and ‘wrong way’ of writing code, rigorously applied in every case, that has lead to this perfectly good design becoming a problem. I guess that it’s all a matter of horses for courses; choosing the right design for the right situation. If I was implementing a complex story, then I wouldn’t hesitate to split up the various responsibilities, but in the simple case it’s just not worth it. I’ll therefore end by asking if any one has a better solution for the Simple Story Paradox shown above, please let me know.   1 I’ve worked on a project once in umpteen years of programming where the underlying database was changed to meet a customer requirement. That was many years and many thousands of miles away and the code was written in C++ and Visual Basic.The code for this blog is available on Github at https://github.com/roghughe/captaindebug/tree/master/cargo-cultReference: The Simple Story Paradox from our JCG partner Roger Hughes at the Captain Debug’s Blog blog....

gonsole weeks: content assist for git commands

While Eclipse ships with a comprehensive Git tool, it seems that for certain tasks many developers switch to the command line. This gave Frank and me the idea, to start an open source project to provide a git console integration for the IDE. What happened so far during the gonsole weeks can be read in git init gonsole and eclipse egit integration. The recent days we have been working on content assist and made the appearance more colorful. The color setttings aren’t yet configurable so that you’d have to live with what we found appropriate!      Furthermore we have basic content assist in place. In its current state it helps you with finding the right git commands. If you type ‘s’ followed by Ctrl+Space for example, it will show you that there is a show, a show-ref, and a status command.  While the feature itself might not look very impressive, it provides the basis for more content assists. Showing the documentation for a selected command isn’t far away. The same accounts for completion proposals for command arguments like branches and repositories, which might look like this:  The content assist code was written only backed by end-to-end tests, which turned out to be a quite effective way to work exploratively. Now we will re-construct the functionality test driven while the end-to-end tests ensure that we do not break the overall features. In the meanwhile you may want to try the software by yourself and install it from this update site: http://rherrmann.github.io/gonsole/repository/ That’s it for now, let’s get back to TDD. And maybe the next time we can show you more content assist features…Reference: gonsole weeks: content assist for git commands from our JCG partner Rudiger Herrmann at the Code Affine blog....

Spring Professional Study Notes

Before we get down to my own additions to existing resources mentioned in my certification guide post I would like to reiterate all resources that I used for my preparation:Course-ware Spring in Action, Third Edition by Craig Walls Jeanne’s study notes Spring Framework Reference Documentation    Bean life-cycle Bean initialization process One of the key areas of the certification is the life-cycle management provided by Spring. By the time of the exam you should know this diagram by heart. This picture describes the process of loading beans into the application context. Starting point is definition of beans in beans XML file (but also works for programmatic configuration as well). When the context gets initialized all configuration files and classes are loaded into the application. Since all these sources contain different representations of beans there needs to be a merging step that unifies bean definitions into one internal format. After initialization of whole configuration it needs to be checked for errors and invalid configuration. When the configuration is validated then a dependency tree is built and indexed. As a next step, Spring applies BeanFactoryPostProcessors (BFPP from now on). BFPP allows for custom modification of an application context’s bean definitions. Application contexts can auto-detect BFPP beans in their bean definitions and apply them before any other beans get created. Classic examples of BFPPs include PropertyResourceConfigurer and PropertyPlaceHolderConfigurer. When this phase is over and Spring owns all the final bean definitions then the process leaves the ‘happens once’ part and enters ‘happens for each bean’ part.Please note that even though I used graphical elements usually depicting UML elements, this picture is not UML compliant. When in second phase, the first thing that is performed is the evaluation of all SPEL expressions. The Spring Expression Language (SPEL for short) is an expression language that supports querying and manipulating an object graph at runtime. Now, that all SPEL expressions are evaluated, Spring performs dependency injection (constructor and setter). As a next step Spring applies BeanPostProcessors (BPP from now on). BPP is a factory hook that allows for custom modification of new bean instances. Application contexts can autodetect BPP beans in their bean definitions and apply them to any beans subsequently created. Calling postProcessBeforeInitialization and postProcessAfterInitialization provides bean life-cycle hooks from outside of bean definition with no regard to bean type. To specify bean life-cycle hooks that are bean specific you can choose from three possible ways to do so (ordered with respect to execution priority):@PostConstructAnnotating method with JSR-250 annotation @PostConstructafterPropertiesSetImplementing InitializingBean interface and providing implementation for the afterPropertiesSet methodinit-methodDefining the init-method attribute of bean element in XML configuration fileWhen a bean goes through this process it gets stored in a store which is basically a map with bean ID as a key and bean object as a value. When all beans get processed the context is initialized and we can call getBean on the context and retrieve bean instances. Bean destruction process Whole process begins when ApplicationContext is closed (whether by calling method close explicitly or from container where the application is running). When this happens all beans and the context itself are destroyed. Just like with bean initialization, Spring provides life-cycle hooks for beans destruction with three possible ways to do so (ordered with respect to execution priority):@PreDestroyAnnotating method with JSR-250 annotation @PreDestroydestroyImplementing DisposableBean interface and providing implementation for the destroy methoddestroy-methodDefining the destroy-method attribute of bean element in XML configuration fileHowever there is one tricky part to this. When you are dealing with prototype bean an interesting behavior emerges upon context closing. After prototype bean is fully initialized and all initializing life-cycle hooks are executed, container hands over the reference and has no reference to this prototype bean since. This means that no destruction life-cycle hooks will be executed. Request processing in Spring MVC When it comes to Spring MVC it is important to be familiar with basic principals of how Spring turns requests into responses. It all begins with ContextLoaderListener that ties the life-cycle of the ApplicationContext to the life-cycle of the ServletContext. Then there is a DelegatingFilterProxy which is a proxy for standard servlet filter delegating to a Spring-managed bean implementing javax.servlet.Filter interface. What DelegatingFilterProxy does is delegate the Filter‘s methods through to a bean which is obtained from the Spring application context. This enables the bean to benefit from the Spring web application context life-cycle support and configuration flexibility. The bean must implement javax.servlet.Filter and it must have the same name as that in the filter-name element. DelegatingFilterProxy delegates all mapped requests to a central Servlet that dispatches requests to controllers and offers other functionality that facilitates the development of web applications. Spring’s DispatcherServlet however, does more than just that. It is completely integrated with the Spring IoC container. Each DispatcherServlet has its own WebApplicationContext, which inherits all the beans already defined in the root WebApplicationContext. WebApplicationContext is an extension of plain old ApplicationContext that owns few specific beans. These beans provide handy bundle of tools I named ‘Common things’ that include support for things like: resolving the locale a client is using, resolving themes of your web application,  mapping of exceptions to views, parsing multipart request from HTML form uploads and few others. After all these things are taken care of, DispatcherServlet needs to determine where to dispatch incoming request. In order to do so DispatcherServlet turns to HandlerMapping which (in turn) maps requests to controllers. Spring’s handler mapping mechanism includes handler interceptors, which are useful when you want to apply specific functionality to certain requests, for example, checking for a principal.Please note that even though I used graphical elements usually depicting UML elements, this picture is not UML compliant. Before the execution reaches the controller there are certain steps that must happen like resolving various annotations. The main purpose of a HandlerAdapter is to shield the DispatcherServlet from such details. The HandlerExecutionChain object wraps the handler (controller or method). It may also contain a set of interceptor objects of type HandlerInterceptor. Each interceptor may veto the execution of the handling request. By the time execution reaches a controller or method, HandlerAdapter has already performed dependency injection, type conversion, validation according to JSR-303 and so on. When inside controller, you can call bean methods just like from any standard bean in your application. When controller finishes its logic and fills Model with relevant data, HandlerInterceptor retrieves a string (but could be a special object type as well) that is later resolved to a View object. In order to do so, Spring performs mapping of returned string to a view. Views in Spring are addressed by a logical view name and are resolved by a view resolver. In the end, Spring inserts the model object in a view and renders the results which are in turn returned to the client in form of response. Remote method invocation protocols When it comes to remote method invocation and the protocols it supports, it is useful to know basic properties and limitations of said protocols.RMI protocols informationProtocol Port SerializationRMI 1099 + ‘random’ IO SerializableHessian 80 + 443 Binary XML (compressed)Burlap 80 + 443 Plain XML (overhead!)HttpInvoker 80 + 443 IO SerializableReference: Spring Professional Study Notes from our JCG partner Jakub Stas at the Jakub Stas blog....

5 tips to improve performance in Android applications

If your application has many time-intensive operations, here are some tricks to improve the performance and provide a better experience for your users.                  Operations that can take a long time should run on their own thread and not in the main (UI) thread. If an operation takes too long while it runs on the main thread, the Android OS may show an Application not responding (ANR) dialog : from there, the user may choose to wait or to close your application. This message is not very user-friendly and your application should never have an occasion to trigger it. In particular, web services calls to an external API are especially sensitive to this and should always be on their own thread, since a network slowdown or a problem on their end can trigger an ANR, blocking the execution of your application. You can also taken advantages of threads to pre-calculate graphics that are displayed later on on the main thread. If your application requires a lot of call to external APIs, avoid sending the calls again and again if the wifi and cellular networks are not available. It is a waste of resources to prepare the whole request, send it off and wait for a timeout when it is sure to fail. You can pool the status of the connexion regularly, switch to an offline mode if no network is available, and reactivate it as soon as the network comes back. Take advantage of caching to reduce the impact of expensive operations. Calculations that are long but for which the result won’t change or graphics that will be reused can be kept in memory. You can also cache the result of calls to external APIs to a local database so you won’t depend on that resource being available at all times. A call to a local database can be faster, will not use up your users’ data plan and will work even it the device is offline. On the other hand, you should plan for a way to fresh that data from time to time, for example keeping a time and date stamp and refreshing it when it’s getting old. Save the current state of your activities to avoid having to recalculate it when the application is opened again. The data loaded by your activities or the result of any long-running operation should be saved when the onSaveInstanceState event is raised and restored when the onRestoreInstanceState event is raised.Since the state is saved with a serializable Bundle object, the easiest way to manage state is to have a serializable state object containing all the information needed to restore the activity so only this object needs to be saved. The information entered by the user in View controls is already saved automatically by the Android SDK and does not need to be kept in the state. Remember, the activity state may be lost when the user leaves your application or rotates the screen, not only when the user navigates to another activity. Make sure your layouts are as simple as possible, without unnecessary layout elements. When the view hierarchy gets too deep, the UI engine have trouble traversing all the views and calculating the position of all elements. For example, if you create a custom control and include it in another layout element, it can add an extra view that is not necessary to display the UI and that will slightly slow down the appication. You can analyse your view hierarchy to see where your layout can be flattened with the Hierarchy Viewer tool. The tool can be opened from Eclipse using the Dump View Hierarchy for UI Automator icon in the DDMS perspective, or launch the standalone tool hierarchyviewer in the <sdk>\tools\ directory.If you have other unexplained slowdown in your application, you should profile it to help identify bottlenecks. In that case, you should take a look at my article about profiling Android applications.Reference: 5 tips to improve performance in Android applications from our JCG partner Cindy Potvin at the Web, Mobile and Android Programming blog....

MongoDB and Grails

So recently, I had a requirement to store unstructured JSON data that was coming back from a web service. The web service was returning back various soccer teams from around the world. Amongst the data contained in most of the soccer teams was a list of soccer players, who were part of the team. Some of the teams had 12 players, some had 20 some had even more than 20. The players had their own attribute, some were easy to predict some impossible. For the entire data structure, the only attribute that I knew would definitely be coming back was team’s teamname. After that, it depended on each team.       { "teams": [{ "teamname":"Kung fu pirates", "founded":1962, "players": [ {"name": "Robbie Fowler", "age": 56}, {"name": "Larry David", "age": 55} ... ]}, { "teamname":"Climate change observers", "founded":1942, "players": [ {"name": "Jim Carrey", "age": 26}, {"name": "Carl Craig", "age": 35} ... ]}, ... ]} There are several different ways to do store this data. I decided to go for MongoDB. Main reasons:I wanted to store the data in an as close as possible format to the JSON responses I was getting back from the web service. This would mean, less code, less bugs, less hassle. I wanted something that had a low learning curve, had good documentation and good industry support (stackoverflow threads, blog posts etc) Something that had a grails plugin that was documented, had footfall and looked like it was maintained Features such as text stemming were nice to have’s. Some support would have been nice, but it didn’t need to be cutting age. Would support good JSON search facilities, indexing, etc.MongoDB ticked all the boxes. So this is how I got it all working. After I installed MongoDB as per Mongo’s instructions and the MongoDB Grails plugin, it was time to write some code. Now here’s the neat part, there was hardly any code. I created a domain object for the Team. class Team implements Serializable {static mapWith = "mongo"static constraints = { }static mapping = { teamname index: true }String teamnameList players static embedded = ['players'] } Regarding the Team domain object:The first point to make about the Team domain object was that I didn’t even need to create it. The reason why I did use this approach was so that I could use GORM style api’s such as Team.find() if I wanted to. Players are just a List of object. I didn’t bother creating a Player object. I like the idea of always ensuring the players for the team were always in a List data structure, but I didn’t see the need to type anything further. The players are marked as embedded. This means the team and players are stored in a single denormalised data structure. This allows – amongst other things – the ability to retrieve and manipulate the team data in a single database operation. I marked the teamname as an index. I marked the domain object as: static mapWith = "mongo" This means that if I was also using another persistence solution with my GORM (postgres, MySQL, etc.) I am telling the GORM that this Team domain class is only for Mongo – keep your relational hands off it. See here for info. Note: This is a good reminder that the GORM is a higher level of abstraction than hibernate. It is possible to have a GORM object that doesn’t use hibernate but instead goes to a NoSQL store and doesn’t go near hibernate.You’ll note that in the JSON there are team attributes such as founded that haven’t been explicitly declared in the Team class. This is where Groovy and NoSQL play really well with each other. We can use some of the Meta programming features of Groovy to dynamically add attributes to the Team domain object. private List importTeams(int page) { def rs = restClient.get("teams") // invoke web service List teams = rs.responseData.teams.collect { teamResponse -> Team team = new Team(teamname: teamResponse.teamname) team.save(); // Save is needed to dynamically add the attribute teamname.each {key, value -> team["$key"] = value } teamname.save(); // We need the second save to ensure the variants get saved. return teamname } log.info("importTeams(),teams=teams); teams } Ok, so the main points in our importTeams() method:After getting our JSON response we run a collect function on the teams array. This will create the Team domain objects. We use some meta programming to dynamically add any attribute that comes back in the JSON team structure to the Team object. Note: we have to invoke save() first to be able to dynamically add the attributes that are declared in the Team domain object to the Team domain object. We also have to invoke save() again to ensure that attributes that are declared in the Team domain object to ensure they are saved. This may change in future versions of the MongoDB plugin, but it is what I had to do to get it working (I was using MongoDB plugin version 3.0.1)So what’s next? Write some queries. Ok so two choices here. First, you can use the dynamic finders and criteria queries with the GORM thanks to the MongoDB plugin. But, I didn’t do this. Why? I wanted to write the queries as close as possible to how they are supposed to be written in Mongo. There were a number of reasons for this:A leaky abstraction is inevitable here. Sooner or later you are going to have to write a query that the GORM won’t do very well. Better to approach this heads on. I wanted to be able to run the queries in the Mongo console first, check explain plans if I needed to and then use the same query in my code. Easier to do this, if I write the query directly without having to worry about what the GORM is going to do.The general format of queries is: teams = Team.collection.find(queryMap) // where queryMap is a map of fields and the various values you are searching for. Ok, some examples of queries… Team.collection.find(["teamname":"hicks"]) // Find a team name hicks Team.collection.find(["teamname":"hicks", "players.name": "Robbie Fowler"] // As above but also has Robbie Fowler Team.collection.find(["players.name": "Robbie Fowler"] // Any teams that has a Robbie Fowler Team.collection.find(["teamname":"hicks", "players.name": "Robbie Fowler", {"players.$":1}] // Returns matching player only Team.collection.find(["teamname":"/ick/"]) // Match on the regular expression /ick/, i.e. any team that contains text ick. Anything else? Yeah sure. I wanted to connect to a Mongo instance on my own machine when in development but to a Mongo machine on a dedicated server in other environments (CI, stage, production). To do this, I updated my DataSource.groovy as: environments { development { grails { mongo { host = "localhost" port = 27017 username = "test" password = "test" databaseName = "mydb" } } dataSource { dbCreate = "create-drop" // one of 'create', 'create-drop', 'update', 'validate', '' url = "jdbc:h2:mem:devDb;MVCC=TRUE;LOCK_TIMEOUT=10000" } } ci { println("In bamboo environment") grails { mongo { host = "" port = 27017 username = "shop" password = "shop" databaseName = "tony" } } dataSource { dbCreate = "create-drop" // one of 'create', 'create-drop', 'update', 'validate', '' url = "jdbc:h2:mem:devDb;MVCC=TRUE;LOCK_TIMEOUT=10000" } } } You’ll see I have configured multiple datasources (MongoDB and PostGres). I am not advocating using both MongoDB and a relational database, just pointing out it is possible. The other point is that the MongoDB configuration is always under: grails { mongo { Ok this is a simple introductory post, I will try to post up something more sophisticated soon. Until the next time, take care of yourselves.Reference: MongoDB and Grails from our JCG partner Alex Staveley at the Dublin’s Tech Blog blog....

How to increase productivity

Unlocking productivity is one of the biggest concerns for any person with a management role. However, people rarely agree on the best approaches to improve performance. Over the years, I have observed different managers using opposite practices to churn out the best performance of the team they are managing. Unfortunately, some of them work and others don’t. To be more accurate, what does not increase performance, might actually reduce performance. In this article, I would like to review what I have seen and learned over the years and share my personal view on the best approaches to unlock productivity. What factors define teams performance? Let’s start with analyzing what composes a team. Obviously, a team is composed by team members, each of them with its own expertise, strengths and weaknesses. However, the total productivity of the team is not necessarily the total sum of individual’s productivity. Other factors like team work, processes and environment also have a major impact to the total performance, which can be both positive or negative. To sum up, the 3 major factors discussed in this article will be technical skills, working process and culture. Technical Skills In a factory, we can count the total productivity as the sum of individual productivity of each worker, but this simplicity does not apply to the IT field. The differences lie in the nature of the work. Programming until today is still an innovative work, which cannot be automated. In IT industry, nothing is more valuable than innovation and vision. That explains why Japan may be well known for producing high quality cars, but US is much more famous for producing well known IT companies. In contrast to the factory environment, in a software team, developers do not necessarily do or are good at the same things. Even if they graduated from the same school and took the same job, personal preferences and self-studying quickly make developer’s skills different. For the sake of increasing total productivity, this may be a good thing. There is no use for all of members to be competent on the same kind of tasks. As it is too difficult to be good at everything, life will be much easier if members of the team can compensate for each others weakness. It is not easy to improve the team’s technical skills, as it takes many years for a developer to build up his/her skill set. The fastest way to pump up the team’s skills set is to recruit new talent that offers what the team lacks of. That’s why the popular practice in the industry is to let the team recruit a new member themselves. Because of this, the team, which is slowly built over the years, normally offers a more balance skills set. While recruitment is a quick and short term solution, the long term solution is to keep the team up to date with the latest trends of technology. In this field, if you do not go forward, you go backward. There is no skill set that can be useful forever. One of my colleagues even emphasizes that upgrading developer’s skills is beneficial to the company in the long run. Even if we do not count inflation, it is quite common that the company will offer pay raises after each annual review to retain staff. If the staff do not acquire new skills, effectively, the company is paying higher price every year for a depreciating asset. It may be a good suggestion for the company to use a monetary prize like KPI to motivate self-study and upgrading. There are lots of training courses in the industry but it is not necessarily the best method for upgrading skills. Personally, I feel most of the courses offer more branding value than real life usage. If a developer is keen on learning, there should be quite sufficient knowledge on internet to pick up anything. Therefore, unless for a commercial API or product, spending money on monetary prizes should be more worthy than on training courses. Another well-known challenge for self-studying is the human natural laziness. There is nothing surprising about it. However, the best way to fight laziness is to find fun in learning new things. This only can be achieved if developers take programming as their hobby more than as their profession. Even not, it is quite reasonable that one should re-invest effort on his bread and butter tool. One of my friends even argues that if singers/musicians take their own responsibility in training, programmers should do the same. Sometimes, we may feel lost due to the huge amount of technologies exposed to us every year. I myself feel that too. My approach for self studying is adding a delay in absorbing concepts and ideas. I try to understand but do not invest too much until the new concepts and ideas are reasonable accepted by the market. Working Process Working process can contribute greatly to a team’s performance, positively or negatively. Great developers write great code, but they will not be able to do so if they waste too much effort on something not essential. Obviously, when the process is wrong, developers may feel uncomfortable about their daily life. Unhappy developers may not perform their best. There is no clear guideline to judge if the working process is well defined but people in the environment will feel it right away if something is wrong. However, it is not as easy to get it right, as people who have the right to make decisions are not necessarily the guys who suffer from a bad process. We need an environment with effective feedback channels to improve the working process. A common pitfall for the working process is the lack of a result oriented nature. The process is less effective if it is too reporting oriented, attitude oriented or based on some unreal assumptions. To define the process, it may be good if the executive can decide whether he wants to build an innovative company or an operation oriented company. Some examples of the former kind are Google, Facebook, Twitter, while the latter may be GM, Ford, Toyota. It is not that an operation-oriented company cannot innovate, but the process was not built with innovation as a first priority. Therefore, the metric for measuring performance may be slightly different, which causes different results on the long term. Not all companies in IT fields are innovative. One counter example is the outsourcing companies or software houses in Asia. To encourage innovation, the working process needs to focus on people, minimize hassle, maximize collaboration and sharing. Through my years in the industry with Water Fall, not so Agile and Agile companies, I feel that Agile works quite well for IT fields. It was built based on the right assumptions that software development is innovation work and less predictable compared to other kinds of engineering. Company Culture When Steve Jobs passed away in 2011, I bought his authorized biography by Walter Isaacson. The book clearly explains why Sony failed to keep its competitive edge because of inner competition among its departments. Microsoft suffers similar problems due to the controversy stack ranking system that enforces inner competition. I think that the IT field is getting more complicated and we need more collaboration than in the past to implement new ideas. It is tough to maintain collaboration when your company grows to become a multi-culture MNC. However, it still can be done if management gets the right mindset and continuously communicates its visions to the team. As above, the management needs to be clear if they want to build an innovative company as this requires a distinct culture, which is more open, and highly motivated. In Silicon Valley, office life consumes a great part of the day as most of developers are geeks and they love nothing more than coding. However, it is not necessary a good practice as all of us have a family to take care of. It is up to individual to define his/her own work life balance but the requirement is employee fully charged and feels exited whenever he comes to office. He must feel that his work is appreciated and he has the support when he needs it. Conclusions To makes it short, here are the kind of things that management can apply to increase productivity of the team:Let the team involve in the recruitment. Recruit people who take programming as hobby. Monetary prizes or other kind of encouragements for self-study, self-upgrading. Save money by not having company sponsored courses (unless for commercial products). Make sure that the working process result is oriented. Apply Agile practices Encourage collaboration, eliminate inner competition. Encourage sharing Encourage feedback Maintain employee work-life balance and motivation. Make sure employee can find support when he needs it.Reference: How to increase productivity from our JCG partner Nguyen Anh Tuan at the Developers Corner blog....

10 things you can do as a developer to make your app secure: #5 Authentication Controls

This is part #5 of a series of posts on the OWASP Top 10 Proactive Development Controls: In the previous post, we covered how important it is to think about Access Control and Authorization early in design, and common mistakes to avoid in implementing access controls. Now, on to Authentication, which is about proving the identity of a user, and then managing and protecting the user’s identity as they use the system. Don’t try to build this all yourself Building your own bullet-proof authentication and session management scheme is not easy. There are lots of places to make mistakes, which is why “Broken Authentication and Session Management” is #2 on the OWASP Top 10 list of serious application security problems. If your application framework doesn’t take care of this properly for you, then look at a library like Apache Shiro or OWASP’s ESAPI which provide functions for authentication and secure session management. Users and Passwords One of the most common authentication methods is to request the user to enter an id and password. Like other fundamentals in application security this is simple in concept, but there are still places where you can make mistakes which the bad guys can and will take advantage of. First, carefully consider rules for user-ids and passwords, including password length and complexity. If you are using an email address as the user–id, be careful to keep this safe: bad guys may be interested in harvesting email addresses for other purposes. OWASP’s ESAPI has methods to generateStrongPassword and verifyPaqsswordStrength which appliy a set of checks that you could use to come up with secure passwords. Commonly-followed rules for password complexity are proving to be less useful in protecting a user’s identity than letting users take advantage of a password manager or entering long password strings. A long password phrase or random data from a password manager is much harder to crack than something likePassword1!which is likely to pass most application password strength checks. Password recovery is another important function that you need to be careful with – otherwise attackers can easily and quickly steal user information and hijack a user’s identity. OWASP’s Forgot Password Cheat Sheet can help you design a secure recovery function, covering things like choosing and using good User Security Questions, properly verifying the answer to these questions, and using a side channel to send a reset password token. Storing passwords securely is another critically important step. It’s not enough just to salt and hash passwords any more. OWASP’s Password Storage Cheat Sheet explains what you need to do and what algorithms to use. Session Management Once you establish a user’s identity, you need to associate it with a unique session id to all of the actions that a user performs after they log-in. The user’s session id has to be carefully managed and protected: attackers can impersonate the user by guessing or stealing the session id. OWASP’s Session Management Cheat Sheet explains how to properly set up a session, how to manage the different stages in a session id life cycle, and common attacks and defences on session management. For more information on Authentication and how to do it right, go to OWASP’s Authentication Cheat Sheet. We’re half way through the list. On to #6: Protecting Data and Privacy.Reference: 10 things you can do as a developer to make your app secure: #5 Authentication Controls from our JCG partner Jim Bird at the Building Real Software blog....

Lambda Expressions and Stream API: basic examples

This blog post contains a list of basic Lambda expressions and Stream API examples I used in a live coding presentation I gave in June 2014 at Java User Group – Politechnica Gedanensis (Technical University of Gdańsk) and at Goyello. Lambda Expressions Syntax The most common example:     Runnable runnable = () -> System.out.println("Hello!"); Thread t = new Thread(runnable); t.start(); t.join(); One can write this differently: Thread t = new Thread(() -> System.out.println("Hello!")); t.start(); t.join(); What about arguments? Comparator<String> stringComparator = (s1, s2) -> s1.compareTo(s2); And expanding to full expression: Comparator<String> stringComparator = (String s1, String s2) -> { System.out.println("Comparing..."); return s1.compareTo(s2); }; Functional interface Lambda expressions let you express instances of single-method classes more compactly. Single-method classes are called functional interfaces and can be annotated with @FunctionalInterface: @FunctionalInterface public interface MyFunctionalInterface<T> { boolean test(T t); }// Usage MyFunctionalInterface<String> l = s -> s.startsWith("A"); Method references Method references are compact, easy-to-read lambda expressions for methods that already have a name. Let’s look at this simple example: public class Sample {public static void main(String[] args) { Runnable runnable = Sample::run; }private static void run() { System.out.println("Hello!"); } } Another example: public static void main(String[] args) { Sample sample = new Sample(); Comparator<String> stringLengthComparator = sample::compareLength; }private int compareLength(String s1, String s2) { return s1.length() - s2.length(); } Stream API – basics A stream is a sequence of elements supporting sequential and parallel bulk operations. Iterating over a list List<String> list = Arrays.asList("one", "two", "three", "four", "five", "six");list.stream() .forEach(s -> System.out.println(s)); Filtering Java 8 introduced default methods in interfaces. They are handy in Stream API: Predicate<String> lowerThanOrEqualToFour = s -> s.length() <= 4; Predicate<String> greaterThanOrEqualToThree = s -> s.length() >= 3;list.stream() .filter(lowerThanOrEqualToFour.and(greaterThanOrEqualToThree)) .forEach(s -> System.out.println(s)); Sorting Predicate<String> lowerThanOrEqualToFour = s -> s.length() <= 4; Predicate<String> greaterThanOrEqualToThree = s -> s.length() >= 3; Comparator<String> byLastLetter = (s1, s2) -> s1.charAt(s1.length() - 1) - s2.charAt(s2.length() - 1); Comparator<String> byLength = (s1, s2) -> s1.length() - s2.length();list.stream() .filter(lowerThanOrEqualToFour.and(greaterThanOrEqualToThree)) .sorted(byLastLetter.thenComparing(byLength)) .forEach(s -> System.out.println(s)); In the above example a default method and of java.util.function.Predicate is used. Default (and static) methods are new to interfaces in Java 8. Limit Predicate<String> lowerThanOrEqualToFour = s -> s.length() <= 4; Predicate<String> greaterThanOrEqualToThree = s -> s.length() >= 3; Comparator<String> byLastLetter = (s1, s2) -> s1.charAt(s1.length() - 1) - s2.charAt(s2.length() - 1); Comparator<String> byLength = (s1, s2) -> s1.length() - s2.length();list.stream() .filter(lowerThanOrEqualToFour.and(greaterThanOrEqualToThree)) .sorted(byLastLetter.thenComparing(byLength)) .limit(4) .forEach(s -> System.out.println(s)); Collect to a list Predicate<String> lowerThanOrEqualToFour = s -> s.length() <= 4; Predicate<String> greaterThanOrEqualToThree = s -> s.length() >= 3; Comparator<String> byLastLetter = (s1, s2) -> s1.charAt(s1.length() - 1) - s2.charAt(s2.length() - 1); Comparator<String> byLength = (s1, s2) -> s1.length() - s2.length();List<String> result = list.stream() .filter(lowerThanOrEqualToFour.and(greaterThanOrEqualToThree)) .sorted(byLastLetter.thenComparing(byLength)) .limit(4) .collect(Collectors.toList()); Parallel processing I used quite common example with iterating over a list of files: public static void main(String[] args) { File[] files = new File("c:/windows").listFiles(); Stream.of(files) .parallel() .forEach(Sample::process); }private static void process(File file) { try { Thread.sleep(1000); } catch (InterruptedException e) { }System.out.println("Processing -> " + file); } Please note that while showing the examples I explained some known drawbacks with parallel processing of streams. Stream API – more examples Mapping Iterate over files in a directory and return a FileSize object: class FileSize {private final File file; private final Long size;FileSize(File file, Long size) { this.file = file; this.size = size; }File getFile() { return file; }Long getSize() { return size; }String getName() { return getFile().getName(); }String getFirstLetter() { return getName().substring(0, 1); }@Override public String toString() { return Objects.toStringHelper(this) .add("file", file) .add("size", size) .toString(); } } The final code of a mapping: File[] files = new File("c:/windows").listFiles(); List<FileSize> result = Stream.of(files) .map(FileSize::new) .collect(Collectors.toList()); Grouping Group FileSize object by first letter of a file name: Map<String, List<FileSize>> result = Stream.of(files) .map(FileSize::new) .collect(Collectors.groupingBy(FileSize::getFirstLetter)); Reduce Get the biggest/smallest file in a directory: Optional<FileSize> filesize = Stream.of(files) .map(FileSize::new) .reduce((fs1, fs2) -> fs1.getSize() > fs2.getSize() ? fs1 : fs2); In case you don’t need a FileSize object, but only a number: OptionalLong max = Stream.of(files) .map(FileSize::new) .mapToLong(fs -> fs.getSize()) .max();Reference: Lambda Expressions and Stream API: basic examples from our JCG partner Rafal Borowiec at the Codeleak.pl blog....

Graph Degree Distributions using R over Hadoop

There are two common types of graph engines. One type is focused on providing real-time, traversal-based algorithms over linked-list graphs represented on a single-server. Such engines are typically called graph databases and some of the vendors include Neo4j, OrientDB, DEX, and InfiniteGraph. The other type of graph engine is focused on batch-processing using vertex-centric message passing within a graph represented across a cluster of machines. Graph engines of this form include Hama, Golden Orb, Giraph, and Pregel. The purpose of this post is to demonstrate how to express the computation of two fundamental graph statistics — each as a graph traversal and as a MapReduce algorithm. The graph engines explored for this purpose are Neo4j and Hadoop. However, with respects to Hadoop, instead of focusing on a particular vertex-centric BSP-based graph-processing package such as Hama or Giraph, the results presented are via native Hadoop (HDFS + MapReduce). Moreover, instead of developing the MapReduce algorithms in Java, the R programming language is used. RHadoop is a small, open-source package developed by Revolution Analytics that binds R to Hadoop and allows for the representation of MapReduce algorithms using native R. The two graph algorithms presented compute degree statistics: vertex in-degree and graph in-degree distribution. Both are related, and in fact, the results of the first can be used as the input to the second. That is, graph in-degree distribution is a function of vertex in-degree. Together, these two fundamental statistics serve as a foundation for more quantifying statistics developed in the domains of graph theory and network science.Vertex in-degree: How many incoming edges does vertex X have? Graph in-degree distribution: How many vertices have X number of incoming edges?These two algorithms are calculated over an artificially generated graph that contains 100,000 vertices and 704,002 edges. A subset is diagrammed on the left. The algorithm used to generate the graph is called preferential attachment. Preferential attachment yields graphs with “natural statistics” that have degree distributions that are analogous to real-world graphs/networks. The respective iGraph R code is provided below. Once constructed and simplified (i.e. no more than one edge between any two vertices and no self-loops), the vertices and edges are counted. Next, the first five edges are iterated and displayed. The first edge reads, “vertex 2 is connected to vertex 0.” Finally, the graph is persisted to disk as a GraphML file.     ~$ rR version 2.13.1 (2011-07-08) Copyright (C) 2011 The R Foundation for Statistical Computing> g <- simplify(barabasi.game(100000, m=10)) > length(V(g)) [1] 100000 > length(E(g)) [1] 704002 > E(g)[1:5] Edge sequence: [1] 2 -> 0 [2] 2 -> 1 [3] 3 -> 0 [4] 4 -> 0 [5] 4 -> 1 > write.graph(g, '/tmp/barabasi.xml', format='graphml') Graph Statistics using Neo4j  When a graph is on the order of 10 billion elements (vertices+edges), then a single-server graph database is sufficient for performing graph analytics. As a side note, when those analytics/algorithms are “ego-centric” (i.e. when the traversal emanates from a single vertex or small set of vertices), then they can typically be evaluated in real-time (e.g. < 1000 ms). To compute these in-degree statistics, Gremlin is used. Gremlin is a graph traversal language developed by TinkerPop that is distributed with Neo4j, OrientDB, DEX, InfiniteGraph, and the RDF engine Stardog. The Gremlin code below loads the GraphML file created by R in the previous section into Neo4j. It then performs a count of the vertices and edges in the graph. ~$ gremlin\,,,/ (o o) -----oOOo-(_)-oOOo----- gremlin> g = new Neo4jGraph('/tmp/barabasi') ==>neo4jgraph[EmbeddedGraphDatabase [/tmp/barabasi]] gremlin> g.loadGraphML('/tmp/barabasi.xml') ==>null gremlin> g.V.count() ==>100000 gremlin> g.E.count() ==>704002 The Gremlin code to calculate vertex in-degree is provided below. The first line iterates over all vertices and outputs the vertex and its in-degree. The second line provides a range filter in order to only display the first five vertices and their in-degree counts. Note that the clarifying diagrams demonstrate the transformations on a toy graph, not the 100,000 vertex graph used in the experiment.  gremlin> g.V.transform{[it, it.in.count()]} ... gremlin> g.V.transform{[it, it.in.count()]}[0..4] ==>[v[1], 99104] ==>[v[2], 26432] ==>[v[3], 20896] ==>[v[4], 5685] ==>[v[5], 2194] Next, to calculate the in-degree distribution of the graph, the following Gremlin traversal can be evaluated. This expression iterates through all the vertices in the graph, emits their in-degree, and then counts the number of times a particular in-degree is encountered. These counts are saved into an internal map maintained by groupCount. The final cap step yields the internal groupCount map. In order to only display the top five counts, a range filter is applied. The first line emitted says: “There are 52,611 vertices that do not have any incoming edges.” The second line says: “There are 16,758 vertices that have one incoming edge.”  gremlin> g.V.transform{it.in.count()}.groupCount.cap ... gremlin> g.V.transform{it.in.count()}.groupCount.cap.next()[0..4] ==>0=52611 ==>1=16758 ==>2=8216 ==>3=4805 ==>4=3191 To calculate both statistics by using the results of the previous computation in the latter, the following traversal can be executed. This representation has a direct correlate to how vertex in-degree and graph in-degree distribution are calculated using MapReduce (demonstrated in the next section). gremlin> degreeV = [:] gremlin> degreeG = [:] gremlin> g.V.transform{[it, it.in.count()]}.sideEffect{degreeV[it[0]] = it[1]}.transform{it[1]}.groupCount(degreeG) ... gremlin> degreeV[0..4] ==>v[1]=99104 ==>v[2]=26432 ==>v[3]=20896 ==>v[4]=5685 ==>v[5]=2194 gremlin> degreeG.sort{a,b -> b.value <=> a.value}[0..4] ==>0=52611 ==>1=16758 ==>2=8216 ==>3=4805 ==>4=3191 Graph Statistics using Hadoop When a graph is on the order of 100+ billion elements (vertices+edges), then a single-server graph database will not be able to represent nor process the graph. A multi-machine graph engine is required. While native Hadoop is not a graph engine, a graph can be represented in its distributed HDFS file system and processed using its distributed processing MapReduce framework. The graph generated previously is loaded up in R and a count of its vertices and edges is conducted. Next, the graph is represented as an edge list. An edge list (for a single-relational graph) is a list of pairs, where each pair is ordered and denotes the tail vertex id and the head vertex id of the edge. The edge list can be pushed to HDFS using RHadoop. The variable edge.list represents a pointer to this HDFS file. > g <- read.graph('/tmp/barabasi.xml', format='graphml') > length(V(g)) [1] 100000 > length(E(g)) [1] 704002 > edge.list <- to.dfs(get.edgelist(g))  In order to calculate vertex in-degree, a MapReduce job is evaluated on edge.list. The map function is fed key/value pairs where the key is an edge id and the value is the ids of the tail and head vertices of the edge (represented as a list). For each key/value input, the head vertex (i.e. incoming vertex) is emitted along with the number 1. The reduce function is fed key/value pairs where the keys are vertices and the values are a list of 1s. The output of the reduce job is a vertex id and the length of the list of 1s (i.e. the number of times that vertex was seen as an incoming/head vertex of an edge). The results of this MapReduce job are saved to HDFS and degree.V is the pointer to that file. The final expression in the code chunk below reads the first key/value pair from degree.V — vertex 10030 has an in-degree of 5. > degree.V <- mapreduce(edge.list, map=function(k,v) keyval(v[2],1), reduce=function(k,v) keyval(k,length(v))) > from.dfs(degree.V)[[1]] $key [1] 10030 $val [1] 5 attr(,"rmr.keyval") [1] TRUE In order to calculate graph in-degree distribution, a MapReduce job is evaluated on degree.V. The map function is fed the key/value results stored in degree.V. The function emits the degree of the vertex with the number 1 as its value. For example, if vertex 6 has an in-degree of 100, then the map function emits the key/value [100,1]. Next, the reduce function is fed keys that represent degrees with values that are the number of times that degree was seen as a list of 1s. The output of the reduce function is the key along with the length of the list of 1s (i.e. the number of times a degree of a particular count was encountered). The final code fragment below grabs the first key/value pair from degree.g — degree 1354 was encountered 1 time. > degree.g <- mapreduce(degree.V, map=function(k,v) keyval(v,1), reduce=function(k,v) keyval(k,length(v))) > from.dfs(degree.g)[[1]] $key [1] 1354 $val [1] 1 attr(,"rmr.keyval") [1] TRUE In concert, these two computations can be composed into a single MapReduce expression. > degree.g <- mapreduce(mapreduce(edge.list, map=function(k,v) keyval(v[2],1), reduce=function(k,v) keyval(k,length(v))), map=function(k,v) keyval(v,1), reduce=function(k,v) keyval(k,length(v))) Note that while a graph can be on the order of 100+ billion elements, the degree distribution is much smaller and can typically fit into memory. In general, edge.list > degree.V > degree.g. Due to this fact, it is possible to pull the degree.g file off of HDFS, place it into main memory, and plot the results stored within. The degree.g distribution is plotted on a log/log plot. As suspected, the preferential attachment algorithm generated a graph with natural “scale-free” statistics — most vertices have a small in-degree and very few have a large in-degree. > degree.g.memory <- from.dfs(degree.g) > plot(keys(degree.g.memory), values(degree.g.memory), log='xy', main='Graph In-Degree Distribution', xlab='in-degree', ylab='frequency')  Related MaterialCohen, J., “Graph Twiddling in a MapReduce World,” Computing in Science & Engineering, IEEE, 11(4), pp. 29-41, July 2009.Reference: Graph Degree Distributions using R over Hadoop from our JCG partner Marko Rodriguez at the AURELIUS blog....
Java Code Geeks and all content copyright © 2010-2014, Exelixis Media Ltd | Terms of Use | Privacy Policy
All trademarks and registered trademarks appearing on Java Code Geeks are the property of their respective owners.
Java is a trademark or registered trademark of Oracle Corporation in the United States and other countries.
Java Code Geeks is not connected to Oracle Corporation and is not sponsored by Oracle Corporation.
Do you want to know how to develop your skillset and become a ...
Java Rockstar?

Subscribe to our newsletter to start Rocking right now!

To get you started we give you two of our best selling eBooks for FREE!

Get ready to Rock!
You can download the complementary eBooks using the links below: