Do you want to know how to develop your skillset to become a Java Rockstar?

Subscribe to our newsletter to start Rocking right now!

To get you started we give you our best selling eBooks for FREE!

1. JPA Mini Book

2. JVM Troubleshooting Guide

3. JUnit Tutorial for Unit Testing

4. Java Annotations Tutorial

5. Java Interview Questions

and many more ....

Featured FREE Whitepapers

What's New Here?


FreeChart with Groovy and Apache POI

The point of this article is to show you how to parse data from an Excel spreadsheet that looks like this:and turn it into a series of graphs that look like this:Recently I was looking for an opportunity to get some practice with JFreeChart and ended up looking at a dataset released by the Canadian government as part of their ‘Open Data’ initiative. The particular set of data is entitled ‘Number of Seedlings Planted by Ownership, Species’ and is delivered as an Excel spreadsheet, hence the need for the Apache POI library in order to read the data in. As is fairly usual, at least in my experience, the Excel spreadsheet is designed primarily for human consumption which adds a degree of complexity to the parsing. Fortunately the spreadsheet does follow a repetitive pattern that can be accounted for fairly easily, so this is not insurmountable. Still, we want to get the data out of Excel to make it more approachable for machine consumption so the first step is to convert it to a JSON representation. Once it is in this much more transportable form we can readily convert the data into graph visualizations using JFreeChart.The spreadsheet format Excel as a workplace tool is very well established, can increase individual productivity and is definitely a boon to your average office worker. The problem is that once the data is there it’s often trapped there. Data tends to be laid out based on human aesthetics and not on parsability, meaning that unless you want to use Excel itself to do further analysis, there’s not a lot of options. Exports to more neutral formats like csv suffer from the same problems- namely that there’s no way to read in the data coherently without designing a custom parser. In this particular case, parsing the spreadsheet has to take into account the following:Merged cells where one column is meant to represent a fixed value for a number of sequential rows. Column headers that do not represent all of the actual columns. Here we have a ‘notes’ column for each province that immediately follows its’ data column. As the header cells are merged across both of these columns, they cannot be used directly to parse the data. Data is broken down into several domains that lead to repetitions in the format. The data contains a mix of numbers where results are available and text where they are not. The meanings of the text entries are described in a table at the end of the spreadsheet. Section titles and headers are repeated throughout the document, apparently trying to match some print layout, or perhaps just trying to provide some assistance to those scrolling through the long document.Data in the spreadsheet is first divided into reporting by Provincial crown land, private land, Federal land, and finally a total for all of them. Within each of these sections, data is reported for each tree species on a yearly basis across all Provinces and Territories along with aggregate totals of these figures across Canada. Each of these species data-tables has an identical row/column structure which allows us to create a single parsing structure sufficient for reading in data from each of them separately.Converting the spreadsheet to JSON For parsing the Excel document, I’m using the Apache POI library and a Groovy wrapper class to assist in processing. The wrapper class is very simple but allows us to abstract most of the mechanics of dealing with the Excel document away. The full source is available on this blog post from author Goran Ehrsson. The key benefit is the ability to specify a window of the file to process based on ‘offset’ and ‘max’ parameters provided in a simple map. Here’s an example for reading data for the text symbols table at the end of the spreadsheet. We define a Map which states which sheet to read from, which line to start on(offset) and how many lines to process. The ExcelBuilder class(which isn’t really a builder at all) takes in the path to a File object and under the hood reads that into a POI HSSFWorkbook which is then referenced by the call to the eachLine method. public static final Map SYMBOLS = [sheet: SHEET1, offset: 910, max: 8] ... final ExcelBuilder excelReader = new ExcelBuilder(data.absolutePath) Map<String, String> symbolTable = [:] excelReader.eachLine(SYMBOLS) { HSSFRow row -> symbolTable[row.getCell(0).stringCellValue] = row.getCell(1).stringCellValue } Eventually when we turn this into JSON, it will look like this: 'Symbols': { '...': 'Figures not appropriate or not applicable', '..': 'Figures not available', '--': 'Amount too small to be expressed', '-': 'Nil or zero', 'p': 'Preliminary figures', 'r': 'Revised figures', 'e': 'Estimated by provincial or territorial forestry agency', 'E': 'Estimated by the Canadian Forest Service or by Statistics Canada' } Now processing the other data blocks gets a little bit trickier. The first column consists of 2 merged cells, and all but one of the other headers actually represents two columns of information: a count and an optional notation. The merged column is handled by a simple EMPTY placeholder and the extra columns by processing the list of headers;. public static final List<String> HEADERS = ['Species', 'EMPTY', 'Year', 'NL', 'PE', 'NS', 'NB', 'QC', 'ON', 'MB', 'SK', 'AB', 'BC', 'YT', 'NT *a', 'NU', 'CA'] /** * For each header add a second following header for a 'notes' column * @param strings * @return expanded list of headers */ private List<String> expandHeaders(List<String> strings) { strings.collect {[it, '${it}_notes']}.flatten() } Each data block corresponds to a particular species of tree, broken down by year and Province or Territory. Each species is represented by a map which defines where in the document that information is contained so we can iterate over a collection of these maps and aggregate data quite easily. This set of constants and code is sufficient for parsing all of the data in the document. public static final int HEADER_OFFSET = 3 public static final int YEARS = 21 public static final Map PINE = [sheet: SHEET1, offset: 6, max: YEARS, species: 'Pine'] public static final Map SPRUCE = [sheet: SHEET1, offset: 29, max: YEARS, species: 'Spruce'] public static final Map FIR = [sheet: SHEET1, offset: 61, max: YEARS, species: 'Fir'] public static final Map DOUGLAS_FIR = [sheet: SHEET1, offset: 84, max: YEARS, species: 'Douglas-fir'] public static final Map MISCELLANEOUS_SOFTWOODS = [sheet: SHEET1, offset: 116, max: YEARS, species: 'Miscellaneous softwoods'] public static final Map MISCELLANEOUS_HARDWOODS = [sheet: SHEET1, offset: 139, max: YEARS, species: 'Miscellaneous hardwoods'] public static final Map UNSPECIFIED = [sheet: SHEET1, offset: 171, max: YEARS, species: 'Unspecified'] public static final Map TOTAL_PLANTING = [sheet: SHEET1, offset: 194, max: YEARS, species: 'Total planting'] public static final List<Map> PROVINCIAL = [PINE, SPRUCE, FIR, DOUGLAS_FIR, MISCELLANEOUS_SOFTWOODS, MISCELLANEOUS_HARDWOODS, UNSPECIFIED, TOTAL_PLANTING] public static final List<String> AREAS = HEADERS[HEADER_OFFSET..-1] Closure collector = { Map species -> Map speciesMap = [name: species.species] excelReader.eachLine(species) {HSSFRow row -> //ensure that we are reading from the correct place in the file if (row.rowNum == species.offset) { assert row.getCell(0).stringCellValue == species.species } //process rows if (row.rowNum > species.offset) { final int year = row.getCell(HEADERS.indexOf('Year')).stringCellValue as int Map yearMap = [:] expandHeaders(AREAS).eachWithIndex {String header, int index -> final HSSFCell cell = row.getCell(index + HEADER_OFFSET) yearMap[header] = cell.cellType == HSSFCell.CELL_TYPE_STRING ? cell.stringCellValue : cell.numericCellValue } speciesMap[year] = yearMap.asImmutable() } } speciesMap.asImmutable() } The defined collector Closure returns a map of all species data for one of the four groupings(Provincial, private land, Federal and totals). The only thing that differentiates these groups is their offset in the file so we can define maps for the structure of each simply by updating the offsets of the first. public static final List<Map> PROVINCIAL = [PINE, SPRUCE, FIR, DOUGLAS_FIR, MISCELLANEOUS_SOFTWOODS, MISCELLANEOUS_HARDWOODS, UNSPECIFIED, TOTAL_PLANTING] public static final List<Map> PRIVATE_LAND = offset(PROVINCIAL, 220) public static final List<Map> FEDERAL = offset(PROVINCIAL, 441) public static final List<Map> TOTAL = offset(PROVINCIAL, 662)private static List<Map> offset(List<Map> maps, int offset) { maps.collect { Map map -> Map offsetMap = new LinkedHashMap(map) offsetMap.offset = offsetMap.offset + offset offsetMap } } Finally, we can iterate over these simple map structures applying the collector Closure and we end up with a single map representing all of the data. def parsedSpreadsheet = [PROVINCIAL, PRIVATE_LAND, FEDERAL, TOTAL].collect { it.collect(collector) } Map resultsMap = [:] GROUPINGS.eachWithIndex {String groupName, int index -> resultsMap[groupName] = parsedSpreadsheet[index] } resultsMap['Symbols'] = symbolTable And the JsonBuilder class provides an easy way to convert any map to a JSON document ready to write out the results. Map map = new NaturalResourcesCanadaExcelParser().convertToMap(data) new File('src/test/resources/NaturalResourcesCanadaNewSeedlings.json').withWriter {Writer writer -> writer << new JsonBuilder(map).toPrettyString() }Parsing JSON into JFreeChart line charts All right, so now that we’ve turned the data into a slightly more consumable format, it’s time to visualize it. For this case I’m using a combination of the JFreeChart library and the GroovyChart project which provides a nice DSL syntax for working with the JFreeChart API. It doesn’t look to be under development presently, but aside from the fact that the jar isn’t published to an available repository it was totally up to this task. We’re going to create four charts for each of the fourteen areas represented for a total of 56 graphs overall. All of these graphs contain plotlines for each of the eight tree species tracked. This means that overall we need to create 448 distinct time series. I didn’t do any formal timings of how long this takes, but in general it came in somewhere under ten seconds to generate all of these. Just for fun, I added GPars to the mix to parallelize creation of the charts, but since writing the images to disk is going to be the most expensive part of this process, I don’t imagine it’s speeding things up terribly much. First, reading in the JSON data from a file is simple with JsonSlurper. def data new File(jsonFilename).withReader {Reader reader -> data = new JsonSlurper().parse(reader) } assert data Here’s a sample of what the JSON data looks like for one species over a single year, broken down first by one of the four major groups, then by tree species, then by year and finally by Province or Territory. { 'Provincial': [ { 'name': 'Pine', '1990': { 'NL': 583.0, 'NL_notes': '', 'PE': 52.0, 'PE_notes': '', 'NS': 4.0, 'NS_notes': '', 'NB': 4715.0, 'NB_notes': '', 'QC': 33422.0, 'QC_notes': '', 'ON': 51062.0, 'ON_notes': '', 'MB': 2985.0, 'MB_notes': '', 'SK': 4671.0, 'SK_notes': '', 'AB': 8130.0, 'AB_notes': '', 'BC': 89167.0, 'BC_notes': 'e', 'YT': '-', 'YT_notes': '', 'NT *a': 15.0, 'NT *a_notes': '', 'NU': '..', 'NU_notes': '', 'CA': 194806.0, 'CA_notes': 'e' }, ... Building the charts is a simple matter of iterating over the resulting map of parsed data. In this case we’re ignoring the ‘notes’ data but have included it in the dataset in case we want to use it later. We’re also just ignoring any non-numeric values. GROUPINGS.each { group -> withPool { AREAS.eachParallel { area -> ChartBuilder builder = new ChartBuilder(); String title = sanitizeName('$group-$area') TimeseriesChart chart = builder.timeserieschart(title: group, timeAxisLabel: 'Year', valueAxisLabel: 'Number of Seedlings(1000s)', legend: true, tooltips: false, urls: false ) { timeSeriesCollection { data.'$group'.each { species -> Set years = (species.keySet() - 'name').collect {it as int} timeSeries(name:, timePeriodClass: '') { years.sort().each { year -> final value = species.'$year'.'$area' //check that it's a numeric value if (!(value instanceof String)) { add(period: new Year(year), value: value) } } } } } } ... } Then we apply some additional formatting to the JFreeChart to enhance the output styling, insert an image into the background, and fix the plot color schemes. JFreeChart innerChart = chart.chart String longName = PROVINCE_SHORT_FORM_MAPPINGS.find {it.value == area}.key innerChart.addSubtitle(new TextTitle(longName)) innerChart.setBackgroundPaint(Color.white) innerChart.plot.setBackgroundPaint(Color.lightGray.brighter()) innerChart.plot.setBackgroundImageAlignment(Align.TOP_RIGHT) innerChart.plot.setBackgroundImage(logo) [Color.BLUE, Color.GREEN, Color.ORANGE, Color.CYAN, Color.MAGENTA, Color.BLACK, Color.PINK, Color.RED].eachWithIndex { color, int index -> innerChart.XYPlot.renderer.setSeriesPaint(index, color) }And we write out each of the charts to a formulaically named png file. def fileTitle = '$FILE_PREFIX-${title}.png' File outputDir = new File(outputDirectory) if (!outputDir.exists()) { outputDir.mkdirs() } File file = new File(outputDir, fileTitle) if (file.exists()) { file.delete() } ChartUtilities.saveChartAsPNG(file, innerChart, 550, 300) To tie it all together, an html page is created using MarkupBuilder to showcase all of the results, organized by Province or Territory. def buildHtml(inputDirectory) { File inputDir = new File(inputDirectory) assert inputDir.exists() Writer writer = new StringWriter() MarkupBuilder builder = new MarkupBuilder(writer) builder.html { head { title('Number of Seedlings Planted by Ownership, Species') style(type: 'text/css') { mkp.yield(CSS) } } body { ul { AREAS.each { area -> String areaName = sanitizeName(area) div(class: 'area rounded-corners', id: areaName) { h2(PROVINCE_SHORT_FORM_MAPPINGS.find {it.value == area}.key) inputDir.eachFileMatch(~/.*$areaName\.png/) { img(src: } } } } script(type: 'text/javascript', src: '', '') script(type: 'text/javascript') { mkp.yield(JQUERY_FUNCTION) } } } writer.toString() } The generated html page assumes that all images are co-located in the same folder, presents four images per Province/Territory and, just for fun, uses JQuery to attach a click handler to each of the headers. Click on a header and the images in that div will animate into the background. I’m sure the actual JQuery being used could be improved upon, but it serves its purpose. Here’s a sample of the html output: <ul> <div class='area rounded-corners' id='NL'> <h2>Newfoundland and Labrador</h2> <img src='naturalResourcesCanadaNewSeedlings-Federal-NL.png' /> <img src='naturalResourcesCanadaNewSeedlings-PrivateLand-NL.png' /> <img src='naturalResourcesCanadaNewSeedlings-Provincial-NL.png' /> <img src='naturalResourcesCanadaNewSeedlings-Total-NL.png' /> </div> ... The resulting page looks like this in Firefox.  Source code and Links The source code is available on GitHub. So is the final resulting html page. The entire source required to go from Excel to charts embedded in an html page comes in at slightly under 300 lines of code and I don’t think the results are too bad for the couple of hours effort involved. Finally, the JSON results are also hosted on the GitHub pages for the project for anyone else who might want to delve into the data. Some reading related to this topic:Groovy loves POI and POI loves Groovy Writing batch import scripts with Grails, GSQL and GPars GSheets – A Groovy Builder based on Apache POI Groovy for the OfficeRelated links:Groovy inspect()/Eval for Externalizing Data Groovy reverse map sort done easy Five Cool Things You Can Do With Groovy ScriptsReference: JFreeChart with Groovy and Apache POI from our JCG partner Kelly Robinson at the The Kaptain on … stuff blog....

Spring @Configuration and FactoryBean

Consider a FactoryBean for defining a cache using a Spring configuration file: <cache:annotation-driven /> <context:component-scan base-package='org.bk.samples.cachexml'></context:component-scan> <bean id='cacheManager' class=''> <property name='caches'> <set> <ref bean='defaultCache'/> </set> </property> </bean> <bean name='defaultCache' class='org.springframework.cache.concurrent.ConcurrentMapCacheFactoryBean'> <property name='name' value='default'/> </bean> The factory bean ConcurrentMapCacheFactoryBean is a bean which is in turn responsible for creating a Cache bean. My first attempt at translating this setup to a @Configuration style was the following: @Bean public SimpleCacheManager cacheManager(){ SimpleCacheManager cacheManager = new SimpleCacheManager(); List<Cache> caches = new ArrayList<Cache>(); ConcurrentMapCacheFactoryBean cacheFactoryBean = new ConcurrentMapCacheFactoryBean(); cacheFactoryBean.setName('default'); caches.add(cacheFactoryBean.getObject()); cacheManager.setCaches(caches ); return cacheManager; } This did not work however, the reason is that here I have bypassed some Spring bean lifecycle mechanisms altogether. It turns out that ConcurrentMapCacheFactoryBean also implements the InitializingBean interface and does a eager initialization of the cache in the ‘afterPropertiesSet’ method of InitializingBean. Now by directly calling factoryBean.getObject() , I was completely bypassing the afterPropertiesSet method. There are two possible solutions: 1. Define the FactoryBean the same way it is defined in the XML: @Bean public SimpleCacheManager cacheManager(){ SimpleCacheManager cacheManager = new SimpleCacheManager(); List<Cache> caches = new ArrayList<Cache>(); caches.add(cacheBean().getObject()); cacheManager.setCaches(caches ); return cacheManager; }@Bean public ConcurrentMapCacheFactoryBean cacheBean(){ ConcurrentMapCacheFactoryBean cacheFactoryBean = new ConcurrentMapCacheFactoryBean(); cacheFactoryBean.setName('default'); return cacheFactoryBean; }In this case, there is an explicit FactoryBean being returned from a @Bean method, and Spring will take care of calling the lifecycle methods on this bean. 2. Replicate the behavior in the relevant lifecycle methods, in this specific instance I know that the FactoryBean instantiates the ConcurrentMapCache in the afterPropertiesSet method, I can replicate this behavior directly this way: @Bean public SimpleCacheManager cacheManager(){ SimpleCacheManager cacheManager = new SimpleCacheManager(); List<Cache> caches = new ArrayList<Cache>(); caches.add(cacheBean()); cacheManager.setCaches(caches ); return cacheManager; }@Bean public Cache cacheBean(){ Cache cache = new ConcurrentMapCache('default'); return cache; } Something to keep in mind when translating a FactoryBean from xml to @Configuration. Note: A working one page test as a gist is available here: package org.bk.samples.cache; import static org.hamcrest.MatcherAssert.assertThat; import static org.hamcrest.Matchers.equalTo; import java.util.ArrayList; import java.util.List; import java.util.Random; import org.junit.Test; import org.junit.runner.RunWith; import org.springframework.beans.factory.annotation.Autowired; import org.springframework.cache.Cache; import org.springframework.cache.annotation.Cacheable; import org.springframework.cache.annotation.EnableCaching; import org.springframework.cache.concurrent.ConcurrentMapCacheFactoryBean; import; import org.springframework.context.annotation.Bean; import org.springframework.context.annotation.ComponentScan; import org.springframework.context.annotation.Configuration; import org.springframework.stereotype.Component; import org.springframework.test.context.ContextConfiguration; import org.springframework.test.context.junit4.SpringJUnit4ClassRunner; @RunWith(SpringJUnit4ClassRunner.class) @ContextConfiguration(classes={TestSpringCache.TestConfiguration.class}) public class TestSpringCache { @Autowired TestService testService; @Test public void testCache() { String response1 = testService.cachedMethod('param1', 'param2'); String response2 = testService.cachedMethod('param1', 'param2'); assertThat(response2, equalTo(response1)); } @Configuration @EnableCaching @ComponentScan('org.bk.samples.cache') public static class TestConfiguration{ @Bean public SimpleCacheManager cacheManager(){ SimpleCacheManager cacheManager = new SimpleCacheManager(); List<Cache> caches = new ArrayList<Cache>(); caches.add(cacheBean().getObject()); cacheManager.setCaches(caches ); return cacheManager; } @Bean public ConcurrentMapCacheFactoryBean cacheBean(){ ConcurrentMapCacheFactoryBean cacheFactoryBean = new ConcurrentMapCacheFactoryBean(); cacheFactoryBean.setName('default'); return cacheFactoryBean; } } } interface TestService{ String cachedMethod(String param1,String param2); } @Component class TestServiceImpl implements TestService{ @Cacheable(value='default', key='#p0.concat('-').concat(#p1)') public String cachedMethod(String param1, String param2){ return 'response ' + new Random().nextInt(); } }Reference: Spring @Configuration and FactoryBean from our JCG partner Biju Kunjummen at the all and sundry blog....

Fixing Bugs that can’t be Reproduced

There are bugs that can’t be reproduced, or at least not easily: intermittent and transient errors; bugs that disappear when you try to look for them; bugs that occur as the result of a long chain of independent operations or cross-request timing. Some of these bugs are only found in high-scale production systems that have been running for a long time under heavy load. Capers Jones calls these bugs “abeyant defects” and estimates that in big systems, as much as 10% of bugs cannot be reproduced, or are too expensive to try reproducing. These bugs can cost 100x more to fix than a simple defect – an “average” bug like this can take more than a week for somebody to find (if they can be found at all) by walking through the design and code, and up to another week or two to fix. Heisenbugs One class of bugs that can’t be reproduced are Heisenbugs: bugs that disappear when you attempt to trace or isolate them. When you add tracing code, or step through the problem in a debugger, the problem goes away. In Debug It!, Paul Butcher offers some hope for dealing with these bugs. He says Heisenbugs are caused by non-deterministic behavior which in turn can only be caused by:Unpredictable initial state – a common problem in C/C++ code Interaction with external systems – which can be isolated and stubbed out, although this is not always easy Deliberate randomness – random factors can also be stubbed out in testing Concurrency – the most common cause of Heisenbugs today, at least in Java.Knowing this (or at least making yourself believe it while you are trying to find a problem like this) can help you decide where to start looking for a cause, and how to go forward. But unfortunately, it doesn’t mean that you will find the bug, at least not soon. Race Conditions Race conditions used to be problems only for systems programmers and people writing communications handlers. But almost everybody runs into race conditions today – is anybody writing code that isn’t multi-threaded anymore? Races are synchronization errors that occur when two or more threads or processes access the same data or resource without a consistent locking approach. Races can result in corrupted data – memory getting stomped on (especially in C/C++), changes applied more than once (balances going negative) or changes being lost (credits without debits) – inconsistent UI behaviour, random crashes due to null pointer problems (a thread references an object that has already been freed by another thread), intermittent timeouts, and in actions executed out of sequence, including time-of-check time-of-use security violations. The results will depend on which thread or process wins the race a particular time. Because races are the result of unlucky timing, and because they problem may not be visible right away (e.g., something gets stepped on but you don’t know until much later when something else tries to use it), they’re usually hard to understand and hard to make happen. You can’t fix (or probably even find) a race condition without understanding concurrency. And the fact that you have a race condition is a good sign that whoever wrote this code didn’t understand concurrency, so you’re probably not dealing with only one mistake. You’ll have to be careful in diagnosing and especially in fixing the bug, to make sure that you don’t change the problem from a race into a stall, thread starvation or livelock, or deadlock instead, by getting the synchronization approach wrong. Fixing bugs that can’t be reproduced If you’re think you’re dealing with a concurrency problem, a race condition or a timing-related bug, try introducing log pauses between different threads – this will expand the window for races and timing-related problems to occur, which should make the problem more obvious. This is what IBM Research’s ConTest tool does (or did – unfortunately, ConTest seems to have disappeared off of the IBM alphaWorks site), messing with thread scheduling to make deadlocks and races occur more often. If you can’t reproduce the bug, that doesn’t mean that you give up. There are still some things to look at and try. It’s often faster to find concurrency bugs, timing problems and other hard problems by working back from the error and stepping through the code – in a debugger or by hand – to build up your own model of how the code is supposed to work. “Even if you can’t easily reproduce the bug in the lab, use the debugger to understand the affected code. Judiciously stepping into or over functions based on your level of “need to know” about that code path. Examine live data and stack traces to augment your knowledge of the code paths.” Jeff Vroom, Debugging Hard Problems I’ve worked with brilliant programmers who can work back through even the nastiest code, tracing through what’s going on until they see the error. For the rest of us, debugging is a perfect time to pair up. While I’m not sure that pair programming makes a lot of sense as an everyday practice for everyday work, I haven’t seen too many ugly bugs solved without two smart people stepping through the code and logs together. It’s also important to check compiler and static analysis warnings – these tools can help point to easily-overlooked coding mistakes that could be the cause of your bug, and maybe even other bugs. Findbugs has statistical concurrency bug pattern checkers that can point out common concurrency bug patterns as well as lots of other coding mistakes that you can miss finding on your own. There are also dynamic analysis tools that are supposed to help find race conditions and other kinds of concurrency bugs at run-time, but I haven’t seen any of them actually work. If anybody has had success with tools like this in real-world applications I would like to hear about it. Shotgun Debugging, Offensive Programming, Brute-Force Debugging and other ideas Debug It! recommends that if you are desperate, just take a copy of the code, and try changing something, anything, and see what happens. Sometimes the new information that results from your change may point you in a new direction. This kind of undirected “Shotgun Debugging” is basically wishful thinking, relying on luck and accidental success, what Andy Hunt and Dave Thomas call “Programming by Coincidence”. It isn’t something that you want to rely on, or be proud of. But there are some changes that do make a lot of sense to try. In Code Complete, Steve McConnell recommends making “offensive coding” changes: adding asserts and other run-time debugging checks that will cause the code to fail if something “impossible” happens, because something “impossible” apparently is happening. Jeff Vroom suggests writing your own debugging code: “For certain types of complex code, I will write debugging code, which I put in temporarily just to isolate a specific code path where a simple breakpoint won’t do. I’ve found using the debugger’s conditional breakpoints is usually too slow when the code path you are testing is complicated. You may hit a specific method 1000?s of times before the one that causes the failure. The only way to stop in the right iteration is to add specific code to test for values of input parameters… Once you stop at the interesting point, you can examine all of the relevant state and use that to understand more about the program.” Paul Butcher suggests that before you give up trying to reproduce and fix a problem, see if there are any other bugs reported in the same area and try to fix them – even if they aren’t serious bugs. The rationale: fixing other bugs may clear up the situation (your bug was being masked by another one); and by working on something else in the same area, you may learn more about the code and get some new ideas about your original problem. Refactoring or even quickly rewriting some of the code where you think the problem might be can sometimes help you to see the problem more clearly, especially if the code is difficult to follow. Or it could possibly move the problem somewhere else, making it easier to find. If you can’t find it and fix it, at least add defensive code: tighten up error handling and input checking, and add some logging – if something is happening that can’t happen or won’t happen for you, you need more information on where to look when it happens again. Question Everything Finally, The Pragmatic Programmer tells us to remember that nothing is impossible. If you aren’t getting closer to the answer, question your assumptions. Code that has always worked might not work in this particular case. Unless you have a complete trace of the problem in the original report, it’s possible that the report is incomplete or misleading – that the problem you need to solve is not the problem that you are looking at. If you’re trapped in a real debugging nightmare, look outside of your own code. The bug may not be in your code, but in underlying third party code in your service stack or the OS. In big systems under heavy load, I’ve run into problems in the operating system kernel, web servers, messaging middleware, virtual machines, and the DBMS. Your job when debugging a problem like this is to make sure that you know where the bug isn’t (in your code), and try to come up with a simple a test case as possible that shows this – when you’re working with a vendor, they’re not going to be able to setup and run a full-scale enterprise app in order to reproduce your problem. Hard bugs that can’t be reproduced can blow schedules and service levels to hell, and wear out even your best people. Taking all of this into account, let’s get back to the original problem of how or whether to estimate bug fixes, in the next post. Reference: Fixing Bugs that can’t be Reproduced from our JCG partner Jim Bird at the Building Real Software blog....

Enterprise SOAP Services in a Web 2.0 REST/JSON World

With the popularlity of JSON and other NoSQL data standard formats, the complexity and in some cases the plain verbosity of XML formats are being shunned. However, XML and the abilities and standards that have formed around it have become key to the enterprise and their business processes. However, the needs of their customers require that they start to support multiple formats. To this end, tooling frameworks like Axis2 have started to add support for taking WSDL based services and generate JSON responses.Enterprises need to live in this post SOAP world, but leverage the expertise and SOA infrastructures they have developed over the years. Axis2 is one way to do it, but it doesn’t provide monitoring and policy support of the box. Another alternative is the Turmeric SOA project from ebay. Natively out of the box one can take a WSDL like, the one provided by the STAR standard organization, and add support not only from SOAP 1.1/1.2, but also for REST style services serving XML, JSON, and other any other data format you would need to support.There is a catch though, Turmeric SOA was not designed with full SOAP and the W3C web service stack in mind. It uses WSDL to only describe the operations and the data formats supported by the service. So advanced features like WS-Security, WS-Reliable Messaging, and XML Encryption are not natively built into Turmeric. Depending on your needs you will need to work with pipeline Handlers and enhance the protocol processors to support some of the advance features. However, these are items that can be worked around, and it can interoperate with existing web services to act as a proxy.As an example, the STAR organisation provides a web service specification that has been implemented in the automative industry to provide transports for their specifications. Using a framework like Turmeric SOA existing applications can be made available to trading partners and consumer of the service in multiple formats. As an example, one could provide data in RESTful xml: <?xml version='1.0' encoding='UTF-8'?> <ns2:PullMessageResponse xmlns:ms="" xmlns:ns2=""> <ns2:payload> <ns2:content id="1"/> </ns2:payload> </ns2:PullMessageResponse>Or one can provide the same XML represented in a JSON format: { "jsonns.ns2": "", "jsonns.ns3": "", "": "", "jsonns.xs": "", "jsonns.xsi": "", "ns2.PullMessageResponse": { "ns2.payload": { "ns2.content": [ { "@id": "1" } ] } } }The above is generated from the same web service, but with just a header changed to indicate the data format that should be returned. No actual changes to business logic or the web service implementation code itself has to change. In Turmeric this is handled with the Data Binding Framework and its corresponding pipeline handlers. With Axis2 this is a message request configuration entry. Regardless of how it is done, it is important to be able to leverage existing services, but provide the data in a format that your consumers require.For those that are interested, I’ve created a sample STAR Web Service that can be used with Turmeric SOA to see how this works. Code is available at github.While Turmeric handles the basics of the SOAP protocol, advance items like using and storing information in the soap:header are not as easily supported. You can get to the information, but because the use of WSDL in Turmeric Services are there just to describe the data formats and messages, the underlying soap transport support is not necessarily leveraged to the full specification. Depending on your requirements, Axis2 may be better, but Turmeric SOA provides additional items like Service Metrics, Monitoring Console, Policy Administration, Rate Limiting through XACML 2.0 based policies, and much more. If you already have existing web services written the W3C way, but need to provide data in other formats, Turmeric can be used along side. It isn’t a one or the other proposition. Leverage the tools that provide you the greatest flexibility to provide the data to your consumers, with the least amount of effort.Reference: Enterprise SOAP Services in a Web 2.0 REST/JSON World from our JCG partner David Carver at the Intellectual Cramps blog....

Using EasyMock or Mockito

I have been using EasyMock for most of time but recently I worked with a few people who were pretty much inclined to use Mockito. Not intending to use two frameworks for the same purpose in the same project I adopted Mockito. So for the last couple of months I have been using Mockito and here is my comparative analysis of the two.The people with whom I have work cite reasons of test readability for using Mockitio but I have a different opinion on the same. Suppose we have the following code that we intend to test : public class MyApp { MyService service; OtherService otherService;void operationOne() { service.operationOne(); }void operationTwo(String args) { String operationTwo = otherService.operationTwo(args); otherService.operationThree(operationTwo); }void operationThree() { service.operationOne(); otherService.operationThree("success"); } }class MyService { void operationOne() {} }class OtherService { public String operationTwo(String args) { return args; }public void operationThree(String operationTwo) {} } Now let me write a simple test case for this class using EasyMock and the using Mockito. public class MyAppEasyMockTest { MyApp app; MyService service; OtherService otherService;@Before public void initialize() { service = EasyMock.createMock(MyService.class); otherService = EasyMock.createMock(OtherService.class); app = new MyApp(); app.service = service; app.otherService = otherService; }@Test public void verifySimpleCall() { service.operationOne(); EasyMock.replay(service); app.operationOne(); EasyMock.verify(service); } } public class MyAppMockitoTest { MyApp app; MyService service; OtherService otherService;@Before public void initialize() { service = Mockito.mock(MyService.class); otherService = Mockito.mock(OtherService.class); app = new MyApp(); app.service = service; app.otherService = otherService; }@Test public void verifySimpleCall() { app.operationOne(); Mockito.verify(service).operationOne(); } } This is a really simple test and I must say the Mockito one is more readable . But according to the classic testing methodology the Mockito test is not complete. We have verified the call that we are looking for but if tomorrow I change the source code by adding one more call to service the test would not break. void operationOne() { service.operationOne(); service.someOtherOp(); } Now this makes me feel that the tests are not good enough. But thankfully Mockito gives the verifyNoMoreInteractions that can be used to complete the test. Now let me write a few more test for the MyApp class. public class MyAppEasyMockTest { @Test public void verifyMultipleCalls() { String args = "one"; EasyMock.expect(otherService.operationTwo(args)).andReturn(args); otherService.operationThree(args); EasyMock.replay(otherService); app.operationTwo(args); EasyMock.verify(otherService); }@Test(expected = RuntimeException.class) public void verifyException() { service.operationOne(); EasyMock.expectLastCall().andThrow(new RuntimeException()); EasyMock.replay(service); app.operationOne(); }@Test public void captureArguments() { Capture<String> captured = new Capture<String>(); service.operationOne(); otherService.operationThree(EasyMock.capture(captured)); EasyMock.replay(service, otherService); app.operationThree(); EasyMock.verify(service, otherService); assertTrue(captured.getValue().contains("success")); }}public class MyAppMockitoTest { @Test public void verifyMultipleCalls() { String args = "one"; Mockito.when(otherService.operationTwo(args)).thenReturn(args); app.operationTwo(args); Mockito.verify(otherService).operationTwo(args); Mockito.verify(otherService).operationThree(args); Mockito.verifyNoMoreInteractions(otherService); Mockito.verifyZeroInteractions(service); }@Test(expected = RuntimeException.class) public void verifyException() { Mockito.doThrow(new RuntimeException()).when(service).operationOne(); app.operationOne(); }@Test public void captureArguments() { app.operationThree(); ArgumentCaptor capturedArgs = ArgumentCaptor .forClass(String.class); Mockito.verify(service).operationOne(); Mockito.verify(otherService).operationThree(capturedArgs.capture()); assertTrue(capturedArgs.getValue().contains("success")); Mockito.verifyNoMoreInteractions(service, otherService); } } These are some practical scenarios of testing where we would like to assert arguments, Exceptions etc. If I look and compare the ones written using EasyMock with the ones using Mockito I tend to feel that both the tests are equal in readability, none of them do a better task.The large number of expect and return calls in EasyMock make the tests not readable and the verify statements of Mockito often compromise over test readility. As per the documentation of Mockito the verifyZeroInteractions, verifyNoMoreInteractions should not be used in every test that you write but if I leave them out of my tests then my tests are not good enough.Moreover in tests everything thing should be under the control of the developer i.e. how the interaction are happening and what interactions are happening. In EasyMock this aspect is more visible as the devloper must put down all of these interaction in his code but in Mockito, the framework takes care of all interactions and the developer is just concerned with their verification ( if any). But this can lead to testing scenarios where the developer is not under control of all interactions.There are few nice things that Mockito has like the JunitRunner that can be used to create Mocks of all the required dependencies. It is a nice way of removing some of the infrastructure code and EasyMock should also have one. @RunWith(MockitoJUnitRunner.class) public class MyAppMockitoTest { MyApp app; @Mock MyService service; @Mock OtherService otherService;@Before public void initialize() { app = new MyApp(); app.service = service; app.otherService = otherService; } } Conclusion:Since I have used both frameworks, I feel that except for simple test cases both EasyMock and Mockito lead to test cases that equal in readability. But EasyMock is better for the unit testing as it forces the developer to take control of things. Mockito due to its assumptions and considerations hides this control under the carpet and thus is not a good choice. But Mockito offers certaing things that are quite useful(eg. junitRunner, call chaining) and EasyMock should have one in its next release. Reference: using EasyMock or Mockito from our JCG partner Rahul Sharma at the The road so far… blog blog. ...

ADF : Dynamic View Object

Today I want to write about dynamic view object which allow me to change its data source (SQL query) and attributes at run time. I will use oracle.jbo.ApplicationModule :: createViewObjectFromQueryStmt method to do this issue. I will present how to do this step by step Create View Object and Application Module   1- Right click on Model project and choose New2- Choose from left pane “ADF Business Component” , then from list choose “View Object” and click “OK” button3- Enter “DynamicVO” in “Name” and choose “Sql Query” radio button and click “Next” button.4- Write in Select field “select * from dual” and click “Next” button until reach Window “Step 8 of 9″  5- Check “Add to Application Module” check box and click “Finish” button.Implement Changes in Application Module 1- Open application module “AppModule”, then open Java tab and check “Generate Application Module Class AppModuleImpl” check box2- Open Class and Add the below method for dynamic view object public void changeDynamicVoQuery(String sqlStatement) { ViewObject dynamicVO = this.findViewObject("DynamicVO1"); dynamicVO.remove(); dynamicVO = this.createViewObjectFromQueryStmt("DynamicVO1", sqlStatement); dynamicVO.executeQuery(); }3- Open “AppModule” then open Java tab and Add changeDynamicVoQuery method to Client InterfaceTest Business Component   1- Right click on AppModue in Application navigator and choose Run from drop down list.2- Right click on AppModule in left pane and choose Show from drop down lsit Write “Select * from Emp” in sqlStatement parameter Click on Execute button, The result will be Success .3- Click double click on DynamicVO1 in left pane, it will display the data of DynamicVO and display data which I entered “Select * from Emp” before not “Select * from dual” that was used in design time of view object.  To use dynamic view objects in ADF Faces, you should use ADF Dynamic Table or ADF Dynamic Form. You can download sample application from here Reference: ADF : Dynamic View Object from our JCG partner Mahmoud A. ElSayed at the Dive in Oracle blog....

Bug Fixing – to Estimate, or not to Estimate, that is the question

According to Steve McConnell in Code Complete (data from 1975-1992) most bugs don’t take long to fix. About 85% of errors can be fixed in less than a few hours. Some more can be fixed in a few hours to a few days. But the rest take longer, sometimes much longer – as I talked about in an earlier post. Given all of these factors and uncertainty, how to you estimate a bug fix? Or should you bother? Block out some time for bug fixing Some teams don’t estimate bug fixes upfront. Instead they allocate a block of time, some kind of buffer for bug fixing as a regular part of the team’s work, especially if they are working in time boxes. Developers come back with an estimate only if it looks like the fix will require a substantial change – after they’ve dug into the code and found out that the fix isn’t going to be easy, that it may require a redesign or require changes to complex or critical code that needs careful review and testing. Use a rule of thumb placeholder for each bug fix Another approach is to use a rough rule of thumb, a standard place holder for every bug fix. Estimate ½ day of development work for each bug, for example. According to this post on Stack Overflow the ½ day suggestion comes from Jeff Sutherland, one of the inventors of Scrum. This place holder should work for most bugs. If it takes a developer more than ½ day to come up with a fix, then they probably need help and people need to know anyways. Pick a place holder and use it for a while. If it seems too small or too big, change it. Iterate. You will always have bugs to fix. You might get better at fixing them over time, or they might get harder to find and fix once you’ve got past the obvious ones. Or you could use the data earlier from Capers Jones on how long it takes to fix a bug by the type of bug. A day or half day works well on average, especially since most bugs are coding bugs (on average 3 hours) or data bugs (6.5 hours). Even design bugs on average only take little more than a day to resolve. Collect some data – and use it Steve McConnell, In Software Estimation: Demystifying the Black Art says that it’s always better to use data than to guess. He suggests collecting time data for as little as a few weeks or maybe a couple of months on how long on average it takes to fix a bug, and use this as a guide for estimating bug fixes going forward. If you have enough defect data, you can be smarter about how to use it. If you are tracking bugs in a bug database like Jira, and if programmers are tracking how much time they spend on fixing each bug for billing or time accounting purposes (which you can also do in Jira), then you can mine the bug database for similar bugs and see how long they took to fix – and maybe get some ideas on how to fix the bug that you are working on by reviewing what other people did on other bugs before you. You can group different bugs into buckets (by size – small, medium, large, x-large – or type) and then come up with an average estimate, and maybe even a best case, worst case and most likely for each type. Use Benchmarks For a maintenance team (a sustaining engineering or break/fix team responsible for software repairs only), you could use industry productivity benchmarks to project how many bugs your team can handle. Capers Jones in Estimating Software Costs says that the average programmer (in the US, in 2009), can fix 8-10 bugs per month (of course, if you’re an above-average programmer working in Canada in 2012, you’ll have to set these numbers much higher). Inexperienced programmers can be expected to fix 6 a month, while experienced developers using good tools can fix up to 20 per month. If you’re focusing on fixing security vulnerabilities reported by a pen tester or a scan, check out the remediation statistical data that Denim Group has started to collect, to get an idea on how long it might take to fix a SQL injection bug or an XSS vulnerability. So, do you estimate bug fixes, or not? Because you can’t estimate how long it will take to fix a bug until you’ve figured out what’s wrong, and most of the work in fixing a bug involves figuring out what’s wrong, it doesn’t make sense to try to do an in-depth estimate of how long it will take to fix each bug as they come up. Using simple historical data, a benchmark, or even a rough guess place holder as a rule-of-thumb all seem to work just as well. Whatever you do, do it in the simplest and most efficient way possible, don’t waste time trying to get it perfect – and realize that you won’t always be able to depend on it. Remember the 10x rule – some outlier bugs can take up to 10x as long to find and fix than an average bug. And some bugs can’t be found or fixed at all – or at least not with the information that you have today. When you’re wrong (and sometimes you’re going to be wrong), you can be really wrong, and even careful estimating isn’t going to help. So stick with a simple, efficient approach, and be prepared when you hit a hard problem, because it’s gonna happen. Reference: Bug Fixing – to Estimate, or not to Estimate, that is the question from our JCG partner Jim Bird at the Building Real Software blog....

Mahout and Scalding for poker collusion detection

When I’ve been reading a very bright book on Mahout, Mahout In Action (which is a great hands-in intro to machine learning, as well), one of the examples has caught my attention. Authors of the book where using well-known K-means clusterization algorithm for finding similar players on, where the criterion of similarity was the set of the authors of questions/answers the users were up-/downvoting. In a very simple words, K-means algorithm iteratively finds clusters of points/vectors, located close to each other, in a multidimensional space. Being applied to the problem of finding similars players in StackOverflow, we assume that every axis in the multi-dimensional space is a user, where the distance from zero is a sum of points, awarded to the questions/answers given by other players (those dimensions are also often called “features”, where the the distance is a “feature weight”). Obviously, the same approach can be applied to one of the most sophisticated problems in a massively-multiplayer online poker – collusion detection. We’re making a simple assumption that if two or more players have played too much games with each other (taking into account that any of the players could simply have been an active player, who played a lot of games with anyone), they might be in a collusion. We break a massive set of players into a small, tight clusters (preferably, with 2-8 players in each), using K-means clustering algorithm. In a basic implementation that we will go through further, every user is represented as a vector, where axises are other players that she has played with (and the weight of the feature is the number of games, played together). Stage 1. Building a dictionary As the first step, we need to build a dictionary/enumeration of all the players, involved in the subset of hand history that we analyze: // extract user ID from hand history record val userId = (playerHistory: PlayerHandHistory) => new Text(playerHistory.getUserId.toString)// Builds basic dixtionary (enumeration, in fact) of all the players, participated in the selected subset of hand // history records class Builder(args: Args) extends Job(args) {// input tap is an HTable with hand history entries: hand history id -> hand history record, serialized with ProtoBuf val input = new HBaseSource("hand", args("hbasehost"), 'handId, Array("d"), Array('blob)) // output tap - plain text file with player IDs val output = TextLine(args("output"))input .read .flatMap('blob -> 'player) { // every hand history record contains the list of players, participated in the hand blob: Array[Byte] => // at the first stage, we simply extract the list of IDs, and add it to the flat list HandHistory.parseFrom(blob) } .unique('player) // remove duplicate user IDs .project('player) // leave only 'player column from the tuple .write(output)} 1003 1004 1005 1006 1007 ...Stage 2. Adding indices to the dictionary Secondly, we map user IDs to position/index of a player in the vector. class Indexer(args: Args) extends Job(args) {val output = WritableSequenceFile(args("output"), classOf[Text], classOf[IntWritable], 'userId -> 'idx)TextLine(args("input")).read .map(('offset -> 'line) -> ('userId -> 'idx)) { // dictionary lines are read with indices from TextLine source // out of the box. For some reason, in my case, indices were multiplied by 5, so I have had to divide them tuple: (Int, String) => (new Text(tuple._2.toString) -> new IntWritable((tuple._1 / 5))) } .project(('userId -> 'idx)) // only userId -> index tuple is passed to the output .write(output)} 1003 0 1004 1 1005 2 1006 3 1007 4 ... Stage 3. Building vectors We build vectors that will be passed as an input to K-means clustering algorithm. As we noted above, every position in the vector corresponds to another player the player has played with: /** * K-means clustering algorithm requires the input to be represented as vectors. * In out case, the vector, itself, represents the player, where other users, the player has played with, are * vector axises/features (the weigh of the feature is a number of games, played together) * User: remeniuk */ class VectorBuilder(args: Args) extends Job(args) {import Dictionary._// initializes dictionary pipe val dictionary = TextLine(args("dictionary")) .read .map(('offset -> 'line) -> ('userId -> 'dictionaryIdx)) { tuple: (Int, String) => (tuple._2 -> tuple._1 / 5) } .project(('userId -> 'dictionaryIdx))val input = new HBaseSource("hand", args("hbasehost"), 'handId, Array("d"), Array('blob)) val output = WritableSequenceFile(args("output"), classOf[Text], classOf[VectorWritable], 'player1Id -> 'vector)input .read .flatMap('blob -> ('player1Id -> 'player2Id)) { //builds a flat list of pairs of users that player together blob: Array[Byte] => val playerList = HandsHistoryCoreInternalDomain.HandHistory.parseFrom(blob) playerList.flatMap { playerId => playerList.filterNot(_ == playerId).map(otherPlayerId => (playerId -> otherPlayerId.toString)) } } .joinWithSmaller('player2Id -> 'userId, dictionary) // joins the list of pairs of //user that played together with // the dictionary, so that the second member of the tuple (ID of the second //player) is replaced with th index //in the dictionary .groupBy('player1Id -> 'dictionaryIdx) { group => group.size // groups pairs of players, played together, counting the number of hands } .map(('player1Id, 'dictionaryIdx, 'size) ->('playerId, 'partialVector)) { tuple: (String, Int, Int) => val partialVector = new NamedVector( new SequentialAccessSparseVector(args("dictionarySize").toInt), tuple._1) // turns a tuple of two users // into a vector with one feature partialVector.set(tuple._2, tuple._3) (new Text(tuple._1), new VectorWritable(partialVector)) } .groupBy('player1Id) { // combines partial vectors into one vector that represents the number of hands, //played with other players group => group.reduce('partialVector -> 'vector) { (left: VectorWritable, right: VectorWritable) => new VectorWritable( } } .write(output)} 1003 {3:5.0,5:4.0,6:4.0,9:4.0} 1004 {2:4.0,4:4.0,8:4.0,37:4.0} 1005 {1:4.0,4:5.0,8:4.0,37:4.0} 1006 {0:5.0,5:4.0,6:4.0,9:4.0} 1007 {1:4.0,2:5.0,8:4.0,37:4.0} ...The entire workflow, required to vectorize the input: val conf = new Configuration conf.set("io.serializations", "," + "")// the path, where the vectors will be stored to val vectorsPath = new Path("job/vectors") // enumeration of all users involved in a selected subset of hand history records val dictionaryPath = new Path("job/dictionary") // text file with the dictionary size val dictionarySizePath = new Path("job/dictionary-size") // indexed dictionary (every user ID in the dictionary is mapped to an index, from 0) val indexedDictionaryPath = new Path("job/indexed-dictionary")println("Building dictionary...") // extracts IDs of all the users, participating in selected subset of hand history records Tool.main(Array(classOf[Dictionary.Builder].getName, "--hdfs", "--hbasehost", "localhost", "--output", dictionaryPath.toString)) // adds index to the dictionary Tool.main(Array(classOf[Dictionary.Indexer].getName, "--hdfs", "--input", dictionaryPath.toString, "--output", indexedDictionaryPath.toString)) // calculates dictionary size, and stores it to the FS Tool.main(Array(classOf[Dictionary.Size].getName, "--hdfs", "--input", dictionaryPath.toString, "--output", dictionarySizePath.toString))// reads dictionary size val fs = FileSystem.get(dictionaryPath.toUri, conf) val dictionarySize = new BufferedReader( new InputStreamReader( Path(dictionarySizePath, "part-00000")) )).readLine().toIntprintln("Vectorizing...") // builds vectors (player -> other players in the game) // IDs of other players (in the vectors) are replaces with indices, taken from dictionary Tool.main(Array(classOf[VectorBuilder].getName, "--hdfs", "--dictionary", dictionaryPath.toString, "--hbasehost", "localhost", "--output", vectorsPath.toString, "--dictionarySize", dictionarySize.toString))Stage 4. Generating n-random clusters Random clusters/centroids is an entry point for K-means algorithm: //randomly selected cluster the will be passed as an input to K-means val inputClustersPath = new Path('jobinput-clusters') val distanceMeasure = new EuclideanDistanceMeasure println('Making random seeds...') //build 30 initial random clusterscentroids RandomSeedGenerator.buildRandom(conf, vectorsPath, inputClustersPath, 30, distanceMeasure)Stage 5. Running K-means algorithms Every next iteration, K-means will find better centroids and clusters. As a result, we have 30 clusters of players that played with each other the most often: // clusterization results val outputClustersPath = new Path("job/output-clusters") // textual dump of clusterization results val dumpPath = "job/dump"println("Running K-means...") // runs K-means algorithm with up to 20 iterations, to find clusters of colluding players (assumption of collusion is // made on the basis of number hand player together with other player[s]), vectorsPath, inputClustersPath, outputClustersPath, new CosineDistanceMeasure(), 0.01, 20, true, 0, false)println("Printing results...")// dumps clusters to a text file val clusterizationResult = finalClusterPath(conf, outputClustersPath, 20) val clusteredPoints = new Path(outputClustersPath, "clusteredPoints") val clusterDumper = new ClusterDumper(clusterizationResult, clusteredPoints) clusterDumper.setNumTopFeatures(10) clusterDumper.setOutputFile(dumpPath) clusterDumper.setTermDictionary(new Path(indexedDictionaryPath, "part-00000").toString, "sequencefile") clusterDumper.printClusters(null)Results Let’s go to “job/dump”, now – this file contains textual dumps of all clusters, generated by K-means. Here’s a small fragment of the file: VL-0{n=5 c=[1003:3.400, 1006:3.400, 1008:3.200, 1009:3.200, 1012:3.200] r=[1003:1.744, 1006:1.744, 1008:1.600, 1009:1.600, 1012:1.600]} Top Terms: 1006 => 3.4 1003 => 3.4 1012 => 3.2 1009 => 3.2 1008 => 3.2 VL-15{n=1 c=[1016:4.000, 1019:3.000, 1020:3.000, 1021:3.000, 1022:3.000, 1023:3.000, 1024:3.000, 1025:3.000] r=[]} Top Terms: 1016 => 4.0 1025 => 3.0 1024 => 3.0 1023 => 3.0 1022 => 3.0 1021 => 3.0 1020 => 3.0 1019 => 3.0 As we can see, 2 clusters of players have been detected: one with 8 players, that has played a lot of games with each other, and the second with 4 players. Reference: Poker collusion detection with Mahout and Scalding from our JCG partner Vasil Remeniuk at the Vasil Remeniuk blog blog....

Hibernate caches basics

Recently I have experimented with hibernate cache. In this post I would like share my experience and point out some of the details of Hibernate Second Level Cache. On the way I will direct you to some articles that helped me implement the cache. Let’s get started from the ground. Caching in hibernate Caching functionality is designed to reduces the amount of necessary database access. When the objects are cached they resides in memory. You have the flexibility to limit the usage of memory and store the items in disk storage.The implementation will depend on the underlying cache manager. There are various flavors of caching available, but is better to cache non-transactional and read-only data. Hibernate provides 3 types of caching. 1. Session Cache The session cache caches object within the current session. It is enabled by default in Hibernate. Read more about Session Cache . Objects in the session cache resides in the same memory location. 2. Second Level Cache The second level cache is responsible for caching objects across sessions. When this is turned on, objects will be first searched in cache and if they are not found, a database query will be fired. Read here on how to implement Second Level Cache. Second level cache will be used when the objects are loaded using their primary key. This includes fetching of associations. In case of second level cache the objects are constructed and hence all of them will reside in different memory locations. 3. Query Cache Query Cache is used to cache the results of a query. Read here on how to implement query cache.When the query cache is turned on, the results of the query are stored against the combination query and parameters. Every time the query is fired the cache manager checks for the combination of parameters and query. If the results are found in the cache they are returned other wise a database transaction is initiated. As you can see, it is not a good idea to cache a query if it has number of parameters or a single parameter can take number of values. For each of this combination the results are stored in the memory. This can lead to extensive memory usage. Finally, here is a list of good articles written on this topic, 1. Speed Up Your Hibernate Applications with Second-Level Caching 2. Hibernate: Truly Understanding the Second-Level and Query Caches 3. EhCache Integration with Spring and Hibernate. Step by Step Tutorial 4. Configuring Ehcache with hibernate Reference: All about Hibernate Second Level Cache from our JCG partner Manu PK at the The Object Oriented Life blog....

Java threads: How many should I create

Introduction“How many threads should I create?”. Many years before one of my friends asked me the question, then I gave him the answer follow the guideline with ” Number of CPU core + 1″. Most of you will be nodding when you are reading here. Unfortunately all of us are wrong at that point.Right now I would give the answer with if your archiecture was based on shared resource model then your thread number should be “Number of CPU core + 1″ with better throughput, but if your architecture was shared-nothing model (like SEDA, ACTOR) then you could create as many thread as your need. Walk ThroughSo here came one question why so many eldership continuely gave us the guideline with “Number of Cpu core + 1″, because they told us the context switching of thread was heavy and would block your system scalability. But noboday noticed the programming or architecture model they were under. So if you read carefully you would find most of them described the pragramming or architecture model were based on shared resource model.Give you several examples:1. Socket programming – socket layer was shared by many requests, so you need context switch between every requests.2. Information provider system – most customer will contiuely access the same requestetc…So they would meet the multiple requests access the same resource situation so system would require add lock to that resource since consistency requirement of their system. Lock contention would come into play so the context swich of multiple threading would be very heavy.After I find this interesting thing, I consider willother programming or architecture models can walk around that limitation. So if shared resource model has failed for creating more java threading, maybe we can try shared nothing model.So fortunately I get one chance create one system need large scalability, the system need send out lots of notfication in very quick manner. So I decide go ahead with SEDA model for trial and leverage with my multiple-lane commonj pattern, current I can run the java application with maximum number around 600 threads if your java heap setting with 1.5 gigabytes in one machine.So how about the average memory consumption for one java thread is around 512 kilobytes (Ref:, so 600 threads almost you need 300M memory consumption (include java native and java heap). And if you system design is good, the 300M usage will not your burden acutally.By the way in windows you can’t create more then 1000 since windows can’t handle the threads very well, but you can create 1000 threads in linux if you leverage with NPTL. So many persons told you java couldn’t handle large concurrent job processings that wasn’t 100% true.Someone may ask how about thread itself lifecycle swap: ready – runnable – running – waiting. I would say java and latest OS already could handle them suprisingly effecient, and if you have mutliple-core cpu and turn on NUMA the whole performance will be enhanced more further. So it’s not your bottleneck at least from very beginning phase.Of course create thread and make thread to running stage are very heavy things, so please leverage with threadpool (jdk: executors)And you could ref : for power of many java threads  ConclusionIn the future how will you answer the question “How many java threads should I create?”. I hope your answer will change to:1. if your archiecture was based on shared resource model then your thread number should be “Number of CPU core + 1″ with better throughput2. if your architecture was shared-nothing model (like SEDA, ACTOR) then you could create as many thread as your need.Reference: How many java threads should I create? from our JCG partner Andy Song at the song andy’s Stuff blog....
Java Code Geeks and all content copyright © 2010-2015, Exelixis Media Ltd | Terms of Use | Privacy Policy | Contact
All trademarks and registered trademarks appearing on Java Code Geeks are the property of their respective owners.
Java is a trademark or registered trademark of Oracle Corporation in the United States and other countries.
Java Code Geeks is not connected to Oracle Corporation and is not sponsored by Oracle Corporation.
Do you want to know how to develop your skillset and become a ...
Java Rockstar?

Subscribe to our newsletter to start Rocking right now!

To get you started we give you our best selling eBooks for FREE!

Get ready to Rock!
To download the books, please verify your email address by following the instructions found on the email we just sent you.