What's New Here?


GC overhead limit exceeded – Java Heap analysis

This post is the continuation of our original GC overhead limit exceeded problem patterns post. Proper Java Heap analysis is critical in order to eliminate your OutOfMemoryError: GC overhead problem. If you are not familiar with this Java HotSpot 1.6 error, I recommend that you first review my First Article on this subject.This article will provide you with a sample program and a tutorial on how to analyze your Java HotSpot Heap footprint using Memory Analyzer following an OutOfMemoryError. I highly recommend that you execute and analyse the Heap Dump yourself using this tutorial in order to better understand these principles.Troubleshooting tools** all these tools can be downloaded for free ** Eclipse Indigo Release   Memory Analyzer via IBM Support Assistant 4.1 (HotSpot Heap Dump analysis)  Java VM: Windows HotSpot JRE 1.6.0_24 64-bitSample Java programThe simple sample Java program below will be used to triggered an OutOfMemoryError; allowing us to analyze the generated HotSpot Heap Dump file. Simply create a new Java class : JVMOutOfMemoryErrorSimulator.java to the Eclipse project of your choice and either rename or keep the current package as is.This program is basically creating multiple String instances within a Map data structure until the Java Heap depletion.** please make sure your Eclipse compiler and JRE is 1.6 **package org.ph.javaee.javaheap;import java.util.Map; import java.util.HashMap;/** * JVMOutOfMemoryErrorSimulator * * @author PH * */public class JVMOutOfMemoryErrorSimulator {private final static int NB_ITERATIONS = 500000;// ~1 KB data footprint private final static String LEAKING_DATA_PREFIX = "datadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadat adatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadat adatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadat adatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadat adatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadat adatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadatadata";// Map used to stored our leaking String instances private static Map<String, String> leakingMap;static { leakingMap = new HashMap<String, String>(); }/** * @param args */ public static void main(String[] args) {System.out.println("JVM OutOfMemoryError Simulator 1.0"); System.out.println("Author: Pierre-Hugues Charbonneau"); System.out.println(" http ://javaeesupportpatterns.blogspot.com/");try {for (int i = 0; i < NB_ITERATIONS; i++) {String data = LEAKING_DATA_PREFIX + i;// Add data to our leaking Map data structure... leakingMap.put(data, data);}} catch (Throwable any) { if (any instanceof java.lang.OutOfMemoryError){ System.out.println("OutOfMemoryError triggered! " + any.getMessage() + " [" + any + "]");} else { System.out.println("Unexpected Exception! " + any.getMessage() + " [" + any + "]"); } }System.out.println("simulator done!"); }}Step #1 – Setup your JVM start-up argumentsFirst, setup your Eclipse Java runtime arguments as per below. For our example, we used an external JRE 1.6 outside the Eclipse IDE with a Java Heap maximum capacity of 512 MB.The key JVM argument allowing us to generate a Heap Dump is -XX:+HeapDumpOnOutOfMemoryError which tells the JVM to generate a Heap Dump following an OutOfMemoryError condition.Step #2 – Run the sample Java programThe next step is to run our Java program. Depending on your computer specs, this program will run between 5-30 seconds before existing with an OutOfMemoryError.As you can see, the JVM generated a Heap Dump file java_pid3880.hprof . It is now time to fire the Memory Analyzer tool and analyze the JVM Heap Dump.Step #3 – Load the Heap DumpAnalyzing a Heap Dump is an analysis activity that can be simple or very complex. The goal of this tutorial is to give you the basics of Heap Dump analysis. For more Heap Dump analysis, please refer to the other case studies of this Blog.Step #4 – Analyze Heap DumpBelow are the snapshots and analysis steps that you can follow to understand the memory leak that we simulated in our sample Java program.As you can see, the Heap Dump analysis using the Memory Analyzer tool was able to easily identify our primary leaking Java class and data structure.ConclusionI hope this simple Java program and Heap Dump analysis tutorial has helped you understand the basic principles of Java Heap analysis using the raw Heap Dump data. This analysis is critical when dealing with OutOfMemoryError: GC overhead problems since those are symptoms of either Java Heap leak of Java Heap footprint / tuning problem.Reference: GC overhead limit exceeded – Java Heap analysis from our JCG partner Pierre-Hugues Charbonneau at the Java EE Support Patterns & Java Tutorial blog....

MapReduce Questions and Answers Part 2

4 Inverting Indexing for Text Retrieval The chapter contains a lot of details about integer numbers encoding and compression. Since these topics are not directly about MapReduce, I made no questions about them. 4.4 Inverting Indexing: Revised Implementation Explain inverting index retrieval algorithm. You may assume that each document fits into the memory. Assume also then there is a huge number of documents. Which part should be optimized by integer encoding and compression? Input: text documents key: document id value: text document Output: key/value pairs where key: word value: list[documentId, numberOfOccurences] list elements must be sorted by numberOfOccurences The answer is: Mapper counts number of occurrences in the document for each word. As the whole document fits into the memory, we can hold partial results in a map. Intermediate key/values: key: word, numberOfOccurences value: documentIdCustom partitioner groups intermediate key/values by word. Custom sort sorts them primary by word and secondary by the number of occurrences. Reducer uses initialize method to initialize list of all postings. Reduce method handles two cases:current word equal to previous word – add documentId and numberOfOccurences to posting list. current word equal to previous word – emit previous word and posting list; initialize posting list.Posting list in reducer should be compressed. class MAPPER method INITIALIZE H = new hash mapmethod MAP(docid, doc d) H = new hash map for all term w in doc d do H(w) = H(w) + 1for all term w in H do emit(pair(u, w), count 1)method CLOSE for all term w in H emit(pair(w, H(w)), docid)class REDUCER variable previous_word = 0 variable PL = new list of postingsmethod REDUCE(pair (w, #occurrences), docid) if w <> previous_word && previous_word <> 0 do emit(w, PL) PL = new list of postings PL.add(pair(#occurrences, docid)) previous_word = wmethod compare(key (w1, o1), key (w2, o2)) if w1 = w2 return keys are equalreturn keys are differentclass SORTING_COMPARATOR method compare(key (w1, o1), key (w2, o2)) if w1 = w2 do return compare(o1, o2) return compare(w1, w2)5 Graph Algorithms The chapter contains two algorithms: shortest path in the graph and page ranking algorithm. The questions are straightforward. 5.2 Parallel Breadth-First Search Find shortest path from one node origin to all other nodes. Each edge has a weight associated. Input key/value pairs have already bean preprocessed into comfortable form. Input: graph key: node id value: distance to origin, list[adjacent node, edge length] Output: key/value pairs where key: node id value: distance to origin, list[adjacent node, edge length] The answer is: The algorithm requires multiple iterations. It stops the iteration does not change any ‘distance to origin’. At worst, there will be O(n) iterations where n is a number of nodes in the graph. Mapper passes original graph to the next iteration as it is. Plus, it generates key/value pair for each adjacent node. The value contains the minimum known distance from origin if the route would go through node. class MAPPER method MAP(node, pair(dist, adjacencylist)) emit(node, pair(dist, adjacencylist)) for all (closenode, nodedist) in adjacencylist do emit(closenode, pair(dist + nodedist, empty))Reducer finds the minimum known distance from each node. It passes the distance along with the original graph to the next iteration. It also increments global counter whenever minimum known distance to any node changes. class REDUCER method REDUCE(node, list(dist, adjacencylist)) minimum = infinity previous_iteration_solution = infinity original_graph = empty for all (dist, adjacencylist) in list do if adjacencylist not empty do original_graph = adjacencylist previous_iteration_solution = dist if minimum > dist minimum = dist if previous_iteration_solution <> minimum increment global counter emit(node, pair(minimum, original_graph))If the global counter is 0, the algorithm stops. Otherwise another iteration is needed. Explain page rank algorithm, assume alpha = 0.  The answer is: Page rank P(n) of a page n is calculated form page ranks of all pages linking to it. P(n) = sum_m (P(m)/C(m)) The sum goes through all pages m linking to the page n. C(m) is the number of outgoing links of the page m. Page rank algorithm runs in iterations. Mapper passes page rank contribution of each page to adjacent pages. Reducer updates page rank of each node. The algorithm stops when page ranks no longer moves. class MAPPER method MAP(page, (page_rank, adjacency_list)) emit(page, (0, adjacency_list)) contribution = page_rank/adjacency_list.length for all node in adjacency_list do emit(node, (contribution, empty))class REDUCER method REDUCE(page, contributions[c1, c2, ..., cn]) rank = 0 adjacency_list = new list for all c in contributions do adjacency_list.addAll(c.adjacency_list) rank = rank + c.contributionemit(page, (rank, adjacency_list))6 EM Algorithms For Text Processing I made no questions out of this chapter. Exercises This chapter contains hands-on exercises for MapReduce. Some of them require multiple iterations. Warm Up Count number of occurrences of every word in a text collection. Input: key: document id, value: text document. Output: key: word, value: number of occurences.The answer is: Intermediate pairs: key: word value: integer - how many times was the word seen in the input.class MAPPER method MAP(docid a, doc d) for all term w in doc d do emit(w, 1)class COMBINER method COMBINE(word w, counts[c1, c2, ..., cn]) s = 0 for all c in counts[c1, c2, ..., cn] do s = s + cemit(word w, s)class REDUCER variable total_occurrences = 0method REDUCE(word w, counts[c1, c2, ..., cn]) s = 0 for all c in counts[c1, c2, ..., cn] do s = s + cemit(word w, s)Alternative solution would use in-mapper combining.   Web Store Website user log contains user ids and length of each session. The website has modest number of registered users. Compute the average session length for each user. Input: key: user id, value: session length. Output: key: user id, value: average session length. The answer is: As the number of registered users is modest, we can use in-mapper combining. class MAPPER variable total_time = new hash map variable sessions_number = new hash mapmethod MAP(user_id, session_length) total_time(user_id) = total_time(user_id) + session_length sessions_number(user_id) = sessions_number(user_id) + 1method CLOSE for all user_id in total_logged_in_time tt = total_time(user_id) sn = sessions_number(user_id) emit(user_id, pair(tt, sn))class REDUCER method REDUCE(user_id, [pairs(time, sessions_number)]) total_time = 0 total_sessions = 0 for all pairs in [pairs(time, sessions_number)] do total_time = total_time + time total_sessions = total_sessions + sessions_numberemit(user_id, total_time/total_sessions)Web store log contains user id and bought item for each sale. You need to implement “buyers of item also bought” functionality. Whenever the item is shown, the store will suggest five items most often bought by items buyers. Input: key: user id, value: brought item. Output: key: item, value: list of five most common "buyers of item also bought" items. The answer is: Our solution has two iterations. First iteration generates lists of all items brought by the same user. Grouping is done by the framework, both mapper and reducer perform an identity function. Input: key: user id, value: brought item. Output: key: user id, value: list of all brought items. class MAPPER method MAP(user_id, item) emit(user_id, item)class REDUCER method REDUCE(user_id, items[i1, i2, ..., in]) emit(user_id, items)Second iteration solves co-occurrences problem on list items. It uses the stripes approach. Only difference against the standard solution is that we have emit only five most common co-occurrences. Input: key: user id, value: list of all brought items. Output: key: item, value: list of five most common co-occurrences. class MAPPER method MAP(user_id, items[i1, i2, ..., in]) for all item in items do H = new hash map for all item j in items do H(j) = H(j) + 1 emit(item, H)class REDUCER method REDUCE(item, stripes[H1, H2, ..., Hn]) T = new hash map for all H in stripes do for all (key/value) in H do T(key) = T(key) + value emit(user_id, max_five(T))Web store log contains user id, timestamp, item and number of brought pieces for each sale. The store is looking for items whose sales rise or decline at the same time. Find 20 item couples with maximum of such months. Input: key: user id, value: timestamp, brought item, count. Output: key: item, item value: number of months when both items sales rose or decline. #: the output contains 20 key/value pairs with maximum value The answer is: Our solution requires multiple MapReduce iterations. We have to:calculate whether items sales for any given month went up or down, create lists of items with the same sales change during the same month, find number of co-occurrences in those lists, choose items with maximum co-occurrences.First iteration calculates sales changes for any given month. We have to supply mapper, partitioner, custom sort and reducer. Mapper generates one intermediate key/value pair for each input key/value. Key is composed of sold item and sales month. Value contains number of sold pieces. Partitioner sends all key/value pairs with the same item to the same reducer. Custom sort sorts them by months. Finally, reducer calculates sales changes. Input: key: user id, value: timestamp, item, count. Intermediate key/values: key: item, month value: count. Output: key: month, up/down/equal value: item.class MAPPER method MAP(user_id, (timestamp, item, count)) month = get_month(timestamp) emit((item, month), count)class PARTITIONING_COMPARATOR method compare(key (item1, month1), key (item2, month2)) if item1 = item2 return keys are equalreturn keys are differentclass SORTING_COMPARATOR method compare(key (item1, month1), key (item2, month2)) if item1 = item2 do return compare(month1, month2) return compare(item1, item2)class REDUCER method REDUCE((item, month), counts[c1, c2, ..., cn]) c = sum([c1, c2, ..., cn]) if last_item = item if last_month + 1 = month //emit correct up/down/equal flags if last_count < count emit((item, month), up) if last_count > count emit((item, month), down) if last_count = count emit((item, month), equal) else //no sales during some months emit((item, last_month + 1), down) emit((item, month), up) else // new item emit((last_item, last_month + 1), down) emit((item, month), up)last_item = item last_count = count last_month = monthSecond iteration groups first iteration results by keys. It generates lists of items with same sales changes during the same month. Framework does all the work. Both mapper and reducer perform an identity function. Input: key: month, up/down/equal value: item. Output: key: month, up/down/equal value: [items].Third iteration performs standard ‘co-occurrences by pairs’ algorithm. Input: key: month, up/down/equal value: [items]. Intermediate key/values: key: item, item value: partial number of co-occurrences. Output: key: item, item value: number of months when both items sales rose or decline. #: the output contains all items couplesclass MAPPER method MAP((month, change), items[i1, i2, ..., in]) for each i in items do for each j in items do if i != j emit((i, j), 1)class COMBINER method COMBINE((item1, item2), co-occurrences[c1, c2, ..., cn]) s = 0 for all c in co-occurrences[c1, c2, ..., cn] do s = s + cemit((item1, item2), s)class REDUCER method REDUCE((item, item), co-occurrences[c1, c2, ..., cn]) s = 0 for all c in co-occurrences[c1, c2, ..., cn] do s = s + cemit((item1, item2), s)Finally, we have to choose 20 key/value pairs with maximum value. Each mapper selects 20 key/value pairs with maximum value and emits them with the same key. There will be only one reducer which selects final 20 key/value pairs. Input: key: item, item value: number of months when both items sales rose or decline. #: the output contains all items couples Intermediate key/values: key: 1 value: item, item, number of months when both items sales rose or decline. #: the output contains 20 key/value pairs with maximum value for each mapper Output: key: item, item value: number of months when both items sales rose or decline. #: the output contains 20 key/value pairs with maximum valuethe code is very simple but longCriminal Agency Inputs to all exercises in this chapter uses the same data structure. Criminal agency stole Facebook’s friendships database and wants to analyze new data. Friendships are stored in form key/value pairs, each friendship corresponds to two key/value pairs: Friends: key: first friend name value: second friend name key: second friend name value: first friend nameThe agency owns also criminal records of all citizens: Criminal record: key: citizen name value: when, where, accomplices, descriptionFind at risk youths. A person is considered at risk youth if more than half of his friends have criminal record. The answer is: Our solution has two iterations. First iteration joins two sets and flags each ‘value friend’ with has_record/law_abiding flags. Output: key: first friend value: second friend, has_record/law_abidingThe mapper flags each key with data set name. Partitioner groups data according to names in keys and sorter puts criminal records before friendships. We could use local aggregation to remove multiple criminal records for the same person. class MAPPER method MAP(name, value) if value is name do emit(name, friendship, item) else emit(name, criminal, item)class PARTITIONING_COMPARATOR method compare(key (name1, dataset1), key (name2, dataset2)) if name1 = name2 return keys are equal return keys are different class SORTING_COMPARATOR method compare(key (name1, dataset1), key (name2, dataset2)) if name1 = name2 AND dataset1 is criminal return key1 is lowerif name1 = name2 AND dataset2 is criminal return key2 is lowerreturn compare(name1, name2)class REDUCER variable previous_namemethod REDUCE(pair(name, flag), items[i1, i2, ..., in]) if flag is criminal do previous_name = name has_record = criminal returnif previous_name <> name do has_record = law_abiding else has_record = criminalprevious_name = name for all i in items do emit(i.name, pair(name, has_record))Second iteration counts both total number of friends and number of friends with criminal record. Reducer emits key/value pairs only for at risk youths. Also this iteration could use some kind of local aggregation. Intermediate key/value: key: name value: total friends, total friend criminals # totals are relative only to in data sets subsets Output: key: name value: empty # only at risk youthsclass MAPPER method MAP(name, pair(name, has_record)) if has_record is law_abiding do emit(name, pair(0, 1)) else emit(name, pair(1, 1))class REDUCER method REDUCE(name, items[pair(total, criminals)]) total = 0 criminals = 0 for all i in items do total = total + i.total criminals = criminals + i.criminalsif criminals / total > 0.5 do emit(name, empty)Find gangs. Gang is a group of people that:has exactly 5 members, each member is friend with all other members, each two members committed at least 3 crimes together.The answer is: Again, we need three iterations. The idea is to first clean up the graph of all useless edges, so that only criminal contacts remain. Then, we split graph into smaller manageable sub-graphs. We attach all criminal contacts and edges between them to each person: Last iteration reducers input: key: person values: all his criminal contacts and relationships between them.Final reducer takes smaller graphs represented by value in each key/value pair and finds complete sub-graphs with 4 vertices in it. Add person from the key in it, and you have found a complete sub-graph with 5 vertices. The reducer may use any polynomial algorithm. First iteration uses pairs approach to clear the graph. We omit both local aggregation and removal of duplicities. Both would make the algorithm more efficient. Intermediate key/values: key: first friend, second friend, friendship/accomplice value: 1 Output: key: first friend, second friend value: empty # only friends with common criminal record class MAPPER method MAP(name, value) if value is name do emit(triple(name, value, friendship), empty) else for all crime_accomplice in value.accomplices do emit(triple(name, crime_accomplice, accomplice), 1)class PARTITIONING_COMPARATOR method compare(key (name1, accomplice1, flag1), key (name2, accomplice2, flag2)) if name1 = name2 AND accomplice1 = accomplice2 return keys are equal return keys are different class SORTING_COMPARATOR method compare(key (name1, accomplice1, flag1), key (name2, accomplice2, flag2)) if name1 = name2 AND accomplice1 AND flag1 is friendship return key1 is lowerif name1 = name2 AND accomplice1 AND flag2 is friendship return key2 is lowerreturn compare(pair(name1, accomplice1), pair(name2, accomplice2))class REDUCER variable previous_name variable previous_accomplicemethod sameAsPrevious(name, accomplice) if previous_name <> name return falseif previous_accomplice <> accomplice return falsereturn truemethod REDUCE(triple(name, accomplice, flag), items[i1, i2, ..., in]) if sameAsPrevious(name, accomplice) do if items.length > 2 do emit(name, accomplice) returnif flag is friendship do previous_name = name previous_accomplice = accompliceSecond iteration attaches lists of all ‘second degree’ friends to edges: Input key: first friend, second friend value: empty Intermediate key/values: key: first friend value: first friend, second friend key: second friend value: first friend, second friend Output: key: first friend, second friend value: all friends of second friend key: second friend, first friend value: all friends of first friendclass MAPPER method MAP((first friend, second friend), empty) emit(first friend, (first friend, second friend)) emit(second friend, (first friend, second friend))class REDUCER method REDUCE(name, edges[e1, e2, ..., en]) friends = new Set friends.add(name)for all edge in edges do friends.add(edge.v1, edge.v2)for all edge in edges do emit(edge, friends)Finally, mapper and shuffle and sort phase together generate lists of all friends of any given person and relationships between them. Input key: friend 1, friend 2 value: all friends of friend 2 Intermediate key/values: key: friend 1 value: friend 2, all friends of friend 2 Reducers input (after shuffle and sort): key: person values: all his friends and relationships between them. Output: key: first friend, second friend, third friend, fourth friend, fifth friend value: gang class MAPPER method MAP((friend , friend 2), all friends of second friend) emit(friend 1, (friend 2, all friends of friend 2))class REDUCER method REDUCE(name, graph attached to it) any polynomial algorithm will workReference: MapReduce Questions and Answers from our JCG partner Maria Jurcovicova at the This is Stuff blog....

MapReduce Questions and Answers Part 1

With all the hype and buzz surrounding NoSQL, I decided to have a look at it. I quickly found that there is not one NoSQL I could learn. Rather, there are various different solutions with different purposes and trade offs. Those various solutions tend to have one thing in common: processing of data in NoSQL storage is usually done using MapReduce framework. Search on MapReduce found various scattered blog posts, some universities courses pages and one book that seems to contain almost everything other sources did. This post contains MapReduce questions and answers based on the book. Basically, if I would be a student, this is what I would have made as a test preparation notes. If I would be a teacher, this is what I would ask on the exam. First chapter gives credit where the credit is due, the rest contains questions. Last chapter contains hands-on coding exercises. The Book The book is named Data-Intensive Text Processing with MapReduce. If you are unsure whether you want to buy it or not, pre-production manuscript is available for free. Do not be fooled by the title. The book is more about MapReduce itself and less about text processing. First half of the book describes general purpose tricks (design patterns) useful for any task. Second half contains a chapters on text processing, graph algorithms and expectation maximization. The book contains almost everything I found on various blogs, university courses pages and much more. Questions Questions are split by book chapters. With one minor exception (counters), questions are mostly based on the first half of the book. Second half contains concrete algorithms and I wanted to focus on general purpose knowledge. It does not mean that learning them is not useful. Especially Graph Algorithms chapter contains easily generalizable ideas. 2 MapReduce Basics 2.2 Mappers and Reducers Describe general MapReduce algorithm. Split it into phases. For each phase include:who is responsible (framework/programmer/customizable), what it does, phase input, phase output.The answer is: MapReduce has four phases:map, combine, shuttle and sort, reduce.Map phase is done by mappers. Mappers run on unsorted input key/values pairs. The same physical nodes that keeps input data run also mappers. Each mapper emits zero, one or multiple output key/value pairs for each input key/value pair. Output key/value pairs are called intermediate key/value pairs. Their type is usually different from input key/value pair type. Mapper must be supplied by programmers. Combine phase is done by combiners. Combiner should combine key/value pairs with the same key together. Each combiner may run zero, once or multiple times. Framework decides whether and how many times to run the combiner, programmer has no control over it. Combiners output key/value pair type must be the same as its input key/value pair types. Shuttle and sort phase is done by framework. Data from all mappers are grouped by the key, split among reducers and sorted by the key. Each reducer obtains all values associated with the same key. Programmer may supply custom compare function for sorting and partitioner for data split. All key/value pairs going to the same reducer are sorted by the key, but there is no global sorting. Reducer obtains sorted key/[values list] pairs sorted by the key. Values list contains all values with the same key produced by mappers. Each reducer emits zero, one or multiple output key/value pairs for each input key/value pair. Output key/value pair type is usually different from input key/value pair type. Reducer must be supplied by programmers. If the algorithm requires multiple MapReduce iterations, each combiner may increment global counter. Driver program would read the counter after the reduce phase. It then decides whether next iteration is needed or not. Note: chapter 2 does not mention counters. They are explained later, in the chapter 5. Decide if the statement is true or false: All MapReduce implementations implement exactly same algorithm. Decide if the statement is true or false: All MapReduse implementations implement exactly same algorithm.The answer is: False. For example, Google’s implementation does not allow change of key in the reducer, but provides sorting for values. Hadoop does not provide values sorting, but reducer can change the key. True or false: Each mapper must generate the same number of key/value pairs as its input had. The answer is: False. Mapper may generate any number of key/value pairs (including zero). True or false: Mappers input key/value pairs are sorted by the key. The answer is: False. Mapper’s input is not sorted in any way. True or false: Mappers output key/value must be of the same type as its input. The answer is: False. Mapper may produce key/value pairs of any type. True or false: Reducer is applied to all values associated with the same key. The answer is: True. Reducer is applied to all values associated with the same key. True or false: Reducers input key/value pairs are sorted by the key. The answer is: True. Reducers input key/value pairs are sorted by the key. implementation. True or false: Each reducer must generate the same number of key/value pairs as its input had. The answer is: False. Reducer may generate any number of key/value pairs (including zero). True or false: Reducers output key/value pair must be of the same type as its input. The answer is: False. The statement is false in Hadoop and true in Google’s implementation. 2.3 The Execution Framework What happens in case of hardware/software failure?   The answer is: MapReduce framework must be able to recover from both hardware (disk failures, RAM errors) and software (bugs, unexpected exceptions) errors. Both are common and expected. Is it possible to start reducers while some mappers still run? Why? The answer is: No. Reducer’s input is grouped by the key. The last mapper could theoretically produce key already consumed by running reducer. Define a straggler.  The answer is: Straggler is either map or reduce task that takes unusually long time to complete. What is speculative execution (also called backup tasks)? What problem does it solve? The answer is: Identical copy of the same task is executed on multiple nodes. Output of the fastest task used. Speculative execution helps if the task is slow because of hardware problem. It does not help if the distribution of values over keys is skewed. 2.4 Partitioners and Combiners What does partitioner do? The answer is: Partitioner divides key/values pairs produced by map tasks between reducers. What does combiner do?  The answer is: Combiner does local aggregation of key/values pairs produced by mapper before or during shuttle and sort phase. In general, it reduces amount of data to be transferred between nodes. The framework decides how many times to run it. Combiner may run zero, one or multiple times on the same input. Decide if the statement is true or false: Each combiner runs exactly once.  The answer is: False. The framework decides whether combiner runs zero, once or multiple times. 2.5 The Distributed File System Briefly describe HDFS architecture.  The answer is: HDFS has one namenode and a lot of datanodes. Namenode is master and coordinates file operations, ensures integrity of the system and keeps namespace (metadata, directory structure, file to block mapping etc.). Data are stored in big blocks on data nodes. Each block is stored on multiple, by default three, data nodes. Name node checks whether data nodes work correctly and manages data replication. Client contacts name node which answers with data block id and data node id. Data node then sends data directly to the client. Decide if the statement is true or false: HDFS is good at managing large number of small files.  The answer is: False. HDFS is good at managing large files.   2.6 Hadoop Cluster Architecture Explain difference between jobtracker and tasktracker? The answer is: Client executes jobs on jobtracker. Jobtracker runs on the master. Jobtracker monitors MapReduce jobs. It also coordinates mappers and reducers. Tasktracker runs both user code and datanode daemon on slave nodes. It is never contacted by the client. Explain mapper lifecycle. The answer is: Initialization method is called before any other method is called. It has no parameters and no output. Map method is called separately for each key/value pair. It process input key/value pairs and emits intermediate key/value pairs. Close method runs after all input key/value have been processed. The method should close all open resources. It may also emit key/value pairs. Explain reducer lifecycle. The answer is: Initialization method is called before any other method is called. It has no parameters and no output. Reduce method is called separately for each key/[values list] pair. It process intermediate key/value pairs and emits final key/value pairs. Its input is a key and iterator over all intermediate values associated with the same key. Close method runs after all input key/value have been processed. The method should close all open resources. It may also emit key/value pairs.   3 MapReduce Algorithm Design   3.1 Local Aggregation What is local aggregation and why is it used? The answer is: Either combiner or a mapper combines key/value pairs with the same key together. They may do also some additional preprocessing of combined values. Only key/value pairs produced by the same mapper are combined. Key/Value pairs created by map tasks are transferred between nodes during shuffle and sort phase. Local aggregation reduces amount of data to be transferred. If the distribution of values over keys is skewed, data preprocessing in combiner helps to eliminate reduce stragglers. What is in-mapper combining? State advantages and disadvantages over writing custom combiner.  The answer is: Local aggregation (combining of key/value pairs) done inside the mapper. Map method does not emit key/value pairs, it only updates internal data structure. Close method combines and preprocess all stored data and emits final key/value pairs. Internal data structure is initialized in init method. Advantages:It will run exactly once. Combiner may run multiple times or not at all. We are sure it will run during map phase. Combiner may run either after map phase or before reduce phase. The latter case provides no reduction in transferred data. In-mapper combining is typically more effective. Combiner does not reduce amount of data produced by mappers, it only groups generated data together. That causes unnecessary object creation, destruction, serialization and deserialization.Disadvantages:Scalability bottleneck: the technique depends on having enough memory to store all partial results. We have to flush partial results regularly to avoid it. Combiner use produce no scalability bottleneck.3.2 Pairs and Stripes Explain Pair design patter on a co-occurence example. Include advantages/disadvantages against Stripes approach, possible optimizations and their efficacy.   The answer is: Mapper generates keys composed from pairs of words that occurred together. The value contains the number 1. Framework groups key/value pairs with the same work pairs together and reducer simply counts the number values for each incoming key/value pairs. Each final pair encodes a cell in co-occurrence matrix. Local aggregation, e.g. combiner or in-mapper combining, can be used. Advantages:Simple values, less serialization/deserialization overhead. Simpler memory management. No scalability bottleneck (only if in-mapper optimization would be used).Disadvantages:Huge amount of intermediate key/value pairs. Shuffle and sort phase is slower. Local aggregation is less effective – too many distinct keys.Explain Stripes design patter on a co-occurence example. Include advantages/disadvantages against Pairs approach, possible optimizations and their efficacy. The answer is: Mapper generates a distinct key from each encountered word. Associated value contains a map of all co-occurred words as map keys and number of co-occurrences as map values. Framework groups same words together and reducer merges value maps. Each final pair encodes a row in co-occurrence matrix. Combiner or in-mapper combining can be used. Advantages:Small amount of intermediate key/value pairs. Shuffle and sort phase is faster. Intermediate keys are smaller. Effective local aggregation – smaller number of distinct keys.Disadvantages:Complex values, more serialization/deserialization overhead. More complex memory management. As value maps may grow too big, the approach has potential for scalability bottleneck.Explain scalability bottleneck caused by stripes approach. The answer is: Stripes solution keeps a map of co-occurred words in memory. As the amount of co-occurred words is unlimited, the map size is unlimited too. Huge map does not fit into the memory and causes paging or out of memory errors.   3.3 Computing Relative Frequencies Relative frequencies of co-occurrences problem: Input: text documents key: document id value: text document Output: key/value pairs where key: pair(word1, word2) value: #co-occurrences(word1, word2)/#co-occurrences(word1, any word) Fix following solution to relative frequencies of co-occurrences problem: class MAPPER method INITIALIZE H = new hash mapmethod MAP(docid a, doc d) for all term w in doc d do for all term u patri neighbors(w) do H(w) = H(w) + 1 emit(pair(u, w), count 1)method CLOSE for all term w in H emit(pair(w, *), H(w))class REDUCER variable total_occurrences = 0method REDUCE(pair (p, u), counts[c1, c2, ..., cn]) s = 0 for all c in counts[c1, c2, ..., cn] do s = s + cif u = * total_occurrences = s else emit(pair p, s/total_occurrences)class SORTING_COMPARATOR method compare(key (p1, u1), key (p2, u2)) if p1 = p2 AND u1 = * return key1 is lowerif p1 = p2 AND u2 = * return key2 is lowerreturn compare(p1, p2)The answer is: Partitioner is missing, framework could send key/value pairs with totals to different reducer than key/pairs with word pairs. class PARTITIONING_COMPARATOR method compare(key (p1, u1), key (p2, u2)) if p1 = p2 return keys are equalreturn keys are differentDescribe order inversion design pattern. The answer is: Order inversion is used if the algorithm requires two passes through mapper generated key/value pairs with the same key. The first pass generates some overall statistic which is then applied to data during the second pass. The reducer would need to buffer data in the memory just to be able to pass twice through them. First pass result is calculated by mappers and stored in some internal data structure. The mapper emits the result in closing method, after all usual intermediate key/value pairs. The pattern requires custom partitioning and sort. First pass result must come to the reducer before usual key/value pairs. Of course, it must come to the same reducer.   3.4 Secondary Sorting Describe value-to-key design pattern.  The answer is: Hadoop implementation does not provide sorting for grouped values in reducers input. Value-to-key is used as a workaround. Part of the value is added to the key. Custom sort then sorts primary by the key and secondary by the added value. Custom partitioner must move all data with the same original key to the same reducer.   3.5 Relational Joins Describe reduce side join between tables with one-on-one relationship. The answer is: Mapper produces key/value pairs with join ids as keys and row values as value. Corresponding rows from both tables are grouped together by the framework during shuffle and sort phase. Reduce method in reducer obtains join id and two values, each represents row from one table. Reducer joins the data. Describe reduce side join between tables with one-to-many relationship. The answer is: We assume that the join key is primary key in table called S. Second table is called T. In other words, the table S in on the ‘one’ side of the relationship and the table T is on the ‘many’ side of the relationship. We have to implement mapper, custom sorter, partitioner and reducer. Mapper produces key composed from join id and table flag. Partitioner splits the data in such a way, that all key/value pairs with the same join id goes to the same reducer. Custom sort puts key/value pair generated from the table S right before key/value pair with the same join id from the table T. Reducers input looks like this: ((JoinId1, s)-> row) ((JoinId1, t)-> [rows]) ((JoinId2, s)-> row) ((JoinId2, t)-> [rows]) ... ((JoinIdn, s), row) ((JoinIdn, t), [rows]) The reducer joins all rows from s pair with all rows from following t pair. Describe reduce side join between tables with many-to-many relationship. The answer is: We assume that data are stored in tables called S and T. The table S is smaller. We have to implement mapper, custom sorter, partitioner and reducer. Mapper produces key composed from join id and table flag. Partitioner splits the data in such a way, that all key/value pairs with the same join id goes to the same reducer. Custom sort puts the key/value pairs generated from the table S is right before all key/value pair with the data from the table T. Reducers input looks like this: ((JoinId1, s)-> [rows]) ((JoinId1, t)-> [rows]) ((JoinId2, s)-> [rows]) ((JoinId2, t)-> [rows]) ... ((JoinIdn, s), [rows]) ((JoinIdn, t), [rows]) The reducer buffers all rows with the same JoinId from the table S into the memory and joins them with following T table rows. All data from the smaller table must fit into the memory – the algorithm has scalability bottleneck problem. Describe map side join between two database tables. The answer is: Map side join works only if following assumptions hold:both datasets are sorted by the join key, both datasets are partitioned the same way.Mapper maps over larger dataset and reads corresponding part of smaller dataset inside the mapper. As the smaller set is partitioned the same way as bigger one, only one map task access the same data. As the data are sorted by the join key, we can perform merge join O(n). Describe memory backed join. The answer is: Smaller set of data is loaded into the memory in every mapper. Mappers loop over larger dataset and joins it with data in the memory. If the smaller set is too big to fit into the memory, dataset is loaded into memcached or some other caching solution. Which one is faster? Map side join or reduce side join? The answer is: Map side join is faster. Go to Part 2 Reference: MapReduce Questions and Answers from our JCG partner Maria Jurcovicova at the This is Stuff blog....

Secure Password Storage – Don’ts, dos and a Java example

The importance of storing passwords securelyAs software developers, one of our most important responsibilities is the protection of our users’ personal information. Without technical knowledge of our applications, users have no choice but to trust that we’re fulfilling this responsibility. Sadly, when it comes to passwords, the software development community has a spotty track record.While it’s impossible to build a 100% secure system, there are fortunately some simple steps we can take to make our users’ passwords safe enough to send would-be hackers in search of easier prey.If you don’t want all the background, feel free to skip to the Java SE example below.The Don’tsFirst, let’s quickly discuss some of the things you shouldn’t do when building an application that requires authentication:Don’t store authentication data unless you really have to. This may seem like a cop-out, but before you start building a database of user credentials, consider letting someone else handle it. If you’re building a public application, consider using OAuth providers such as Google or Facebook. If you’re building an internal enterprise application, consider using any internal authentication services that may already exist, like a corporate LDAP or Kerberos service. Whether it’s a public or internal application, your users will appreciate not needing to remember another user ID and password, and it’s one less database out there for hackers to attack. If you must store authentication data, for Gosling’s sake don’t store the passwords in clear text. This should be obvious, but it bears mentioning. Let’s at least make the hackers break a sweat. Don’t use two-way encryption unless you really need to retrieve the clear-text password. You only need to know their clear-text password if you are using their credentials to interact with an external system on their behalf. Even then, you’re better off having the user authenticate with that system directly. To be clear, you do not need to use the user’s original clear-text password to perform authentication in your application. I’ll go into more detail on this later, but when performing authentication, you will be applying an encryption algorithm to the password the user entered and comparing it to the encrypted password you’ve stored. Don’t use outdated hashing algorithms like MD5. Honestly, hashing a password with MD5 is virtually useless. Here’s an MD5-hashed password: 569a70c2ccd0ac41c9d1637afe8cd932. Go to http://www.md5hacker.com/ and you can decrypt it in seconds. Don’t come up with your own encryption scheme. There are a handful of brilliant encryption experts in the world that are capable of outwitting hackers and devising a new encryption algorithm. I am not one of them, and most likely, neither are you. If a hacker gets access to your user database, they can probably get your code too. Unless you’ve invented the next great successor to PBKDF2 or bcrypt, they will be cackling maniacally as they quickly crack all your users’ passwords and publish them on the darknet.The DosOkay, enough lecturing on what not to do. Here are the things you need to focus on:Choose a one-way encryption algorithm. As I mentioned above, once you’ve encrypted and stored a user’s password, you never need to know the real value again. When a user attempts to authenticate, you’ll just apply the same algorithm to the password they entered, and compare that to the encrypted password that you stored. Make the encryption as slow as your application can tolerate. Any modern password encryption algorithm should allow you to provide parameters that increase the time needed to encrypt a password (i.e. in PBKDF2, specifying the number of iterations). Why is slow good? Your users won’t notice if it takes an extra 100ms to encrypt their password, but a hacker trying a brute-force attack will notice the difference as they run the algorithm billions of times. Pick a well-known algorithm. The National Institute of Standards and Technology (NIST) recommends PBKDF2 for passwords. bcrypt is a popular and established alternative, and scrypt is a relatively new algorithm that has been well-received. All these are popular for a reason: they’re good.PBKDF2Before I give show you some concrete code, let’s talk a little about why PBKDF2 is a good choice for encrypting passwords:Recommended by the NIST. Section 5.3 of Special Publication 800-132 recommends PBKDF2 for encrypting passwords. Security officials will love that. Adjustable key stretching to defeat brute force attacks. The basic idea of key stretching is that after you apply your hashing algorithm to the password, you then continue to apply the same algorithm to the result many times (the iteration count). If hackers are trying to crack your passwords, this greatly increases the time it takes to try the billions of possible passwords. As mentioned previously, the slower, the better. PBKDF2 lets you specify the number of iterations to apply, allowing you to make it as slow as you like. A required salt to defeat rainbow table attacks and prevent collisions with other users. A salt is a randomly generated sequence of bits that is unique to each user and is added to the user’s password as part of the hashing. This prevents rainbow table attacks by making a precomputed list of results unfeasible. And since each user gets their own salt, even if two users have the same password, the encrypted values will be different. There is a lot of conflicting information out there on whether the salts should be stored someplace separate from the encrypted passwords. Since the key stretching in PBKDF2 already protects us from brute-force attacks, I feel it is unnecessary to try to hide the salt. Section 3.1 of NIST SP 800-132 also defines salt as a “non-secret binary value,” so that’s what I go with. Part of Java SE 6. No additional libraries necessary. This is particularly attractive to those working in environments with restrictive open-source policies.Finally, a concrete exampleOkay, here’s some code to encrypt passwords using PBKDF2. Only Java SE 6 is required.import java.security.NoSuchAlgorithmException; import java.security.SecureRandom; import java.security.spec.InvalidKeySpecException; import java.security.spec.KeySpec; import java.util.Arrays;import javax.crypto.SecretKeyFactory; import javax.crypto.spec.PBEKeySpec;public class PasswordEncryptionService {public boolean authenticate(String attemptedPassword, byte[] encryptedPassword, byte[] salt) throws NoSuchAlgorithmException, InvalidKeySpecException { // Encrypt the clear-text password using the same salt that was used to // encrypt the original password byte[] encryptedAttemptedPassword = getEncryptedPassword(attemptedPassword, salt);// Authentication succeeds if encrypted password that the user entered // is equal to the stored hash return Arrays.equals(encryptedPassword, encryptedAttemptedPassword); }public byte[] getEncryptedPassword(String password, byte[] salt) throws NoSuchAlgorithmException, InvalidKeySpecException { // PBKDF2 with SHA-1 as the hashing algorithm. Note that the NIST // specifically names SHA-1 as an acceptable hashing algorithm for PBKDF2 String algorithm = "PBKDF2WithHmacSHA1"; // SHA-1 generates 160 bit hashes, so that's what makes sense here int derivedKeyLength = 160; // Pick an iteration count that works for you. The NIST recommends at // least 1,000 iterations: // http://csrc.nist.gov/publications/nistpubs/800-132/nist-sp800-132.pdf // iOS 4.x reportedly uses 10,000: // http://blog.crackpassword.com/2010/09/smartphone-forensics-cracking-blackberry-backup-passwords/ int iterations = 20000;KeySpec spec = new PBEKeySpec(password.toCharArray(), salt, iterations, derivedKeyLength);SecretKeyFactory f = SecretKeyFactory.getInstance(algorithm);return f.generateSecret(spec).getEncoded(); }public byte[] generateSalt() throws NoSuchAlgorithmException { // VERY important to use SecureRandom instead of just Random SecureRandom random = SecureRandom.getInstance("SHA1PRNG");// Generate a 8 byte (64 bit) salt as recommended by RSA PKCS5 byte[] salt = new byte[8]; random.nextBytes(salt);return salt; } }The flow goes something like this:When adding a new user, call generateSalt(), then getEncryptedPassword(), and store both the encrypted password and the salt. Do not store the clear-text password. Don’t worry about keeping the salt in a separate table or location from the encrypted password; as discussed above, the salt is non-secret. When authenticating a user, retrieve the previously encrypted password and salt from the database, then send those and the clear-text password they entered to authenticate(). If it returns true, authentication succeeded. When a user changes their password, it’s safe to reuse their old salt; you can just call getEncryptedPassword() with the old salt.Easy enough, right? If you’re building or maintaining an application that violates any of the “don’ts” above, then please do your users a favor and use something like PBKDF2 or bcrypt. Help them, Obi-Wan Developer, you’re their only hope.Reference: Secure Password Storage – Lots of don’ts, a few dos, and a concrete Java SE example from our JCG partner Jerry Orr at the Jerry on Java blog....

Birt Made Report Easy

Here is the complete guide to build report in Eclipse using Birt plugin. Birt or Business Intelligence and Reporting tool is a tool that can build report without writing toooo many java codes. If you are using ireport you know what I’m talking about :) (crystal report.. no point :D) Main draw back in Ireport is we have to write too many codes to build simple report that can integrate to java application. But in Birt no need to write codes just drag n drop. Birt is a Eclipse Plugin (Only to Eclipse) that you can download. So in this article we demonstrate how to build some sample report with ease of drag n drop :) Here we go… Step 1Download Eclipse (http://www.eclipse.org/)You can download eclipse for report users (Comes with Birt Plugin)If you already have eclipse just install the birt pluginHelp –> Install New Software –> Click Add Give name as Birt and give location as http://download.eclipse.org/birt/update-site/3.7-interim/ Click Ok Select All packages (tic) and accept the license agreement (whether we like it or not :P ). It will take some time to download all the packages and after the download restart the eclipse.Step 2Before make project adjust your perspective to Report Design To do that Window –> Show View –> Other Select Report tab and select all the fieldsStep 3Now its the time to create new project File –> New –> Other –> Business Intelligence and Reporting Tool –> Report Project Give a Name for the Project and click finish.Step 4Now the Project is created. But we have to create a report. To do that File –> New –> Other –> Business Intelligence and Reporting Tool –> Report Give a name for report and and click Next. In here you can choose the report template. (If you have one you can use it. but before you have to register the report template. will talk about this later) I choose Blank Report and click Finish.Step 5Now report is ok. But we need to display data. Click Palette in Left Hand side (this is the place that all controls have) and drag n drop Table component to report design. Give number of rows and columns and click Ok. Now you can see the table in report view (layout you are currently in)Step 6So the basic designing part is done. but data?? (give me 5 min) Click on the Data Explorer (right hand sof Palatte) –> right click and select new data source In here we can configure the database. For this demo we configure in built database that comes up with Birt. (Classic Model) (Others will talk later) Select Classic Models –> Next –> Finish.Step 7Right click data set –> new data set –> Next In the query window you can see the databases available. Select CLASSICMODELS expand it. In the query text area you can write the SQL query. In here we select customer name, customer first name and customer phone (You can do this by double clicking the table values in Available items) After click Ok you can see the window edit data set (if you have any query issue it will show some error window)Step 8Click Preview tab (in layout are) and you can see the output of data. Run –> View Report –> In web browser –> you can see the report in web browser. Can give more advance function like export to doc, pdf, ppt and many more.Here is some basic things about Birt report. But Birt is more than that. You can write javascript to get more functionality. We will talk about this later. That means the Advance Birt :) Reference: Birt Made Report Easy from our JCG partner Rajith Delantha at the Looping around with Rajith… blog....

JSON for polymorphic Java object serialization

For a long time now JSON is a de facto standard for all kinds of data serialization between client and server. Among other, its strengths are simplicity and human-readability. But with simplicity comes some limitations, one of them I would like to talk about today: storing and retrieving polymorphic Java objects. Let’s start with simple problem: a hierarchy of filters. There is one abstract class AbstractFilter and two subclasses, RegexFilter and StringMatchFilter.package bean.json.examples;public abstract class AbstractFilter { public abstract void filter(); }Here is RegexFilter class:package bean.json.examples;public class RegexFilter extends AbstractFilter { private String pattern;public RegexFilter( final String pattern ) { this.pattern = pattern; }public void setPattern( final String pattern ) { this.pattern = pattern; }public String getPattern() { return pattern; }@Override public void filter() { // Do some work here } }And here is StringMatchFilter class:package bean.json.examples;public class StringMatchFilter extends AbstractFilter { private String[] matches; private boolean caseInsensitive;public StringMatchFilter() { }public StringMatchFilter( final String[] matches, final boolean caseInsensitive ) { this.matches = matches; this.caseInsensitive = caseInsensitive; }public String[] getMatches() { return matches; }public void setCaseInsensitive( final boolean caseInsensitive ) { this.caseInsensitive = caseInsensitive; }public void setMatches( final String[] matches ) { this.matches = matches; }public boolean isCaseInsensitive() { return caseInsensitive; }@Override public void filter() { // Do some work here } }Nothing fancy, pure Java beans. Now what if we need to store list of AbstractFilter instances to JSON, and more importantly, to reconstruct this list back from JSON? Following class Filters demonstrates what I mean:package bean.json.examples;import java.util.ArrayList; import java.util.Arrays; import java.util.Collection;public class Filters { private Collection< AbstractFilter > filters = new ArrayList< AbstractFilter >();public Filters() { }public Filters( final AbstractFilter ... filters ) { this.filters.addAll( Arrays.asList( filters ) ); }public Collection< AbstractFilter > getFilters() { return filters; }public void setFilters( final Collection< AbstractFilter > filters ) { this.filters = filters; } }As JSON is textual, platform-independent format, it doesn’t carry any type specific information. Thanks to awesome Jackson JSON processor it could be easily done. So let’s add Jackson JSON processor to our POM file:<project> <modelversion> 4.0.0 </modelversion> <groupid> bean.json </groupid> <artifactid> examples </artifactid> <version> 0.0.1-SNAPSHOT </version> <packaging> jar </packaging> <properties> <project.build.sourceencoding> UTF-8 </project.build.sourceencoding> </properties> <dependencies> <dependency> <groupid> org.codehaus.jackson </groupid> <artifactid> jackson-mapper-asl </artifactid> <version> 1.9.6 </version> </dependency> </dependencies></project>Having this step done, we need to tell Jackson that we have an intention to store the type information together with our objects in JSON so it would be possible to reconstruct exact objects from JSON later. Few annotations on AbstractFilter do exactly that.import org.codehaus.jackson.annotate.JsonSubTypes; import org.codehaus.jackson.annotate.JsonSubTypes.Type; import org.codehaus.jackson.annotate.JsonTypeInfo; import org.codehaus.jackson.annotate.JsonTypeInfo.Id;@JsonTypeInfo( use = Id.NAME ) @JsonSubTypes( { @Type( name = "Regex", value = RegexFilter.class ), @Type( name = "StringMatch", value = StringMatchFilter.class ) } ) public abstract class AbstractFilter { // ... }And … that’s it! Following helper class does the dirty job of serializing filters to string and deserializing them back from string using Jackson’s ObjectMapper:package bean.json.examples;import java.io.IOException; import java.io.StringReader; import java.io.StringWriter;import org.codehaus.jackson.map.ObjectMapper;public class FilterSerializer { private final ObjectMapper mapper = new ObjectMapper();public String serialize( final Filters filters ) { final StringWriter writer = new StringWriter(); try { mapper.writeValue( writer, filters ); return writer.toString(); } catch( final IOException ex ) { throw new RuntimeException( ex.getMessage(), ex ); } finally { try { writer.close(); } catch ( final IOException ex ) { /* Nothing to do here */ } } }public Filters deserialize( final String str ) { final StringReader reader = new StringReader( str ); try { return mapper.readValue( reader, Filters.class ); } catch( final IOException ex ) { throw new RuntimeException( ex.getMessage(), ex ); } finally { reader.close(); } } }Let’s see this in action. Following code examplefinal String json = new FilterSerializer().serialize( new Filters( new RegexFilter( "\\d+" ), new StringMatchFilter( new String[] { "String1", "String2" }, true ) ) );produces following JSON:{ "filters": [ {"@type":"Regex","pattern":"\\d+"}, {"@type":"StringMatch","matches":["String1","String2"],"caseInsensitive":true} ] }As you can see, each entry in "filters" collection has property "@type" which has the value we have specified by annotating AbstractFilter class. Calling new FilterSerializer().deserialize( json ) produces exactly the same Filters object instance. Reference: JSON for polymorphic Java object serialization from our JCG partner Andrey Redko at the Andriy Redko {devmind} blog....

Java High CPU troubleshooting guide – part 1

This article is part 1 of a series that will provide you with a comprehensive guide on how you can troubleshoot and identify root cause of Java high CPU problems. This guide is also applicable for standalone Java programs but designed to help individuals involved in day to day Java EE enterprise production support. It will also include the list of the most common high CPU problems along with high level resolution strategies. Production problems resolution mindset review Before we go any further, it is important to review your production problem resolution mindset. One of the most common “reflexes” that I have seen in my experience with Java EE production support teams is that Java VM / middleware restart is often the first recovery action that is performed. While a premature restart can quickly eliminate the business impact, it can also prevent you to get all the technical facts; reducing your capability to identify the root cause and exposing the platform to future re-occurrences of the problem. Before pulling the trigger and shutdown your Java VM process, ask yourself the following question: do I have all the data available to perform a root cause analysis post restart? If the answer is no then my recommendation to you is to review and improve your current platform monitoring and / or troubleshooting approaches. Proper performance data gathering before and during a high CPU problem is critical. Java high CPU – what is it exactly? Now back to our original topic, a high CPU problem is defined by an observation of one or many Java VM processes consuming excessive CPU utilization from your physical host(s). Excessive CPU can also be described by an abnormal high CPU utilization vs. a known & established baseline. Ex: if the average CPU utilization of your Java VM under peak load condition is 40% then excessive CPU threshold can be set around 80%.A typical Java VM process contains several Java Threads, some waiting to do work and others currently executing tasks. The # of Threads can be very low in the event of a single Java program and very high for Java EE enterprise platforms processing heavy concurrent transactions. In order to understand and identify the source of high CPU of one or many of your Java processes, you will need to understand and perform a full breakdown of all Threads of your Java VM so you can pinpoint the biggest contributors. This analysis exercise can be visualized as per below diagram.Understand your average CPU utilization As I mentioned in the earlier section, it is very important that you understand your current average CPU utilization, what I refer as a baseline. This is crucial data that needs to be monitored on a regular basis as part of a comprehensive and ongoing platform capacity planning strategy. Proper understanding and tracking of your average and “healthy” CPU utilization observed from your Java VM processes will allow you to quickly detect abnormal CPU surge scenarios and correlate with possible root causes (problem introduced by a project, unexpected load increase etc.). Finally, this will provide you with proper threshold values necessary to configure pro-active CPU related alerts using monitoring tools of your choice. Understand your production environment and available tools As the middleware and / or application support prime, you really need to understand your production environment, including the out-of-the-box tools available for you to perform low level troubleshooting tasks. This may be trivial for some individuals but if you just recently started to work on a new Java or Java EE platform for a new client, my recommendation is that you should spend enough time understand your client’s environment specifications & business situation as per below:Physical & virtual host configuration and capacity (total # of assigned CPU cores, RAM etc.) OS vendor, version and patch level Middleware vendor, versions and patch level Java vendor & versions (including 32-bit vs. 64-bit); including patch level Third party API’s used within the Java or Java EE applications Existing monitoring tools that you can leverage for historical data and trend analysis History of the environment, known issues, resources utilization etc. Business traffic breakdown per application along with average & peak traffic level of the platform; including peak business periodsCollecting all proper facts as per above will definitely help your root cause analysis process going forward; including for high CPU related problems. Your homework before jumping to the part 2 Before we jump into the part 2 of this CPU troubleshooting guide, I highly recommend that you study and understand each the following articles below. Focus on the ones applicable for your environment. Each of these articles will provide you with a technical step by step on how to breakdown CPU per Thread for a Java VM; a key troubleshooting skill to acquire in order to investigate Java CPU related problems. This technique is common on some aspects and specific depending of the OS. # CPU per Thread analysis on Solaris http://javaeesupportpatterns.blogspot.com/2011/12/prstat-solaris-pinpoint-high-cpu-java.html # CPU per Thread analysis on Linux http://javaeesupportpatterns.blogspot.com/2012/02/prstat-linux-how-to-pinpoint-high-cpu.html # CPU per Thread analysis on AIX http://javaeesupportpatterns.blogspot.com/2011/12/prstat-aix-how-to-pinpoint-high-cpu.html # CPU per Thread analysis on Windows http://javaeesupportpatterns.blogspot.com/2012/04/java-thread-cpu-analysis-on-windows.html I hope this article has provided with a good starting point on Java CPU problems. The part 2 will provide you with the troubleshooting guide which will include flow diagrams allowing you to choose the right investigation path depending of your problem case. Reference: Java High CPU troubleshooting guide – part 1 from our JCG partner Pierre-Hugues Charbonneau at the Java EE Support Patterns & Java Tutorial blog....

The depths of Java: API leak exposed through covariance

Java can be very tricky some times, especially in API design. Let’s have a look at a very interesting showcase. jOOQ strongly separates API from implementation. All API is in the org.jooq package, and public. Most implementation is in the org.jooq.impl package and package-private. Only factories and some dedicated base implementations are public. This allows for very powerful package-level encapsulation, exposing mostly only interfaces to jOOQ users. A simplified example of package-level encapsulation Here’s roughly how jOOQ models SQL tables. The (overly simplified) API: package org.jooq;/** * A table in a database */ public interface Table {/** * Join two tables */ Table join(Table table); } And two (overly simplified) implementation classes: package org.jooq.impl;import org.jooq.Table;/** * Base implementation */ abstract class AbstractTable implements Table {@Override public Table join(Table table) { return null; } }/** * Custom implementation, publicly exposed to client code */ public class CustomTable extends AbstractTable { }How the internal API is exposed Let’s assume that the internal API does some tricks with covariance: abstract class AbstractTable implements Table, InteralStuff {// Note, this method returns AbstractTable, as it might // prove to be convenient to expose some internal API // facts within the internal API itself @Override public AbstractTable join(Table table) { return null; }/** * Some internal API method, also package private */ void doThings() {} void doMoreThings() {// Use the internal API join(this).doThings(); } }This looks all safe at the first sight, but is it? AbstractTable is package-private, but CustomTable extends it and inherits all of its API, including the covariant method override of “AbstractTable join(Table)”. What does that result in? Check out the following piece of client code package org.jooq.test;import org.jooq.Table; import org.jooq.impl.CustomTable;public class Test { public static void main(String[] args) { Table joined = new CustomTable();// This works, no knowledge of AbstractTable exposed to the compiler Table table1 = new CustomTable(); Table join1 = table1.join(joined);// This works, even if join exposes AbstractTable CustomTable table2 = new CustomTable(); Table join2 = table2.join(joined);// This doesn't work. The type AbstractTable is not visible Table join3 = table2.join(joined).join(joined); // ^^^^^^^^^^^^^^^^^^^ This cannot be dereferenced// ... so hide these implementation details again // The API flaw can be circumvented with casting Table join4 = ((Table) table2.join(joined)).join(joined); } }Conclusion Tampering with visibilities in class hierarchies can be dangerous. Beware of the fact that API methods declared in interfaces are always public, regardless of any covariant implementations that involve non-public artefacts. This can be quite annoying for API users when not properly dealt with by API designers. Fixed in the next version of jOOQ Reference: The depths of Java: API leak exposed through covariance from our JCG partner Lukas Eder at the JAVA, SQL, AND JOOQ blog....

5′ on IT-Architecture: three laws of good software architecture

The issue with architectural decisions is that they effect the whole system and/or you often need to make them early in the development process. It means a lot effort if you change that decision a couple of months later. From an economic standpoint architectural decisions are often irrevocable. Good architecture is one that allows an architect to make late decisions without superior effect on efforts and costs. Let’s put that on record.Law 1: Good architecture is one that enables architects to have a minimum of irrevocable decisions.To minimize the set of irrevocable decisions the system needs to be responsive to change. There is a major lesson I have learned about software development projects: Nothing is permanent except change. The client changes his opinion about requirements. The stakeholders change their viewpoint of what’s important. People join and leave the project team. The fact that change alone is unchanging leads me to the second rule of good architecture, that is:Law 2: To make decisions revocable you need to design for flexibility.This is the most provocative statement and I am having controversial discussions here. The reason is that flexibility introduces the need for abstraction. Abstraction uses a strategy of simplification, wherein formerly concrete details are left ambiguous, vague, or undefined (from Wikipedia). This simplification process isn’t always simple to do and to follow for others in particular. “Making something easy to change makes the overall system a little more complex, and making everything easy to change makes the entire system very complex. Complexity is what makes software hard to change.” from M. Fowler) This is one core problem of building good software architecture: Developing software that is easy to change but at the same time understandable. There are several concepts that try to tackle this paradox problem: design patterns and object oriented design principles. Polymorphism, loose coupling and high cohesion are flexibility enablers to me.Law 3: To make use of flexibility one needs to refactor mercilessly.Flexibility is not an end in itself. You need to actively make use of flexible design. If something is changing and it makes a previous design or architectural decision obsolete you need to go into the code and change the software. Otherwise the effort of building flexible software is useless and technical debt may cause late delays and a maintenance nightmare. The fact that you take rigorous action on your code base requires continuous feedback about the qualities of your software. To be able to refactor it is therefore essential that the code base is covered by a sufficient amount of automated tests. In an ideal scenario everything is integrated into a continuous integration environment to receive permanent feedback about the health of your code base.Reference: “5′ on IT-Architecture: three laws of good software architecture” from our JCG partner Niklas....

Programs and Technical Debt

Once you have a program (a collection of interrelated projects focused on one business goal) and you have technical debt, you have a much bigger problem. Not just because the technical debt is likely bigger. Not just because you have more people. But because you also geographically distributed teams, and those teams are almost always separated by function and time zone.So, my nice example of a collocated team in Thoughts on Infrastructure, Technical Debt, and Automated Test Framework, rarely occurs in a program, unless you have cross-functional teams collocated in a program. If they do, great. You know what to do.But let’s assume you don’t have them. Let’s assume you have what I see in my consulting practice: an architecture group in one location, or an architect in one location and architects around the world; developers and “their” testers in multiple time zones; product owners separated from their developers and testers. The program is called agile because the program is working in iterations. And, because it’s a program, the software pre-existed the existence of the agile transition in the organization, so you have legacy technical debt up the wazoo (the technical term). What do you do?Let’s walk through an example, and see how it might work. Here’s a story which is a composite from several clients; no clients were harmed in the telling of this story.Let’s also assume you are working on release 5.0 of a custom email client. Release 4 was the previous release. Release 4 had trouble. It was late by 6 months and quite buggy. Someone sold agile as the way to make software bug-free and on-time.You do not have automated tests for much of the code, unit tests or system tests. You have a list of defects that make Jack the Ripper’s list of killings look like child’s play. But agile is your silver bullet.The program manager is based in London. The testers for the entire program are in Bangalore because management had previously fired all the testers and outsourced the testers. That was back in release 2. They have since hired all the Bangalore testers as employees of the Bangalore subsidiary. The program architect is based in San Francisco, and there is an architect team that is dispersed into 4 other teams: Denver, LA, Munich, and Paris. The developers are clustered in “Development Centers of Excellence:” Denver, LA, Cambridge, Paris, London, Munich, and Milan. That’s 8 development teams.Oh, and if you think I’m kidding with this scenario, I’m not. This is what most of my clients with geographically distributed teams and programs face on a daily basis. They deserve your sympathy and empathy. Do not tell them, “Don’t go agile.” That’s nuts. They have a right to go agile. You can tell them, “Don’t go Scrum.” That’s reasonable. Scrum is for a cross-functional co-located team. Agile is for everyone. Scrum is for a specific subset.What do you do?Assign specific testers to specific development teams. No calling people resources; that allows managers to treat people like resources and plug-and-play them. You need to get rock-solid teams together. Once you have teams together, you can name them. Name teams so the teams reflect the feature groups they work on. What does an email product do? It gets email, it sorts email, it deletes, it forwards, it creates new mailboxes, and so on. The eight feature teams had to be named for the feature areas: Platform for the general features, Sort, Delete, Forward. There were two teams who worked on Platform. They were called Platform 1 and Platform 2. At one point, someone suggested they call themselves Thing1 and Thing2 from the Dr. Seuss book. Make sure you have enough product owners so they can develop roadmaps for each feature area. With a roadmap, the teams know where they are going. Even more importantly, the architects know where the program is going.Architects think and provide just enough guidance ahead. In a small project, the architecture can probably evolve with the project. In a larger program, that risk is too large. You have too many people developing in parallel for the architecture to evolve on its own with no guidance. But I do not mean there should a Master Architect Who Knows All Handing Down the Architecture From On High. NO NO NO. I want the architect who is a working member of the development team, who also is part of an architecture community of practice team, who curates the architecture, who guides the business value of the architecture. I do not want Big Architecture Up Front. But Thinking Up Front? Sure, that’s a great idea. Stuck on only one idea? Bad. Willing to spike an idea? Great. Willing to play in a sandbox and debate several ideas? Great. I wrote about this before, in How Agile Architects Lead.Decide what done means for every feature. You must have acceptance criteria for each feature. What does that mean? You need a product owner present for each team. You still need the conversations with each team to discuss what done means. Especially with a geographically distributed team, you need the conversation when you create the backlog at the beginning of the iteration. The US development teams had trouble planning their iterations with their testers, because of the time zone differences with the testers. So, they asked their product owners if the product owner would write more than just a few phrases on the cards, because that would help them get through the iteration planning meeting faster. Someone was going to get up early or stay up late, and either way, someone was going to suffer. It made more sense to have a little bit more preparation than less sleep. Decide to do continuous integration and stick with it. Especially if you know you have technical debt and you don’t want to create more, you have to do continuous integration now. That prevents more technical debt.I have recommended to some teams that they have one-week iterations so that they stop the estimation nonsense and make their stories small. The point of estimation is so that you have an idea of what you can do as a team and not commit to more than that. The idea is that if you know what it takes to make your stories small, you will. Instead, we have all these crazy rituals around estimation and management tracking velocity of all things. (Yes, I’ve been drafting this post for a long time, and I wrote Why Does Management Care About Velocity? last week.) You know, velocity is a little like weight. Only you and your doctor need to know your weight. If you are healthy, you are fine. If you are not, you need to change something.If your team velocity is not healthy, you, as a team, need to change it. But, your management has no business butting its head in. Only you can change it.When you limit the iteration length, you tend to have the team swarm around a story. This is a tendency, not a given. If I really was the Empress of the Universe, I would decree this, but I’m not, so I won’t. If you want to decrease technical debt, or even eliminate it on your program, explain that your team will only work on one story at a time until that story is done. That story will be polished and gleaming. Fast. You will not have to worry about what kinds of testing will be done. All if it will be done. Explicitly discuss what you will automate for testing and when. In a program, I assume we will have automated system tests first. I assume we will do exploratory tests later. That’s because if you don’t start building something for test automation when you start the program and refactor as you proceed, you can never catch up. I assume every time we fix a defect, we will have an automated test for it. I also assume we build these assumptions into how we develop :-)So far, this is all about preventing more technical debt, not what happens when you trip over technical debt as you enter code or tests you never looked at before.If you expected to walk into a closet, take out a shirt, and close the closet door, that’s one thing. But now, you stepped into something out of one of those death-by-hoarding shows on TV, you have an obligation to do something. You can document the problem as you encounter it; you can let the product owner know; file a defect report; write a test so you can contain the debt; and maybe you have more options. Whatever you do, make sure you have done something. Do not open the door, see the mess inside and close the door on the mess. It’s tempting. Oh my, it is tempting.See, on programs because of the size, everything is magnified. With more people and more teams, everything is harder. Things happen faster. If you have co-located cross-functional teams, no problem. But if you don’t have co-located cross-functional teams, you have to work with what you have. And, if you already have a big legacy product, you want to address technical debt in small chunks, refactoring in small bits, integrating as you proceed.My philosophy is this: the bigger the program, the more you need to become accustomed to working in small chunks, integrating as you go. Fully implement a small story, integrate it on the mainline. Everyone on the program does that. If you need help from an integration team, so be it.But, if everyone only implements small stories, and everyone takes care of their own technical debt as they discover it, you don’t need an army of integration people. You only need an army of integration people when you have technical debt around integration and release. Fix that, and everyone can become responsible for their own integration.And, if you can’t release, that’s where the architects should start. If you can’t do continuous integration, that’s where the architects should start. Because that’s what’s preventing you from making progress on the product. Work backwards from release, and then the architects can work on the rest of the product. Until you can release and build reliably, the rest of the product doesn’t matter.Reference: Programs and Technical Debt from our JCG partner Johanna Rothman at the Managing Product Development blog....
Java Code Geeks and all content copyright © 2010-2014, Exelixis Media Ltd | Terms of Use
All trademarks and registered trademarks appearing on Java Code Geeks are the property of their respective owners.
Java is a trademark or registered trademark of Oracle Corporation in the United States and other countries.
Java Code Geeks is not connected to Oracle Corporation and is not sponsored by Oracle Corporation.

Sign up for our Newsletter

15,153 insiders are already enjoying weekly updates and complimentary whitepapers! Join them now to gain exclusive access to the latest news in the Java world, as well as insights about Android, Scala, Groovy and other related technologies.

As an extra bonus, by joining you will get our brand new e-books, published by Java Code Geeks and their JCG partners for your reading pleasure! Enter your info and stay on top of things,

  • Fresh trends
  • Cases and examples
  • Research and insights
  • Two complimentary e-books