Featured FREE Whitepapers

What's New Here?

git-logo

Using Git

Using Git When it comes to Software Version/Configuration Management, there might be a whole lot of vendor or open source implementations to choose from but in recent years, there’s none that could parallel Git in terms of being the most development/hacking friendly. I’ve used quite a few different forms of software management tools, from the CVS/SVN family to the Clearcase/Perforce family (which I personally feel is absolutely horrible) but it is with Git that I finally think that Software Versioning is no longer a necessary evil but something that actually helps in the software development process. Perhaps, I will need to corroborate my statement with some examples later but using Git can actually encourage developers to experiment and be creative in their code, knowing that they can always reset back any code changes without any penalty or overheads. I would not want to spend too much time talking about Git’s history. (if you are interested, you can always read wikipedia. If you are like myself, who was previously using CVS/SVN or Clearcase/Perforce to manage your software, I hope this article would improve your understanding on how Git works and how it could increase your productivity as well. How a distributed version control system works As the name suggests, a DVCS does not require a centralised server to be present for you to use it. It’s perfectly fine to use Git as a way to store a history of your changes in your local system and it’s able to do so efficiently and conveniently. However, as with all Software management tools, one of the main benefits is to be able to collaborate effectively in a team and manage changes made to a software repository. So how does Git (or any DVCS) allow you to work in a standalone manner and yet allow you to collaborate on the same codebase? The answer is that Git stores a copy of the whole software repository in each local machine that contains the codebase. This might seemed like a very inefficient and space-consuming method but as a matter of fact, this wouldn’t be a big issue if your files are mostly text (as most source code are) and these files are usually stored as a blob (and highly compressed). So when you use Git, you are actually working within your local environment and this means that besides a few commands that do require network communications, most commands are actually pretty responsive. When you “commit” code into a Git repository, you are not actually in a “collaborative” mode yet as your codebase is actually stored in your local system. This is a concept that is somewhat different from other VCS systems where a “commit” actually puts your latest changes into a common repository where everyone can sync or access it. The concept can be simply illustrated with the following image I pilfered from Atlassian’s site here :So that being said, how do I actually share my code with the rest of my team ? The wonderful thing about Git is that it allows you to define your workflow. Do you want to synchronize your code directly with your peers? Or would you prefer a traditional “centralized” model where everyone will update the “centralized” server with their code. The most common way is the latter where each developer or local workstation will synchronized their codebases with this centralized repository. This centralized repo is also the authoritative copy of the repository which all workstations should clone from and update to. So, for the centralized model, everyone will perform a “git push” to this central repo which every workstation will nominate as its “origin” for this software repo. Before a “git push” succeeds, git will need to ensure that no one else has actually modified the base copy that you have retrieved from this centralised server. Otherwise, Git will require that you perform a “git pull” to merge the changes performed by others into your local server (which might sometimes result in a conflicted state if someone changes the same files you have). Only after this is done will you be allowed to push the new merged commit to the server. If you have yet to make changes to your local copy and there’s a new commit in the “centralized” server since you last synchronization, all you need to do is perform a “git pull” and git does something called a “fast-forward” which essentially is to bring your local copy to the latest code from the centralized server. If all this sounds rather convoluted, it is actually simpler than it sounds. I would recommend Scott Chacon’s Pro-Git book which explains clearly Git’s workings (here’s a link to his blog). So what if it’s DVCS? When you use Git in your software development, you will start to realize that making experimental code changes is not as painful as it used to be with other tools. And the main reason for that would be the ability to do a “git stash” whenever you want to .. or a “git branch”, which as the name implies, creates a new branch off your current working code. Branching in Git is an extremely cheap operation and it allows you to define an experimental branch almost instantaneously without having to explain yourself to all your team mates who are working on the same code base. This is because you choose what you want to push to the “centralized” Git repo. Also, whenever you need to checkout a version of the software from history or remote, you can “git stash” your work into a temporary store in the Git repository and retrieve this stash later when you are done with whatever you need to do with that version. Ever tried creating a branch in Clearcase or Perforce? I shudder even at the thought of doing it. SVN does it better but it is still a rather slow operation which requires plenty of network transfers. Once you have done branching using Git, you will never want to go back to your old VCS tool.Reference: Using Git from our JCG partner Lim Han at the Developers Corner blog....
java-logo

Java 8 Friday: Better Exceptions

At Data Geekery, we love Java. And as we’re really into jOOQ’s fluent API and query DSL, we’re absolutely thrilled about what Java 8 will bring to our ecosystem. Java 8 Friday Every Friday, we’re showing you a couple of nice new tutorial-style Java 8 features, which take advantage of lambda expressions, extension methods, and other great stuff. You’ll find the source code on GitHub.   Better Exceptions I had the idea when I stumbled upon JUnit GitHub issue #706, which is about a new method proposal: ExpectedException#expect(Throwable, Callable) One suggestion was to create an interceptor for exceptions like this. assertEquals(Exception.class, thrown(() -> foo()).getClass()); assertEquals("yikes!", thrown(() -> foo()).getMessage()); On the other hand, why not just add something completely new along the lines of this? // This is needed to allow for throwing Throwables // from lambda expressions @FunctionalInterface interface ThrowableRunnable { void run() throws Throwable; }// Assert a Throwable type static void assertThrows( Class<? extends Throwable> throwable, ThrowableRunnable runnable ) { assertThrows(throwable, runnable, t -> {}); }// Assert a Throwable type and implement more // assertions in a consumer static void assertThrows( Class<? extends Throwable> throwable, ThrowableRunnable runnable, Consumer<Throwable> exceptionConsumer ) { boolean fail = false; try { runnable.run(); fail = true; } catch (Throwable t) { if (!throwable.isInstance(t)) Assert.fail("Bad exception type");exceptionConsumer.accept(t); }if (fail) Assert.fail("No exception was thrown"); } So the above methods both assert that a given throwable is thrown from a given runnable – ThrowableRunnable to be precise, because most functional interfaces, unfortunately, don’t allow for throwing checked exceptions. See this article for details. We’re now using the above hypothetical JUnit API as such: assertThrows(Exception.class, () -> { throw new Exception(); });assertThrows(Exception.class, () -> { throw new Exception("Message"); }, e -> assertEquals("Message", e.getMessage())); In fact, we could even go further and declare an exception swallowing helper method like this: // This essentially swallows exceptions static void withExceptions( ThrowableRunnable runnable ) { withExceptions(runnable, t -> {}); }// This delegates exception handling to a consumer static void withExceptions( ThrowableRunnable runnable, Consumer<Throwable> exceptionConsumer ) { try { runnable.run(); } catch (Throwable t) { exceptionConsumer.accept(t); } } This is useful to swallow all sorts of exceptions. The following two idioms are thus equivalent: try { // This will fail assertThrows(SQLException.class, () -> { throw new Exception(); }); } catch (Throwable t) { t.printStackTrace(); }withExceptions( // This will fail () -> assertThrows(SQLException.class, () -> { throw new Exception(); }), t -> t.printStackTrace() ); Obviuously, these idioms aren’t necessarily more useful than an actual try .. catch .. finally block, specifically also because it does not support proper typing of exceptions (at least not in this example), nor does it support the try-with-resources statement. Nonetheless, such utility methods will come in handy every now and then.Reference: Java 8 Friday: Better Exceptions from our JCG partner Lukas Eder at the JAVA, SQL, AND JOOQ blog....
apache-hadoop-logo

ElasticSearch-Hadoop: Indexing product views count and customer top search query from Hadoop to ElasticSearch

This post covers to use ElasticSearch-Hadoop to read data from Hadoop system and index that in ElasticSearch. The functionality it covers is to index product views count and top search query per customer in last n number of days. The analyzed data can further be used on website to display customer recently viewed, product views count and top search query string. In continuation to the previous posts on Customer product search clicks analytics using big data,  Flume: Gathering customer product search clicks data using Apache Flume,  Hive: Query customer top search query and product views count using Apache Hive.We already have customer search clicks data gathered using Flume and stored in Hadoop HDFS and ElasticSearch, and how to analyze same data using Hive and generate statistical data. Here we will further see how to use the analyzed data to enhance customer experience on website and make it relevant for the end customers. Recently Viewed Items We already have covered in first part, how we can use flume ElasticSearch sink to index the recently viewed items directory to ElasticSearch instance and the data can be used to display real time clicked items for the customer. ElasticSearch-Hadoop Elasticsearch for Apache Hadoop allows Hadoop jobs to interact with ElasticSearch with small library and easy setup. Elasticsearch-hadoop-hive, allows to access ElasticSearch using Hive. As shared in previous post, we have product views count and also customer top search query data extracted in Hive tables. We will read and index the same data to ElasticSearch so that it can be used for display purpose on website.Product views count functionality Take a scenario to display each product total views by customer in the last n number of days. For better user experience, you can use the same functionality to display to end customer how other customer perceive the same product. Hive Data for product views Select sample data from hive table: # search.search_productviews : id, productid, viewcount 61, 61, 15 48, 48, 8 16, 16, 40 85, 85, 7 Product Views Count Indexing Create Hive external table “search_productviews_to_es” to index data to ElasticSearch instance. Use search; DROP TABLE IF EXISTS search_productviews_to_es; CREATE EXTERNAL TABLE search_productviews_to_es (id STRING, productid BIGINT, viewcount INT) STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler' TBLPROPERTIES('es.resource' = 'productviews/productview', 'es.nodes' = 'localhost', 'es.port' = '9210', 'es.input.json' = 'false', 'es.write.operation' = 'index', 'es.mapping.id' = 'id', 'es.index.auto.create' = 'yes'); INSERT OVERWRITE TABLE search_productviews_to_es SELECT qcust.id, qcust.productid, qcust.viewcount FROM search_productviews qcust; External table search_productviews_to_es is created points to ES instance  ElasticSearch instance configration used is localhost:9210  Index “productviews” and document type “productview” will be used to index data  Index and mappins will automatically created if it does not exist  Insert overwrite will override the data if it already exists based on id field.  Data is inserting by selecting data from another hive table “search_productviews” storing analytic/statistical data.Execute the hive script in java to index product views data, HiveSearchClicksServiceImpl.java Collection<HiveScript> scripts = new ArrayList<>(); HiveScript script = new HiveScript(new ClassPathResource("hive/load-search_productviews_to_es.q")); scripts.add(script); hiveRunner.setScripts(scripts); hiveRunner.call(); productviews index sample data The sample data in ElasticSearch index is stored as below: {id=48, productid=48, viewcount=10} {id=49, productid=49, viewcount=20} {id=5, productid=5, viewcount=18} {id=6, productid=6, viewcount=9} Customer top search query string functionality Take a scenario, where you may want to display top search query string by a single customer or all the customers on the website. You can use the same to display top search query cloud on the website. Hive Data for customer top search queries Select sample data from hive table: # search.search_customerquery : id, querystring, count, customerid 61_queryString59, queryString59, 5, 61 298_queryString48, queryString48, 3, 298 440_queryString16, queryString16, 1, 440 47_queryString85, queryString85, 1, 47 Customer Top search queries Indexing Create Hive external table “search_customerquery_to_es” to index data to ElasticSearch instance. Use search; DROP TABLE IF EXISTS search_customerquery_to_es; CREATE EXTERNAL TABLE search_customerquery_to_es (id String, customerid BIGINT, querystring String, querycount INT) STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler' TBLPROPERTIES('es.resource' = 'topqueries/custquery', 'es.nodes' = 'localhost', 'es.port' = '9210', 'es.input.json' = 'false', 'es.write.operation' = 'index', 'es.mapping.id' = 'id', 'es.index.auto.create' = 'yes'); INSERT OVERWRITE TABLE search_customerquery_to_es SELECT qcust.id, qcust.customerid, qcust.queryString, qcust.querycount FROM search_customerquery qcust; External table search_customerquery_to_es is created points to ES instance  ElasticSearch instance configration used is localhost:9210  Index “topqueries” and document type “custquery” will be used to index data  Index and mappins will automatically created if it does not exist  Insert overwrite will override the data if it already exists based on id field.  Data is inserting by selecting data from another hive table “search_customerquery” storing analytic/statistical data.Execute the hive script in java to index data HiveSearchClicksServiceImpl.java Collection<HiveScript> scripts = new ArrayList<>(); HiveScript script = new HiveScript(new ClassPathResource("hive/load-search_customerquery_to_es.q")); scripts.add(script); hiveRunner.setScripts(scripts); hiveRunner.call(); topqueries index sample data The topqueries index data on ElasticSearch instance is as shown below: {id=474_queryString95, querystring=queryString95, querycount=10, customerid=474} {id=482_queryString43, querystring=queryString43, querycount=5, customerid=482} {id=482_queryString64, querystring=queryString64, querycount=7, customerid=482} {id=483_queryString6, querystring=queryString6, querycount=2, customerid=483} {id=487_queryString86, querystring=queryString86, querycount=111, customerid=487} {id=494_queryString67, querystring=queryString67, querycount=1, customerid=494} The functionality described above is only sample functionality and ofcourse need to be extended to map to specific business scenario. This may cover business scenario of displaying search query cloud to customers on website or for further Business Intelligence analytics. Spring Data Spring ElasticSearch for testing purpose has also been included to create ESRepository to count total records and delete All. Check the service for details, ElasticSearchRepoServiceImpl.java Total product views: @Document(indexName = "productviews", type = "productview", indexStoreType = "fs", shards = 1, replicas = 0, refreshInterval = "-1") public class ProductView { @Id private String id; @Version private Long version; private Long productId; private int viewCount; ... ... }public interface ProductViewElasticsearchRepository extends ElasticsearchCrudRepository<ProductView, String> { }long count = productViewElasticsearchRepository.count(); Customer top search queries: @Document(indexName = "topqueries", type = "custquery", indexStoreType = "fs", shards = 1, replicas = 0, refreshInterval = "-1") public class CustomerTopQuery { @Id private String id; @Version private Long version; private Long customerId; private String queryString; private int count; ... ... }public interface TopQueryElasticsearchRepository extends ElasticsearchCrudRepository<CustomerTopQuery, String> { }long count = topQueryElasticsearchRepository.count(); In later posts we will cover to analyze the data further using scheduled jobs,Using Oozie to schedule coordinated jobs for hive partition and bundle job to index data to ElasticSearch. Using Pig to count total number of unique customers etc.Reference: ElasticSearch-Hadoop: Indexing product views count and customer top search query from Hadoop to ElasticSearch from our JCG partner Jaibeer Malik at the Jai’s Weblog blog....
git-logo

GitHub vs. Bitbucket: It’s More Than Just Features

Let’s go back to 2005 when Bitkeeper, host of the Linux kernel project back then, pulled the trigger and changed its core policies around pricing. The kernel’s license was an especially thorny issue after a free Bitkeeper clone was created by Andrew Tridgell – a key figure in the open-source community. Linus Torvalds didn’t like how the whole thing unfolded (to say the least), and began working on his own distributed version control system called Git (British slang for rotten person). He’s famously quoted for it: “I’m an egotistical bastard, so I name all my projects after myself. First Linux, now Git”. Mercurial was another worthy alternative being developed for the Linux kernel by Matt Mackall with a similar purpose. Git eventually prevailed, and 3 years after that Bitbucket and GitHub were born. If one existed, I would pay honest money to watch a documentary about the full story. But now that we’re done with this short piece of repo-history, let’s dig deeper at what each service offers us today, and also share some insights we gathered over time from our own experience with buckets and octocats. Does my code have to be public? Bitbucket and GitHub take different approaches to private and public repositories. This is at the heart of their pricing model, or even philosophy you might say. We’ll talk more about these differences below. Bitbucket offers unlimited free private repos while GitHub charges for them. Public repositories are unlimited and free in both services to an unlimited number of contributors. Bottom line: No, you’ll get free private repositories on Bitbucket and pay for them on GitHub. Where is it easier to work on open-source projects? The difference in approach continues with the second aspect of pricing – the number of collaborators. Bitbucket’s main offering is a free account with up to 5 collaborators on private repositories, while GitHub’s focus is on its public repositories, so it has an edge there. Although they offer many similar features for code hosting, GitHub’s has been focused on open-source while Bitbucket seems to be more focused at enterprise developers, especially after its acquisition by Atlassian in 2010. Bottom line: GitHub is the undisputed home for open-source. Mirror mirror on the wall, who forks best of them all? GitHub is definitely winning the popularity contest, having hit the 4M user mark. Bitbucket on the other hand is no underdog, offering a well rounded experience, as well as a part of Atlassian’s product suite. Both offer a slick front-end which includes issue tracking, wikis, easy to use REST APIs, and a rich GUI and command line tools for Windows, Mac, Linux and even mobile. You could argue GitHub is ahead here, but it’s sometimes a matter of taste. One central feature available on GitHub but not on Bitbucket is Gists which let you apply version control to shareable code snippets or just plain text. There is a popular open issue on Bitbucket to implement this with Mercurial, but for now it doesn’t looks like it’s happening. Another highly ranked open issue on Bitbucket which is already available on GitHub is two-factor authentication. Almost forgot, you can’t spoon on GitHub! Bottom line: It’s a matter of taste. Pages – The 2048 Effect A nice feature both services share is Pages – hosting simple HTML pages and opening up projects to users who may not necessarily be developers.  You could say it’s a hellish feature for developers, having burnt some fine productive hours playing 2048 and its clones… This feature is pretty much the same on both services. You can create a repo named either username.bitbucket.com or github.io and get your own nifty URL. github.io URLs are turning out to be a semi-obligatory feature for many open-source libraries and projects, paired with the complementary “Fork me on GitHub” banner. Watch out though, if you’re using a custom domain, it might cost you some precious loading time. Bottom line: Awesome feature, available on both services. To see and be seen The difference between GitHub’s and Bitbucket’s approach is also evident in the Explore page. On Bitbucket this only shows up as a simple Search. GitHub on the other hand boasts trending repos and showcases popular topics, not to mention its use as a portfolio for developers and an open job board. Bottom line: Unless someone is specifically looking for your project, it won’t be found on Bitbucket.Where do they stand with community support? A quick look at the newest questions on Stackoverflow will reveal that GitHub is asked about every couple of minutes, while Bitbucket questions take about an hour or two to resurface. You will find an answer to either question you might have though. Great resources and online communities are also available on the websites themselves, but GitHub is definitely miles ahead here. When we looked into the most popular libraries in Java, Ruby and JS, there was no doubt the place to look was GitHub. Again, its open-source nature has gained it a golden reputation. Bottom line: GitHub, GitHub and GitHub. Can I switch between the two? Yes you can. Bitbucket makes it pretty straightforward to import your repositories from GitHub. I can’t say the same for the other way around, but it is possible of course, and a few walkthroughs are available out there. Some more work will be needed to transfer issues and wikis as well. Bottom line: It’s possible, but you’ll sweat less moving to Bitbucket. What is it about Git and Mercurial? Although not the center of this post, you can’t talk about Bitbucket and GitHub without answering this question for yourself. Bitbucket was conceived as a tool for Mercurial and added Git support at 2011, while GitHub was all about Git from the beginning. There is no absolute right decision and they are actually very similar, check out this comparison right here. The main trade-off is a steep learning curve for Git in exchange for greater control than in Mercurial. If you’re migrating to a distributed version control system for the first time from systems like CVS or SVN, it is often considered easier with Mercurial. Bottom line: Mercurial is faster to learn, but Git offers greater control. What to expect with pricing? Besides the enterprise options, Bitbucket puts a price tag of between $10-200 for 5-Unlimited collaborators. On GitHub, pricing is divided to personal and organizational accounts. The organizational accounts offer a team management layer and range between $25-200 per month for 10-125 private repositories. Personal accounts range between $7-50 per month for 5-50 private repositories. Academics enjoy free or discounted accounts on both. Bottom line: Check out the attached comparison table and see for yourself.What happened to on-premise? Both services offer on-premise solutions and this is where Bitbucket hopes to outgun Github with a similar product by Atlassian called Stash. Unlike other account types, pricing here is a bigger pain point. Github asks for $5000 per 20 developers, and Bitbucket starts at only $10 for small teams up to 10 developers, and $1,800 for 11-25 developers. When rising to hundreds of developers, Stash offers much lower prices for similar functionality as in GitHub. Some famous users of GitHub Enterprise are Blizzard, Rackspace and Etsy. Nasa, Netflix and Philips are with Stash. Bottom line: GitHub Enterprise is way more expensive than Stash, offering similar functionality. And what about you? Hope this was helpful and helped clarify things.  Reference: GitHub vs. Bitbucket: It’s More Than Just Features from our JCG partner Alex Zhitnitsky at the Takipi blog....
software-development-2-logo

How to build Java based cloud application

Recently, we were tasked to develop a SAAS application for big data analysis. To do data mining, the system need to store multi billion public posts in the database and run the classification process on them. Classification in our context is a slow, resource intensive and painful process to assign a topic or sentiment to any record in the database. The process can last up to 24 hours with our testing data. To cope with these requirements, our obvious choice is to build a cloud application on Amazon Web Services. After working on the project for a while, I want to share my own thought, understanding and approach to build Java based cloud application. What is Cloud Computing Let start with Wikipedia first:“Cloud computing involves distributed computing over a network, where a program or application may run on many connected computers at the same time.”The definition may be a bit ambiguous but it is understandable as In The Cloud itself is more of a marketing term rather than technical term. For a newbie, it is easier to understand if we define it with a more practical way: The only difference between traditional web application with the cloud web application is the ability to scale perfectly. Cloud application should be able to cope with unlimited amount of works given unlimited hardware.  Cloud application is getting popular nowadays because of higher requirement for modern application. In the past, Google is famous for building high scale application that contains almost all available information in the internet. However, for now, many other corporates need to build applications that serve similar scale of data and computation (Facebook, Youtube, LinkedIn, Twitter,.. and also the people who crawl and process their data like us). This amount of data and processing cannot be achieved with the traditional way of developing application. That lead us to an entirely different approach to build application that can scale very well. This is cloud application. Why traditional approach of developing web application does not scale well enough Traditional Approach of developing web application Let take a look on why traditional application cannot serve that scale of data.If you have developed one traditional web application, it should be pretty much similar to the diagram above. There are some other minor variations as merging of application server and web server or multiple enterprise servers. However, most of the time, the database is relational. Web servers are normally stateful while enterprise servers can serve both stateless and stateful services. There are some crucial weaknesses that cause this architect does not scale well enough. Let start our analysis with defining perfect scalability first. Perfect scalability can be achieved if a system can always provide identical response time for double amount of work given double amount of bandwidth and double amount of hardware. Perfect scalability cannot be achieved in real life. Rather, developers only aim to achieve near perfect scalability. For example, DNS servers are out of our control. Hence, theoretically, we cannot serve higher amount of requests than the DNS servers. This is the upper bound for any system, even Google. SQL Come back to the diagram above, the biggest weakness is the database scalability. When the amount of requests and size of data are small enough, developers should not notice any performance impact when increasing load. Continue to increase the load higher, the impact can be very obvious, if the CPU is 100% utilized or memory fully occupied. At this point, the most realistic option is to pump more memory and CPU to the database system. After this, the system may perform well again. Unfortunately, this approach cannot be repeated forever whenever problems arise. There will be a limit where no matter how much ram and CPU you have, performance will slowly getting worse. This is expectable because you will have some certain records that need to be  create, read, update, delete (CRUD) by many requests. No matter whether you choose to cache them, store them on memory or do whatever trick, they are unique records, persisting in a single machine and there is a limit on amount of access requests that can be sent to a single memory address. This is the unavoidable limit as SQL is built for integrity. To ensure integrity, it is necessary that any information in SQL server should be unique. This characteristic still applicable even after data segregation or replication are done (at least for the primary instance). In contrast, NoSQL does not attempt to normalize data. Instead, it chooses to store the aggregate objects, which may contain duplicated information. Therefore, NoSQL is only applicable if data integrity is not compulsory.Above example (from couchbase.com) shows how data is stored in a document database versus relational database. If a family contains many members, relational database only store a single address for all of them while NoSQL database simply replicate the housing address. When a family relocate, the housing addresses of all members may not be updated in a single transaction, which cause data integrity violation. However, for our application and many others, this temporary violation is acceptable. For example, you may not need the amount of page views on your social page or amount of public posts in a social website to be 100% accurate. Data duplication effectively removes the concurrent access to a single memory address that we mentioned above and give developers the option to store data anywhere they want, as long as the changes in one node can be slowly synced up to other nodes. This architect is much more scalable. Stateful The next problem is stateful service. Stateful service requires the same set of hardware to serve requests from the same client. When the amount of clients increase, the best possible move is to deploy more application servers and web servers into the system. However, the resource allocation cannot be fully optimized with stateful services. For traditional applications, load balancer does not have any information of system load and normally spread the requests to different servers using Round Robin technique. The problem here is not all requests are equals and not all clients are sending identical amount of requests. That cause some servers are heavily overloaded while others are still idle. Mixing of data retrieval and processing For traditional applications, the server that retrieve data from database ends up processing it. There is no clear separation of processing data and retrieving data. Both of the two tasks can cause bottle neck to the system. If the bottle neck come from data retrieval, data processing is under-utilized and vice versa. Rethinking best approaches to build scalable application Look at what have been adopted in our IT fields recently, I hardly found them as new inventions. Rather, they are adoption of the practices that have been used succesfully in real life to solve scalability issue. To illustrate this, let imagine a real life situation of tackling scalability issue. HospitalAssume that we have a small hospital. For our hospital, we mostly serve loyal customers. Each loyal customer have a personal doctor, who keeps track of his/her medical record. Because of this, customers only need to show the ICs to be served by the preferred doctors. To make things challenging, our hospital is functioning before the internet era. Stateless versus stateful Is the description above look similar enough to stateful service? Now, your hospital is getting famous and the amount of customers suddenly surges. Provide that you have enough infrastructure, the obvious option is to hire more doctors and nurses. However, customers are not willing to try out new doctors. That cause the new staffs are free while old staffs are busy. To ensure optimization, you choose to change the hospital policy so that the customers must keep their medical records and the hospital will assign them to any available doctors. This new practice helps to resolve all of your headache and give you the option to deploy more seasonal staffs to cope with sudden surge of clients. Well, this policy may not make the customers happy but for IT fields, stateless and stateful services provide identical results. Data Duplication Let say the amount of customers constantly surge and you start to consider opening more branches. At the same time, there is a new rising problem that customers constantly complain about the need of bringing medical records while visiting hospital. To solve this problem, you come back to the original policy of storing the medical records at the hospital. However, as you are having more than one branch, each branch need to store a copy of user medical records. At the end of the day or the week, any record change need to be synced to every branch. Separation of Services After running the hospital for a few months, you recognize that the resources allocation are not very optimized. For example, you have blood test and X-ray faculty in both branch A and B. However, there are many customer doing blood test in branch A and many people taking X-ray in branch B. It cause the customers keep waiting in one branch, while no one visit the other branch. To optimize resource, you shutdown the under-utilized faculties and setup unique blood test centre and X-ray centre. Customers will be sent from the branches to the specialized centres for special services. Adhoc Resource It is hard to do resource planning for hospital. There are seasonal diseases that only happens at a certain time of the year. Moreover, catastrophe may happen any time. They cause sudden surge of warded patients for a short period. To cope with this, you may want to sign agreement with the city council to temporarily rent facilities when needed and hire more part-time staffs. Apply these ideas to build cloud application Now, after looking at the example above, you may feel that most of the ideas make sense. It only take a short while before developers start to apply these ideas into building web application. Then, we move to the cloud application era. How to build cloud application To build a cloud application, we need to find way to apply the mentioned ideas into our application. Here is my suggest approach Infrastructure If you start to think about building cloud application, infrastructure is the first concern. If your platform does not support adhoc resource (dynamically bursting of existing server spec or spawning new instance), it is very hard to build cloud application. At the moment, we choose AWS because it is the most matured platform in the market. We have moved from internal hosting to AWS hosting one year ago due to some major benefits:Mutiple Locations: Our customers are coming from all 5 continents, using Amazon Region, we can deploy the instance closer to customer location, through that, reduce the response time. Monitoring & Auto Scaling: Amazon offers quite a decent monitoring service for their platform. Due to server load, it is possible to do Auto Scaling. Content Delivery Network: Amazon CloudFront give us the options to offload static contents from our main deployment, which will improve page load time. Similar to normal instances, static contents can be served from the nearest instances to customer. Synchronized & Distributed Caching: MemCache has been our preferred caching solution over the years. However, one major concern is the lack of support for synchronization among the nodes. Amazon Elastic Cache give us the option to use MemCache without worrying about node synchronization. Management API: This is one major advantage. Recently, we start to make use of Management API to spawn up instance for a short while to run integration test.Database Provide that you have select the platform for developing cloud application, the next step should be selecting the right database for your system. The first decision you need to make is whether SQL or NoSQL is the right choice for your system. If the system is not data intensive, SQL should be fine, if the reverse is true, you should consider NoSQL. Sometimes, multiple databases can be used together. For example, if we want to implement a Social Network application like Facebook, it is possible to store system settings or even user profiles in SQL database. In contrast, user posts must be stored in the NoSQL database due to huge volume of data. Moreover, we can choose SOLR to store public posts due to strong searching capability and Mongo DB for storing of user activities. If possible, please choose the database system that support clustering, data segregation and load balancing. If not, you may end up implement all of these features yourself. For example, SOLR should be the better choice compare to Lucene unless we want to do our own data segregation. Computing Intensive or Data Intensive It is better if we know that the system is data intensive or computing intensive. For example, Social Network like Facebook is pretty much data intensive while our big data analysis are both data intensive and computing intensive. For data intensive system, we can let any node in the cloud retrieve data and do processing as well. For computing intensive node, it is better to split out data retrieval and data processing.Data intensive system normally serve real-time data while computing intensive system run the background jobs to process data. Mixing these two heavy tasks in the same environment may end up reducing system effectiveness. For computing cloud, it is better to have a framework to monitor load, distribute tasks and collect results at the end of computing process. If you do not need the processing to be real time, Hadoop is the best choice in the market. If real time computation is required, please consider Apache Storm. Design Pattern for Cloud Application To build a successful Cloud Application, there are something that we should keep in mind:Stateless It is a must to make all your services and server stateless. If the service need user data, include them as parameter in the API. It is worth noticed that to implement Stateless Session on Web Server, we have a few choices to consider:Cookie based session Distributed Cache session Database SessionThe solutions above are sorted from up to down with lower scalability but easier management. Idempotence For Cloud Application, most of the API call will happen through the network rather than internal method calls. Therefore, it is better if we can make the method calls safe. If you stick to the Stateless principle above, it is  likely that the services you implement are already idempotent. Remote Facade Remote Facade is different with Facade pattern. They may look similar in term of practice but aim to fix different problems. As most of your API calls happen over the network, the network latency contribute a great part to the response time. With Remote Facade pattern, developers should build a coarse-grained API so that the amount of calls can be reduced. In layman’s terms, it is better to go to supermarket and buy 10 things in one shot rather than visit 10 times, each time buy 1 thing. Data Access Object As you may transfer the data around, be careful with the amount of data you transfer. It is best to only give the minimum data as required. Play Safe This is not a design pattern but you will thanks yourself for playing safe in the future. Due to the nature of distributed computing, when something go wrong, it is very difficult to find out which part is wrong. If possible, implement health check, ping, thoroughly logging, debug mode to every component in the system.Conclusion I hope this approach to build Cloud Application can bring some benefit to everyone. If you have other opinions or experience, kindly feedback and share with us. In the next article, I will share the design of our Social Monitoring Tool.Reference: How to build Java based cloud application from our JCG partner Tony Nguyen at the Developers Corner blog....
java-logo

Why use SerialVersionUID inside Serializable class in Java

Serialization and SerialVersionUID is always remains a puzzle for many Java developers. I often see questions like what is this SerialVersionUID, or what will happen if I don’t declare SerialVersionUID in my Serializable class? Apart from complexity involved and rare use, one more reason of these questions is Eclipse IDE’s warning against absence of SerialVersionUID e.g.“The Serializable class Customer does not declare a static final SerialVersionUID field of type long”. In this article, you will not only learn basics of Java SerialVersionUID but also it’s effect during serialization and de-serialization process. When you declare a class as Serializable by implementing marker interface java.io.Serializable, Java runtime persist instance of that class into disk by using default Serialization mechanism, provided you have not customized the process using Externalizable interface. During serialization, Java runtime creates a version number for a class, so that it can de-serialize it later. This version number is known as SerialVersionUID in Java. If during de-serialization, SerialVersionUID doesn’t match than process will fail with InvalidClassException as Exception in thread “main” java.io.InvalidClassException, also printing class-name and respective SerialVersionUID. Quick solution to fix this problem is copying SerialVersionUID and declaring them as private static final long constant in your class. In this article, we will learn about why we should use SerialVersionUID in Java and How to use serialver JDK tool for generating this ID. If you are new to serialization, you can also see Top 10 Java Serialization Interview question to gauge your knowledge and find gap in your understanding for further reading. Similar to Concurrency and Multi-threading, Serialization is another topic, which deserve couple of reading. Why use SerialVersionUID in Java As I said, when we don’t declare SerialVersionUID, as a static, final and long value in our class, Serialization mechanism creates it for us. This mechanism is sensitive to many details including fields in your class, there access modifier, the interface they implement and even different Compiler implementations, any change on class or using different compiler may result in different SerialVersionUID, which many eventually stop reloading serialized data. It’s too risky to rely on Java Serialization mechanism for generating this id, and that’s why it’s recommended to declare explicit SerialVersionUID in your Serializable class. I strongly suggest to read Joshua Bloch’s classic Java title, Effective Java to understand Java Serialization and issues of incorrect handling it. By the way JDK also provides a tool called serialver, located in bin directory of JAVA_HOME folder, in my machine C:\Program Files\Java\jdk1.6.0_26\bin\serialver.exe, which can be used to generate SerialVersionUID for old classes. This is very helpful, in case you have made changes in your class, which is breaking Serialization and your application is not able to reload serialized instances. You can simply use this tool to create SerialVersionUID for old instances and then use it explicitly in your class by declaring a private, static, final and long SerialVersionUID field. By the way, it’s highly recommend, both due to performance and security reason to use customized binary format for Serialization, once again Effective Java has couple of Items, which explains benefits of custom format in great details. How to use serialver JDK tool to generate SerialVersionUID You can use JDK’s serialver tool to generate SerialVersionUID for classes. This is particularly useful for evolving classes, it returns SerialVersionUID in format easy to copy. You can use serialver JDK tool as shown in below example : $ serialver use: serialver [-classpath classpath] [-show] [classname...]$ serialver -classpath . Hello Class Hello is not Serializable.$ serialver -classpath . Hello Hello: static final long SerialVersionUID = -4862926644813433707L; You can even use serialver tool in GUI form by running command $ serialver -show, this will open the serial version inspector, which takes full class name and shows it’s Serial version. Summary Now we know what is SerialVersionUID and why it’s important to declare it in Serializable class, it’s time to revise some of the important fact, related to Java SerialVersionUID.SerialVersionUID is used to version serialized data. You can only de-serialize a class if it’s SerialVersionUID matches with the serialized instance. When we don’t declare SerialVersionUID in our class, Java runtime generates it for us, but that process is sensitive to many class meta data including number of fields, type of fields, access modifier of fields, interface implemented by class etc. You can find accurate information in Serialization documentation from Oracle. It’s recommended to declare SerialVersionUID as private static final long variable to avoid default mechanism. Some IDE like Eclipse also display warning if you miss it e.g. “The Serializable class Customer does not declare a static final SerialVersionUID field of type long”. Though you can disable this warnings by going to Window > Preferences > Java > Compiler > Errors / Warnings > Potential Programming Problems, I suggest not to do that. Only case, I see being careless is when restoring of data is not needed. Here is how this error looks like in Eclipse IDE, all you need to do is accept first quick fix.You can even use serialver tool from JDK to generate Serial Version for classes in Java. It also has a GUI, which can be enable by passing -show parameter. It’s Serialization best practice in Java to explicitly declare SerialVersionUID, to avoid any issues during de-serialization especially if you are running a client server application which relies on serialized data e.g. RMI.That’s all about SerialVersionUID in Java. Now we know that Why it’s important to declare SerialVersionUID right into the class. You can thanks your IDE for this reminder, which may potentially break de-serialization of your class. If you want to learn more about Serialization and related concept, you can also see these amazing articles :Difference between transient and volatile variable in Java Difference between Serializable and Externalizable interface in Java When to use transient variable in JavaReference: Why use SerialVersionUID inside Serializable class in Java from our JCG partner Javin Paul at the Javarevisited blog....
mysql-logo

MySQL Transaction Isolation Levels and Locks

Recently, an application that my team was working on encountered problems with a MySQL deadlock situation and it took us some time to figure out the reasons behind it. This application that we deployed was running on a 2-node cluster and they both are connected to an AWS MySQL database. The MySQL db tables are mostly based on InnoDB which supports transaction (meaning all the usual commit and rollback semantics) as well as row-level locking that MyISAM engine does not provide. So the problem arose when our users, due to some poorly designed user interface, was able to execute the same long running operation twice on the database. As it turned out, due to the fact that we have a dual node cluster, each of the user operation originated from a different web application (which in turn meant 2 different transaction running the same queries). The deadlock query happened to be a “INSERT INTO T… SELECT FROM S WHERE” query that introduced shared locks on the records that were used in the SELECT query. It didn’t help that both T and S in this case happened to be the same table. In effect, both the shared locks and exclusive locks were applied on the same table. An attempt to explain the possible cause of the deadlock on the queries could be explained by the following table. This is based on the assumption that we are using a default REPEATABLE_READ transaction isolation level (I will explain the concept of transaction isolation later) Assuming that we have a table as such:RowId Value1 Collection 12 Collection 2… Collection N450000 Collection 450000The following is a sample sequence that could possibly cause a deadlock based on the 2 transactions running an SQL query like “INSERT INTO T SELECT FROM T WHERE … “ :Time Transaction 1 Transaction 2 CommentT1 Statement executedStatement executed. A shared lock is applied to records that are read by selectionT2 Read lock s1 on Row 10-20The lock on the index across a range. InnoDB has a concept of gap locks.T3Statement executed Transaction 2 statement executed. Similar shared lock to s1 applied by selectionT4Read lock s2 on Row 10-20 Shared read locks allow both transaction to read the records onlyT5 Insert lock x1 into Row 13 in index wantedTransaction 1 attempts to get exclusive lock on Row 13 for insertion but Transaction 2 is holding a shared lockT6Insert lock x2 into Row 13 in index wanted Transaction 2 attempts to get exclusive lock on Row 13 for insertion but Transaction 1 is holding a shared lockT7Deadlock!The above scenario occurs only when we use REPEATABLE_READ (which introduces shared read locks). If we were to lower the transation isolation level to READ_COMMITTED, we would reduce the chances of a deadlock happening. Of course, this would mean relaxing the consistency of the database records. In the case of our data requirements, we do not have such strict requirements for strong consistency. Thus, it is acceptable for one transaction to read records that are committed by other transactions. So, to delve deeper into the idea of Transaction Isolation, this concept has been defined by ANSI/ISO SQL as the following from highest isolation levels to lowest:Serializable This is the highest isolation level and usually requires the use of shared read locks and exclusive write locks (as in the case of MySQL). What this means in essence that any query made will require access to a shared read lock on the records which prevents another transaction’s query to modify these records. Every update statement will require access to an exclusive write lock. Also, range-locks must be acquired when a select statement with a WHERE condition is used. This is implemented as a gap lock in MySQL.Repeatable Reads This is the default level used in MySQL. This is mainly similar to Serializable beside the fact that a range lock is not used. However, the way that MySQL implements this level seemed to me a little different. Based on Wikipedia’s article on Transaction Isolation, a range lock is not implemented and so phantom reads can still occur. Phantom reads refer to a possibility that select queries will have additional records when the same query is made within a transaction. However, what I understand from MySQL’s document is that range locks are still used and the same select queries made in the same transaction will always return the same records. Maybe I’m mistaken in my understanding and if there’s any mistakes in my intepretations, I stand ready to be corrected.Read Committed This is an isolation level that will maintain a write lock until the end of the transaction but read locks will be released at the end of the SELECT statement. It does not promise that a SELECT statement will find the same data if it is re-run again in the same transaction. It will, however, guarantee that the data that is read are not “dirty” and has been committed.Read Uncommitted This is an isolation level that I doubt would be useful for most use cases. Basically, it allows a transaction to see all data that has been modified, including “dirty” or uncommitted data. This is the lowest isolation levelHaving gone through the different transaction isolation levels, we could see how the selection of the Transaction Isolation level determines the kind of database locking mechanism. From a practical standpoint, the default MySQL isolation level (REPEATABLE_READ) might not always be a good choice when you are dealing with a scenario like ours where there is really no need for such strong consistency in the data reads. I believe that by lowering the isolation level, it is likely to reduce chances that your database queries meet with a deadlock. Also, it might even allow a higher concurrent access to your database which improve the performance level of your queries. Of course, this comes with the caveat that you need to understand how important consistent reads are for your application. If you are dealing with data where precision is paramount (e.g. your bank accounts), then it is definitely necessary to impose as much isolation as possible so that you would not read inconsistent information within your transaction.Reference: MySQL Transaction Isolation Levels and Locks from our JCG partner Lim Han at the Developers Corner blog....
apache-solr-logo

See Your Solr Cache Sizes: Eclipse Memory Analyzer

Solr uses different caches to prevent too much IO access and calculations during requests. When indexing doesn’t happen too frequently you can get huge performance gains by employing those caches. Depending on the structure of your index data and the size of the caches they can become rather large and use a substantial part of your heap memory. In this post I would like to show how you can use the Eclipse Memory Analyzer to see how much space your caches are really using in memory. Configuring the Caches All the Solr caches can be configured in solrconfig.xml in the query section. You will find definitions like this: <filterCache class="solr.FastLRUCache" size="8000" initialSize="512" autowarmCount="0"/> This is an example of a filter cache configured to use the FastLRUCache, a maximum size of 8000 items and no autowarming. Solr ships with two commonly used cache implementations, the FastLRUCache, that uses a ConcurrentHashMap and the LRUCache, that synchronizes the calls. Some of the caches are still configured to use the LRUCache but on some read heavy projects I had good results with changing those to FastLRUCache as well. Additionaly, starting from Solr 3.6 there is also the LFUCache. I have never used it and it is still marked as experimental and subject to change. Solr comes with the following caches: FilterCache Caches a bitset of the filter queries. This can be a very effective cache if you are reusing filters. QueryResultCache Stores an ordered list of the document ids for a query. DocumentCache Caches the stored fields of the Lucene documents. If you have large or many fields this cache can become rather large. FieldValueCache A cache that is mainly used for faceting. Additionaly you will see references to the FieldCache which is not a cache managed by Solr and can not be configured. In the default configuration Solr only caches 512 items per cache which can often be too small. You can see the usage of your cache in the administration view of Solr in the section Plugin/Stats/Caches of your core. This will tell you the hit rate as well as the evictions for your caches.The stats are a good starting point for tuning your caches but you should be aware that by setting the size too large you can see some unwanted GC activity. That is why it might be useful to look at the real size of your caches in memory instead of the item count alone. Eclipse MAT Eclipse MAT is a great tool for looking at your heap in memory and see which objects occupy the space. As the name implies it is based on Eclipse and can either be downloaded as a standalone tool or is available via update sites for integration in an existing instance. Heap dumps can be aquired using the tool directly but you can also open existing dumps. On opening it will automatically calculate a chart of the largest objects that might already contain some of the cache objects, if you are keeping lots of items in the cache.Using the links below the pie chart you can also open further automatic reports, e.g. the Top Consumers, a more detailed page on large objects.Even if you do see some of the cache classes here, you can’t really see which of the caches it is that consumes the memory. Using the Query Browser menu on top of the report you can also list instances of classes directly, no matter how large those are.We are choosing List Objects with outgoing references and enter the class name for the FastLRUCache org.apache.solr.search.FastLRUCache. For the default configuration you will see two instances. When clicking on one of the instances you can see the name of the cache in the lower left window, in this case the filter cache.There are two numbers available for the heap size: The shallow size and the retained size. When looking at the caches we are interested in the retained size as this is the size that would be available when the instance is garbage collected, i.e. the size of the cache that is only used by the cache. In our case this is around 700kB but this can grow a lot. You can also do the same inspection for the org.apache.solr.search.LRUCache to see the real size of your caches. Conclusion The caches can get a lot bigger than in our example here. Eclipse Memory Analyzer has helped me a lot already to see if there are any problems with a heap that is growing too large.Reference: See Your Solr Cache Sizes: Eclipse Memory Analyzer from our JCG partner Florian Hopf at the Dev Time blog....
enterprise-java-logo

Integration Tests for External Services

Our systems often depend on 3rd party services (They may even be services internal to the company that we have no control on). Such services include Social Networks exposing APIs, SaaS with APIs like Salesforce, Authentication providers, or any system that our system communicates with, but is outside our product lifecycle. In regular integration tests, we would have an integration deployment of all sub-systems, in order to test how they work together. In case of external services, however, we can only work with the real deployment (given some API credentials). What options do we have to write integration tests, i.e. check if our system properly integrates with the external system?   If the service provides a sandbox, that’s the way to go – you have a target environment where you can do anything and it will be short-lived and not visible to any end-users. This is, however, rare, as most external services do not provide such sandboxes. Another option is to have an integration test account – e.g. you register an application at twitter, called “yourproduct-test”, create a test twitter account, and provide these credentials to the integration test. That works well if you don’t have complex scenarios involving multi-step interactions and a lot of preconditions. For example, if you application is analyzing tweets over a period of time, you can’t post tweets with the test account in the past. The third option is mocks. Normally, mocks and integration tests are mutually exclusive, but not in this case. You don’t want to test whether the external service conforms to its specification (or API documentation) – you want to test whether your application invokes it in a proper way, and properly processes its responses. Therefore it should be OK to run a mock of the external system, that returns predefined results in predefined set of criteria. These results and criteria should correspond directly to the specifications. This can be easily achieved by running an embedded mock server. There are multiple tools that can be used to do that – here’s a list of some of them for Java – WireMock, MockServer, MockWebServer, Apache Wink. The first three are specifically created for the above usecase, while Apache Wink has a simple mock server class as part of a larger project. So, if you want to test whether your application properly posts tweets after each successful purchase, you can (using WireMock, for example) do it as follows: @Rule public WireMockRule wireMockRule = new WireMockRule(8089);@Test public void purchaseTweetTest() { stubFor(post(urlEqualTo("/statuses/update.json")) .willReturn(aResponse() .withStatus(200) .withHeader("Content-Type", "application/json") .withBody(getMockJsonResponse()));// ... purchaseService.completePurchase(purchase);verify(postRequestedFor(urlMatching("/statuses/update.json")) .withRequestBody( matching(".*purchaseId: " + purchaseId + "*"))); } That way you will verify whether your communication with the external service is handled properly in your application, i.e. whether you integrate properly, but you won’t test with an actual system. That, of course, has a drawback – the rules that you put in your mocks may not be the same as in the external system. You may have misinterpreted the specification/documentation, or it may not be covering all corner cases. But for the sake of automated tests, I think this is preferable to supporting test accounts that you can’t properly cleanup or set test data to. These automated integration tests can be accompanied by manual testing on a staging environment, to make sure that the integration is really working even with the actual external system.Reference: Integration Tests for External Services from our JCG partner Bozhidar Bozhanov at the Bozho’s tech blog blog....
junit-logo

Writing Clean Tests – Beware of Magic

It is pretty hard to figure out a good definition for clean code because everyone of us has our own definition for the word clean. However, there is one definition which seems to be universal:Clean code is easy to read.This might come as a surprise to some of you, but I think that this definition applies to test code as well. It is in our best interests to make our tests as readable as possible because:  If our tests are easy to read, it is easy to understand how our code works. If our tests are easy to read, it is easy to find the problem if a test fails (without using a debugger).It isn’t hard to write clean tests, but it takes a lot of practice, and that is why so many developers are struggling with it. I have struggled with this too, and that is why I decided to share my findings with you. This is the third part of my tutorial which describes how we can write clean tests. This time we will learn two techniques which can be used to remove magic numbers from our tests. Constants to the Rescue We use constants in our code because without constants our code would be littered with magic numbers. Using magic numbers has two consequences:Our code is hard to read because magic numbers are just values without meaning. Our code is hard to maintain because if we have to change the value of a magic number, we have to find all occurrences of that magic number and update everyone of them.In other words,Constants help us to replace magic numbers with something that describes the reason of its existence. Constants make our code easier to maintain because if the value of a constant changes, we have to make that change only to one place.If we think about the magic numbers found from our test cases, we notice that they can be divided into two groups:Magic numbers which are relevant to a single test class. A typical example of this kind of magic number is the property value of an object which created in a test method. We should declare these constants in the test class. Magic numbers which are relevant to multiple test classes. A good example of this kind of magic number is the content type of a request processed by a Spring MVC controller. We should add these constants to a non-instantiable class.Let’s take a closer look at both situations. Declaring Constants in the Test Class So, why should we declare some constants in our test class? After all, if we think about the benefits of using constants, the first thing that comes to mind is that we should eliminate magic numbers from our tests by creating classes which contains the constants used in our tests. For example, we could create a TodoConstants class which contains the constants used in the TodoControllerTest, TodoCrudServiceTest, and TodoTest classes. This is a bad idea. Although it is sometimes wise to share data in this way, we shouldn’t make this decision lightly because most of the time our only motivation to introduce constants in our tests is to avoid typos and magic numbers. Also, if the magic numbers are relevant only to the a single test class, it makes no sense to introduce this kind of dependency to our tests just because we want to minimize the number of created constants. In my opinion, the simplest way to deal with this kind of situation is to declare constants in the test class. Let’s find out how we can improve the unit test described in the previous part of this tutorial. That unit test is written to test the registerNewUserAccount() method of the RepositoryUserService class, and it verifies that this method is working correctly when a new user account is created by using a social sign provider and a unique email address. The source code of the that test case looks as follows: import org.junit.Before; import org.junit.Test; import org.junit.runner.RunWith; import org.mockito.Mock; import org.mockito.invocation.InvocationOnMock; import org.mockito.runners.MockitoJUnitRunner; import org.mockito.stubbing.Answer; import org.springframework.security.crypto.password.PasswordEncoder;import static org.junit.Assert.assertEquals; import static org.junit.Assert.assertNull; import static org.mockito.Matchers.isA; import static org.mockito.Mockito.times; import static org.mockito.Mockito.verify; import static org.mockito.Mockito.verifyNoMoreInteractions; import static org.mockito.Mockito.verifyZeroInteractions; import static org.mockito.Mockito.when;@RunWith(MockitoJUnitRunner.class) public class RepositoryUserServiceTest {private RepositoryUserService registrationService;@Mock private PasswordEncoder passwordEncoder;@Mock private UserRepository repository;@Before public void setUp() { registrationService = new RepositoryUserService(passwordEncoder, repository); }@Test public void registerNewUserAccount_SocialSignInAndUniqueEmail_ShouldCreateNewUserAccountAndSetSignInProvider() throws DuplicateEmailException { RegistrationForm registration = new RegistrationForm(); registration.setEmail("john.smith@gmail.com"); registration.setFirstName("John"); registration.setLastName("Smith"); registration.setSignInProvider(SocialMediaService.TWITTER);when(repository.findByEmail("john.smith@gmail.com")).thenReturn(null);when(repository.save(isA(User.class))).thenAnswer(new Answer<User>() { @Override public User answer(InvocationOnMock invocation) throws Throwable { Object[] arguments = invocation.getArguments(); return (User) arguments[0]; } });User createdUserAccount = registrationService.registerNewUserAccount(registration);assertEquals("john.smith@gmail.com", createdUserAccount.getEmail()); assertEquals("John", createdUserAccount.getFirstName()); assertEquals("Smith", createdUserAccount.getLastName()); assertEquals(SocialMediaService.TWITTER, createdUserAccount.getSignInProvider()); assertEquals(Role.ROLE_USER, createdUserAccount.getRole()); assertNull(createdUserAccount.getPassword());verify(repository, times(1)).findByEmail("john.smith@gmail.com"); verify(repository, times(1)).save(createdUserAccount); verifyNoMoreInteractions(repository); verifyZeroInteractions(passwordEncoder); } } The problem is that this test case uses magic numbers when it creates a new RegistrationForm object, configures the behavior of the UserRepository mock, verifies that information of the returned User object is correct, and verifies that the correct method methods of the UserRepository mock are called in the tested service method. After we have removed these magic numbers by declaring constants in our test class, the source code of our test looks as follows: import org.junit.Before; import org.junit.Test; import org.junit.runner.RunWith; import org.mockito.Mock; import org.mockito.invocation.InvocationOnMock; import org.mockito.runners.MockitoJUnitRunner; import org.mockito.stubbing.Answer; import org.springframework.security.crypto.password.PasswordEncoder;import static org.junit.Assert.assertEquals; import static org.junit.Assert.assertNull; import static org.mockito.Matchers.isA; import static org.mockito.Mockito.times; import static org.mockito.Mockito.verify; import static org.mockito.Mockito.verifyNoMoreInteractions; import static org.mockito.Mockito.verifyZeroInteractions; import static org.mockito.Mockito.when;@RunWith(MockitoJUnitRunner.class) public class RepositoryUserServiceTest {private static final String REGISTRATION_EMAIL_ADDRESS = "john.smith@gmail.com"; private static final String REGISTRATION_FIRST_NAME = "John"; private static final String REGISTRATION_LAST_NAME = "Smith"; private static final Role ROLE_REGISTERED_USER = Role.ROLE_USER; private static final SocialMediaService SOCIAL_SIGN_IN_PROVIDER = SocialMediaService.TWITTER;private RepositoryUserService registrationService;@Mock private PasswordEncoder passwordEncoder;@Mock private UserRepository repository;@Before public void setUp() { registrationService = new RepositoryUserService(passwordEncoder, repository); }@Test public void registerNewUserAccount_SocialSignInAndUniqueEmail_ShouldCreateNewUserAccountAndSetSignInProvider() throws DuplicateEmailException { RegistrationForm registration = new RegistrationForm(); registration.setEmail(REGISTRATION_EMAIL_ADDRESS); registration.setFirstName(REGISTRATION_FIRST_NAME); registration.setLastName(REGISTRATION_LAST_NAME); registration.setSignInProvider(SOCIAL_SIGN_IN_PROVIDER);when(repository.findByEmail(REGISTRATION_EMAIL_ADDRESS)).thenReturn(null);when(repository.save(isA(User.class))).thenAnswer(new Answer<User>() { @Override public User answer(InvocationOnMock invocation) throws Throwable { Object[] arguments = invocation.getArguments(); return (User) arguments[0]; } });User createdUserAccount = registrationService.registerNewUserAccount(registration);assertEquals(REGISTRATION_EMAIL_ADDRESS, createdUserAccount.getEmail()); assertEquals(REGISTRATION_FIRST_NAME, createdUserAccount.getFirstName()); assertEquals(REGISTRATION_LAST_NAME, createdUserAccount.getLastName()); assertEquals(SOCIAL_SIGN_IN_PROVIDER, createdUserAccount.getSignInProvider()); assertEquals(ROLE_REGISTERED_USER, createdUserAccount.getRole()); assertNull(createdUserAccount.getPassword());verify(repository, times(1)).findByEmail(REGISTRATION_EMAIL_ADDRESS); verify(repository, times(1)).save(createdUserAccount); verifyNoMoreInteractions(repository); verifyZeroInteractions(passwordEncoder); } } This example demonstrates that declaring constants in the test class has three benefits:Our test case is easier to read because the magic numbers are replaced with constants which are named properly. Our test case is easier to maintain because we can change the values of constants without making any changes to the actual test case. It is easier to write new tests for the registerNewUserAccount() method of the RepositoryUserService class because we can use constants instead of magic numbers. This means that we don’t have to worry about typos.However, sometimes our tests use magic numbers which are truly relevant to multiple test classes. Let’s find out how we can deal with this situation. Adding Constants to a Non-Instantiable Class If the constant is relevant for multiple test classes, it makes no sense to declare the constant in every test class which uses it. Let’s take a look at one situation where it makes sense to add constant to a non-instantiable class. Let’s assume that we have to write two unit tests for a REST API:The first unit test ensures that we cannot add an empty todo entry to the database. The second unit test ensures that we cannot add an empty note to the database.These unit tests use the Spring MVC Test framework. If you are not familiar with it, you might want to take a look at my Spring MVC Test tutorial. The source code of the first unit test looks as follows: import com.fasterxml.jackson.databind.ObjectMapper; import org.junit.Before; import org.junit.Test; import org.junit.runner.RunWith; import org.springframework.beans.factory.annotation.Autowired; import org.springframework.http.MediaType; import org.springframework.test.context.ContextConfiguration; import org.springframework.test.context.junit4.SpringJUnit4ClassRunner; import org.springframework.test.context.web.WebAppConfiguration; import org.springframework.test.web.servlet.MockMvc; import org.springframework.test.web.servlet.setup.MockMvcBuilders; import org.springframework.web.context.WebApplicationContext;import java.nio.charset.Charset;import static org.springframework.test.web.servlet.request.MockMvcRequestBuilders.post; import static org.springframework.test.web.servlet.result.MockMvcResultMatchers.status;@RunWith(SpringJUnit4ClassRunner.class) @ContextConfiguration(classes = {WebUnitTestContext.class}) @WebAppConfiguration public class TodoControllerTest {private static final MediaType APPLICATION_JSON_UTF8 = new MediaType( MediaType.APPLICATION_JSON.getType(), MediaType.APPLICATION_JSON.getSubtype(), Charset.forName("utf8") );private MockMvc mockMvc;@Autowired private ObjectMapper objectMapper;@Autowired private WebApplicationContext webAppContext;@Before public void setUp() { mockMvc = MockMvcBuilders.webAppContextSetup(webAppContext).build(); }@Test public void add_EmptyTodoEntry_ShouldReturnHttpRequestStatusBadRequest() throws Exception { TodoDTO addedTodoEntry = new TodoDTO();mockMvc.perform(post("/api/todo") .contentType(APPLICATION_JSON_UTF8) .content(objectMapper.writeValueAsBytes(addedTodoEntry)) ) .andExpect(status().isBadRequest()); } } The source code of the second unit test looks as follows: import com.fasterxml.jackson.databind.ObjectMapper; import org.junit.Before; import org.junit.Test; import org.junit.runner.RunWith; import org.springframework.beans.factory.annotation.Autowired; import org.springframework.http.MediaType; import org.springframework.test.context.ContextConfiguration; import org.springframework.test.context.junit4.SpringJUnit4ClassRunner; import org.springframework.test.context.web.WebAppConfiguration; import org.springframework.test.web.servlet.MockMvc; import org.springframework.test.web.servlet.setup.MockMvcBuilders; import org.springframework.web.context.WebApplicationContext;import java.nio.charset.Charset;import static org.springframework.test.web.servlet.request.MockMvcRequestBuilders.post; import static org.springframework.test.web.servlet.result.MockMvcResultMatchers.status;@RunWith(SpringJUnit4ClassRunner.class) @ContextConfiguration(classes = {WebUnitTestContext.class}) @WebAppConfiguration public class NoteControllerTest {private static final MediaType APPLICATION_JSON_UTF8 = new MediaType( MediaType.APPLICATION_JSON.getType(), MediaType.APPLICATION_JSON.getSubtype(), Charset.forName("utf8") );private MockMvc mockMvc;@Autowired private ObjectMapper objectMapper;@Autowired private WebApplicationContext webAppContext;@Before public void setUp() { mockMvc = MockMvcBuilders.webAppContextSetup(webAppContext).build(); }@Test public void add_EmptyNote_ShouldReturnHttpRequestStatusBadRequest() throws Exception { NoteDTO addedNote = new NoteDTO();mockMvc.perform(post("/api/note") .contentType(APPLICATION_JSON_UTF8) .content(objectMapper.writeValueAsBytes(addedNote)) ) .andExpect(status().isBadRequest()); } } Both of these test classes declare a constant called APPLICATION_JSON_UTF8. This constant specifies the content type and the character set of the request. Also, it is clear that we need this constant in every test class which contains tests for our controller methods. Does this mean that we should declare this constant in every such test class? No! We should move this constant to a non-instantiable class because of two reasons:It is relevant to multiple test classes. Moving it to a separate class makes it easier for us to write new tests for our controller methods and maintain our existing tests.Let’s create a final WebTestConstants class, move the APPLICATION_JSON_UTF8 constant to that class, and add a private constructor to the created class. The source code of the WebTestConstant class looks as follows: import org.springframework.http.MediaType;public final class WebTestConstants { public static final MediaType APPLICATION_JSON_UTF8 = new MediaType( MediaType.APPLICATION_JSON.getType(), MediaType.APPLICATION_JSON.getSubtype(), Charset.forName("utf8") ); private WebTestConstants() { } } After we have done this, we can remove the APPLICATION_JSON_UTF8 constants from our test classes. The source code of our new test looks as follows: import com.fasterxml.jackson.databind.ObjectMapper; import net.petrikainulainen.spring.jooq.config.WebUnitTestContext; import net.petrikainulainen.spring.jooq.todo.dto.TodoDTO; import org.junit.Before; import org.junit.Test; import org.junit.runner.RunWith; import org.springframework.beans.factory.annotation.Autowired; import org.springframework.test.context.ContextConfiguration; import org.springframework.test.context.junit4.SpringJUnit4ClassRunner; import org.springframework.test.context.web.WebAppConfiguration; import org.springframework.test.web.servlet.MockMvc; import org.springframework.test.web.servlet.setup.MockMvcBuilders; import org.springframework.web.context.WebApplicationContext;import java.nio.charset.Charset;import static org.springframework.test.web.servlet.request.MockMvcRequestBuilders.post; import static org.springframework.test.web.servlet.result.MockMvcResultMatchers.status;@RunWith(SpringJUnit4ClassRunner.class) @ContextConfiguration(classes = {WebUnitTestContext.class}) @WebAppConfiguration public class TodoControllerTest {private MockMvc mockMvc;@Autowired private ObjectMapper objectMapper;@Autowired private WebApplicationContext webAppContext;@Before public void setUp() { mockMvc = MockMvcBuilders.webAppContextSetup(webAppContext).build(); }@Test public void add_EmptyTodoEntry_ShouldReturnHttpRequestStatusBadRequest() throws Exception { TodoDTO addedTodoEntry = new TodoDTO();mockMvc.perform(post("/api/todo") .contentType(WebTestConstants.APPLICATION_JSON_UTF8) .content(objectMapper.writeValueAsBytes(addedTodoEntry)) ) .andExpect(status().isBadRequest()); } } We have just removed duplicate code from our test classes and reduced the effort required to write new tests for our controllers. Pretty cool, huh? If we change the value of a constant which is added to a constants class, this change effects to every test case which uses this constant. That is why we should minimize the number of constants which are added to a constants class. Summary We have now learned that constants can help us to write clean tests, and reduce the effort required to write new tests and maintain our existing tests. There are a couple of things which we should remember when we put the advice given in this blog post to practice:We must give good names to constants and constants classes. If we don’t do that, we aren’t leveraging the full potential of these techniques. We shouldn’t introduce new constants without figuring out what we want to achieve with that constant. The reality is often a lot more complex than the examples of this blog post. If we write code on autopilot, the odds are that we will miss the best solution to the problem at hand.Reference: Writing Clean Tests – Beware of Magic from our JCG partner Petri Kainulainen at the Petri Kainulainen blog....
Java Code Geeks and all content copyright © 2010-2014, Exelixis Media Ltd | Terms of Use | Privacy Policy
All trademarks and registered trademarks appearing on Java Code Geeks are the property of their respective owners.
Java is a trademark or registered trademark of Oracle Corporation in the United States and other countries.
Java Code Geeks is not connected to Oracle Corporation and is not sponsored by Oracle Corporation.
Do you want to know how to develop your skillset and become a ...
Java Rockstar?

Subscribe to our newsletter to start Rocking right now!

To get you started we give you two of our best selling eBooks for FREE!

Get ready to Rock!
You can download the complementary eBooks using the links below:
Close