Featured FREE Whitepapers

What's New Here?

java-logo

Parsing a file with Stream API in Java 8

Streams are everywhere in Java 8. Just look around and for sure you will find them. It also applies to java.io.BufferedReader. Parsing a file in Java 8 with Stream API is extremely easy. I have a CSV file that I want to be read. An example below:             username;visited jdoe;10 kolorobot;4 A contract for my reader is to provide a header as list of strings and all records as list of lists of strings. My reader accepts java.io.Reader as a source to read from. I will start with reading the header. The algorithm for reading the header is as follows:Open a source for reading, Get the first line and parse it, Split line by a separator, Get the first line and parse it, Convert the line to list of strings and return.And the implementation: class CsvReader {private static final String SEPARATOR = ";";private final Reader source;CsvReader(Reader source) { this(source); } List<String> readHeader() { try (BufferedReader reader = new BufferedReader(source)) { return reader.lines() .findFirst() .map(line -> Arrays.asList(line.split(SEPARATOR))) .get(); } catch (IOException e) { throw new UncheckedIOException(e); } } } Fairly simple. Self-explanatory. Similarly, I created a method to read all records. The algorithm for reading the records is as follows:Open a source for reading, Skip the first line, Split line by a separator, Apply a mapper on each line that maps a line to a list of strings.And the implementation: class CsvReader {List<List<String>> readRecords() { try (BufferedReader reader = new BufferedReader(source)) { return reader.lines() .substream(1) .map(line -> Arrays.asList(line.split(separator))) .collect(Collectors.toList()); } catch (IOException e) { throw new UncheckedIOException(e); } } } Nothing fancy here. What you could notice is that a mapper in both methods is exactly the same. In fact, it can be easily extracted to a variable: Function<String, List<String>> mapper = line -> Arrays.asList(line.split(separator)); To finish up, I created a simple test. public class CsvReaderTest {@Test public void readsHeader() { CsvReader csvReader = createCsvReader(); List<String> header = csvReader.readHeader(); assertThat(header) .contains("username") .contains("visited") .hasSize(2); }@Test public void readsRecords() { CsvReader csvReader = createCsvReader(); List<List<String>> records = csvReader.readRecords(); assertThat(records) .contains(Arrays.asList("jdoe", "10")) .contains(Arrays.asList("kolorobot", "4")) .hasSize(2); }private CsvReader createCsvReader() { try { Path path = Paths.get("src/test/resources", "sample.csv"); Reader reader = Files.newBufferedReader( path, Charset.forName("UTF-8")); return new CsvReader(reader); } catch (IOException e) { throw new UncheckedIOException(e); } } }Reference: Parsing a file with Stream API in Java 8 from our JCG partner Rafal Borowiec at the Codeleak.pl blog....
grails-logo

Grails tip: refactoring your URLs

On the current project I am working we use a lot of integration tests. For you non-Grails users out there, Integration tests test your Controller APIs, your Services and any persistence that might happen all very neatly. The only slice of the cake they don’t test from a back end perspective are your Grails filters for which you’d need something like a functional test. In the Grails world, Controller API’s are mapped to URL requests in the URLMappings.groovy file. This is just a simple Groovy to configure which HTTP request go to which Controller.       For example: class UrlMappings { static mappings = { "/sports/rugby/ball" (controller: "rugbyBall", action = [POST: "createBall", DELETE: "removeBall", GET: "getBall"]) ... So in the above example, the HTTP request /sports/rugby/ball will go to the RugbyBallController and will go to the methods: createBall(), deleteBall(), getBall() depending on weather the request is a GET, POST or DELETE. Now suppose, you have your project all set up to server up the CRUD operations for the rugby ball and after a few hectic sprints some software entropy creeps and you need to refactor your Controller APIs but before you race ahead and do that your project manager looks you in the eye and says: “You must support all existing APIs as clients are using them”. This is how generally refactoring works in the real world when things go into production. There is always a phase of supporting the old and new, deprecating the old and then when everyone is happy removing it. Anyway, you begin by updating your URLMappings.groovy class UrlMappings { static mappings = { // Old APIs "/sports/rugby/ball" ( controller: "rugbyBall", action = [POST: "oldCreateBall", DELETE: "oldRemoveBall", GET: "oldGetBall"]) ...// New APIs "/sports/rugby/v2/ball" ( controller: "rugbyBall", action = [POST: "createBall", DELETE: "removeBall", GET: "getBall"]) ... The URLMappings.groovy show the old and the new. The old APIs are going to controller methods that you have renamed. Clients using these APIs are not impacted because they only send HTTP requests, they do not know which Controller is behind these endpoints. The old APIs already have really good integration tests and our project manager has mandated that the new APIs must have similar quality integration tests before they go anywhere near pre-production. def "test adding a single item to your cart"() { setup: "Set up the Cart and Landing Controller" //... when: //... rugbyBallController.oldGetBall(); rugbyBall = JSON.parse(rugbyBallController.response.contentAsString) then: rugbyBall.isOval(); Mr. Project manager says: “I want all these new tests added by Friday or you are not going for a pint after work. You need a quick way to get your integration tests done”. Thinking about that cool lager and its quenching effect on the back of your throat, you remember Groovy’s excellent support for invoking methods dynamically where you can specify the name of the method as a variable. myObject."$myMethod"() // myMethod is a Groovy String variable. In the above code snippet, myMethod is a variable that corresponds to the name of method you want to invoke on myObject. “$myMethod” means, evaluate the variable myMethod (which of course will be the method name), the () of course just means invokes the method. Eureka moments happens when you remember that the old and new APIs will return the exact same JSON. All you need to do is run the same test twice, once for the old code and once for the new. Since you are using the spock framework for your integration tests that’s easily achieved using a where block. def "test adding a single item to your cart"(String method) { setup: "Set up the Cart and Landing Controller" //... when: //... rugbyBallController."$method"(); rugbyBall = JSON.parse(rugbyBallController.response.contentAsString) then: rugbyBall.isOval(); where: method = ["oldGetBall", "getBall"] Happy days. Now go off and drink that lager.Reference: Grails tip: refactoring your URLs from our JCG partner Alex Staveley at the Dublin’s Tech Blog blog....
enterprise-java-logo

Writing Clean Tests – New Considered Harmful

It is pretty hard to figure out a good definition for clean code because everyone of us has our own definition for the word clean. However, there is one definition which seems to be universal:Clean code is easy to read.This might come as a surprise to some of you, but I think that this definition applies to test code as well. It is in our best interests to make our tests as readable as possible because:  If our tests are easy to read, it is easy to understand how our code works. If our tests are easy to read, it is easy to find the problem if a test fails (without using a debugger).It isn’t hard to write clean tests, but it takes a lot of practice, and that is why so many developers are struggling with it. I have struggled with this too, and that is why I decided to share my findings with you. This is the fourth part of my tutorial which describes how we can write clean tests. This time we will learn why we will should not create objects in our test methods by using the new keyword. We will also learn how we can replace the new keyword with factory methods and test data builders. New Is Not the New Black During this tutorial we have been refactoring a unit test which ensures that the registerNewUserAccount(RegistrationForm userAccountData) method of the RepositoryUserService class works as expected when a new user account is created by using a unique email address and a social sign in provider. The RegistrationForm class is a data transfer object (DTO), and our unit tests sets its property values by using setter methods. The source code of our unit test looks as follows (the relevant code is highlighted): import org.junit.Before; import org.junit.Test; import org.junit.runner.RunWith; import org.mockito.Mock; import org.mockito.invocation.InvocationOnMock; import org.mockito.runners.MockitoJUnitRunner; import org.mockito.stubbing.Answer; import org.springframework.security.crypto.password.PasswordEncoder;import static org.junit.Assert.assertEquals; import static org.junit.Assert.assertNull; import static org.mockito.Matchers.isA; import static org.mockito.Mockito.times; import static org.mockito.Mockito.verify; import static org.mockito.Mockito.verifyNoMoreInteractions; import static org.mockito.Mockito.verifyZeroInteractions; import static org.mockito.Mockito.when;@RunWith(MockitoJUnitRunner.class) public class RepositoryUserServiceTest {private static final String REGISTRATION_EMAIL_ADDRESS = "john.smith@gmail.com"; private static final String REGISTRATION_FIRST_NAME = "John"; private static final String REGISTRATION_LAST_NAME = "Smith"; private static final Role ROLE_REGISTERED_USER = Role.ROLE_USER; private static final SocialMediaService SOCIAL_SIGN_IN_PROVIDER = SocialMediaService.TWITTER;private RepositoryUserService registrationService;@Mock private PasswordEncoder passwordEncoder;@Mock private UserRepository repository;@Before public void setUp() { registrationService = new RepositoryUserService(passwordEncoder, repository); }@Test public void registerNewUserAccount_SocialSignInAndUniqueEmail_ShouldCreateNewUserAccountAndSetSignInProvider() throws DuplicateEmailException { RegistrationForm registration = new RegistrationForm(); registration.setEmail(REGISTRATION_EMAIL_ADDRESS); registration.setFirstName(REGISTRATION_FIRST_NAME); registration.setLastName(REGISTRATION_LAST_NAME); registration.setSignInProvider(SOCIAL_SIGN_IN_PROVIDER);when(repository.findByEmail(REGISTRATION_EMAIL_ADDRESS)).thenReturn(null);when(repository.save(isA(User.class))).thenAnswer(new Answer<User>() { @Override public User answer(InvocationOnMock invocation) throws Throwable { Object[] arguments = invocation.getArguments(); return (User) arguments[0]; } });User createdUserAccount = registrationService.registerNewUserAccount(registration);assertEquals(REGISTRATION_EMAIL_ADDRESS, createdUserAccount.getEmail()); assertEquals(REGISTRATION_FIRST_NAME, createdUserAccount.getFirstName()); assertEquals(REGISTRATION_LAST_NAME, createdUserAccount.getLastName()); assertEquals(SOCIAL_SIGN_IN_PROVIDER, createdUserAccount.getSignInProvider()); assertEquals(ROLE_REGISTERED_USER, createdUserAccount.getRole()); assertNull(createdUserAccount.getPassword());verify(repository, times(1)).findByEmail(REGISTRATION_EMAIL_ADDRESS); verify(repository, times(1)).save(createdUserAccount); verifyNoMoreInteractions(repository); verifyZeroInteractions(passwordEncoder); } } So, what is the problem? The highlighted part of our unit test is short and it is relatively easy to read. In my opinion, the biggest problem of this code is that it is data centric. It creates a new RegistrationForm object and sets the property values of the created object, but it doesn’t describe the meaning of these property values. If we create new objects in the test method by using the new keyword, our tests become harder to read because:The reader has to know the different states of the created object. For example, if we think about our example, the reader has to know that if we create a new RegistrationForm object and set the property values of the email, firstName, lastName, and signInProvider properties, it means that the object is a registration which is made by using a social sign in provider. If the created object has many properties, the code which creates it, litters the source code of our tests. We should remember that even though we need these objects in our tests, we should focus on describing the behavior of the tested method / feature.Although it isn’t realistic to assume that we can completely eliminate these drawbacks, we should do our best to minimize their effect and make our tests as easy to read as possible. Let’s find out how we can do this by using factory methods. Using Factory Methods When we create new objects by using factory methods, we should name the factory methods and their method parameters in a such way that it makes our code easier to read and write. Let’s take a look at two different factory methods and see what kind of an effect they have to the readability of our unit test. These factory methods are typically added to an object mother class because often they are useful to more than one test class. However, because I want to keep things simple, I will add them directly to the test class. The name of the first factory method is newRegistrationViaSocialSignIn(), and it has no method parameters. After we have added this factory method to our test class, the source of our unit test looks as follows (the relevant parts are highlighted): import org.junit.Before; import org.junit.Test; import org.junit.runner.RunWith; import org.mockito.Mock; import org.mockito.invocation.InvocationOnMock; import org.mockito.runners.MockitoJUnitRunner; import org.mockito.stubbing.Answer; import org.springframework.security.crypto.password.PasswordEncoder;import static org.junit.Assert.assertEquals; import static org.junit.Assert.assertNull; import static org.mockito.Matchers.isA; import static org.mockito.Mockito.times; import static org.mockito.Mockito.verify; import static org.mockito.Mockito.verifyNoMoreInteractions; import static org.mockito.Mockito.verifyZeroInteractions; import static org.mockito.Mockito.when;@RunWith(MockitoJUnitRunner.class) public class RepositoryUserServiceTest {private static final String REGISTRATION_EMAIL_ADDRESS = "john.smith@gmail.com"; private static final String REGISTRATION_FIRST_NAME = "John"; private static final String REGISTRATION_LAST_NAME = "Smith"; private static final Role ROLE_REGISTERED_USER = Role.ROLE_USER; private static final SocialMediaService SOCIAL_SIGN_IN_PROVIDER = SocialMediaService.TWITTER;private RepositoryUserService registrationService;@Mock private PasswordEncoder passwordEncoder;@Mock private UserRepository repository;@Before public void setUp() { registrationService = new RepositoryUserService(passwordEncoder, repository); }@Test public void registerNewUserAccount_SocialSignInAndUniqueEmail_ShouldCreateNewUserAccountAndSetSignInProvider() throws DuplicateEmailException { RegistrationForm registration = newRegistrationViaSocialSignIn();when(repository.findByEmail(REGISTRATION_EMAIL_ADDRESS)).thenReturn(null);when(repository.save(isA(User.class))).thenAnswer(new Answer<User>() { @Override public User answer(InvocationOnMock invocation) throws Throwable { Object[] arguments = invocation.getArguments(); return (User) arguments[0]; } });User createdUserAccount = registrationService.registerNewUserAccount(registration);assertEquals(REGISTRATION_EMAIL_ADDRESS, createdUserAccount.getEmail()); assertEquals(REGISTRATION_FIRST_NAME, createdUserAccount.getFirstName()); assertEquals(REGISTRATION_LAST_NAME, createdUserAccount.getLastName()); assertEquals(SOCIAL_SIGN_IN_PROVIDER, createdUserAccount.getSignInProvider()); assertEquals(ROLE_REGISTERED_USER, createdUserAccount.getRole()); assertNull(createdUserAccount.getPassword());verify(repository, times(1)).findByEmail(REGISTRATION_EMAIL_ADDRESS); verify(repository, times(1)).save(createdUserAccount); verifyNoMoreInteractions(repository); verifyZeroInteractions(passwordEncoder); } private RegistrationForm newRegistrationViaSocialSignIn() { RegistrationForm registration = new RegistrationForm(); registration.setEmail(REGISTRATION_EMAIL_ADDRESS); registration.setFirstName(REGISTRATION_FIRST_NAME); registration.setLastName(REGISTRATION_LAST_NAME); registration.setSignInProvider(SOCIAL_SIGN_IN_PROVIDER);return registration; } } The first factory method has the following consequences:The part of our test method, which creates the new RegistrationForm object, is a lot cleaner than before and the name of the factory method describes the state of the created RegistrationForm object. The configuration of our mock object is harder to read because the value of the the email property is “hidden” inside our factory method. Our assertions are harder to read because the property values of the created RegistrationForm object are “hidden” inside our factory method.If we would use the object mother pattern, the problem would be even bigger because we would have to move the related constants to the object mother class. I think that it is fair to say that even though the first factory method has its benefits, it has serious drawbacks as well. Let’s see if the second factory method can eliminate those drawbacks. The name of the second factory method is newRegistrationViaSocialSignIn(), and it takes the email address, first name, last name, and social sign in provider as method parameters. After we have added this factory method to our test class, the source of our unit test looks as follows (the relevant parts are highlighted): import org.junit.Before; import org.junit.Test; import org.junit.runner.RunWith; import org.mockito.Mock; import org.mockito.invocation.InvocationOnMock; import org.mockito.runners.MockitoJUnitRunner; import org.mockito.stubbing.Answer; import org.springframework.security.crypto.password.PasswordEncoder;import static org.junit.Assert.assertEquals; import static org.junit.Assert.assertNull; import static org.mockito.Matchers.isA; import static org.mockito.Mockito.times; import static org.mockito.Mockito.verify; import static org.mockito.Mockito.verifyNoMoreInteractions; import static org.mockito.Mockito.verifyZeroInteractions; import static org.mockito.Mockito.when;@RunWith(MockitoJUnitRunner.class) public class RepositoryUserServiceTest {private static final String REGISTRATION_EMAIL_ADDRESS = "john.smith@gmail.com"; private static final String REGISTRATION_FIRST_NAME = "John"; private static final String REGISTRATION_LAST_NAME = "Smith"; private static final Role ROLE_REGISTERED_USER = Role.ROLE_USER; private static final SocialMediaService SOCIAL_SIGN_IN_PROVIDER = SocialMediaService.TWITTER;private RepositoryUserService registrationService;@Mock private PasswordEncoder passwordEncoder;@Mock private UserRepository repository;@Before public void setUp() { registrationService = new RepositoryUserService(passwordEncoder, repository); }@Test public void registerNewUserAccount_SocialSignInAndUniqueEmail_ShouldCreateNewUserAccountAndSetSignInProvider() throws DuplicateEmailException { RegistrationForm registration = newRegistrationViaSocialSignIn(REGISTRATION_EMAIL_ADDRESS, REGISTRATION_FIRST_NAME, REGISTRATION_LAST_NAME, SOCIAL_MEDIA_SERVICE );when(repository.findByEmail(REGISTRATION_EMAIL_ADDRESS)).thenReturn(null);when(repository.save(isA(User.class))).thenAnswer(new Answer<User>() { @Override public User answer(InvocationOnMock invocation) throws Throwable { Object[] arguments = invocation.getArguments(); return (User) arguments[0]; } });User createdUserAccount = registrationService.registerNewUserAccount(registration);assertEquals(REGISTRATION_EMAIL_ADDRESS, createdUserAccount.getEmail()); assertEquals(REGISTRATION_FIRST_NAME, createdUserAccount.getFirstName()); assertEquals(REGISTRATION_LAST_NAME, createdUserAccount.getLastName()); assertEquals(SOCIAL_SIGN_IN_PROVIDER, createdUserAccount.getSignInProvider()); assertEquals(ROLE_REGISTERED_USER, createdUserAccount.getRole()); assertNull(createdUserAccount.getPassword());verify(repository, times(1)).findByEmail(REGISTRATION_EMAIL_ADDRESS); verify(repository, times(1)).save(createdUserAccount); verifyNoMoreInteractions(repository); verifyZeroInteractions(passwordEncoder); } private RegistrationForm newRegistrationViaSocialSignIn(String emailAddress, String firstName, String lastName, SocialMediaService signInProvider) { RegistrationForm registration = new RegistrationForm(); registration.setEmail(emailAddress); registration.setFirstName(firstName); registration.setLastName(lastName); registration.setSignInProvider(signInProvider);return registration; } } The second factory method has the following consequences:The part of our test method, which creates the new RegistrationForm object, is a bit messier than the same code which uses the first factory method. However, it is still cleaner than the original code because the name of the factory method describes the state of the created object. It seems to eliminate the drawbacks of the first factory method because the property values of the created object are not “hidden” inside the factory method.Seems cool, right? It would be really easy to think that all is well in the paradise, but that is not the case. Although we have seen that factory methods can make our tests more readable, the thing is that they are a good choice only when the following conditions are met:The factory method doesn’t have too many method parameters. When the number of method parameter grows, our tests become harder to write and read. The obvious question is: how many method parameters a factory method can have? Unfortunately it is hard to give an exact answer to that question but I think that using a factory method is a good choice if the factory method has only a handful of method parameters. The test data doesn’t have too much variation. The problem of using factory methods is that a single factory method is typically suitable for one use case. If we need to support N use cases, we need to have N factory methods. This is a problem because over time our factory methods become bloated, messy, and hard to maintain (especially if we use the object mother pattern).Let’s find out if test data builders can solve some of these problems. Using Test Data Builders A test data builder is a class which creates new objects by using the builder pattern. The builder pattern described in Effective Java has many benefits, but our primary motivation is to provide a fluent API for creating the objects used in our tests. We can create a test data builder class which creates new RegistrationForm objects by following these steps:Create a RegistrationFormBuilder class. Add a RegistrationForm field to the created class. This field contains a reference to the created object. Add a default constructor to the created class and implement it by creating a new RegistrationForm object. Add methods which are used to set the property values of the created RegistrationForm object. Each method sets the property value by calling the correct setter method and returns a reference to the RegistrationFormBuilder object. Remember that the method names of these methods can either make or break our DSL. Add a build() method to the created class and implement it by returning the created RegistrationForm object.The source code of our test data builder class looks as follows: public class RegistrationFormBuilder {private RegistrationForm registration;public RegistrationFormBuilder() { registration = new RegistrationForm(); }public RegistrationFormBuilder email(String email) { registration.setEmail(email); return this; }public RegistrationFormBuilder firstName(String firstName) { registration.setFirstName(firstName); return this; }public RegistrationFormBuilder lastName(String lastName) { registration.setLastName(lastName); return this; }public RegistrationFormBuilder isSocialSignInViaSignInProvider(SocialMediaService signInProvider) { registration.setSignInProvider(signInProvider); return this; }public RegistrationForm build() { return registration; } } After we have modified our unit test to use the new test data builder class, its source code looks as follows (The relevant part is highlighted): import org.junit.Before; import org.junit.Test; import org.junit.runner.RunWith; import org.mockito.Mock; import org.mockito.invocation.InvocationOnMock; import org.mockito.runners.MockitoJUnitRunner; import org.mockito.stubbing.Answer; import org.springframework.security.crypto.password.PasswordEncoder;import static org.junit.Assert.assertEquals; import static org.junit.Assert.assertNull; import static org.mockito.Matchers.isA; import static org.mockito.Mockito.times; import static org.mockito.Mockito.verify; import static org.mockito.Mockito.verifyNoMoreInteractions; import static org.mockito.Mockito.verifyZeroInteractions; import static org.mockito.Mockito.when;@RunWith(MockitoJUnitRunner.class) public class RepositoryUserServiceTest {private static final String REGISTRATION_EMAIL_ADDRESS = "john.smith@gmail.com"; private static final String REGISTRATION_FIRST_NAME = "John"; private static final String REGISTRATION_LAST_NAME = "Smith"; private static final Role ROLE_REGISTERED_USER = Role.ROLE_USER; private static final SocialMediaService SOCIAL_SIGN_IN_PROVIDER = SocialMediaService.TWITTER;private RepositoryUserService registrationService;@Mock private PasswordEncoder passwordEncoder;@Mock private UserRepository repository;@Before public void setUp() { registrationService = new RepositoryUserService(passwordEncoder, repository); }@Test public void registerNewUserAccount_SocialSignInAndUniqueEmail_ShouldCreateNewUserAccountAndSetSignInProvider() throws DuplicateEmailException { RegistrationForm registration = new RegistrationFormBuilder() .email(REGISTRATION_EMAIL_ADDRESS) .firstName(REGISTRATION_FIRST_NAME) .lastName(REGISTRATION_LAST_NAME) .isSocialSignInViaSignInProvider(SOCIAL_SIGN_IN_PROVIDER) .build();when(repository.findByEmail(REGISTRATION_EMAIL_ADDRESS)).thenReturn(null);when(repository.save(isA(User.class))).thenAnswer(new Answer<User>() { @Override public User answer(InvocationOnMock invocation) throws Throwable { Object[] arguments = invocation.getArguments(); return (User) arguments[0]; } });User createdUserAccount = registrationService.registerNewUserAccount(registration);assertEquals(REGISTRATION_EMAIL_ADDRESS, createdUserAccount.getEmail()); assertEquals(REGISTRATION_FIRST_NAME, createdUserAccount.getFirstName()); assertEquals(REGISTRATION_LAST_NAME, createdUserAccount.getLastName()); assertEquals(SOCIAL_SIGN_IN_PROVIDER, createdUserAccount.getSignInProvider()); assertEquals(ROLE_REGISTERED_USER, createdUserAccount.getRole()); assertNull(createdUserAccount.getPassword());verify(repository, times(1)).findByEmail(REGISTRATION_EMAIL_ADDRESS); verify(repository, times(1)).save(createdUserAccount); verifyNoMoreInteractions(repository); verifyZeroInteractions(passwordEncoder); } } As we can see, test data builders have the following benefits:The code which creates new RegistrationForm objects is both easy to read and write. I am a big fan of fluent APIs, and I think that this code is both beautiful and elegant. The builder pattern ensures that the variation found from our test data is no longer a problem because we can simply add new methods to the test data builder class. The configuration of our mock object and our assertions are easy to read because the constants are visible in our test method and our DSL emphasizes the meaning of each property value.So, should we use the builder pattern for everything? NO! We should use test data builders only when it makes sense. In other words, we should use them when:We have set more than a handful of property values. Our test data has a lot of variation.The builder pattern is a perfect choice if one of these conditions is true. The reason for this is that we can create a domain-specific language by naming the setter-like methods of the builder class. This makes our tests easy to read and write even if we would have create a lot of different objects and set a lot of property values. That is the power of the builder patten. If you want to learn more about fluent APIs, you should read the following articles:Fluent Interface The Java Fluent API Designer Crash Course Building a fluent API (Internal DSL) in JavaThat is all for today. Let’s move on and summarize what we learned from this blog post. Summary We learned why it is a bad idea to create objects in the test method by using the new keyword, and we learned two different ways to create the objects which are used in our tests. To be more specific, this blog post has taught us three things:It is a bad idea to create the required objects in the test method by using the new keyword because it makes our tests messy and hard to read. If we have to set only a handful of property values and our test data doesn’t have a lot of variation, we should create the required object by using a factory method. If we have to set a lot of property values and / or our test data has a lot of variation, we should create the required object by using a test data builder.Reference: Writing Clean Tests – New Considered Harmful from our JCG partner Petri Kainulainen at the Petri Kainulainen blog....
software-development-2-logo

Yet Another 10 Common Mistakes Java Developers Make When Writing SQL (You Won’t BELIEVE the Last One)

(Sorry for that click-bait heading. Couldn’t resist!) We’re on a mission. To teach you SQL. But mostly, we want to teach you how to appreciate SQL. You’ll love it! Getting SQL right or wrong shouldn’t be about that You’re-Doing-It-Wrong™ attitude that can be encountered often when evangelists promote their object of evangelism. Getting SQL right should be about the fun you’ll have once you do get it right. The things you start appreciating when you notice that you can easily replace 2000 lines of slow, hard-to-maintain, and ugly imperative (or object-oriented) code with 300 lines of lean functional code (e.g. using Java 8), or even better, with 50 lines of SQL. We’re glad to see that our blogging friends have started appreciating SQL, and most specifically, window functions after reading our posts. For instance, take:Vlad Mihalea’s Time to Break Free from the SQL-92 Mindset Petri Kainulainen’s revelations that lead to him starting his jOOQ tutorial series (among other reasons) Eugen Paraschiv (from Baeldung)’s cracking up about Es-Queue-ElSo, after our previous, very popular posts:10 Common Mistakes Java Developers Make when Writing SQL 10 More Common Mistakes Java Developers Make when Writing SQL… we’ll bring you: Yet Another 10 Common Mistakes Java Developer Make When Writing SQL And of course, this doesn’t apply to Java developers alone, but it’s written from the perspective of a Java (and SQL) developer. So here we go (again):Not Using Window FunctionsAfter all that we’ve been preaching, this must be our number 1 mistake in this series. Window functions are probably the coolest SQL feature of them all. They’re so incredibly useful, they should be the number one reason for anyone to switch to a better database, e.g. PostgreSQL: Mind bending talk by @lukaseder about @JavaOOQ at tonight's @jsugtu. My new resolution: Install PostgreSQL and study SQL standard at once. — Peter Kofler (@codecopkofler) April 7, 2014If free and/or Open Source is important to you, you have absolutely no better choice than using PostgreSQL (and you’ll even get to use the free jOOQ Open Source Edition, if you’re a Java developer). And if you’re lucky enough to work in an environment with Oracle or SQL Server (or DB2, Sybase) licenses, you get even more out of your new favourite tool. We won’t repeat all the window function goodness in this section, we’ve blogged about them often enough:Probably the Coolest SQL Feature: Window Functions NoSQL? No, SQL! – How to Calculate Running Totals How can I do This? – With SQL of Course! CUME_DIST(), a Lesser-Known SQL Gem Popular ORMs Don’t do SQL SQL Trick: row_number() is to SELECT what dense_rank() is to SELECT DISTINCT ORM vs. SQL, compared to C vs. ASMThe Cure: Remove MySQL. Take a decent database. And start playing with window functions. You’ll never go back, guaranteed.Not declaring NOT NULL constraintsThis one was already part of a previous list where we claimed that you should add as much metadata as possible to your schema, because your database will be able to leverage that metadata for optimisations. For instance, if your database knows that a foreign key value in BOOK.AUTHOR_ID must also be contained exactly once in AUTHOR.ID, then a whole set of optimisations can be achieved in complex queries. Now let’s have another look at NOT NULL constraints. If you’re using Oracle, NULL values will not be part of your index. This doesn’t matter if you’re expressing an IN constraint, for instance: SELECT * FROM table WHERE value IN ( SELECT nullable_column FROM ... ) But what happens with a NOT IN constraint? SELECT * FROM table WHERE value NOT IN ( SELECT nullable_column FROM ... ) Due to SQL’s slightly unintuitive way of handling NULL, there is a slight risk of the second query unexpectedly not returning any results at all, namely if there is at least one NULL value as a result from the subquery. This is true for all databases that get SQL right. But because the index on nullable_column doesn’t contain any NULL values, Oracle has to look up the complete content in the table, resulting in a FULL TABLE SCAN. Now that is unexpected! Details about this can be seen in this article. The Cure: Carefully review all your nullable, yet indexed columns, and check if you really cannot add a NOT NULL constraint to those columns. The Tool: If you’re using Oracle, use this query to detect all nullable, yet indexed columns: SELECT i.table_name, i.index_name, LISTAGG( LPAD(i.column_position, 2) || ': ' || RPAD(i.column_name , 30) || ' ' || DECODE(t.nullable, 'Y', '(NULL)', '(NOT NULL)'), ', ' ) WITHIN GROUP (ORDER BY i.column_position) AS "NULLABLE columns in indexes" FROM user_ind_columns i JOIN user_tab_cols t ON (t.table_name, t.column_name) = ((i.table_name, i.column_name)) WHERE EXISTS ( SELECT 1 FROM user_tab_cols t WHERE (t.table_name, t.column_name, t.nullable) = ((i.table_name, i.column_name, 'Y' )) ) GROUP BY i.table_name, i.index_name ORDER BY i.index_name ASC; Example output: TABLE_NAME | INDEX_NAME | NULLABLE columns in indexes -----------+--------------+---------------------------- PERSON | I_PERSON_DOB | 1: DATE_OF_BIRTH (NULL) And then, fix it!(Accidental criticism of Maven is irrelevant here! ) If you’re curious about more details, see also these posts:The Index You’ve Added is Useless. Why? NULL in SQL. Explaining its Behaviour Indexing NULL in the Oracle DatabaseUsing PL/SQL Package StateNow, this is a boring one if you’re not using Oracle, but if you are (and you’re a Java developer), be very wary of PL/SQL package state. Are you really doing what you think you’re doing? Yes, PL/SQL has package-state, e.g. CREATE OR REPLACE PACKAGE pkg IS -- Package state here! n NUMBER := 1;FUNCTION next_n RETURN NUMBER; END pkg;CREATE OR REPLACE PACKAGE BODY pkg IS FUNCTION next_n RETURN NUMBER IS BEGIN n := n + 1; RETURN n; END next_n; END pkg; Wonderful, so you’ve created yourself an in-memory counter that generates a new number every time you call pkg.next_n. But who owns that counter? Yes, the session. Each session has their own initialised “package instance”. But no, it’s probably not the session you might have thought of. We Java developers connect to databases through connection pools. When we obtain a JDBC Connection from such a pool, we recycle that connection from a previous “session”, e.g. a previous HTTP Request (not HTTP Session!). But that’s not the same. The database session (probably) outlives the HTTP Request and will be inherited by the next request, possibly from an entirely different user. Now, imagine you had a credit card number in that package…? Not The Cure: Nope. Don’t just jump to using SERIALLY_REUSABLE packages CREATE OR REPLACE PACKAGE pkg IS PRAGMA SERIALLY_REUSABLE; n NUMBER := 1;FUNCTION next_n RETURN NUMBER; END pkg; Because:You cannot even use that package from SQL, now (see ORA-06534). Mixing this PRAGMA with regular package state from other packages just makes things a lot more complex.So, don’t. Not The Cure: I know. PL/SQL can be a beast. It often seems like such a quirky language. But face it. Many things run much much faster when written in PL/SQL, so don’t give up, just yet. Dropping PL/SQL is not the solution either. The Cure: At all costs, try to avoid package state in PL/SQL. Think of package state as of static variables in Java. While they might be useful for caches (and constants, of course) every now and then, you might not actually access that state that you wanted. Think about load-balancers, suddenly transferring you to another JVM. Think about class loaders, that might have loaded the same class twice, for some reason. Instead, pass state as arguments through procedures and functions. This will avoid side-effects and make your code much cleaner and more predictable. Or, obviuously, persist state to some table.Running the same query all the timeMaster data is boring. You probably wrote some utility to get the latest version of your master data (e.g. language, locale, translations, tenant, system settings), and you can query it every time, once it is available. At all costs, don’t do that. You don’t have to cache many things in your application, as modern databases have grown to be extremely fast when it comes to caching:Table / column content Index content Query / materialized view results Procedure results (if they’re deterministic) Cursors Execution plansSo, for your average query, there’s virtually no need for an ORM second-level cache, at least from a performance perspective (ORM caches mainly fulfil other purposes, of course). But when you query master data, i.e. data that never changes, then, network latency, traffic and many other factors will impair your database experience. The Cure: Please do take 10 minutes, download Guava, and use its excellent and easy to set up cache, that ships with various built-in invalidation strategies. Choose time-based invalidation (i.e. polling), choose Oracle AQ or Streams, or PostgreSQL’s NOTIFY for event-based invalidation, or just make your cache permanent, if it doesn’t matter. But don’t issue an identical master data query all the time. … This obviously brings us to:Not knowing about the N+1 problemYou had a choice. At the beginning of your software product, you had to choose between:An ORM (e.g. Hibernate, EclipseLink) SQL (e.g. through JDBC, MyBatis, or jOOQ) BothSo, obviously, you chose an ORM, because otherwise you wouldn’t be suffering from “N+1″. What does “N+1″ mean? The accepted answer on this Stack Overflow question explains it nicely. Essentially, you’re running: SELECT * FROM book-- And then, for each book: SELECT * FROM author WHERE id = ? SELECT * FROM author WHERE id = ? SELECT * FROM author WHERE id = ? Of course, you could go and tweak your hundreds of annotations to correctly prefetch or eager fetch each book’s associated author information to produce something along the lines of: SELECT * FROM book JOIN author ON book.author_id = author.id But that would be an awful lot of work, and you’ll risk eager-fetching too many things that you didn’t want, resulting in another performance issue. Maybe, you could upgrade to JPA 2.1 and use the new @NamedEntityGraph to express beautiful annotation trees like this one: @NamedEntityGraph( name = "post", attributeNodes = { @NamedAttributeNode("title"), @NamedAttributeNode( value = "comments", subgraph = "comments" ) }, subgraphs = { @NamedSubgraph( name = "comments", attributeNodes = { @NamedAttributeNode("content") } ) } ) The example was taken from this blog post by Hantsy Bai. Hantsy then goes on explaining that you can use the above beauty through the following statement: em.createQuery("select p from Post p where p.id=:id", Post.class) .setHint("javax.persistence.fetchgraph", postGraph) .setParameter("id", this.id) .getResultList() .get(0); Let us all appreciate the above application of JEE standards with all due respect, and then consider… The Cure: You just listen to the wise words at the beginning of this article and replace thousands of lines of tedious Java / Annotatiomania™ code with a couple of lines of SQL. Because that will also likely help you prevent another issue that we haven’t even touched yet, namely selecting too many columns as you can see in these posts:Our previous listing of common mistakes Myth: SELECT * is badSince you’re already using an ORM, this might just mean resorting to native SQL – or maybe you manage to express your query with JPQL. Of course, we agree with Alessio Harri in believing that you should use jOOQ together with JPA: Loved the type safety of @JavaOOQ today. OpenJPA is the workhorse and @JavaOOQ is the artist :) #80/20 — Alessio Harri (@alessioh) May 23, 2014The Takeaway: While the above will certainly help you work around some real world issues that you may have with your favourite ORM, you could also take it one step further and think about it this way. After all these years of pain and suffering from the object-relational impedance mismatch, the JPA 2.1 expert group is now trying to tweak their way out of this annotation madness by adding more declarative, annotation-based fetch graph hints to JPQL queries, that no one can debug, let alone maintain. The alternative is simple and straight-forward SQL. And with Java 8, we’ll add functional transformation through the Streams API. That’s hard to beat. But obviuosly, your views and experiences on that subject may differ from ours, so let’s head on to a more objective discussion about…Not using Common Table ExpressionsWhile common table expressions obviously offer readability improvements, they may also offer performance improvements. Consider the following query that I have recently encountered in a customer’s PL/SQL package (not the actual query): SELECT round ( (SELECT amount FROM payments WHERE id = :p_id) * ( SELECT e.bid FROM currencies c, exchange_rates e WHERE c.id = (SELECT cur_id FROM payments WHERE id = :p_id) AND e.cur_id = (SELECT cur_id FROM payments WHERE id = :p_id) AND e.org_id = (SELECT org_id FROM payments WHERE id = :p_id) ) / ( SELECT c.factor FROM currencies c, exchange_rates e WHERE c.id = (SELECT cur_id FROM payments WHERE id = :p_id) AND e.cur_id = (SELECT cur_id FROM payments WHERE id = :p_id) AND e.org_id = (SELECT org_id FROM payments WHERE id = :p_id) ), 0 ) INTO amount FROM dual; So what does this do? This essentially converts a payment’s amount from one currency into another. Let’s not delve into the business logic too much, let’s head straight to the technical problem. The above query results in the following execution plan (on Oracle): ------------------------------------------------------ | Operation | Name | ------------------------------------------------------ | SELECT STATEMENT | | | TABLE ACCESS BY INDEX ROWID | PAYMENTS | | INDEX UNIQUE SCAN | PAYM_PK | | NESTED LOOPS | | | INDEX UNIQUE SCAN | CURR_PK | | TABLE ACCESS BY INDEX ROWID | PAYMENTS | | INDEX UNIQUE SCAN | PAYM_PK | | TABLE ACCESS BY INDEX ROWID | EXCHANGE_RATES | | INDEX UNIQUE SCAN | EXCH_PK | | TABLE ACCESS BY INDEX ROWID | PAYMENTS | | INDEX UNIQUE SCAN | PAYM_PK | | TABLE ACCESS BY INDEX ROWID | PAYMENTS | | INDEX UNIQUE SCAN | PAYM_PK | | NESTED LOOPS | | | TABLE ACCESS BY INDEX ROWID | CURRENCIES | | INDEX UNIQUE SCAN | CURR_PK | | TABLE ACCESS BY INDEX ROWID| PAYMENTS | | INDEX UNIQUE SCAN | PAYM_PK | | INDEX UNIQUE SCAN | EXCH_PK | | TABLE ACCESS BY INDEX ROWID | PAYMENTS | | INDEX UNIQUE SCAN | PAYM_PK | | TABLE ACCESS BY INDEX ROWID | PAYMENTS | | INDEX UNIQUE SCAN | PAYM_PK | | FAST DUAL | | ------------------------------------------------------ The actual execution time is negligible in this case, but as you can see, the same objects are accessed again and again within the query. This is a violation of Common Mistake #4: Running the same query all the time. The whole thing would be so much easier to read, maintain, and for Oracle to execute, if we had used a common table expression. From the original source code, observe the following thing: -- We're always accessing a single payment: FROM payments WHERE id = :p_id-- Joining currencies and exchange_rates twice: FROM currencies c, exchange_rates e So, let’s factor out the payment first: -- "payment" contains only a single payment -- But it contains all the columns that we'll need -- afterwards WITH payment AS ( SELECT cur_id, org_id, amount FROM payments WHERE id = :p_id ) SELECT round(p.amount * e.bid / c.factor, 0)-- Then, we simply don't need to repeat the -- currencies / exchange_rates joins twice FROM payment p JOIN currencies c ON p.cur_id = c.id JOIN exchange_rates e ON e.cur_id = p.cur_id AND e.org_id = p.org_id Note, that we’ve also replaced table lists with ANSI JOINs as suggested in our previous list You wouldn’t believe it’s the same query, would you? And what about the execution plan? Here it is! --------------------------------------------------- | Operation | Name | --------------------------------------------------- | SELECT STATEMENT | | | NESTED LOOPS | | | NESTED LOOPS | | | NESTED LOOPS | | | FAST DUAL | | | TABLE ACCESS BY INDEX ROWID| PAYMENTS | | INDEX UNIQUE SCAN | PAYM_PK | | TABLE ACCESS BY INDEX ROWID | EXCHANGE_RATES | | INDEX UNIQUE SCAN | EXCH_PK | | TABLE ACCESS BY INDEX ROWID | CURRENCIES | | INDEX UNIQUE SCAN | CURR_PK | --------------------------------------------------- No doubt that this is much much better. The Cure: If you’re lucky enough and you’re using one of those databases that supports window functions, chances are incredibly high (100%) that you also have common table expression support. This is another reason for you to migrate from MySQL to PostgreSQL, or appreciate the fact that you can work on an awesome commercial database. Common table expressions are like local variables in SQL. In every large statement, you should consider using them, as soon as you feel that you’ve written something before. The Takeaway: Some databases (e.g. PostgreSQL, or SQL Server) also support common table expressions for DML statements. In other words, you can write: WITH ... UPDATE ... This makes DML incredibly more powerful.Not using row value expressions for UPDATEsWe’ve advertised the use of row value expressions in our previous listing. They’re very readable and intuitive, and often also promote using certain indexes, e.g. in PostgreSQL. But few people know that they can also be used in an UPDATE statement, in most databases. Check out the following query, which I again found in a customer’s PL/SQL package (simplified again, of course): UPDATE u SET n = (SELECT n + 1 FROM t WHERE u.n = t.n), s = (SELECT 'x' || s FROM t WHERE u.n = t.n), x = 3; So this query takes a subquery as a data source for updating two columns, and the third column is updated “regularly”. How does it perform? Moderately: ----------------------------- | Operation | Name | ----------------------------- | UPDATE STATEMENT | | | UPDATE | U | | TABLE ACCESS FULL| U | | TABLE ACCESS FULL| T | | TABLE ACCESS FULL| T | ----------------------------- Let’s ignore the full table scans, as this query is constructed. The actual query could leverage indexes. But T is accessed twice, i.e. in both subqueries. Oracle didn’t seem to be able to apply scalar subquery caching in this case. To the rescue: row value expressions. Let’s simply rephrase our UPDATE to this: UPDATE u SET (n, s) = (( SELECT n + 1, 'x' || s FROM t WHERE u.n = t.n )), x = 3; Let’s ignore the funny, Oracle-specific double-parentheses syntax for the right hand side of such a row value expression assignment, but let’s appreciate the fact that we can easily assign a new value to the tuple (n, s) in one go! Note, we could have also written this, instead, and assign x as well: UPDATE u SET (n, s, x) = (( SELECT n + 1, 'x' || s, 3 FROM t WHERE u.n = t.n )); As you will have expected, the execution plan has also improved, and T is accessed only once: ----------------------------- | Operation | Name | ----------------------------- | UPDATE STATEMENT | | | UPDATE | U | | TABLE ACCESS FULL| U | | TABLE ACCESS FULL| T | ----------------------------- The Cure: Use row value expressions. Where ever you can. They make your SQL code incredibly more expressive, and chances are, they make it faster, as well. Note that the above is supported by jOOQ’s UPDATE statement. This is the moment we would like to make you aware of this cheap, in-article advertisement:Using MySQL when you could use PostgreSQLTo some, this may appear to be a bit of a hipster discussion. But let’s consider the facts:MySQL claims to be the “most popular Open Source database”. PostgreSQL claims to be the “most advanced Open Source database”.Let’s consider a bit of history. MySQL has always been very easy to install, maintain, and it has had a great and active community. This has lead to MySQL still being the RDBMS of choice with virtually every web hoster on this planet. Those hosters also host PHP, which was equally easy to install, and maintain. BUT! We Java developers tend to have an opinion about PHP, right? It’s summarised by this image here:Well, it works, but how does it work? The same can be said about MySQL. MySQL has always worked somehow, but while commercial databases like Oracle have made tremendous progress both in terms of query optimisation and feature scope, MySQL has hardly moved in the last decade. Many people choose MySQL primarily because of its price (USD $ 0.00). But often, the same people have found MySQL to be slow and quickly concluded that SQL is slow per se – without evaluating the options. This is also why all NoSQL stores compare themselves with MySQL, not with Oracle, the database that has been winning the Transaction Processing Performance Council’s (TPC) benchmarks almost forever. Some examples:Comparing Cassandra, MongoDB, MySQL Switching from MySQL to Cassandra. Pros / Cons MySQL to Cassandra migrations When to use MongoDB rather than MySQLWhile the last article bluntly adds “(and other RDBMS)” it doesn’t go into any sort of detail whatsoever, what those “other RDBMS” do wrong. It really only compares MongoDB with MySQL. The Cure: We say: Stop complaining about SQL, when in fact, you’re really complaining about MySQL. There are at least four very popular databases out there that are incredibly good, and millions of times better than MySQL. These are:Oracle Database SQL Server PostgreSQL MS Access(just kidding about the last one, of course) The Takeaway: Don’t fall for agressive NoSQL marketing. 10gen is an extremely well-funded company, even if MongoDB continues to disappoint, technically. The same is true for Datastax. Both companies are solving a problem that few people have. They’re selling us niche products as commodity, making us think that our real commodity databases (the RDBMS) no longer fulfil our needs. They are well-funded and have big marketing teams to throw around with blunt claims. In the mean time, PostgreSQL just got even better, and you, as a reader of this blog / post, are about to bet on the winning team! … just to cite Mark Madsen once more: History of NoSQL according to @markmadsen #strataconf pic.twitter.com/XHXMJsXHjV — Edd Dumbill (@edd) November 12, 2013The Disclaimer: This article has been quite strongly against MySQL. We don’t mean to talk badly about a database that perfectly fulfils its purpose, as this isn’t a black and white world. Heck, you can get happy with SQLite in some situations. MySQL, being the cheap and easy to use, easy to install commodity database. We just wanted to make you aware of the fact, that you’re expressly choosing the cheap, not-so-good database, rather than the cheap, awesome one.Forgetting about UNDO / REDO logsWe have claimed that MERGE statements or bulk / batch updates are good. That’s correct, but nonetheless, you should be wary when updating huge data sets in transactional contexts. If your transaction “takes too long”, i.e. if you’re updating 10 million records at a time, you will run into two problems:You increase the risk of race conditions, if another process is also writing to the same table. This may cause a rollback on their or on your transaction, possibly making you roll out the huge update again You cause a lot of concurrency on your system, because every other transaction / session, that wants to see the data that you’re about to update, will have to temporarily roll back all of your updates first, before they reach the state on disk that was there before your huge update. That’s the price of ACID.One way to work around this issue is to allow for other sessions to read uncommitted data. Another way to work around this issue is to frequently commit your own work, e.g. after 1000 inserts / updates. In any case, due to the CAP theorem, you will have to make a compromise. Frequent commits will produce the risk of an inconsistent database in the event of the multi-million update going wrong after 5 million (committed) records. A rollback would then mean to revert all database changes towards a backup. The Cure: There is no definitive cure to this issue. But beware that you are very very rarely in a situation where it is OK to simply update 10 million records of a live and online table outside of an actual scheduled maintenance window. The simplest acceptable workaround is indeed to commit your work after N inserts / updates. The Takeaway: By this time, NoSQL aficionados will claim (again due to excessive marketing by aforementioned companies) that NoSQL has solved this by dropping schemas and typesafety. “Don’t update, just add another property!” – they said. But that’s not true! First off, I can add columns to my database without any issue at all. An ALTER TABLE ADD statement is executed instantly on live databases. Filling the column with data doesn’t bother anyone either, because no one reads the column yet (remember, don’t SELECT * !). So adding columns in RDBMS is as cheap as adding JSON properties to a MongoDB document. But what about altering columns? Removing them? Merging them? It is simply not true that denormalisation takes you anywhere far. Denormalisation is always a short-term win for the developer. Hardly a long-term win for the operations teams. Having redundant data in your database for the sake of speeding up an ALTER TABLE statement is like sweeping dirt under the carpet. Don’t believe the marketers. And while you’re at it, perform some doublethink and forget that we’re SQL tool vendors ourselves! Here’s again the “correct” message:Not using the BOOLEAN type correctlyThis is not really a mistake per se. It’s just again something that hardly anyone knows. When the SQL:1999 standard introduced the new BOOLEAN data type, they really did it right. Because before, we already had something like booleans in SQL. We’ve had <search condition> in SQL-92, which are essentially predicates for use in WHERE, ON, and HAVING clauses, as well as in CASE expressions. SQL:1999, however, simply defined the new <boolean value expression> as a regular <value expression>, and redefined the <search condition> as such: <search condition> ::= <boolean value expression> Done! Now, for most of us Java / Scala / etc. developers, this doesn’t seem like such an innovation. Heck it’s a boolean. Obviously it can be interchangeably used as predicate and as variable. But in the mind-set of the keyword-heavy SQL folks who have taken inspiration from COBOL when designing the language, this was quite a step forward. Now, what does this mean? This means that you can use any predicate also as a column! For instance: SELECT a, b, c FROM ( SELECT EXISTS (SELECT ...) a, MY_COL IN (1, 2, 3) b, 3 BETWEEN 4 AND 5 c FROM MY_TABLE ) t WHERE a AND b AND NOT(c) This is a bit of a dummy query, agreed, but are you aware of how powerful this is? Luckily, again, PostgreSQL fully supports this (unlike Oracle, which still doesn’t have any BOOLEAN data type in SQL). The Cure: Every now and then, using BOOLEAN types feels very right, so do it! You can transform boolean value expressions into predicates and predicates into boolean value expressions. They’re the same. This makes SQL all so powerful.Conclusion SQL has evolved steadily over the past years through great standards like SQL:1999, SQL:2003, SQL:2008 and now SQL:2011. It is the only surviving mainstream declarative language, now that XQuery can be considered pretty dead for the mainstream. It can be easily mixed with procedural languages, as PL/SQL and T-SQL (and other procedural dialects) have shown. It can be easily mixed with object-oriented or functional languages, as jOOQ has shown. At Data Geekery, we believe that SQL is the best way to query data. You don’t agree with any of the above? That’s fine, you don’t have to. Sometimes, even we agree with Winston Churchill who is known to have said: SQL is the worst form of database querying, except for all the other forms. But as Yakov Fain has recently put it: You can run from SQL, but you can’t hide So, let’s better get back to work and learn this beast! Thanks for reading.Reference: Yet Another 10 Common Mistakes Java Developers Make When Writing SQL (You Won’t BELIEVE the Last One) from our JCG partner Lukas Eder at the JAVA, SQL, AND JOOQ blog....
software-development-2-logo

The Absolute Basics of Indexing Data

Ever wondered how a search engine works? In this post I would like to show you a high level view of the internal workings of a search engine and how it can be used to give fast access to your data. I won’t go into any technical details, what I am describing here holds true for any Lucene based search engine, be it Lucene itself, Solr or Elasticsearch. Input Normally a search engine is agnostic to the real data source of indexing data. Most often you push data into it via an API that already needs to be in the expected format, mostly Strings and data types like integers. It doesn’t matter if this data originally resides in a document in the filesystem, on a website or in a database. Search engines are working with documents that consist of fields and values. Though not always used directly you can think of documents as JSON documents. For this post imagine we are building a book database. In our simplified world a book just consists of a title and one or more authors. This would be two example documents: { "title" : "Search Patterns", "authors" : [ "Morville", "Callender" ], } { "title" : "Apache Solr Enterprise Search Server", "authors" : [ "Smiley", "Pugh" ] } Even though the structure of both documents is the same in our case, the format of the document doesn’t need to be fixed. Both documents could have totally different attributes, nevertheless both could be stored in the same index. In reality you will try to keep the documents similar, after all, you need a way to handle the documents in your application. Lucene itself doesn’t even have the concept of a key. But of course you need a key to identify your documents when updating them. Both Solr and Elasticsearch have ids that can either be chosen by the application or be autogenerated. Analyzing For every field that is indexed a special process called analyzing is employed. What it does can differ from field to field. For example, in a simple case it might just split the terms on whitespace and remove any punctuation so Search Patterns would become two terms, Search and Patterns. Index Structure An inverted index, the structure search engines are using, is similar to a map that contains search terms as key and a reference to a document as value. This way the process of searching is just a lookup of the term in the index, a very fast process. Those might be the terms that are indexed for our example documents.Field Term Document Idtitle Apache 2Enterprise 2Patterns 1Search 1,2Server 2Solr 2author Callender 1Morville 1Pugh 2Smiley 2A real index contains more information like position information to enable phrase queries and frequencies for calculating the relevancy of a document for a certain search term. As we can see the index holds a reference to the document. This document, that is also stored with the search index, doesn’t necessarily have to be the same as our input document. You can determine for each field if you would like to keep the original content which is normally controlled via an attribute named stored. As a general rule, you should have all the fields stored that you would like to display with the search results. When indexing lots of complete books and you don’t need to display it in a results page it might be better to not store it at all. You can still search it, as the terms are available in the index, but you can’t access the original content. More on Analyzing Looking at the index structure above we can already imagine how the search process for a book might work. The user enters a term, e.g. Solr, this term is then used to lookup the documents that contain the term. This works fine for cases when the user types the term correctly. A search for solr won’t match for our current example. To mitigate those difficulties we can use the analyzing process already mentioned above. Besides the tokenization that splits the field value into tokens we can do further preprocessing like removing tokens, adding tokens or modifying tokens (TokenFilter). For our book case it might at first be enough to do lowercasing on the incoming data. So a field value Solr will then be stored as solr in the index. To enable the user to also search for Solr with an uppercase letter we need to do analyzing for the query as well. Often it is the same process that is used for indexing but there are also cases for different analyzers. The analyzing process not only depends on the content of the documents (field types, language of text fields) but also on your application. Take one common scenario: Adding synonyms for terms to the index. You might think that you just take a huge list of synonyms like WordNet and add those to each of your application. This might in fact decrease the search experience of your users as there are too many false positives. Also, for certain terms of the domain of your users WordNet might not contain the correct synonyms at all. Duplication When designing the index structure there are two competing forces: Often you either optimize for query speed or for index size. If you have a lot of data you probably need to take care that you only store data that you really need and even only put terms in the index that are necessary for lookups. Oftentimes for smaller datasets the index size doesn’t matter that much and you can design your index for query performance. Let’s look at an example that can make sense for both cases. In our book information system we would like to display an alphabetic navigation for the last name of the author. If the user clicks on A, all the books of authors starting with the letter A should be displayed. When using the Lucene query syntax you can do something like this with its wildcard support: Just issue a query that contains the letter the user clicked and a trailing *, e.g. a*. Wildcard queries have become very fast with recent Lucene versions, nevertheless it still is a query time impact. You can also choose another way. When indexing the data you can add another field that just stores the first letter of the name. This is what the relevant configuration might look like in Elasticsearch but the concept is the same for Lucene and Solr: "author": { "type": "multi_field", "fields": { "author" : { "type": "string" }, "letter" : { "type": "string", "analyzer": "single_char_analyzer" } } } Under the hood, another term dictionary for the field author.letter will be created. For our example it will look like this:Field Term Document Idauthor.letter C 1M 1P 2S 2Now, instead of issuing a wildcard query on the author field we can directly query the author.letter field with the letter. You can even build the navigation from all the terms in the index using techniques like faceting the extract all the available terms for a field from the index. Conclusion These are the basics of the indexing data for a search engine. The inverted index structure makes searching really fast by moving some processing to the indexing phase. When we are not bound by any index size concerns we can design our index for query performance and add additional fields that duplicate some of the data. This design for queries is what makes search engines similar to how lots of the NoSQL solutions are used. If you’d like to go deeper in the topics I recommend watching the talk What is in a Lucene index by Adrien Grand. He shows some of the concepts I have mentioned here (and a lot more) but also how those are implemented in Lucene.Reference: The Absolute Basics of Indexing Data from our JCG partner Florian Hopf at the Dev Time blog....
software-development-2-logo

What Is Special About This? Significant Terms in Elasticseach

I have been using Elasticsearch a few times now for doing analytics of twitter data for conferences. Popular hashtags and mentions that can be extraced using facets can show what is hot at a conference. But you can go even further and see what makes each hashtag special. In this post I would like to show you the significant terms aggregation that is available with Elasticsearch 1.1. I am using the tweets of last years Devoxx as those contain enough documents to play around. Aggregations Elasticsearch 1.0 introduced aggregations, that can be used similar to facets but are far more powerful. To see why those are useful let’s take a step back and look at facets, that are often used to extract statistical values and distributions. One useful example for facets is the total count of a hashtag: curl -XGET "http://localhost:9200/devoxx/tweet/_search" -d' { "size": 0, "facets": { "hashtags": { "terms": { "field": "hashtag.text", "size": 10, "exclude": [ "devoxx", "dv13" ] } } } }' We request a facet called hashtags that uses the terms of hashtag.text and returns the 10 top values with the counts. We are excluding the hashtags devoxx and dv13 as those are very frequent. This is an excerpt of the result with the popular hashtags: "facets": { "hashtags": { "_type": "terms", "missing": 0, "total": 19219, "other": 17908, "terms": [ { "term": "dartlang", "count": 229 }, { "term": "java", "count": 216 }, { "term": "android", "count": 139 }, [...] Besides the statistical information we are retrieving here facets are often used for offering a refinement on search results. A common use is to display categories or features of products on eCommerce sites for example. Starting with Elasticsearch 1.0 you can have the same behaviour by using one of the new aggregations, in this case a terms aggregation: curl -XGET "http://localhost:9200/devoxx/tweet/_search" -d' { "size" : 0, "aggs" : { "hashtags" : { "terms" : { "field" : "hashtag.text", "exclude" : "devoxx|dv13" } } } }' Instead of requesting facets we are now requesting a terms aggregation for the field hashtag.text. The exclusion is now based on a regular expression instead of a list. The result looks similar to the facet return values: "aggregations": { "hashtags": { "buckets": [ { "key": "dartlang", "doc_count": 229 }, { "key": "java", "doc_count": 216 }, { "key": "android", "doc_count": 139 }, [...] Each value forms a so called bucket that contains a key and a doc_count. But aggregations not only are a replacement for facets. Multiple aggregations can be combined to give more information on the distribution of different fields. For example we can see the users that used a certain hashtag by adding a second terms aggregation for the field user.screen_name: curl -XGET "http://localhost:9200/devoxx/tweet/_search" -d' { "size" : 0, "aggs" : { "hashtags" : { "terms" : { "field" : "hashtag.text", "exclude" : "devoxx|dv13" }, "aggs" : { "hashtagusers" : { "terms" : { "field" : "user.screen_name" } } } } } }' Using this nested aggregation we now get a list of buckets for each hashtag. This list contains the users that used the hashtag. This is a short excerpt for the #scala hashtag: "key": "scala", "doc_count": 130, "hashtagusers": { "buckets": [ { "key": "jaceklaskowski", "doc_count": 74 }, { "key": "ManningBooks", "doc_count": 3 }, [...] We can see that there is one user that is responsible for half of the hashtags. A very dedicated user. Using aggregations we can get information that we were not able to get with facets alone. If you are interested in more details about aggregations in general or the metrics aggregations I haven’t touched here, Chris Simpson has written a nice post on the feature, there is a nice visual one at the Found blog, another one here and of course there is the official documentation on the Elasticsearch website. Significant Terms Elasticsearch 1.1 contains a new aggregation, the significant terms aggregation. It allows you to do something very useful: For each bucket that is created you can see the terms that make this bucket special. Significant terms are calculated by comparing a foreground frequency (which is the frequency of the bucket you are interested in) with a background frequency (which for Elasticsearch 1.1 always is the frequency of the complete index). This means it will collect any results that have a high frequency for the current bucket but not for the complete index. For our example we can now check for the hashtags that are often used with a certain mention. This is not the same that can be done with the terms aggregation. The significant terms will only return those terms that are occuring often for a certain user but not as frequently for all users. This is what Mark Harwood calls the uncommonly common. curl -XGET "http://localhost:9200/devoxx/tweet/_search" -d' { "size" : 0, "aggs" : { "mentions" : { "terms" : { "field" : "mention.screen_name" }, "aggs" : { "uncommonhashtags" : { "significant_terms" : { "field" : "hashtag.text" } } } } } }' We request a normal terms aggregation for the mentioned users. Using a nested significant_terms aggregation we can see any hashtags that are often used with the mentioned user but not so often in the whole index. This is a snippet for the account of Brian Goetz: { "key": "BrianGoetz", "doc_count": 173, "uncommonhashtags": { "doc_count": 173, "buckets": [ { "key": "lambda", "doc_count": 13, "score": 1.8852860861614915, "bg_count": 33 }, { "key": "jdk8", "doc_count": 8, "score": 0.7193691737111163, "bg_count": 32 }, { "key": "java", "doc_count": 21, "score": 0.6601749139630457, "bg_count": 216 }, { "key": "performance", "doc_count": 4, "score": 0.6574225667412876, "bg_count": 9 }, { "key": "keynote", "doc_count": 9, "score": 0.5442707998673785, "bg_count": 52 }, [...] You can see that there are some tags that are targeted a lot at the keynote by Brian Goetz and are not that common for the whole index. Some more ideas what we could look at with the significant terms aggregation:Find users that are using a hashtag a lot. Find terms that are often used with a certain hashtag. Find terms that are used by a certain user. …Besides these impressive analytics feature significant terms can also be used for search applications. A useful example is given in the Elasticsearch documentation itself: If a user searches for “bird flu” automatically display a link to a search to H5N1 which should be very common in the result documents but not in the whole of the corpus. Conclusion With significant terms Elasticsearch has again added a feature that might very well offer surprising new applications and use cases for search. Not only is it important for analytics but it can also be used to improve classic search applications. Mark Harwood has collected some really interesting use cases on the Elasticsearch blog. If you’d like to read another post on the topic you can see this post at QBox-Blog that introduces significant terms as well as the percentile and cardinality aggregations.Reference: What Is Special About This? Significant Terms in Elasticseach from our JCG partner Florian Hopf at the Dev Time blog....
java-logo

JPA 2.1 Entity Graph – Part 1: Named entity graphs

Lazy loading was often an issue with JPA 2.0. You have to define at the entity if you want to use FetchType.LAZY (default) or FetchType.EAGER to load the relation and this mode is always used. FetchType.EAGER is only used if we want to always load the relation. FetchType.LAZY is used in almost all of the cases to get a well performing and scalable application. But this is not without drawbacks. If you have to use an element of the relation, you need to make sure, that the relation gets initialized within the transaction that load the entity from the database. This can be done by using a specific query that reads the entity and the required relations from the database. But this will result in use case specific queries. Another option is to access the relation within your business code which will result in an additional query for each relation. Both approaches are far from perfect. JPA 2.1 entity graphs are a better solution for it. The definition of an entity graph is independent of the query and defines which attributes to fetch from the database. An entity graph can be used as a fetch or a load graph. If a fetch graph is used, only the attributes specified by the entity graph will be treated as FetchType.EAGER. All other attributes will be lazy. If a load graph is used, all attributes that are not specified by the entity graph will keep their default fetch type. Lets have a look how to define and use an entity graph. The example entities For this example we will use an order with a list of items and each item has a product. All relations are lazy. The Order entity: @Entity @Table(name = "purchaseOrder") @NamedEntityGraph(name = "graph.Order.items", attributeNodes = @NamedAttributeNode(value = "items", subgraph = "items"), subgraphs = @NamedSubgraph(name = "items", attributeNodes = @NamedAttributeNode("product"))) public class Order implements Serializable {@Id @GeneratedValue(strategy = GenerationType.AUTO) @Column(name = "id", updatable = false, nullable = false) private Long id = null; @Version @Column(name = "version") private int version = 0;@Column private String orderNumber;@OneToMany(mappedBy = "order", fetch = FetchType.LAZY) private Set<OrderItem> items = new HashSet<OrderItem>();... The OrderItem entity: @Entity public class OrderItem implements Serializable {@Id @GeneratedValue(strategy = GenerationType.AUTO) @Column(name = "id", updatable = false, nullable = false) private Long id = null; @Version @Column(name = "version") private int version = 0;@Column private int quantity;@ManyToOne private Order order;@ManyToOne(fetch = FetchType.LAZY) private Product product; The Product entity: @Entity public class Product implements Serializable {@Id @GeneratedValue(strategy = GenerationType.AUTO) @Column(name = "id", updatable = false, nullable = false) private Long id = null; @Version @Column(name = "version") private int version = 0;@Column private String name; Named entity graph The definition of a named entity graph is done by the @NamedEntityGraph annotation at the entity. It defines a unique name and a list of attributes (the attributeNodes) that have be loaded. The following example shows the definition of the entity graph “graph.Order.items” which will load the list of OrderItem of an Order. @Entity @Table(name = "purchase_order") @NamedEntityGraph(name = "graph.Order.items", attributeNodes = @NamedAttributeNode("items")) public class Order implements Serializable {... Now that we have defined the entity graph, we can use it in a query. Therefore we need to create a Map with query hints and set it as an additional parameter on a find or query method call. The following code snippet shows how to use a named entity graph as a fetch graph in a find statement. EntityGraph graph = this.em.getEntityGraph("graph.Order.items");Map hints = new HashMap(); hints.put("javax.persistence.fetchgraph", graph);return this.em.find(Order.class, orderId, hints); Named sub graph We used the entity graph to define the fetch operation of the Order entity. If we want to do the same for the OrderItem entity, we can do this with an entity sub graph. The definition of a named sub graph is similar to the definition of an named entity graph and can be referenced as an attributeNode. The following code snippets shows the definition of a sub graph to load the Product of each OrderItem. The defined entity graph will fetch an Order with all OrderItems and their Products. @Entity @Table(name = "purchase_order") @NamedEntityGraph(name = "graph.Order.items", attributeNodes = @NamedAttributeNode(value = "items", subgraph = "items"), subgraphs = @NamedSubgraph(name = "items", attributeNodes = @NamedAttributeNode("product"))) public class Order implements Serializable { What’s happening inside? OK, from a development point of view entity graphs are great. They are easy to use and we do not need to write additional code to avoid lazy loading issues. But what is happening inside? How many queries are send to the database? Lets have a look at the hibernate debug log. 2014-03-22 21:56:08,285 DEBUG [org.hibernate.loader.plan.build.spi.LoadPlanTreePrinter] (pool-2-thread-1) LoadPlan(entity=blog.thoughts.on.java.jpa21.entity.graph.model.Order) - Returns - EntityReturnImpl(entity=blog.thoughts.on.java.jpa21.entity.graph.model.Order, querySpaceUid=<gen:0>, path=blog.thoughts.on.java.jpa21.entity.graph.model.Order) - CollectionAttributeFetchImpl(collection=blog.thoughts.on.java.jpa21.entity.graph.model.Order.items, querySpaceUid=<gen:1>, path=blog.thoughts.on.java.jpa21.entity.graph.model.Order.items) - (collection element) CollectionFetchableElementEntityGraph(entity=blog.thoughts.on.java.jpa21.entity.graph.model.OrderItem, querySpaceUid=<gen:2>, path=blog.thoughts.on.java.jpa21.entity.graph.model.Order.items.<elements>) - EntityAttributeFetchImpl(entity=blog.thoughts.on.java.jpa21.entity.graph.model.Product, querySpaceUid=<gen:3>, path=blog.thoughts.on.java.jpa21.entity.graph.model.Order.items.<elements>.product) - QuerySpaces - EntityQuerySpaceImpl(uid=<gen:0>, entity=blog.thoughts.on.java.jpa21.entity.graph.model.Order) - SQL table alias mapping - order0_ - alias suffix - 0_ - suffixed key columns - {id1_2_0_} - JOIN (JoinDefinedByMetadata(items)) : <gen:0> -> <gen:1> - CollectionQuerySpaceImpl(uid=<gen:1>, collection=blog.thoughts.on.java.jpa21.entity.graph.model.Order.items) - SQL table alias mapping - items1_ - alias suffix - 1_ - suffixed key columns - {order_id4_2_1_} - entity-element alias suffix - 2_ - 2_entity-element suffixed key columns - id1_0_2_ - JOIN (JoinDefinedByMetadata(elements)) : <gen:1> -> <gen:2> - EntityQuerySpaceImpl(uid=<gen:2>, entity=blog.thoughts.on.java.jpa21.entity.graph.model.OrderItem) - SQL table alias mapping - items1_ - alias suffix - 2_ - suffixed key columns - {id1_0_2_} - JOIN (JoinDefinedByMetadata(product)) : <gen:2> -> <gen:3> - EntityQuerySpaceImpl(uid=<gen:3>, entity=blog.thoughts.on.java.jpa21.entity.graph.model.Product) - SQL table alias mapping - product2_ - alias suffix - 3_ - suffixed key columns - {id1_1_3_}2014-03-22 21:56:08,285 DEBUG [org.hibernate.loader.entity.plan.EntityLoader] (pool-2-thread-1) Static select for entity blog.thoughts.on.java.jpa21.entity.graph.model.Order [NONE:-1]: select order0_.id as id1_2_0_, order0_.orderNumber as orderNum2_2_0_, order0_.version as version3_2_0_, items1_.order_id as order_id4_2_1_, items1_.id as id1_0_1_, items1_.id as id1_0_2_, items1_.order_id as order_id4_0_2_, items1_.product_id as product_5_0_2_, items1_.quantity as quantity2_0_2_, items1_.version as version3_0_2_, product2_.id as id1_1_3_, product2_.name as name2_1_3_, product2_.version as version3_1_3_ from purchase_order order0_ left outer join OrderItem items1_ on order0_.id=items1_.order_id left outer join Product product2_ on items1_.product_id=product2_.id where order0_.id=? The log shows that only one query is created. Hibernate uses the entity graph to create a load plan with all 3 entities (Order, OrderItem and Product) and load them with one query. Conclusion We defined an entity graph that tells the entity manager to fetch a graph of 3 related entities from the database (Order,OrderItem and Product). The definition and usage of the entity graph is query independent and results in only one select statement. So the main drawbacks of the JPA 2.0 approaches (mentioned in the beginning) are solved. From my point of view, the new entity graph feature is really great and can be a good way to solve lazy loading issues. What do you think about it?Reference: JPA 2.1 Entity Graph – Part 1: Named entity graphs from our JCG partner Thorben Janssen at the Some thoughts on Java (EE) blog....
apache-solr-logo

Prefix and Suffix Matches in Solr

Search engines are all about looking up strings. The user enters a query term that is then retrieved from the inverted index. Sometimes a user is looking for a value that is only a substring of values in the index and the user might be interested in those matches as well. This is especially important for languages like German that contain compound words like Semmelknödel where Knödel means dumpling and Semmel specializes the kind. Wildcards For demoing the approaches I am using a very simple schema. Documents consist of a text field and an id. The configuration as well as a unit test is also vailable on Github. <fields> <field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" /> <field name="text" type="text_general" indexed="true" stored="false"/> </fields> <uniqueKey>id</uniqueKey> <types> <fieldType name="string" class="solr.StrField" sortMissingLast="true" /><fieldType name="text_general" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType> </types> One approach that is quite popular when doing prefix or suffix matches is to use wildcards when querying. This can be done programmatically but you need to take care that any user input is then escaped correctly. Suppose you have the term dumpling in the index and a user enters the term dump. If you want to make sure that the query term matches the document in the index you can just add a wildcard to the user query in the code of your application so the resulting query then would be dump*. Generally you should be careful when doing too much magic like this: if a user is in fact looking for documents containing the word dump she might not be interested in documents containing dumpling. You need to decide for yourself if you would like to have only matches the user is interested in (precision) or show the user as many probable matches as possible (recall). This heavily depends on the use cases for your application. You can increase the user experience a bit by boosting exact matches for your term. You need to create a more complicated query but this way documents containing an exact match will score higher: dump^2 OR dump* When creating a query like this you should also take care that the user can’t add terms that will make the query invalid. The SolrJ method escapeQueryChars of the class ClientUtils can be used to escape the user input. If you are now taking suffix matches into account the query can get quite complicated and creating a query like this on the client side is not for everyone. Depending on your application another approach can be the better solution: You can create another field containing NGrams during indexing. Prefix Matches with NGrams NGrams are substrings of your indexed terms that you can put in an additional field. Those substrings can then be used for lookups so there is no need for any wildcards. Using the (e)dismax handler you can automatically set a boost on your field that is used for exact matches so you get the same behaviour we have seen above. For prefix matches we can use the EdgeNGramFilter that is configured for an additional field: ... <field name="text_prefix" type="text_prefix" indexed="true" stored="false"/> ... <copyField source="text" dest="text_prefix"/> ... <fieldType name="text_prefix" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.LowerCaseTokenizerFactory"/> <filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="15" side="front"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.LowerCaseTokenizerFactory"/> </analyzer> </fieldType> During indexing time the text field value is copied to the text_prefix field and analyzed using the EdgeNGramFilter. Grams are created for any length between 3 and 15, starting from the front of the string. When indexing the term dumpling this would be:dum dump dumpl dumpli dumplin dumplingDuring query time the term is not split again so that the exact match for the substring can be used. As usual, the analyze view of the Solr admin backend can be a great help for seeing the analyzing process in action.Using the dismax handler you can now pass in the user query as it is and just advice it to search on your fields by adding the parameter qf=text^2,text_prefix. Suffix Matches With languages that have compound words it’s a common requirement to also do suffix matches. If a user queries for the term Knödel (dumpling) it is expected that documents that contain the termSemmelknödel also match. Using Solr versions up to 4.3 this is no problem. You can use the EdgeNGramFilterFactory to create grams starting from the back of the string. ... <field name="text_suffix" type="text_suffix" indexed="true" stored="false"/> ... <copyField source="text" dest="text_suffix"/> ... <fieldType name="text_suffix" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="15" side="back"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.KeywordTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType> ... This creates suffixes of the indexed term that also contains the term knödel so our query works. But, using more recent versions of Solr you will encounter a problem during indexing time: java.lang.IllegalArgumentException: Side.BACK is not supported anymore as of Lucene 4.4, use ReverseStringFilter up-front and afterward at org.apache.lucene.analysis.ngram.EdgeNGramTokenFilter.(EdgeNGramTokenFilter.java:114) at org.apache.lucene.analysis.ngram.EdgeNGramTokenFilter.(EdgeNGramTokenFilter.java:149) at org.apache.lucene.analysis.ngram.EdgeNGramFilterFactory.create(EdgeNGramFilterFactory.java:52) at org.apache.lucene.analysis.ngram.EdgeNGramFilterFactory.create(EdgeNGramFilterFactory.java:34) You can’t use the EdgeNGramFilterFactory anymore for suffix ngrams. But fortunately the stack trace also advices us how to fix the problem. We have to combine it with ReverseStringFilter: <fieldType name="text_suffix" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.LowerCaseTokenizerFactory"/> <filter class="solr.ReverseStringFilterFactory"/> <filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="15" side="front"/> <filter class="solr.ReverseStringFilterFactory"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.LowerCaseTokenizerFactory"/> </analyzer> </fieldType> This will now yield the same results as before. Conclusion Whether you are going for manipulating your query by adding wildcards or if you should be using the NGram approach heavily depends on your use case and is also a matter of taste. Personally I am using NGrams most of the time as disk space normally isn’t a concern for the kind of projects I am working on. Wildcard search has become a lot faster in Lucene 4 so I doubt there is a real benefit there anymore. Nevertheless I tend to do as much processing I can during indexing time.Reference: Prefix and Suffix Matches in Solr from our JCG partner Florian Hopf at the Dev Time blog....
enterprise-java-logo

Generate your JAXB classes in a second with xjc

Since JAXB is part of the JDK, it is one of the most often used frameworks to process XML documents. It provides a comfortable way to retrieve and store data from XML documents to Java classes. As nearly every Java developer has already used JAXB, I will not explain the different JAXB annotations. Instead I will focus on a little command line tool called xjc and show you how to generate your binding classes based on an existing XSD schema description. Implementing all binding classes for an existing XML interface can be a time consuming and tedious task. But the good news is, you do not need to do it. If you have a XSD schema description, you can use the xjc binding compiler to create the required classes. And even better, xjc is part of the JDK. So there is no need for external tools and you should always have it at hand if required. Using xjc As you can see in the snippet below, xjc support lots of options. The most important are:-d to define where the generated classes shall be stored in the file system, -p to define the package to be used and of course -help if you need anything else.Usage: xjc [-options ...] <schema file/URL/dir/jar> ... [-b <bindinfo>] ... If dir is specified, all schema files in it will be compiled. If jar is specified, /META-INF/sun-jaxb.episode binding file will be compiled. Options: -nv : do not perform strict validation of the input schema(s) -extension : allow vendor extensions - do not strictly follow the Compatibility Rules and App E.2 from the JAXB Spec -b <file/dir> : specify external bindings files (each <file> must have its own -b) If a directory is given, **/*.xjb is searched -d <dir> : generated files will go into this directory -p <pkg> : specifies the target package -httpproxy <proxy> : set HTTP/HTTPS proxy. Format is [user[:password]@]proxyHost:proxyPort -httpproxyfile <f> : Works like -httpproxy but takes the argument in a file to protect password -classpath <arg> : specify where to find user class files -catalog <file> : specify catalog files to resolve external entity references support TR9401, XCatalog, and OASIS XML Catalog format. -readOnly : generated files will be in read-only mode -npa : suppress generation of package level annotations (**/package-info.java) -no-header : suppress generation of a file header with timestamp -target (2.0|2.1) : behave like XJC 2.0 or 2.1 and generate code that doesnt use any 2.2 features. -encoding <encoding> : specify character encoding for generated source files -enableIntrospection : enable correct generation of Boolean getters/setters to enable Bean Introspection apis -contentForWildcard : generates content property for types with multiple xs:any derived elements -xmlschema : treat input as W3C XML Schema (default) -relaxng : treat input as RELAX NG (experimental,unsupported) -relaxng-compact : treat input as RELAX NG compact syntax (experimental,unsupported) -dtd : treat input as XML DTD (experimental,unsupported) -wsdl : treat input as WSDL and compile schemas inside it (experimental,unsupported) -verbose : be extra verbose -quiet : suppress compiler output -help : display this help message -version : display version information -fullversion : display full version informationExtensions: -Xinject-code : inject specified Java code fragments into the generated code -Xlocator : enable source location support for generated code -Xsync-methods : generate accessor methods with the 'synchronized' keyword -mark-generated : mark the generated code as @javax.annotation.Generated -episode <FILE> : generate the episode file for separate compilation Example OK, so let’s have a look at an example. We will use the following XSD schema definition and xjc to generate the classes Author and Book with the described properties and required JAXB annotations. <?xml version="1.0" encoding="UTF-8" standalone="yes"?> <xs:schema version="1.0" xmlns:xs="http://www.w3.org/2001/XMLSchema"><xs:element name="author" type="author"/><xs:element name="book" type="book"/><xs:complexType name="author"> <xs:sequence> <xs:element name="firstName" type="xs:string" minOccurs="0"/> <xs:element name="lastName" type="xs:string" minOccurs="0"/> </xs:sequence> </xs:complexType><xs:complexType name="book"> <xs:sequence> <xs:element ref="author" minOccurs="0"/> <xs:element name="pages" type="xs:int"/> <xs:element name="publicationDate" type="xs:dateTime" minOccurs="0"/> <xs:element name="title" type="xs:string" minOccurs="0"/> </xs:sequence> </xs:complexType> </xs:schema> The following command calls xjc and provides the target directory for the generated classes, the package and the XSD schema file. xjc -d src -p blog.thoughts.on.java schema.xsdparsing a schema... compiling a schema... blog\thoughts\on\java\Author.java blog\thoughts\on\java\Book.java blog\thoughts\on\java\ObjectFactory.java OK, the operation completed successfully and we now have 3 generated classes in our src directory. That might be one more than some have expected. So lets have a look at each of them. The classes Author and Book look like expected. They contain the properties described in the XSD schema and the required JAXB annotations. // // This file was generated by the JavaTM Architecture for XML Binding(JAXB) Reference Implementation, v2.2.4-2 // See <a href="http://java.sun.com/xml/jaxb">http://java.sun.com/xml/jaxb</a> // Any modifications to this file will be lost upon recompilation of the source schema. // Generated on: 2014.01.13 at 07:38:24 PM CET //package blog.thoughts.on.java;import javax.xml.bind.annotation.XmlAccessType; import javax.xml.bind.annotation.XmlAccessorType; import javax.xml.bind.annotation.XmlType;/**  * <p>Java class for author complex type.  *  * <p>The following schema fragment specifies the expected content contained within this class.  *  * <pre>  * <complexType name="author">  * <complexContent>  * <restriction base="{http://www.w3.org/2001/XMLSchema}anyType">  * <sequence>  * <element name="firstName" type="{http://www.w3.org/2001/XMLSchema}string" minOccurs="0"/>  * <element name="lastName" type="{http://www.w3.org/2001/XMLSchema}string" minOccurs="0"/>  * </sequence>  * </restriction>  * </complexContent>  * </complexType>  * </pre>  *  *  */ @XmlAccessorType(XmlAccessType.FIELD) @XmlType(name = "author", propOrder = { "firstName", "lastName" }) public class Author {protected String firstName; protected String lastName;/**      * Gets the value of the firstName property.      *      * @return      * possible object is      * {@link String }      *      */ public String getFirstName() { return firstName; }/**      * Sets the value of the firstName property.      *      * @param value      * allowed object is      * {@link String }      *      */ public void setFirstName(String value) { this.firstName = value; }/**      * Gets the value of the lastName property.      *      * @return      * possible object is      * {@link String }      *      */ public String getLastName() { return lastName; }/**      * Sets the value of the lastName property.      *      * @param value      * allowed object is      * {@link String }      *      */ public void setLastName(String value) { this.lastName = value; }} // // This file was generated by the JavaTM Architecture for XML Binding(JAXB) Reference Implementation, v2.2.4-2 // See <a href="http://java.sun.com/xml/jaxb">http://java.sun.com/xml/jaxb</a> // Any modifications to this file will be lost upon recompilation of the source schema. // Generated on: 2014.01.13 at 07:38:24 PM CET //package blog.thoughts.on.java;import javax.xml.bind.annotation.XmlAccessType; import javax.xml.bind.annotation.XmlAccessorType; import javax.xml.bind.annotation.XmlSchemaType; import javax.xml.bind.annotation.XmlType; import javax.xml.datatype.XMLGregorianCalendar;/**  * <p>Java class for book complex type.  *  * <p>The following schema fragment specifies the expected content contained within this class.  *  * <pre>  * <complexType name="book">  * <complexContent>  * <restriction base="{http://www.w3.org/2001/XMLSchema}anyType">  * <sequence>  * <element ref="{}author" minOccurs="0"/>  * <element name="pages" type="{http://www.w3.org/2001/XMLSchema}int"/>  * <element name="publicationDate" type="{http://www.w3.org/2001/XMLSchema}dateTime" minOccurs="0"/>  * <element name="title" type="{http://www.w3.org/2001/XMLSchema}string" minOccurs="0"/>  * </sequence>  * </restriction>  * </complexContent>  * </complexType>  * </pre>  *  *  */ @XmlAccessorType(XmlAccessType.FIELD) @XmlType(name = "book", propOrder = { "author", "pages", "publicationDate", "title" }) public class Book {protected Author author; protected int pages; @XmlSchemaType(name = "dateTime") protected XMLGregorianCalendar publicationDate; protected String title;/**      * Gets the value of the author property.      *      * @return      * possible object is      * {@link Author }      *      */ public Author getAuthor() { return author; }/**      * Sets the value of the author property.      *      * @param value      * allowed object is      * {@link Author }      *      */ public void setAuthor(Author value) { this.author = value; }/**      * Gets the value of the pages property.      *      */ public int getPages() { return pages; }/**      * Sets the value of the pages property.      *      */ public void setPages(int value) { this.pages = value; }/**      * Gets the value of the publicationDate property.      *      * @return      * possible object is      * {@link XMLGregorianCalendar }      *      */ public XMLGregorianCalendar getPublicationDate() { return publicationDate; }/**      * Sets the value of the publicationDate property.      *      * @param value      * allowed object is      * {@link XMLGregorianCalendar }      *      */ public void setPublicationDate(XMLGregorianCalendar value) { this.publicationDate = value; }/**      * Gets the value of the title property.      *      * @return      * possible object is      * {@link String }      *      */ public String getTitle() { return title; }/**      * Sets the value of the title property.      *      * @param value      * allowed object is      * {@link String }      *      */ public void setTitle(String value) { this.title = value; }} The third and maybe unexpected class is the class ObjectFactory. It contains factory methods for each generated class or interface. This can be really useful if you need to create JAXBElement representations of your objects. // // This file was generated by the JavaTM Architecture for XML Binding(JAXB) Reference Implementation, v2.2.4-2 // See <a href="http://java.sun.com/xml/jaxb">http://java.sun.com/xml/jaxb</a> // Any modifications to this file will be lost upon recompilation of the source schema. // Generated on: 2014.01.13 at 07:38:24 PM CET //package blog.thoughts.on.java;import javax.xml.bind.JAXBElement; import javax.xml.bind.annotation.XmlElementDecl; import javax.xml.bind.annotation.XmlRegistry; import javax.xml.namespace.QName;/**  * This object contains factory methods for each  * Java content interface and Java element interface  * generated in the blog.thoughts.on.java package.  * <p>An ObjectFactory allows you to programatically  * construct new instances of the Java representation  * for XML content. The Java representation of XML  * content can consist of schema derived interfaces  * and classes representing the binding of schema  * type definitions, element declarations and model  * groups. Factory methods for each of these are  * provided in this class.  *  */ @XmlRegistry public class ObjectFactory {private final static QName _Author_QNAME = new QName("", "author"); private final static QName _Book_QNAME = new QName("", "book");/**      * Create a new ObjectFactory that can be used to create new instances of schema derived classes for package: blog.thoughts.on.java      *      */ public ObjectFactory() { }/**      * Create an instance of {@link Author }      *      */ public Author createAuthor() { return new Author(); }/**      * Create an instance of {@link Book }      *      */ public Book createBook() { return new Book(); }/**      * Create an instance of {@link JAXBElement }{@code <}{@link Author }{@code >}}      *      */ @XmlElementDecl(namespace = "", name = "author") public JAXBElement<Author> createAuthor(Author value) { return new JAXBElement<Author>(_Author_QNAME, Author.class, null, value); }/**      * Create an instance of {@link JAXBElement }{@code <}{@link Book }{@code >}}      *      */ @XmlElementDecl(namespace = "", name = "book") public JAXBElement<Book> createBook(Book value) { return new JAXBElement<Book>(_Book_QNAME, Book.class, null, value); }} Conclusion We had a look at xjc and used it to generated the required binding classes for an existing XSD schema definition. xjc generated a class for each complex type and an additional factory class to ease the creation of JAXBElement representations. What do you think about xjc and the generated code? Please leave me a comment and tell me about it. I think this tool generates very clean code and saves a lot of time. In most of the cases the generated code can be directly added to a project. But even if this is not the case, it is much faster to do some refactoring based on the generated code than doing everything myself.Reference: Generate your JAXB classes in a second with xjc from our JCG partner Thorben Janssen at the Some thoughts on Java (EE) blog....
eclipse-logo

10 ideas to improve Eclipse IDE usability

Few years ago, we had a mini IDE war inside our office. It happened between Eclipse and Netbeans supporters. Fortunately, we did not have IntelliJ supporter. Each side tried their best to convince people from the other side to use their favourite IDE. On that war, I am the Eclipse hardcore supporter and I had a hard time fighting Netbeans team. Not as I expected, we end up on the defence side more often than attack. Look at what Netbeans offers, it is quite interesting for me to see how far Netbeans has improved and how Eclipse is getting slower and more difficult to use nowadays than in the past.     Let I share my experience on that mini war and my point of view on how Eclipse should be improved to keep its competitive edge. What is the benefit of using Netbeans For a long time and even up to now, Eclipse is still the dominant IDE in the market. But this did not happened before Eclipse 3.0, which was released in 2004. From there, Eclipse simply dominates the market share of Java IDE for the next decade. Even the C/C++ and Php folks also built their IDE plugin on top of Eclipse. However, things is getting less rosy now. Eclipse is still good, but not that much better than its competitors any more. IntelliJ is a commercial IDE and we will not compare it to Eclipse in this article. The other and more serious competitor is Netbeans. I myself have tried Netbeans, compared it to Eclipse 3.0 and never came back. But the Netbeans that Eclipse is fighting now and the Netbeans that I have tried are simply too different. It is much faster, more stable, configurable and easier to use than I have known. The key points of using Netbeans are the usability and first class support from Sun/Oracle for new Java features. It may not be very appealing to Eclipse veteran like myself but for a starter, it is a great advantage. Like any other wars in the technology worlds, Eclipse and Netbeans keep copying each other features for so long that it is very hard to find something that one IDE can do and the other one cannot. To consider the preferred IDE, what really matter is how things are done rather than what can be done. Regarding usability, I feel Eclipse failed to keep the competitive edge it once had against Netbeans. Eclipse interface is still very flexible and easy to customize but the recent plugins are not so well implemented and error prone (I am thinking of Maven, Git support). Eclipse market is still great but lots of plugins are not so well tested and may create performance or stability issue. Moreover, careless release (Juno 4.0) made Eclipse slow and hangup often. I did not recalled restarting Eclipse in the past but that happened to me once or twice a month now (I am using Eclipse Kepler 4.3). Plus, Eclipse did not fixed some of the discomforts that I have encountered from early day and I still need to bring along all the favourite plugins to help me ease the pain. What I expect from Eclipse There are lots of things I want Eclipse to have but never see from release note. Let share some thoughts:Warn me before I open a big file rather than hang upI guess this happen to most of us. My preferred view is the Package Explorer rather than Project Explorer or Navigator but it does not matter. When I search a file by Ctrl + Shift + R or left click on the file in Explorer, Eclipse will just open the file in Editor. If the file is a huge size XML file? Eclipse hangup and show me the content one minute later or I get frustrated and kill the process. Both are bad outcomes.Have a single Import/Export configuration endpointFor who does not know, Eclipse allow you to import/export Eclipse configuration to a file. When I first download a new release of Eclipse, there are few steps that I always do:Import -> Install -> From Existing Installation: This step help me to copy all my favourite features and plugins from old Eclipse to new Eclipse. Modify Xms, Xmx in eclipse.ini Import Formatter (from exported file) Import Shortkey (from exported file) Configure Installed JREs to point to local JDK Configure Server Runtime and create Server. Disable useless Validators Register svn repository And some other minor tasks that I cannot remember now…Why don’t make it simpler like Chrome installation when new Eclipse can copy whatever settings that I have done on the old Eclipse?Stop building or refreshing the whole workspaceIt happened to me and some of the folks here that I have hundred projects in my workspace. The common practice in our workplace is workspace per repository. To manage things, we create more than 10 Working Sets and constantly switch among them when moving to new task. For us, having Eclipse building, refreshing, scanning the whole workspace is so painful that whether we keep closing projects or sometimes, create a smaller workspace. But can Eclipse allow me to configure scanning Working Set rather than Workspace? Working Set is all what I care. Plus, sometimes, Ctrl + Shift + R and Ctrl + Shift + T does not reflect my active Working Set and not many people notice the small arrow on the top right of the dialogue to select this.Stop indexing by Git and Maven repository by defaultEclipse is nice, it helps us to index Maven and Git repository so that we can work faster later. But not all the time I open Eclipse to work with Maven or Git. Can these plugins be less resource consuming and let me trigger the indexing process when I want?Give me process id for any server or application that I have launchedThis must be a very simple task but I do not know why Eclipse don’t do it. It is even more helpful if Eclipse can provide the memory usage of each process and Eclipse itself. I would like to have a new views that tracking all running process (similar to Debug View) but with process id and memory usage.Implement Open File Explorer and Console hereI bet that most of us use console often when we do coding, whether for Vi, Maven or Git command. However, Eclipse does not give us this feature and we need to install additional plugin to get it.Improve the Editor I often install AnyEdit plugin because it offer many important features that I found hard to live without like converting, sorting,… These features are so crucial that they should be packaged together with Eclipse distribution rather than in a plugin.Stop showing nonsense warning and suggestionHave any of you build a project without a single yellow colour warning? I did that in the past, but let often now. For example, Eclipse asked me to introduce serialVersionUID because my Exception implements Serializable interface. But seriously, how many Java classes implement Serializable? Do we need to do this for every of them?Provide me short keys for the re-factoring tools that I always useSome folks like to type and seeing IDE dependent as a sin. I am on the opposite side. Whatever things can be done by IDE should be done by IDE. Developer is there to think rather than type. It means that I use lots of Eclipse short-keys and re-factoring tool like:Right click -> Surround With Right click -> Refactor Right click -> SourceSome of most common short keys I use everyday are Ctrl + O, Alt + Shift + L, Alt + Shift + M, Ctrl + Shift + F,… and I would like to have more. Eclipse allows me to define my own short keys but I would like it to be part of Eclipse distribution so that I can use them on other boxes as well. From my personal experience, some tools that worth having a short key are:Generate Getters and Setters Generate Constructor using Fields Generate toString() Extract Interface …I also want Eclipse to be more aggressive in defining Templates for Java Editor. Most of use are familiar with well-known Template like sysout, syserr, switch, why don’t we have more for log, toString(), hashCode(), while true,…Make the error messages easier to read by beginnerI have answered many Eclipse questions regarding some common errors because developers cannot figure out what the error message means. Let give few examples: A developer uses command “mvn eclipse:eclipse”. This command generates project classpath file and effectively disable Workspace Resolution. Later, he want to fix things by Update Project Configuration and encounter an error like below (if you want to understand this further, can take a look at the last part of my Maven series)Who understand that? The real problem is the m2e plugin fail to recognize some entries populated by Maven and the solution is to delete all Eclipse files and import Maven project again. Another well-known issue is the error message on pom editor due to m2e does not recognize Maven plugin. It is very confusing for newbie to see this kind of errors.Conclusion These are my thoughts and I wish Eclipse will grant my wishes some days. Do you have anything to share with us about how you want Eclipse to improve?Reference: 10 ideas to improve Eclipse IDE usability from our JCG partner Tony Nguyen at the Developers Corner blog....
Java Code Geeks and all content copyright © 2010-2014, Exelixis Media Ltd | Terms of Use | Privacy Policy | Contact
All trademarks and registered trademarks appearing on Java Code Geeks are the property of their respective owners.
Java is a trademark or registered trademark of Oracle Corporation in the United States and other countries.
Java Code Geeks is not connected to Oracle Corporation and is not sponsored by Oracle Corporation.
Do you want to know how to develop your skillset and become a ...
Java Rockstar?

Subscribe to our newsletter to start Rocking right now!

To get you started we give you two of our best selling eBooks for FREE!

Get ready to Rock!
You can download the complementary eBooks using the links below:
Close