At least not always... this should be a pretty straightforward fact for any software developer, but I must confess that it bit me as I was implementing the Oracle MOOC homework in Scala for Week 3. Maybe I got too used to a particular implementation of a List, the ArrayList, which access any of its elements by an index in constant time. But it can only do this due to its particular implementation, we have to be careful when choosing our data structures.
Basically, I realised that there was a problem with my Scala implementation, but I couldn’t see what was it. This is what was going on…
On these two code snippets, we can see the Java version of the initialisation of sourceWords, which will hold a list of words read from a file, and the creation of another list of words with a given size, where all the words come from randomly picked words in sourceWords.
The same had to be implemented in Scala, and I did it like this:
The Java and Scala implementations read pretty much the same, but there was a problem when building the random list with 2.000 elements from sourceWords in Scala, when comparing the time it takes with its Java counterpart:
Scala -> Creation of the list of words took 4146 ms
Java -> Creation of the list of words took 16 ms
Obviously, this didn’t make me very happy. It was Chris Loy who pointed out the possible cause; I just couldn’t see it. At this point, it’s probably a useful advice to recommend you having a look at this website (from which I stole the next image), in which the complexity of some of the classic operations we can do on different data structures is summarised, using Big O notation. If you don’t know what’s that, apart from reading this Wikipedia link you should probably do one of these two algorithms courses.
As we can see in the previous image, lists have a linear access time, rather than a constant one. Retrieving 2000 elements from a list, referencing 2000 random indexes from it is slowing things down quite a bit. If that was the case, though, then the Scala and the Java implementations should be slow. But it turns out that (even if “There are no guarantees on the type, mutability, serializability, or thread-safety of the List returned”, as stated in the Java 8 Collectors#toList javadoc) my Java SDK implementation is actually using an ArrayList when collecting the elements of the Stream. And, checking the Javadoc for the ArrayList, it shouldn’t come as a surprise to read that “The size, isEmpty, get, set, iterator, and listIterator operations run in constant time.” This explains why the Java implementation is so quick. In order to fix the Scala one, I just changed the data structure in which we keep the sample words from being a List to be an Array.
The client only cares about getting a List when calling createList, and that method is still returning a List:
The results of this little tweak:
Scala -> Creation of the list of words took 1 ms
Java -> Creation of the list of words took 8 ms
It’s a bit unsettling to think that I may have made plenty of mistakes like this in the past, but I guess they’re only mistakes when they lead to performance issues.
Know your Big O, gentlemen, know your Big O.