Oracle MOOC on Lambdas and Streams: Week 3 homework in Java and Scala

Part 1: A file called “words” is provided containing a large collection of words. A template file, RandomWords.java, is also provided. In the constructor you need to read all the words (which are one per line) from the source file into a list (remember to use a stream to do this). You also need to write the body of the createList() method. This generates a list of the size specified as a parameter selecting words at random from the list read in the constructor. HINT: You can use the ints() method of the Random class, which returns a stream of random integers. You can specify the size of the stream using a parameter.

Nothing really new in the constructor, we’ve seen this in previous posts.

We’re making use of one of the Random#ints versions, which allows us to generate an specialized type of Stream, an IntStream, with listSize elements with random values in between 0 (inclusive) and the size of the sourceWords list  (exclusive). These will be the indexes of the elements of the sourceWords list which will serve as the random words requested for this exercise. When we map from an IntStream, though, if we’re transforming its elements into objects, we can’t use the map method, but the mapToObj one. Extracted the random words, we collect them into a List as usual.

Nothing new in the Scala version of the constructor, which will generate the list of words.

Through the GenTraversableFactory#fill method, we generate a sequence of listSize elements, with values being generated by Random.nextInt(sourceWords.size), which will provide an integer between 0 (inclusive) and the size of the sourceWords list (exclusive) every time it’s invoked. Fill will invoke it listSize time to create the Sequence with the random integers to serve as indexes of the random words from the sourceWords list. Mapping and retrieving a list is done as we saw before.

Part 2: In order to provide a relatively compute-intensive task we will calculate the Levenshtein distance between two strings… a source file, Levenshtein.java containing a lev() function is provided that will calculate the distance for you. A second template file, Lesson3.java, is provided. This contains the code necessary to measure the time taken to execute the code of the get() method of a Supplier (as you will see in the main() method this is simple to do with a lambda expression). Your task is to write the necessary code in the computeLevenshtein method to calculate the distances between each pair of strings in the wordList using the streams API. You will need to process this sequentially or in parallel based on the flag passed as a parameter. Try modifying the size of the wordList to see what impact this has on the performance, and in particular, how the difference between sequential and parallel performance is affected by the input size.

There are many possible ways to solve this exercise. This is just one of the possible solutions, inspired in one of the proposed in the @SVQJUG mailing list. Firstly we create a Supplier which will generate a Stream of Strings every time we invoked its get method. Being Supplier a functional interface, a lambda expression can be used to fill its body, which we do, although wrapped in a ternary operator, depending on the flag which will mark if the stream to be generated should be a parallel one or not.
We generated two streams around the same wordList list, meaning we will iterate over the elements of the list twice. Since we’re nesting these iterations, we make sure we process every element with every element of the list, including itself. For every couple of words, then, we get their Levenshtein distance and store it in the distances array position referencing the indexes of the elements in the original list. The second map, the one transforming the inner Stream, needs to mapToInt, since we’ll be generating an IntStream, which will allow us to obtain an array of integers directly when collecting its elements through the terminal operation toArray(). The outer Stream, since it will contain arrays of integers, has to go though the typical collection of its elements into a List before this list can be converted into an array (bidimensional, since it will contain arrays) which turns up to be what we wanted to obtained in the first place, being able to be assigned into our already declared variable distances to capture the final result.

Levenshtein for 1000 elements:

Java Sequential Levenshtein
Sequential took 1585ms
Sequential took 814ms
Sequential took 781ms
Sequential took 652ms
Sequential took 638ms
Java Parallel Levenshtein
Parallel took 347ms
Parallel took 206ms
Parallel took 186ms
Parallel took 167ms
Parallel took 173ms

We can clearly see an increase in performance when using parallel streams.

We will make our List to be treated in parallel though the invocation of the Parallelizable#par method, which belongs to the Parallelizable trait, inherited by the List class. We’ve extracted this operation into a different method.
The rest of the process is exactly like the Java version, although there is a lot less noise, cause by the simpler Scala API’s. We about the collectors and the mapToInt Stream related operations, since Streams are not needed, nor exist, in Scala.

Scala for 1000 elements:

Scala Sequential Levenshtein
Sequential took 621ms
Sequential took 650ms
Sequential took 602ms
Sequential took 525ms
Sequential took 682ms
Scala Parallel Levenshtein
Parallel took 189ms
Parallel took 183ms
Parallel took 221ms
Parallel took 144ms
Parallel took 152ms

In Scala, parallel collections proof to be way quicker also. Times are very similar to the Java version.

Part 3: As another demonstration of the differences in sequential and parallel stream processing there is a second method, processWords() for you to implement. Take the list of strings passed and process these using a sequential or parallel stream as required to generate a new list. Start by simply sorting the strings then experiment with adding things like mapping to lower or upper case, filtering out certain words (such as those beginning with a certain letter). See what impact adding distinct() to the stream has. For each of these vary the size f the input list to see how this affects the performance. Find the threshold below which sequential will give a faster answer than parallel.

In my case, I decided to do the next operations with the words in the Stream: sort the stream first, by natural order, then make all its elements to go lower case, then upper case, and then removing all the elements which start with A, M and Z or even the ones would start (none, since we made the elements upper case) with a, m or z.
The difference between having distinct() or not is very well explained on this API note:

Preserving stability for distinct() in parallel pipelines is relatively expensive (requires that the operation act as a full barrier, with substantial buffering overhead), and stability is often not needed. Using an unordered stream source (such as generate(Supplier)) or removing the ordering constraint with BaseStream.unordered() may result in significantly more efficient execution for distinct() in parallel pipelines, if the semantics of your situation permit. If consistency with encounter order is required, and you are experiencing poor performance or memory utilization with distinct() in parallel pipelines, switching to sequential execution with BaseStream.sequential() may improve performance.

In our case, since we’re asked to start ordering the strings, we can infer how using a parallel Stream won’t give us much if distinct() is present. Let’s see numbers for a 30000 elements wordList:

Java Sequential process words with Distinct
Sequential took 115ms
Sequential took 70ms
Sequential took 70ms
Sequential took 49ms
Sequential took 72ms
Java Parallel process words with distinct
Parallel took 141ms
Parallel took 22ms
Parallel took 18ms
Parallel took 15ms
Parallel took 97ms
Java Sequential process words without Distinct
Sequential took 28ms
Sequential took 15ms
Sequential took 15ms
Sequential took 16ms
Sequential took 14ms
Java Parallel process words without distinct
Parallel took 8ms
Parallel took 6ms
Parallel took 6ms
Parallel took 6ms
Parallel took 8ms

Clearly, a parallel Stream is quicker than a sequential, but the difference is bigger when we’re not doing distinct.

By trying different sizes for the wordList, I found the threshold to be 3400 elements. Below that number, sequential implementation of the Stream will almost always be quicker than parallel, so it’s not worth making the Stream parallel unless you need to do this particular processing on collections with less than 3400 elements. At least on my hardware.

The Scala version is straightforward, really. Resuls for a 30000 elements wordList size, with and without distinct():

Scala Sequential process words with distinct
Sequential took 196ms
Sequential took 31ms
Sequential took 17ms
Sequential took 17ms
Sequential took 65ms
Scala Parallel process words with distinct
Parallel took 44ms
Parallel took 29ms
Parallel took 34ms
Parallel took 33ms
Parallel took 25ms
Scala Sequential process words without distinct
Sequential took 14ms
Sequential took 14ms
Sequential took 55ms
Sequential took 15ms
Sequential took 16ms
Scala Parallel process words without distinct
Parallel took 22ms
Parallel took 20ms
Parallel took 26ms
Parallel took 53ms
Parallel took 80ms

As in the Java implementation, the Scala parallel implementation is quicker than the sequential.

Published by

juanmacias

Software developer with a passion for Scala, football, tennis, photography, travelling, playing the guitar, learning languages (real ones), and hacking around.

One thought on “Oracle MOOC on Lambdas and Streams: Week 3 homework in Java and Scala”

Leave a Reply

Your email address will not be published. Required fields are marked *