Part 1: A file called “words” is provided containing a large collection of words. A template file, RandomWords.java, is also provided. In the constructor you need to read all the words (which are one per line) from the source file into a list (remember to use a stream to do this). You also need to write the body of the createList() method. This generates a list of the size specified as a parameter selecting words at random from the list read in the constructor. HINT: You can use the ints() method of the Random class, which returns a stream of random integers. You can specify the size of the stream using a parameter.
Nothing really new in the constructor, we’ve seen this in previous posts.
We’re making use of one of the Random#ints versions, which allows us to generate an specialized type of Stream, an IntStream, with listSize elements with random values in between 0 (inclusive) and the size of the sourceWords list (exclusive). These will be the indexes of the elements of the sourceWords list which will serve as the random words requested for this exercise. When we map from an IntStream, though, if we’re transforming its elements into objects, we can’t use the map method, but the mapToObj one. Extracted the random words, we collect them into a List as usual.
Nothing new in the Scala version of the constructor, which will generate the list of words.
Through the GenTraversableFactory#fill method, we generate a sequence of listSize elements, with values being generated by Random.nextInt(sourceWords.size), which will provide an integer between 0 (inclusive) and the size of the sourceWords list (exclusive) every time it’s invoked. Fill will invoke it listSize time to create the Sequence with the random integers to serve as indexes of the random words from the sourceWords list. Mapping and retrieving a list is done as we saw before.
Part 2: In order to provide a relatively compute-intensive task we will calculate the Levenshtein distance between two strings… a source file, Levenshtein.java containing a lev() function is provided that will calculate the distance for you. A second template file, Lesson3.java, is provided. This contains the code necessary to measure the time taken to execute the code of the get() method of a Supplier (as you will see in the main() method this is simple to do with a lambda expression). Your task is to write the necessary code in the computeLevenshtein method to calculate the distances between each pair of strings in the wordList using the streams API. You will need to process this sequentially or in parallel based on the flag passed as a parameter. Try modifying the size of the wordList to see what impact this has on the performance, and in particular, how the difference between sequential and parallel performance is affected by the input size.
There are many possible ways to solve this exercise. This is just one of the possible solutions, inspired in one of the proposed in the @SVQJUG mailing list. Firstly we create a Supplier which will generate a Stream of Strings every time we invoked its get method. Being Supplier a functional interface, a lambda expression can be used to fill its body, which we do, although wrapped in a ternary operator, depending on the flag which will mark if the stream to be generated should be a parallel one or not.
We generated two streams around the same wordList list, meaning we will iterate over the elements of the list twice. Since we’re nesting these iterations, we make sure we process every element with every element of the list, including itself. For every couple of words, then, we get their Levenshtein distance and store it in the distances array position referencing the indexes of the elements in the original list. The second map, the one transforming the inner Stream, needs to mapToInt, since we’ll be generating an IntStream, which will allow us to obtain an array of integers directly when collecting its elements through the terminal operation toArray(). The outer Stream, since it will contain arrays of integers, has to go though the typical collection of its elements into a List before this list can be converted into an array (bidimensional, since it will contain arrays) which turns up to be what we wanted to obtained in the first place, being able to be assigned into our already declared variable distances to capture the final result.
Levenshtein for 1000 elements:
Java Sequential Levenshtein Sequential took 1585ms Sequential took 814ms Sequential took 781ms Sequential took 652ms Sequential took 638ms Java Parallel Levenshtein Parallel took 347ms Parallel took 206ms Parallel took 186ms Parallel took 167ms Parallel took 173ms
We can clearly see an increase in performance when using parallel streams.
We will make our List to be treated in parallel though the invocation of the Parallelizable#par method, which belongs to the Parallelizable trait, inherited by the List class. We’ve extracted this operation into a different method.
The rest of the process is exactly like the Java version, although there is a lot less noise, cause by the simpler Scala API’s. We about the collectors and the mapToInt Stream related operations, since Streams are not needed, nor exist, in Scala.
Scala for 1000 elements:
Scala Sequential Levenshtein Sequential took 621ms Sequential took 650ms Sequential took 602ms Sequential took 525ms Sequential took 682ms Scala Parallel Levenshtein Parallel took 189ms Parallel took 183ms Parallel took 221ms Parallel took 144ms Parallel took 152ms
In Scala, parallel collections proof to be way quicker also. Times are very similar to the Java version.
Part 3: As another demonstration of the differences in sequential and parallel stream processing there is a second method, processWords() for you to implement. Take the list of strings passed and process these using a sequential or parallel stream as required to generate a new list. Start by simply sorting the strings then experiment with adding things like mapping to lower or upper case, filtering out certain words (such as those beginning with a certain letter). See what impact adding distinct() to the stream has. For each of these vary the size f the input list to see how this affects the performance. Find the threshold below which sequential will give a faster answer than parallel.
In my case, I decided to do the next operations with the words in the Stream: sort the stream first, by natural order, then make all its elements to go lower case, then upper case, and then removing all the elements which start with A, M and Z or even the ones would start (none, since we made the elements upper case) with a, m or z.
The difference between having distinct() or not is very well explained on this API note:
Preserving stability for distinct() in parallel pipelines is relatively expensive (requires that the operation act as a full barrier, with substantial buffering overhead), and stability is often not needed. Using an unordered stream source (such as generate(Supplier)) or removing the ordering constraint with BaseStream.unordered() may result in significantly more efficient execution for distinct() in parallel pipelines, if the semantics of your situation permit. If consistency with encounter order is required, and you are experiencing poor performance or memory utilization with distinct() in parallel pipelines, switching to sequential execution with BaseStream.sequential() may improve performance.
In our case, since we’re asked to start ordering the strings, we can infer how using a parallel Stream won’t give us much if distinct() is present. Let’s see numbers for a 30000 elements wordList:
Java Sequential process words with Distinct Sequential took 115ms Sequential took 70ms Sequential took 70ms Sequential took 49ms Sequential took 72ms Java Parallel process words with distinct Parallel took 141ms Parallel took 22ms Parallel took 18ms Parallel took 15ms Parallel took 97ms Java Sequential process words without Distinct Sequential took 28ms Sequential took 15ms Sequential took 15ms Sequential took 16ms Sequential took 14ms Java Parallel process words without distinct Parallel took 8ms Parallel took 6ms Parallel took 6ms Parallel took 6ms Parallel took 8ms
Clearly, a parallel Stream is quicker than a sequential, but the difference is bigger when we’re not doing distinct.
By trying different sizes for the wordList, I found the threshold to be 3400 elements. Below that number, sequential implementation of the Stream will almost always be quicker than parallel, so it’s not worth making the Stream parallel unless you need to do this particular processing on collections with less than 3400 elements. At least on my hardware.
The Scala version is straightforward, really. Resuls for a 30000 elements wordList size, with and without distinct():
Scala Sequential process words with distinct Sequential took 196ms Sequential took 31ms Sequential took 17ms Sequential took 17ms Sequential took 65ms Scala Parallel process words with distinct Parallel took 44ms Parallel took 29ms Parallel took 34ms Parallel took 33ms Parallel took 25ms Scala Sequential process words without distinct Sequential took 14ms Sequential took 14ms Sequential took 55ms Sequential took 15ms Sequential took 16ms Scala Parallel process words without distinct Parallel took 22ms Parallel took 20ms Parallel took 26ms Parallel took 53ms Parallel took 80ms
As in the Java implementation, the Scala parallel implementation is quicker than the sequential.