Friday, March 11, 2016

AM: BMR Book Errata


Section 4.11.1

The `rowSums` formula should read
and the `colSums` formula should read
Note: it is a little bit confusing. Following R semantics, the rowSums() method means "row-wise sum" which is the same as "sum of columns"; and vice versa, the colSums() method means "column-wise sum", which can be computed as the sum of matrix rows.

Section 6.1

The dimensions of matrix $$\mathbf{V}$$ are $$\mathbf{V}\in\mathbb{R}^{n\times k}$$.
Formula (6.1) should read
Formula (6.2) - (6.3) should read
\boldsymbol{a} & = & \left(\mathbf{V}^{\top}\right)^{\text{-}1}\boldsymbol{a}_{pca}+\boldsymbol{\mu} \\
& = & \mathbf{V}\boldsymbol{a}_{pca}+\boldsymbol{\mu}.\label{eq:from-pca}


Section 8.3

In Step (1): Setup working directories and acquire data, the URL of the Wikipedia XML dump has since changed. The third command of Step (1) should read:

curl -o $WORK_DIR/wikixml/enwiki-latest-pages-articles.xml.bz2

We keep the Kindle edition updated with the Errata. If you have ever bought the print version from Amazon, the Kindle version is free via Amazon MatchBook. If you already have a Kindle version, you should be able just to reload it with the updated one.

Saturday, March 5, 2016

Mahout "Samsara" book prerequisites

Q: "... Any tips on prerequisite reading?"

A: To quote the preface of the book:

"Some material assumes an undergraduate level understanding of calculus and on occasion multivariate calculus. The examples are given in Scala and assume familiarity with basic Scala terminology and a minimal notion of functional programming."

Let's elaborate.

There are two sides: the technology side and the math side.

For the technology side, the reader would benefit from some Scala fluency and functional programming. Public Scala material presented on gets one a long way, plus of course there's always a longer book by Martin Odersky, et. al., "Programming in Scala".

For the math side, note that we do not explain math. We do not ask ourselves the question "Can I understand it?", but rather we ask the question "Can I try it?". A lot of practical research is working exactly just like that -- by asking ourselves the question "What if I try this or that?". So it is not so much about math but rather about math notations, which are for our purposes is pseudocode that we turn into code. Math notations are sufficiently explained in the section A.2; if you have a concrete question beyond what is there, please ask -- we will try to answer.

Beyond notations, it would help to be familiar with the undergraduate level of linear algebra and calculus (after all, we are trying to deal with applied machine learning).

For linear algebra, a good reference is the textbook by Gilbert Strang, "Introduction to Linear Algebra". Algorithms we illustrate at times rely on working knowledge of singular value decomposition (SVD), eigendecomposition, Cholesky decomposition and QR decomposition, as well as Four Fundamental subspaces.

It has been some time since I studied multivariate calculus, so I am not quite sure what the best text is on it these days. I have the "Multivariate Calculus" book by Larson Edwards, which is pretty thorough in my opinion. We do not need all of it though; our book examples touch on very few notions of multivariate calculus -- partial derivatives, gradient, Hessian matrix. As a refresher, perhaps even reading Wikipedia articles on these issues is enough.

Of course when one works on a particular algorithm, she or he needs to read the publication containing the algorithm formulation. That is one of the things that we are trying to demonstrate: how to read and implement algorithm formulations using Mahout "Samsara". We give references to the algorithm publications as appropriate, throughout the book.


Tuesday, February 23, 2016

Mahout "Samsara" Book Is Out

After many months of work, our book is finally out.

Here is our formal announcement:

We are happy to announce that Mahout Samsara is finally documented in print. The book title is “Apache Mahout: Beyond MapReduce.”

Similar to other books on computer science or languages such as R, Python, or C++, the authors are hoping to put the reader on the path of designing and creating his or her own algorithms. Also included are tutorials and practical usage information about newer Mahout algorithms.

The emphasis is on Machine Learning algorithm design aspects in the context of Apache Mahout “Samsara” releases (0.10, 0.11). If there were another suitable name for this book, it might be “Beyond the Black Box”: the discussions go beyond the scope of just the Mahout environment or Mahout algorithms and touch on more general concepts for devising algorithms in the context of massive inputs and computations, using Mahout Samsara as a semantical implementation medium for the solutions.

Mahout “Samsara” currently targets H2O, Apache Flink and Apache Spark as backend execution options. As work on the Apache Flink backend is still in progress, all examples, while being execution engine agnostic (with one intended exception), are set up to run with the Apache Spark backend.

This work has been greatly helped by valuable reviews and ideas from other Mahout committers, contributors, and industry professionals, as indicated in the “Acknowledgments” section of the preface (Thank you!).

This book does not discuss legacy MapReduce-based algorithms.

Thank you for using Mahout!

Dmitriy Lyubimov (@dlieuOfTwit)
Andrew Palumbo (@andy_palumbo)

Technical info:

There are two editions of the book: a black and white paperback and a full color Kindle textbook.

The Kindle textbook completely preserves the layout of the paperback edition. It is enrolled in the Amazon “Matchbook” program (free with the purchase of a print copy when ordered via

The paperback edition has been optimized specifically for a black-and-white print. The format is 7x10in. (a common textbook size).

Code examples are available on GitHub.



In the US a paperback-only copy can also be purchased here with a 25% off code QLZ8DLPL.

Post Scriptum 

Thank you for reading!

See also: Book prerequisitesErrata updates

Thursday, April 9, 2015

Mahout 0.10.x: first Mahout release as a programming environment

Mahout 0.10.x is coming. It is a new generation of Mahout that entirely re-thinks philosophy of previous Mahout line of releases. 

First, and foremost, to state the obvious, Mahout abandons MapReduce based algorithms. We still support them (and they are now moved to mahout-mr artifact). But we do not create any new ones. 

Second, we shift the focus from being a collection of things to being a programming environment, as it stands, for Scala.

Third, we support different distributed engines, aka "backs". As it stands, we run on Spark and H20, but hope to add Apache Flink translation as well, provided we get help there.

Many people are coming with obvious question: so how it is different from stuff like MLLib? Here, I'll try to emphasize key philosophical points of this work. 

1. We want to help people to create their own math, as opposed to just taking and plugging in an off-the-shelf black-box solution. 

Indeed, having a fairly well-adjusted and complicated black box solutions is a good thing. People spent quite a bit of time researching and optimizing them. 

Problem is, very often off-the-shelf is just not enough. We want to customize and build our own models, our own features, our own specific rules and ensembles. And we want to do it quickly and try it now.

Let's look at some intentionally simple examples. Suppose we want to compute a column-wise variances on a matrix. For that we are going to use simple formula


applied column-wise on our big, distributed matrix $\mathbf{X}$. Here is how it is going to look in new Mahout enviornment: 

val mu = X colMeans
val variance = (X * X colMeans) -= mu * mu

That's it. All in fully distributed fashion. On Spark or H20. 

Let's take on a little bit more complicated case. What if we want to compute n-dimensional column-wise covariance matrix of the dataset X? 

Assuming that


i.e. every row is a point in the dataset, we can use multivariate generalization of the previous formula:




Here is the Mahout code for that formula: 

val mu = X.colMeans()
val mxCov = (X.t %*% X).collect /= X.nrow -= (mu cross mu)
This code is actually a so called "thin" procedure, i.e. the one that assumes that while the  $m\times n$ input matrix $\mathbf{X}$ is too big to be an in-core matrix, the $n\times n$ covariance matrix   will fit into memory in one chunk. In other words, it assumes $n\ll m$. The code for "wide" distributed covariance computation needs couple more lines but is just as readable.

So, what is the difference between Mahout 0.10.x and MLLib? Well, what is the difference between R package and R? What is the difference between libsvm and Julia? What is difference between Weka and Python? And how to think about it -- R for Spark? Hive for math? Choose your poison.

Bottom line, we want to help you create your own math at scale. (And hopefully, come back to us and share it with us).

2. We want to simplify dialect learning. We actually build the environment in image of R. Initially, we also had DSL dialect (enabled via separate import) for Matlab, but unfortunately Scala operator limits do not make it possible to implement entire Matlab operator set verbatim, so this work received much less attention, and instead, we focused on R side of things.

What it means is it should be easily readable by R programmers. E.g. A %*% B is matrix multiplication, A * B is elementwise Hadamard product, methods like colMeans, colSums are following R naming for those.

Among other things, arguably, math written in R-like fashion is easier to understand and maintain than the same things written in other basic procedural or functional environments.

3. Environment is backend-agnostic. Indeed, Mahout is not positioning itself as Spark-specific. You can think of it that way if you use Spark, but if you use H20, you could think of it as H2o-specifc (or, hopefully, "Apache Flink-specifc" in the future) just as easily.

All the examples above are not containing a single Spark (or h20) imported dependency. They are written once but run on any of supported backs.

Not all things can be written with the subset of back-independent techniques of course, more on that below. But quite a few can -- and the majority can leverage at least some as the backbone. E.g. imagine that dataset $\mathbf{X}$ above is result of embarrassingly parallel statistical Monte Carlo technique (which is also backend-independent). And just like that we perhaps get a backend-agnostic Gibbs sampler.

4. "Vs. mllib" is a false dilemma. Mahout is not taking away any capabilities of the backend. Instead, one can think of it as an "add-on" over e.g. Spark and all its technologies. (same for h2o).

Indeed, it is quite common to expect that just Algebra and stats are not enough to make the ends meet. In Spark's case, one can embed algebraic pipelines by importing Spark-specific capabilities. Import MLLib and all the goodies are here. Import GraphX and all the goodies are here as well. Import DataFrame(or SchemaRDD) and use language-integrated QL. And so on.

I personally use one mllib method as a part of my program along with Mahout algebra. On the same Spark context session as the Mahout Algebra session. No problem here at all.

5. Mahout environment is more than just a Scala DSL. Behind the scenes, there is an algebraic optimizer that creates logical and physical execution plans, just like in a database.

Algebraic Optimizer Stack

For example, we can understand that a thing like 

dlog(X * X + 1).t %*% dlog(X * X + 1) 

 is in fact (in simplified physical operator pseudo code)

self_square( => log(x * x + 1)).

6. Off-the-shelf methods. Of course we want to provide off-the-shelf experience too. Logical thing here is to go with stuff that complements, but not repeats other off-the-shelf things available. Refer to the Mahout site what's in and what's in progress.

But even use of off-the-shelf methods should be easily composable via programming environment and scripting.

Future plans. Work is by no means complete. In fact, it probably just started.

In more or less near future we plan to extend statistical side of things of the environment quite a bit.

We plan to work on performance issues, especially for in-core performance of the algebraic API.

We want to port most of MR stuff to run in the new environment, preferably in a completely backend-agnostic way (and where it is not plausible, we would isolate and add strategies for individual backs for the pieces that are not agnostic).

We plan to add some must-have off-the-shelf techniques such as hyperparameter search.

And of course, there are always bugs and bug fixes.

Longer term, we think it'd be awesome to add some visualization techniques similar to ggplot2 in R, but we don't have capacity to develop anything similar from the ground up, and we are not yet sure what might be integrated in easily and which also works well with JVM based code. Any ideas are appreciated.

Thursday, July 25, 2013

Scala DSL for Mahout in-core linear algebra (proposal)

I've been working on a Scala DSL for Mahout in-core linear algebra.

the code is here:

 Here's what i got so far

Current In-Core operations (examples)

Inline initialization

dense vectors 

val denseVec1: Vector = (1.0, 1.1, 1.2)
val denseVec2 = dvec(1, 0, 1.1, 1.2)

sparse vectors 

val sparseVec = svec((5 -> 1) :: (10 -> 2.0) :: Nil)
val sparseVec2: Vector = (5 -> 1.0) :: (10 -> 2.0) :: Nil

matrix inline initialization, either dense or sparse, is always row-wise:

dense matrices  : 

val a = dense((1, 2, 3), (3, 4, 5))

sparse matrices 

   val a = sparse(
      (1, 3) :: Nil,
      (0, 2) :: (1, 2.5) :: Nil

diagonal matrix with constant diagonal elements

diag(10, 3.5) 

diagonal matrix with main diagonal backed by a vector 

diagv((1, 2, 3, 4, 5))

Identity matrix 


element wise operations 

geting vector element 

val d = vec(5)

setting vector element 

vec(5) = 3.0

getting matrix element 

val d = m(3,5)

setting matrix element (setQuick() behind the scenes) 

m(3,5) = 3.0

Getting matrix row or column

val rowVec = m(3, ::) 
val colVec = m(::, 3) 

Setting matrix row or column 

m(3, ::) = (1, 2, 3)
m(::, 3) = (1, 2, 3)

thru vector assignment also works 

m(3, ::) := (1, 2, 3)
m(::, 3) := (1, 2, 3)

subslices of row or vector work too 

a(0, 0 to 1) = (3, 5)

or with vector assignment

a(0, 0 to 1) := (3, 5)

matrix contiguous block as matrix, with assignment 

// block
val b = a(2 to 3, 3 to 4) 

// asignment to a block
a(0 to 1, 1 to 2) = dense((3, 2), (2, 3))

or thru the matrix assignment operator 

a(0 to 1, 1 to 2) := dense((3, 2), (2, 3))

Assignment operator by copying between vectors or matrix 

vec1 := vec2 
m1 := m2 

also works for matrix subindexing notations as per above

Assignment thru a function literal (matrix)

m := ((row, col, x) => if (row == col) 1 else 0)

for a vector the same

vec := ((index, x) => sqrt(x))

BLAS-like ops

plus/minus, either vector or matrix or numeric, with assignment or not

a + b
a - b

a + 5.0
a - 5.0

Hadamard (elementwise) product, either vector or matrix or numeric operands

a * b
a * 5

same things with assignment, matrix, vector or numeric operands 

a += b
a -= b

a += 5.0
a -= 5.0 

a *= b
a *= 5

One nuance here is associativity rules in scala. E.g. 1/x in R (where x is vector or matrix) is elementwise inverse operation and in scala realm would be expressed as

val xInv = 1 /: x

and R's 5.0 - x would be

val x1 = 5.0 -: x

Even trickier and really probably not so obvious stuff :

a -=: b

assigns a - b to b (in-place) and returns b. Similarly for a /=: b or 1 /=: v.

(all assignment operations, including :=,  return the assignee argument just like in C++)

dot product (vector operands) 

a dot b 

matrix /vector equivalency (or non-equivalency). Dangerous, exact equivalence is rarely useful, better use norm comparisons with admission of small errors 

a === b
a !== b

Matrix multiplication (matrix operands)

a %*% b

for matrices that explicitly support optimized right and left muliply (currently, diagonal matrices)

right-multiply (for symmetry, in fact same as %*%) 

diag(5,5)  :%*% b 

optimized left multiply with a diagonal matrix:

a %*%: diag(5,5) # i.e. same as (diag(5,5) :%*% a.t) t

Second norm, vector or matrix argument:



val mt = m.t

Decompositions(matrix arguments)

Cholesky decompositon (as an object of a CholeskyDecomposition class with all its operations)

val ch = chol(m)


val (u, v, s) = svd(m)


val (v, d) = eigen(m)


val (q, r) = qr(m)

Check for rank deficiency (runs rank-revealing QR)



vector cardinality 


matrix cardinality 



a copy-by-value (vector or matrix ) 

val b = a cloned 


This one the people who are accustomed to writing R linear algebra will probably find quite easy to fall into. R has a nice property, a copy-on-write, so all variables actually appear to act as no-side-effects scalar-like values and all assignment appear to be by value. Since scala always assigns by reference (except for AnyVal types which are assigned by value), it is easy to fall prey to the following side effect modifications

val m1 = m 

m1 += 5.0 // modifies m as well

A fix is as follows: 

val m1 = m cloned 
m1 += 5.0 // now m is intact

Putting it all together: In-core SSVD with power iterations. 

This is a port of my R-based SSVD prototype. To give you an idea, here is how it is going to look in Scala DSL for Mahout (not quite tuned for performance yet, but enough to give you an idea):

   * In-core SSVD algorithm.
   * @param a input matrix A
   * @param k request SSVD rank
   * @param p oversampling parameter
   * @param q number of power iterations
   * @return (U,V,s)
  def ssvd(a: Matrix, k: Int, p: Int = 15, q: Int = 0) = {
    val m = a.nrow
    val n = a.ncol
    if (k > min(m, n))
      throw new IllegalArgumentException(
        "k cannot be greater than smaller of m,n")
    val pfxed = min(p, min(m, n) - k)

    // actual decomposition rank
    val r = k + pfxed

    // we actually fill the random matrix here
    // just like in our R prototype, although technically
    // that would not be necessary if we implemented specific random
    // matrix view. But ok, this should do for now.
    // it is actually the distributed version we are after -- we
    // certainly would try to be efficient there.

    val rnd = new Random()
    val omega = new DenseMatrix(n, r) := ((r, c, v) => rnd.nextGaussian)

    var y = a %*% omega
    var yty = y.t %*% y
    val at = a.t
    var ch = chol(yty)
    var bt = ch.solveRight(at %*% y)

    // power iterations
    for (i <- 0 until q) {
      y = a %*% bt
      yty = y.t %*% y
      ch = chol(yty)
      bt = ch.solveRight(at %*% y)

    val bbt = bt.t %*% bt
    val (uhat, d) = eigen(bbt)

    val s = d.sqrt
    val u = ch.solveRight(y) %*% uhat
    val v = bt %*% (uhat %*%: diagv(1 /: s))

    (u(::, 0 until k), v(::, 0 until k), s(0 until k))


And a very-very naive test for it:

  test("SSVD") {

    val a = dense((1, 2, 3), (3, 4, 5))

    val (u, v, s) = ssvd(a, 2, q=1)

    printf("U:\n%s\n", u)
    printf("V:\n%s\n", v)
    printf("Sigma:\n%s\n", s)

    val aBar = u %*% diagv(s) %*% v.t

    val amab = a - aBar

    printf("A-USV'=\n%s\n", amab)

    assert(amab.norm < 1e-10)



A follow-up

Many people have come and asked, "why not Matlab syntax? in particular, why not reserve single-symbol '*' for matrix-matrix multiply?". There are some thoughts:

  • There's no preference to use either syntax. The only preference is to use some of the existing grammars to make things look familiar. R or Matlab-like syntax makes no difference to me in this case, and I think the adoption-wise they are about the same as well. So we could actually create support for both, that much is easy. We'd have to basically call additionally "import RLikeOps._" or "import MatlabLikeOps._" to enable one or the other. But, see further. 
  • Element-wise *,+,-,/ are defined on the same geometry and follow the same pattern. They also all define in-place operators which are very popular since they don't involve creation and cloning of a new matrix as a result ("op"= stuff).  Matrix-matrix multiplication seems to be a special case. In that sense, R approach looks more consistent to me. 
  • Matlab elementwise operation syntax ".*" seems to run into problems with Scala. Scala does not suport '.' symbol as part of operators. Indeed, that would have created ambiguities in Scala.
  • In-place elementwise operations would look really more complicated then. I.e. they would have to be '.*=', '.+=' etc. But there would be no such simple looking thing as *=  or /= (since matrix-matrix in-place multiplication is not possible due to cardinality mismatch in the result). 
  • The rabbit hole goes even further when we consider right-associative operations such as elementwise in-place inversion 1/x (R style), which is pretty popular with me too. That thing would have to look like '1 ./=: x ' in this case which is even more cryptic. '1 /= x' or its in-place version '1 /=: x' looks much more bearable to common sense and scala styling.
  • %*%, like many things in R, looks quirky to me, true. But in many ways, matrix-matrix multiplication is also "quirky" and, most notably, computationally heavy. It is asymptotically significantly more expensive and always requires a new matrix to be created (no in-place opportunities here). So, subjectively, asking people to type this pretty inconvenient %*% thing also reflects that heavy nature of the operation, and IMO puts a subconscious warning -- "are you sure this is the best route?  This is expensive!". 

Friday, July 13, 2012

#Mahout 0.7 has more usable PCA workflow now

Now that Mahout 0.7 has been released, it is probably appropriate to say a word with respect to my contribution, MAHOUT-817, that enables (afaik, for the first time) for Mahout to have a PCA workflow. (Well, in all fairness, I wasn't alone contributor there. Many thanks to Raphael Cendrillon for contributing the mean computation map-reduce step).

It is based on SSVD method that computes mathematical SVD in a distributed way. Generally, there's no problem using brute-force PCA approach that involves mean subtraction and then running SVD on the result.

The only time when the wrench is thrown into that engine is when original PCA input matrix is sufficiently sparse. Making mean subtraction is going to transform input matrix into a dense matrix, potentially increasing number of non-zero elements (and hence, flops needed to complete matrix operations during SVD computation) many times.

As it is true for many Mahout methods, SSVD actually takes a big view of the world with all its details and produces a more condensed analytic view of the world. So the natural thinking to deal with that is, ok, can we just run an SVD first on the sparse input and fix the smaller condensed view at its smallest incarnations later so that creates summary equivalent to that with mean subtraction?

Here is the quick write-up that shows how the math for that neat trick has worked out in Mahout.

Also, a high-level overview and practical usage of this approach is given :

Thursday, September 8, 2011

#MAHOUT -796 power iterations doc here

Here it goes, so i don't lose it. Also attached to the issue.

Still, more work to come in MAHOUT-797, although not tremendously a lot...