Anti-Phishing Measures in Tellody

(Cross-posting of https://www.tellody.com/blog/166-anti-phishing-measures-in-tellody)

Sending bulk emails is a risky business because, on the one hand, for Tellody that is running it, it’s prone to be abused by scammers and, on the other, for our users, it is easy to create emails in good faith but inadvertently contain indicators that will taint them.

Protecting ourselves and our users is an open-ended issue without easy solutions, but we decided that something is better than nothing and some action is better than no action. We tried to implement some kind of spoofing detection, as a first step. In this article, I’ll briefly explain what we have implemented, what you have to look out for and how you can fix your email if it gets flagged. Since we target phishers, the component is called “CoastGuard”.

I will use the following sample screen that contains all current notifications, as my reference.

phishing-indicators-2

 

First of all, you should not mix scripts in your words. Mixing scripts is a prime indicator of spoofing because words can appear as meaning one thing without actually writing them the way they are supposed to be. For example, the “Spoofed” notification appears to refer to the word “Paypal” but it mixes two scripts, “Basic Latin” (what “Paypal” is supposed to be written in) and “Greek and Coptic”. In this case it’s the uppercase Greek Rho, indistinguishable in all manners from the Latin uppercase P, that triggers this notification. There’s no legitimate reason to use more than one script within a word, and there are plenty of illegitimate ones so, if you find yourself needing to use a particularly clever brand name, that you just thought of, which uses more than one script, Tellody will not allow you to send it in an email. Get in touch if this is something essential to you.

Second, Punycodes have been shown to present the same spoofing opportunities, with little added value. The Punycode that is rejected above looks the same as https://www.epic.com but is not the same URL. Tellody rejects punycodes altogether, at least until major browsers start presenting adequate warnings about them. Again, if using a punycode is essential to you, get in touch with us.

The third category of spoofing indicator is when we think that someone tries to masquerade link text to make it appear as if it’s a URL. Maybe not many people would be fooled by “http:\\”, but would you be fooled by “http:⁄⁄”, containing a character that looks a lot like a forward slash? I know I would.

The next two indicators pertain to using the plethora of Unicode characters to make a string look as if it’s contiguous but is actually broken by characters with no width or scarcely any width. Adding such a pseudo-space character can create a string that looks to the eye like “Viagra” but fails to match a filter that looks for an exact match.

Finally, the last indicator is shown when the text of a link is something that looks as if it is a URL but it is different than the actual link target. A URL that appears to go to http://www.apple.com, while it goes to some shady site, is rejected by Tellody. Note that “appears” means how we think the text appears to a human and, the more tricks we detect to make something look like a URL while failing the straightforward syntactic check for one, the worse off we deem it to be.

This is a first version of those checks and they might evolve over time. If you find yourself having a legitimate business need that is hampered, please get in touch with info@tellody.com to discuss it.

Gender Detection, Vocatives and Namedays for Greek First Names in Tellody

(Cross posting of https://www.tellody.com/blog/164-gender-detection-and-vocatives-for-greek-first-names-in-tellody)

When Tellody was selected by Wind to run the MarketApp service, a big market opened in Greece. Even though Greek is one of the supported UI languages (and currently only EN and EL are usable in the language selector), we felt that the Greek audience deserved some more functionality, enabled by the inherent structure of the Greek language.

You might have noticed that, when the UI language is Greek, you get two new buttons for “message tags”. These enable you to use vocatives for the first and last names.vocative_buttons

We made a decision that we’d limit ourselves to traditional birth names and usual diminutives. That ruled out names of foreign origin like “Μισέλ” or “Νάταλι”. First of all, these names cannot be inflected, so there can be no separate vocative form for them. Secondly, we cannot use them unambiguously for detecting gender. Both previous examples can be male or female names.

This leaves baptismal names which are of ancient origin (ancient Greek names or names from the Old Testament) and modern ones. Both detecting the gender and forming the vocative are operations on the structure of the name, which we’ve implemented in Tellody. Because of that, we can handle a lot more names than what is in the name list we are using for matching names against, which we are currently using to calculate the nameday for the corresponding names.

We test for phonetic equality, so we’ll match various phonemes, no matter how they are spelled. However, we handle diacritic marks in the same phonetic way. We’ll handle omission of the accent mark (Unicode Character ‘GREEK TONOS’ U+0384) and unneeded uses of diaeresis (Unicode Character ‘COMBINING DIAERESIS’ U+0308) that don’t affect syllabication and enunciation, but we won’t match, for example, “Χάιδω” (the first syllable of which is a diphthong of “α” and “ι”) and “Χαιδω” or “ΧΑΙΔΩ”, in uppercase, (the first syllable of which is the composite vowel “αι”, same phoneme as “ε”). However, “Χαϊδω”, having two first syllables heard as “α” and “ι”, will be treated as phonetically equal to the diphthong “άι”, since both consist of the same phonemes.

Namedays come in five flavors:

  1. Fixed date
  2. Easter ± (n)
  3. If Easter is before April 23rd then (Fixed date) else Easter ± (n)
  4. First Sunday on or after December 11th
  5. First Sunday on or after February 13th

Tellody, at the time of this writing, includes data for approximately 1800 name forms for 1250 male and female names, with 800 of them having nameday data.

The Universe conspired to bring me precompiled binaries for Idris

Installing it with cabal on Windows proved impossible. A SuSe VM on VirtualBox couldn’t cut it either. Actually installing MinGW, which was the recommended way on WIndows, didn’t give me Idris, either. But still, I wanted it. And, sure enough, as Paulo Coelho predicted, the Universe conspired to help me get it, and one can now download precompiled binaries for Windows!

Now that I had a real dependently-typed programming language, with realistic-looking tutorials and documentation, I thought it was a great excuse to implement type-safe dimensional calculations in it. The resulting code is so terse, compared to the Scala example, it almost feels like cheating. In the beginning, is the dependent type itself.

data DimVal : Int -> Int -> Int -> Int -> Type where
    DV :  (m:Int) -> (k:Int) -> (s:Int) -> (a:Int) -> (v:Float) -> DimVal m k s a

Please compare it with the Scala (and C++) equivalent. The similarities are obvious but this version completely blends the two levels together. DimVal is a type constructor indexed by four integers and DV is its sole constructor.

Calculating with these types needs the usual arithmetic operations. Unfortunately, trying to overload the operators did not work for me, as they were imported already by the Num class in the Prelude.

||| The Num class defines basic numerical arithmetic.
class Num a where
(+) : a -> a -> a
(-) : a -> a -> a
(*) : a -> a -> a
||| Absolute value
abs : a -> a
||| Conversion from Integer.
fromInteger : Integer -> a

DimVal cannot be made an instance of Num, because types need to be closed under all three operations, and this cannot be the case, since the point is to represent each dimension vector by a different type. So I faked four appropriate operators.

infixr 10 /+/
(/+/) : DimVal m k s a -> DimVal m k s a -> DimVal m k s a
(DV m k s a v1) /+/ (DV m k s a v2) = DV m k s a (v1+v2)

infixr 10 /-/
(/-/) : DimVal m k s a -> DimVal m k s a -> DimVal m k s a
(DV m k s a v1) /-/ (DV m k s a v2) = DV m k s a (v1-v2)

infixr 20 /*/
(/*/) : DimVal m1 k1 s1 a1 -> DimVal m2 k2 s2 a2 -> DimVal (m1+m2) (k1+k2) (s1+s2) (a1+a2)
(DV m1 k1 s1 a1 v1) /*/ (DV m2 k2 s2 a2 v2) = DV (m1+m2) (k1+k2) (s1+s2) (a1+a2) (v1*v2)

infixr 20 ///
(///) : DimVal m1 k1 s1 a1 -> DimVal m2 k2 s2 a2 -> DimVal (m1-m2) (k1-k2) (s1-s2) (a1-a2)
(DV m1 k1 s1 a1 v1) /// (DV m2 k2 s2 a2 v2) = DV (m1-m2) (k1-k2) (s1-s2) (a1-a2) (v1/v2)

Next up, printing dimensioned values. For this, I used an intermediate Dimension type.

data Dimension : Type where
    Dim : (m:Int) -> (k:Int) -> (s:Int) -> (a:Int) -> Dimension    

unitToStr : Dimension -> String
unitToStr (Dim m k s a) =
    case (m, k, s, a) of
        (0,0,1,1) => "C"
        (1,1,-2,0) => "N"
        (0,0,-1,0) => "Hz"
        (-1,1,-2,0) => "Pa"
        (2,1,-3,-2) => "ohm"
        (2,1,-3,-1) => "V"
        (-2,-1,3,2) => "S"
        (2,1,-3,0) => "W"
        (2,1,-2,0) => "J"
        (-2,-1,4,2) => "F"
        (2,1,-2,-2) => "H"
        (2,1,-2,-1) => "Wb"
        (0,1,-2,-1) => "T"
        _ => (if m /= 0 then "m" ++ (if m /= 1 then (show m) else "") else "") ++
            (if (m /= 0) && (k /= 0) then "." else "") ++
            (if k /= 0 then "Kg" ++ (if k /= 1 then (show k) else "") else "") ++
            (if (abs(m) + abs(k)) /= 0 && s /= 0 then "." else "") ++
            (if s /= 0 then "s" ++ (if s /= 1 then (show s) else "") else "") ++
            (if (abs(m) + abs(k) + abs(s)) /= 0 && a /= 0 then "." else "") ++
            (if a /= 0 then "A" ++ (if a /= 1 then (show a) else "") else "")
            
toStr : DimVal m k s a -> String
toStr (DV m k s a v) = (show v) ++ " " ++ (unitToStr (Dim m k s a))
    
instance Show (DimVal m k s a) where
    show dv = toStr dv

The Num class was out of the question, but Eq and Ord are readily instantiable.

dimValEq : DimVal m k s a -> DimVal m k s a -> Bool
dimValEq (DV m k s a v1) (DV m k s a v2) = (v1==v2)
    
dimValComp : DimVal m k s a -> DimVal m k s a -> Ordering
dimValComp (DV m k s a v1) (DV m k s a v2) = compare v1 v2

instance Eq (DimVal m k s a) where
    dv1 == dv2 = dimValEq dv1 dv2

instance Ord (DimVal m k s a) where
    compare dv1 dv2 = dimValComp dv1 dv2

Idris has syntax support for creating DSLs so, a native Idris speaker would probably go on to create a better DSL for creating values, arithmetic etc. I’ll just show how type-safe calculations can naturally be encoded in plain Idris. I did not go into the trouble of defining all named physical types but I guess you see how this can be done without any problem.

Mass : Type
Mass = DimVal 1 0 0 0

mass : Float -> Mass
mass = DV 1 0 0 0

main : IO ()
main = 
    let x = DV 1 1 (-2) 0 14.5 in
    let y = DV 1 1 (-2) 0 15.5 in
    let z = DV 2 2 0 0 15.5 in
    let xpy = x /+/ y in
    let xmy = x /-/ y in
    let xtz = x /*/ z in
    let xdz = x /// z in
    let m : Mass = mass 12.6 in
        do
            putStrLn (show m)
            putStrLn (show x)
            putStrLn (show y)
            putStrLn (show z)
            putStrLn (show (x == x))
            putStrLn (show (x == y))
--            putStrLn ((show (y == z))) -- Doesn't even compile!
            putStrLn (show (x < y))
--            putStrLn ((show (z < y))) -- Doesn't even compile!
            putStrLn (show xtz)
            putStrLn (show xdz)

It does feel like cheating, doesn’t it? Calculating with physical types is actually done at two levels: the values themselves are operated upon as if they were dimensionless, while the dimensions are vectors representing exponents and are operated upon using exponent arithmetic. Idris handles both levels in the same way, very naturally. It is said that all high-level languages are equivalent in the computations they can express, but their compilers are certainly not equivalent.

My Frugal Summer of Code, part 2

Here I am, continuing my foray into the realm of array programming, that started in my previous post. Let me start with a disclaimer: I do not claim that what you read here is good, idiomatic J.This is not a J tutorial. I mean it as an exposé of the notions of array programming and how they can enhance the worldview of the traditional programmer. I’d love it if it could also be a bridge between mainstream FP and Array Programming, but I’m afraid I’m not sufficiently proficient in either to do it.

Remember the last five lines of code? I’m going to dissect them now for your benefit.

NB. Locations (mask bits) where digits have lines 
digitmasks =. '         ' ~:"(1) (,"2 digits)

Laid out, this means: concatenate all lines in each digit (keeping them in three lines serves only for displaying them). This is done by , (called append), applied to cells of digits of rank 2 (the "2 modifier, called an conjunction in J). Then, compare lines (cells of rank 1, the "1 thing) of that, using the “not equals” comparator (the ~:) to a literal with 9 spaces.

This gives you a bit (an 1) whenever there’s anything than a space. I can do that because the digits, as defined, when they share a graphic element, it’s always the same in the same location. So I’m getting rid of some excessive information to make my life easier. So, digitmasks is the following.

0 1 0 1 0 1 1 1 1
0 0 0 0 0 1 0 0 1
0 1 0 0 1 1 1 1 0
0 1 0 0 1 1 0 1 1
0 0 0 1 1 1 0 0 1
0 1 0 1 1 0 0 1 1
0 1 0 1 1 0 1 1 1
0 1 0 0 0 1 0 0 1
0 1 0 1 1 1 1 1 1
0 1 0 1 1 1 0 1 1
NB. Locations where digits share mask bits
maskcompmatrix =. (-"1/~ digitmasks) ="0 ] 0

Moving on, first, I make a table where, I subtract each line (i.e. digit bitmask) from each other. The -"1/~ digitmasks thing is shorthand for digitmasks -"1/ digitmasks, i.e. tabulate (the / adverb – another kind of higher-level construct, like conjunctions – for a dyadic operation) the operation “subtract each line” (the -"1). Afterwards, the whole thing is compared, bit by bit (the ="0) to zero. The result is a 10 x 10 x 9 bitmask array. Are you getting the hang of it?

NB. Count of similar mask bits between digits
diffmatrix =. +/"1 maskcompmatrix

Further, I sum: insert, using the / adverb, the operation “plus” (the + verb) at the level of each line (the "1 conjunction) on maskcompmatrix. This gives a 10 x 10 array of the number of same bits in each bitmask (digit).

9 5 6 6 5 6 7 6 8 7
5 9 4 6 7 4 3 8 4 5
6 4 9 7 4 5 6 5 7 6
6 6 7 9 6 7 6 7 7 8
5 7 4 6 9 6 5 6 6 7
6 4 5 7 6 9 8 5 7 8
7 3 6 6 5 8 9 4 8 7
6 8 5 7 6 5 4 9 5 6
8 4 7 7 6 7 8 5 9 8
7 5 6 8 7 8 7 6 8 9

Now comes the easy part.

NB. Outside the diagonal: Digits that share 8 out of 9 mask bits
offbyone =. diffmatrix ="(0) 8

I make a 10 x 10 array of selectors (bits) when digits share 8 out of 9 bits.

offbyones =. (monad : '< y # i. 10')"1 offbyone

Then, I make a vector (list, in J) of which digits share 8 bits for each digit.

┌─┬─┬┬─┬┬───┬───┬─┬─────┬─────┐
│8│7││9││6 9│5 8│1│0 6 9│3 5 8│
└─┴─┴┴─┴┴───┴───┴─┴─────┴─────┘

I wanted to spare you (and me!) the trouble of bringing up “boxing” but, since the shape is irregular, it was really the easiest solution. Boxing (conveniently displayed with actual boxes) creates an atomic value of anything you put into it.

And, in reality, I found out that it makes calculations simpler if the digits themselves are included, so I really need the following.

NB. Digits that share 8 or 9 (i.e. themselves) mask bits
sameandoffbyone =. diffmatrix >:"(0) 8

sameandoffbyones =. (monad : '< y # i. 10')"1 sameandoffbyone
┌───┬───┬─┬───┬─┬─────┬─────┬───┬───────┬───────┐
│0 8│1 7│2│3 9│4│5 6 9│5 6 8│1 7│0 6 8 9│3 5 8 9│
└───┴───┴─┴───┴─┴─────┴─────┴───┴───────┴───────┘

This was sort of easy, wasn’t it? I’m afraid the rest will be a little more complicated. I’ll throw it all in one glob, and focus on the big picture first. The operation I’m trying to achieve, is fix broken account numbers by trying similar digits, where similar is defined by off-by-one bit.

NB. fold operation to produce the combinations
foldop =. dyad : ',/ <"1 (>y) ,"1/ (,. >x)'
identity =. <''

NB. Use if checksum is <> 0: try combinations with digits off by one mask bit and see which (if any) produce checksum of 0
alternatives =. monad define
 alt =. >"0 foldop/ identity ,~ |. y { sameandoffbyones
 ((checksum"1 = 0:) alt) # alt
)

There’s a fold taking place, which is a common thing in FP. The first complication is that, in J, for reasons of uniformity with its right-to-left evaluation order, when you fold a list (remember, a list is a vector of something) you also do it right-to-left. That’s why I reverse (the |. thing) the result of indexing (the {) sameandoffbyones by the account digits (y stands for the argument in a monadic verb). Then I append the identity element, which is a boxed empty list. Folding takes place, and I’ll explain the operation folded next but, for now, let’s stick to what the fold produces, which is a list of boxed candidate account numbers, which are unboxed (the >"0) and then checked with checksum (the checksum"1 = 0: is actually a composite verb!) and only those with zero checksum are kept (operation copy, the #).

Example usage:

alternatives 2 4 4 4 1 2 5 6 7
2 4 4 4 1 2 5 6 1
2 4 4 4 1 2 9 8 7
2 4 4 4 7 2 6 6 1

However, don’t try it on 7 2 3 4 5 6 7 8 9. It produces 109 possible alternatives! In retrospect, this error correction policy is not very robust or, rather, the checksum used in not suitable to how digits can be misread.

Let’s go back to the folded operation now. I won’t go into it symbol-by-symbol but here’s an English description. One operand holds a list of boxes with partial account numbers, and the result is also a list of boxes with longer partial account numbers, formed by appending the initial partial account numbers with all alternatives for the current digit, held by the other operand as a boxed list.

Have I picked your curiosity until now? Have I succeeded in presenting J as more than a curiosity itself? There is real power in its notions and it goes beyond its conciseness. Imagine all these operations using standard Scala, for example. “It’s easy if you try.” I think it is a useful thing to add to standard FP, which lacks the language to talk about arrays at the level that array programming languages do. I have not even found a data type to represent arbitrary-ranked arrays, let alone how to represent ranked operations and implicit loops and stuff. But I think it can be done, and I’m setting it as a wanna-do-it for the future.

My Frugal Summer of Code

The title being a wordplay on “Google Summer of Code” does not mean that it is not entirely true, as there is hardly any time during vacationing with two little children to write much code. And I admit that most of what you’ll see in this post was written pre- and post-vacations. But I did read a lot of stuff on array programming, APL, J, papers on array functional programming. I thought really hard (but vacation-grade hard) about how some of these things can be grafted on a more conventional functional language. And I did make an honest effort to find a Java or Scala library for dense multidimensional arrays that could provide a basic substrate for such an effort. If the job, Flying Donut or life, in general, let me, you’ll hear more in another post.

The current post, however, will use J to tackle a code kata I found, which I thought would be ideal for some array-fu.

Assume you have the three lines that constitute an account number. The first thing you should do is separate the nine “digits”. Conveniently, the same functionality can be used to generate some data useful in the sequel, so let’s start by separating the ten digits 0 – 9.

alldigits =. ] ;._2 (0 : 0)
 _     _  _     _  _  _  _  _
| |  | _| _||_||_ |_   ||_||_|
|_|  ||_  _|  | _||_|  ||_| _|
)

NB. Break in 3x3 cells
chop =. monad : ',/ (3 3 ,: 3 3) ];._3 y'

digits =. chop alldigits

I know, most of it looks like a cartoon character swearing. What you see in the above is a handy way of writing multi-line data and a handy higher-level operation to apply an operation to subarrays of an operand. Let’s chop a sample account number.

NB. A number with no wrong digits
num =. ] ;._2 (0 : 0)
    _  _     _  _  _  _  _ 
  | _| _||_||_ |_   ||_||_|
  ||_  _|  | _||_|  ||_| _|
)
 
nums =. chop num
<"_1 nums

The resulting “digits” are “boxed” so that they are also displayed as boxed.

┌───┬───┬───┬───┬───┬───┬───┬───┬───┐
│   │ _ │ _ │   │ _ │ _ │ _ │ _ │ _ │
│  |│ _|│ _|│|_|│|_ │|_ │  |│|_|│|_|│
│  |│|_ │ _|│  |│ _|│|_|│  |│|_|│ _|│
└───┴───┴───┴───┴───┴───┴───┴───┴───┘

And the magic continues. Let’s write lots of nested loops to recognize the corresponding digits.

NB. Results in character list
digitize =. digits&i.

Ok, I was kidding. Here’s what digitize nums gives you: the vector 1 2 3 4 5 6 7 8 9. Transforming it to a human readable string is equally ridiculously simple.

NB. Takes digitized list, results in character list 
show =. {&'0123456789?'

Non recognized digits result in a past-the-end index, hence the ‘?’ after the last valid digit.

NB. A number with one wrong digit
wrongnum =. ] ;._2 (0 : 0)
_   _  _     _  _  _  _  _ 
  | _| _||_||_ |_   ||_||_|
  ||_  _|  | _||_|  ||_| _|
)  
wrongnums =. chop wrongnum
show digitize wrongnums

The output is ?23456789.

Surely calculating the checksum should be a little more difficult.

NB. (to be used only on correctly parsed numbers)
checksum =. monad : '11 | (1+i. 9) (+/ . *) |. y'

Well, it is a little more difficult, truth be told. But I have to go to bed, so I’ll break off at this point, but only after I compute some data for the next step, which is to try digits to similar ones in order to produce candidate account numbers to recover from checksum errors.

NB. Locations (mask bits) where digits have lines 
digitmasks =. '         ' ~:"(1) (,"2 digits)

NB. Locations where digits share mask bits
maskcompmatrix =. (-"1/~ digitmasks) ="0 ] 0

NB. Count of similar mask bits between digits
diffmatrix =. +/"1 maskcompmatrix

NB. Outside the diagonal: Digits that share 8 out of 9 mask bits
offbyone =. diffmatrix ="(0) 8

offbyones =. (monad : '< y # i. 10')"1 offbyone

I guess this suddenly strikes you as too much code, doesn’t it? Five whole lines! It’s only one line shorter than the code to define a property in Java, for crying out loud!

I’ll get into the details next time!

TO BE CONTINUED

As simple as possible, but not simpler

The simplicity being alluded to in the title, is not some aesthetic measure of code brevity inspired by Martin Odersky’s keynote in ScalaDays. It actually refers to simplifying arithmetic conditions, with the specific goal of proving whether the intersection of two sets, defined intensionally, is empty or not. This problem is actually very hard, but the case I’m interested in is where atomic conditions are linear inequalities involving single variables, which you have to imagine as referring to properties of a type, instances of which you define sets of. I can hear your enthusiasm winding down, because I’ve now simplified the problem a millionfold. One would think that this would be a toy problem for which there would be many solutions and yet, I couldn’t find any that didn’t involve finding concrete solutions and, all I needed was a simplified expression of the condition for the set intersection (for example, I looked at JSR 331). So, I decided to tackle it myself; why not have a little fun, huh?

Digging through the pits of my school-days memory, my first hunch was that some kind of normalization would be in order, and I settled on disjunctive normal form. Please don’t write me that all SAT and SMT solvers use conjunctive normal form and that I probably dug through the armpits of my school-days memory. Hear me out, first.

Each conjunction in DNF can have all its constituent atomic conditions on same variables be merged together, which results in either false, or another atomic condition. Consequently, each conjunction in DNF can be brought to, what can be described as, a simplified form where each variable is mentioned only once. And disjunctions will be fairly uncommon in the use case I was envisioning, which is some query builder interface (but not absent).

I’m going to walk you through my code and thought process. You’ll have to imagine the former being enclosed in a Scala object and the latter enclosed inside that glass display case for brains, that you see in movies. You’ll also have to imagine what role various types and case classes/enums play, because their names are evocative enough.

def toFlatDNF(c: ICriterion) : ICriterion = simplify(toDNF(c))

In a top-down fashion, I’m going first to transform the expression to DNF, then simplify it.

def flatten(c: ICriterion) : List[ICriterion] = {
    c match {
        case Composite(_, li2) => li2
        case _ => List(c)
    }
}

def cartesianProduct[T](xss: List[List[T]]): List[List[T]] = xss match {
   case Nil => List(Nil)
   case h :: t => for(xh <- h; xt <- cartesianProduct(t)) yield xh :: xt
}

def distributeOrOver(ors: List[ICriterion], ands: List[ICriterion]): List[ICriterion] = {
    val andsInOrs = ors.map(flatten)
    val product = cartesianProduct(andsInOrs)
    val allEnds = product.map(_ ++ ands)
    allEnds.map({
        case List(c) => c
        case li => Composite(AND,li)
    })
}

def toDNF(c: ICriterion) : ICriterion = {
    c match {
        case Composite(op,li) => {
            val li2 = li.map(toDNF).toList 
// each might be an OR of up to depth 2 or an AND of depth up to 1
            val (ands, ors) = li2.partition {
                case Composite(AND, c) => true
                case Composite(OR, c) => false
                case _ => true
            }
            op match {
                case OR => {
                    val flators = ors.flatMap(flatten)
                    Composite(OR, flators ++ ands)
                }
                case AND => {
                    val flatands = ands.flatMap(flatten)
                    if (ors.length == 0)
                         Composite(AND, flatands)
                    else {
                        val over = distributeOrOver(ors, flatands)
                        Composite(OR, over)
                    }
                }
            }

        }
        case _ => c
    }
}

What this (supposedly) does is to distribute OR over AND in order to arrive to DNF.

Now, the simplification that follows is based on extending the domain of all values to add a global MIN and a global MAX value, and on rephrasing all equalities and inequalities as interval conditions, whose either limit can be open or closed (methods left() and right() return the corresponding comparisons to the corresponding limit). A special comparison function, cmp arranges for MIN/MAX-aware comparisons.

The expression is normalized, in that way, then merged, then denormalized again to simple equalities and inequalities.

def groupByParameterEvaluable(liFinal: List[ICriterion]): Map[ParameterEvaluable,List[ICriterion]] = {
    liFinal.groupBy({case ParamOpVal(p,_,_) => p})
}

def normalize(c: ICriterion) : List[ICriterion] = c match {
    case ParamOpVal(p, op, values) =>
        op match {
            case EQ => List(ParamOpVal(p, BTW_E, List(values(0), values(0))))
            case LT => List(ParamOpVal(p, BTW_NE, List(MIN, values(0))))
            case GT => List(ParamOpVal(p, BTW_NE, List(values(0), MAX)))
            case LE => List(ParamOpVal(p, BTW_GE, List(MIN, values(0))))
            case GE => List(ParamOpVal(p, BTW_LE, List(values(0), MAX)))
            case NE => List(
                ParamOpVal(p, BTW_NE, List(MIN, values(0))),
                ParamOpVal(p, BTW_NE, List(values(0), MAX))
            )
            case _ => List(c)
        }
}

def denormalize(c: ICriterion) : List[ICriterion] = c match {
    case ParamOpVal(p, BTW_E, List(x, y)) if x == y => List(ParamOpVal(p, EQ, List(x)))
    case ParamOpVal(p, BTW_NE, List(MIN, x)) => List(ParamOpVal(p, LT, List(x)))
    case ParamOpVal(p, BTW_GE, List(MIN, x)) => List(ParamOpVal(p, LE, List(x)))
    case ParamOpVal(p, BTW_NE, List(x, MAX)) => List(ParamOpVal(p, GT, List(x)))
    case ParamOpVal(p, BTW_LE, List(x, MAX)) => List(ParamOpVal(p, GE, List(x)))
    case _ => List(c)
}

def merge(li: List[ICriterion]) : ICriterion = li.reduceLeft(combine)

def simplifyCriteria(criteria: List[ICriterion]): List[ICriterion] = {
    var normalized = criteria.flatMap(normalize)
    var merged = merge(normalized)
    denormalize(merged)
}

def simplify(cDnf: ICriterion): ICriterion = {
    cDnf match {
        case Composite(OR, liConj) => Composite(OR, liConj.map(simplify))
        case Composite(AND, liFinal) => {
            val grouped = groupByParameterEvaluable(liFinal)
            val groupedMerged = grouped.iterator.map({
                case (_, v) => simplifyCriteria(v)
            })
            val merged = groupedMerged.flatten.toList
            Composite(AND, merged)
        }
        case _ => cDnf
    }
}

The most complicated operation in all that is how to combine too intervals (one step of the merging process). Notice that, in order for two intervals to stand a chance of being able to combine, their closed-closed analogs should intersect and the easiest way to express that is as a negation of the condition that they not intersect: either one is before the other, or vice versa.

def fuseBetweenOperator(op1: CriteriaOperator, op2: CriteriaOperator) : CriteriaOperator =
    (op1, op2) match {
        case (GE, LE) => BTW_E
        case (GT, LE) => BTW_GE
        case (GE, LT) => BTW_LE
        case (GT, LT) => BTW_NE
        case _ => throw new Error(s"fuseBetweenOperator cannot be used on $op1 and $op2")
    }

//This combines two comparisons with conjunction
def combine(ic1: ICriterion, ic2: ICriterion) : ICriterion = {
    val FALSE = new FalseCriterion
    if(ic1.isInstanceOf[FalseCriterion]) return FALSE
    val c1 = ic1.asInstanceOf[SimpleCriterion]
    val c2 = ic2.asInstanceOf[SimpleCriterion]
    (c1,c2) match {
        case (ParamOpVal(p,op1,List(v1,v2)),ParamOpVal(_,op2,List(w1,w2))) if cmp(w1,v2)<=0 =>
            if(cmp(w1,v2)==0 && (op1.right()==LT || op2.left()==GT))
                FALSE
            else {
                val (lowerBound, leftop) = if(cmp(w1,v1)<=0) (v1,op1.left()) else (w1,op2.left())
                val (upperBound, rightop) = if(cmp(v2,w2)<=0) (v2,op1.right()) else (w2,op2.right())
                val operator = fuseBetweenOperator(leftop, rightop)
                if(cmp(lowerBound,upperBound)==0 && operator!=BTW_E)
                    FALSE
                else
                    ParamOpVal(p, operator, List(lowerBound, upperBound))
            }
        case (ParamOpVal(p,op1,List(v1,v2)),ParamOpVal(_,op2,List(w1,w2))) if cmp(w1,v2)<=0 =>
            if(cmp(v1,w2)==0 && (op2.right()==LT || op1.left()==GT))
                FALSE
            else {
                val (lowerBound, leftop) = if(cmp(w1,v1)<=0) (v1,op1.left()) else (w1,op2.left())
                val (upperBound, rightop) = if(cmp(v2,w2)<=0) (v2,op1.right()) else (w2,op2.right())
                val operator = fuseBetweenOperator(leftop, rightop)
                if(cmp(lowerBound,upperBound)==0 && operator!=BTW_E)
                    FALSE
                else
                    ParamOpVal(p, operator, List(lowerBound, upperBound))
            }
        case _ => FALSE
    }
}

Having all that implemented, testing for intersection is as simple as forming the conjunction of the conditions that define the sets involved, performing toFlatDNF, and seeing if the result is the false expression. If it isn’t, you have it as an expression to inspect and maybe think about how to change the set definitions so that they not intersect.

Java 8: too little, too late?

Lately, I attended an event that Oracle organized in Athens, which also served as the inaugural event for Java 8. I consider Java 8 a major release (unlike prior “major” releases) which does merit discussion. And, as I usually play the Devil’s Advocate, I’m going to share some thoughts on it.

I’ll focus on certain new features: the most prominent are Streams and lambdas, but Optional and default methods are also worth discussing. I won’t discuss new time support, or any other library improvement.

Streams is an attempt to give Java a real collection library; only, the library consists of a single interface, the Stream, and you have to convert back and forth between it and the real data structures. Still, it’s exciting to go from almost zero to being able to do map, filter, flatMap, fold, reduce etc in Java.

Lambdas will help a lot to bring some long-needed conciseness to Java. But they’re not really lambda expressions, meaning, anonymous functions. The only place where they can exist is where a Functional interface is expected (and a functional interface, in short, is any interface representing a single concern – more on that later). Combined with some limited type inference, this does give a lot of leverage without doing real changes to the language. If you were planning to do heavy higher-order functional stuff in Java, don’t hold high hopes, because the type signatures you have to write will likely dissuade you. There’s no type inference to ease the burden. And no predefined function interface for three or more arguments. But it’s nice for collection operations and such.

Since you know that I don’t have time for lengthy posts, here is a Java 8 sample which demonstrates, in a single shot, various things about Java 8.

    public static void main(String[] args) {
        final Optional<Integer> integerOptional =
                Arrays.stream(args)
                .findFirst()
                .flatMap(a -> getEl(getTheMap(), a))
                .flatMap(SampleUseOptional::tryParseInt)
                .map(i -> i*2);
    }

    public static <K,V> Optional<V> getEl(Map<K,V> map, K key) {
        return map.containsKey(key) ? Optional.of(map.get(key)) : Optional.empty();
    }

    public static Optional<Integer> tryParseInt(String s) {
        try {
            return Optional.of(Integer.valueOf(s));
        } catch (NumberFormatException e) {
            return Optional.empty();
        }
    }

    private static HashMap<String, String> getTheMap() { /* ... */ }

The meat, of course, is the monadic workflow in main. But notice how artfully I crammed in there

  1. the fact that, even arrays can be made into streams, even though they were second-class collections up to Java 8
  2. the fact that, using defender (default) methods to enhance existing interfaces can do nothing for Java arrays, hence the need for Arrays.stream
  3. the fact that Optional is also a monad in Java 8, as are collections (by virtue of Streams)
  4. the fact that Optional is not integrated where it makes sense, like Map.get or Integer.parse
  5. the fact that there’s shorthand for referring to a method
  6. the fact that there’s no shorthand to refer to a curried method (e.g. getEl(getTheMap())

Let’s work on that lack of currying for a moment. Surely there’s some (verbose) way to define what currying means. Here is an appropriate definition for a two-argument, may I dare say?, function.

    private static <T1,T2,R> Function<T2,R> curryFirstOfTwo(BiFunction<T1,T2,R> f, T1 x) {
        return y -> f.apply(x,y);
    }

Ok, that was not so bad. This is still Java, after all. Let’s try to use it.

        final Optional<Integer> integerOptional3 =
                Arrays.stream(args)
                .findFirst()
                .flatMap(curryFirstOfTwo(SampleUseOptional::getEl, getTheMap()))
                .flatMap(SampleUseOptional::tryParseInt)
                .map(i -> i * 2);

Hmm, IDEA chokes on this, and stops complaining only when the first argument to curryFirstOfTwo is rewritten as a lambda with typed arguments, but the actual compiler fares better. So this amount of boilerplate gives you an idea of what you can expect when you want to go a little further than the design space of the new features. Still, it’s feasible. And don’t forget all the fun, implementing interfaces for functions of arity N > 2!

I didn’t cover what Optional is, because it is what its name implies: it’s what Scala and F# call Option, Haskell calls Maybe and Guava already had with the same name. It’s your way out of NPE Hell, and how you’ll avoid being ridiculed by your C/C++ programmer friends to which you dared boast one day that programming in Java is safer. They’re still laughing, but not for long. The simplicity of this idea makes it be misunderstood. How can anything so simple have so much value? Aren’t nullity declarations, non-nullable types, nullity inference, elvis operators and such (found in Kotlin, Gosu, Fantom), similar, if not better? No, in my opinion, they aren’t. But this is a discussion for another day. The important thing is that Optional is now part of Java 8, and yet it isn’t, since the whole behemoth of its library is still exception-waiting-to-happen-ridden. But it can be part of your codebase.

Default methods, which were added to allow Oracle to extend interfaces, but now allow anyone to do as much, bring multiple (implementation) inheritance in Java. It’s that simple. No diamond problem, according to a presentation from Oracle I watched. At least, no diamond problem if you restrict the problem to how state is initialized in multiple inheritance, not how the rest of the world defines it, in a more general sense. With great power comes great responsibility, and Java, breaking out of its cocoon, starts having grown-up-language problems. Welcome, Java!

So, is Java 8 too little, too late? In my opinion, no and no. It doesn’t come up ahead. It just plays catch with everyone else. But it’s the first time it did anything to approach everyone else in a long time.

Already, Java 8 supersedes Guava for functional-ish programming (my up-to-now escape to sanity while doing Java), which had opted for a more limited set of features. Will it clash with Scala? Certainly not. If anything, it will pull more developers out of the stagnating waters older versions of Java had left them in. Thousands of people had stopped thinking about where programming can go to. Out of some twisted notion of devotion to Java? Out of safety in doing what thousands of others do? Out of arrogance for belonging to the biggest programming community? Something, in any case. And now, it is these people that are the least qualified to make the move to Java 8, while anyone with experience in any other JVM, .Net or any other language should feel right at home.

However, I am pleased to say that many more thousands of people are thirsty for more, which is a sign of sanity in the community. Adoption of Java 8 has a chance of being a lot faster than that of previous releases, and any JVM language will benefit from that happening. Not to mention that, increasing the momentum of the JVM might trigger faster evolution of over-the-horizon features like value classes and real arrays.