Monday, August 24, 2015

UML and top down design

Since my posting, my good friend Mehran Sahami, has pointed out that the syntax of language also lend themselves well to Top Down Design.  He is absolutely correct.
At the risk of sounding passive aggressive, most programmers I know, do a combination of top down and bottom up coding/design.  Also, design and writing the programming syntax may be separated, with coding adding to or adjusting the original design.
But Mehran is absolutely correct.  UML is a perfect example of how Top Down Design results in good architecture and good programming.

Thursday, August 14, 2014

Programming as We Think and Speak, by Changing the Reverse Syntax Notation of Today's Computer Languages

Wow! It has been a while since I last posted (solving the eternal question: "who came first? the chicken or the egg", based on evolution).

The main idea:  The current syntax of C, also inherited by later Structured and OO Languages, maps how a computer reads and treats a structure or a container Class, and not how the human's think and speak in English.  Much like Reverse Polish Notation for early calculators, this made sense initially.  But thankfully, a simple change can make writing and understanding programs faster, easier, more intuitive, less error prone, and more natural for human programmers and operators.  

This goes hand in hand with the concept of with the concepts of inheritance (is-a) and containment (has-a) in OO programming.  In a way, you may say that we often think and speak hierarchically from bottom up and from specific to more general.  But in computer languages, objects and their fields are accessed top down and from the container classes hierarchically down to their elements and elements of elements.  Linguistically, and naturally this makes sense.  What we are interested in is the color of the nail on the pinky of  the left hand of Jill and making it setting it to bright_red.  But what we write in today's computer languages is:
jill.hand.left.pinky.nail.color = bright_red;
rather than
color.(of).nail.(on).pinky.(of).left..hand.(of).jill = bright_red;

Abstract and Summary
This blog discusses the design of Computer Languages rooted in the historic design of C by Dennis Ritchie.  It claims, and in fact shows, that the traditional syntax used in all subsequent structured and Object Oriented languages are indeed "backward" or "in reverse order of" how we think, and we speak or write in English.  Ritchie's design of C, much like the Reverse Polish Notation first used in calculates, was motivated by efficiency of computer programs and specifically ease of writing compilers.  But the way we speak and think in English, for the most part, is in reverse order of how we write programs using the today's modern computer languages.
It is trivial to map the way we speak in simple sentences directly to programming language statements.  The immediate results will be to make programming more intuitive, faster, easier and less error prone.  Teaching programming languages will  be easier, as large nested objects become easier to track and understand.  Automatic document generation from code, or code generation from requirements will also be much easier.

As a simple example take printing a line in Java to the output stream -- of the System.  What we want to say naturally is:  println ("this a line!") to out of System;
What we write instead is:  System.out.println ("this is a line");

Another simple example of how nested containment makes programming more complicated is to consider:  "11th word, of 22nd sentence, of 303rd paragraph, of 4019th document", where word, sentence, paragraphs and documents are appropriately defined objects and classes each containing the correct lower level element.  
In Java, or C/C++ we write:  doc[4018].parag[302].sent[21].word[10] while in reality what we want to write is word[10] of sent[21] of parag[302] of doc[4018].  

I then introduce a new "dot dot" notation where we replace all the "of"  or "in" or "to" or "from" etc with ".."  The above statement then becomes.
word[10]..sent[21]..parag[302]..doc[4018] 
For consistency, we can actually take this one step further when it comes to arrays.  We can  say that slate 10th word in 21st sentence of 302nd parag for 4018th doc := [10]..word..[21]..sent.. etc etc.  But there is no reason not to use a mixed notation, as it acts like using parentheses, and makes statements shorter and more decipherable.

This is not a first time that a less than intuitive notation was first introduced, used, and popularized for the sake of computational efficiency.  In the 50s and 60s Reverse Polish Notation was developed and used in many calculators including HP.  It minimized the computer memory access and instead used stack to evaluate expressions faster. By the same analogy, today's C syntax and its derivative languages come from an era where the notation was developed for interpretation by compilers and not humans.  The dot and -> notation were logical to use, as they made parsing statements easier for the compilers, and writing an optimizer easier.

The new dot dot notation turns out to be better understood, and faster to compose and code for humans.  It also has direct implications for automatic program generation from Spec, or automatic commenting from raw code.  Moreover, it can augment and be inter mixed with, and not completely replace the traditional dot and arrow syntax notation.

A bit of history (you can skip -- in fact, please do skip!).
Couple of weeks ago, I had lunch with two good friends, Thierry David Donneau-Golencer and Steve Omohundro, in Palo Alto. I met Thierry when he was still a scientist at SRI International and who is now the Co-founder and product manager at Tempo-AI.  I met Steve later, and was told that he is the nicest brilliant man in Palo Alto.  I ran into Steve again at an ACM talk given by Bill MacCartney of Google Research and Stanford University and invited him to meet Thierry and to have lunch with us.
Our conversation soon turned in to various things we are working on, including Computer Languages and automatic programming and understanding of programs.  I was first exposed to this topic when finishing my PhD.   I met a new student from Germany who was doing her thesis on Nat Lang Understanding and writing programs directly from Requirement Specs!  I remember hearing about it and not liking the solution of creating a form based approach to writing specifications at all.  I thought, and still believe that it would be too restrictive.
I heard about this topic again, when I was working at NASA and for Research Institute for Advance Computer Scienec.  Through chance I met a research scientist from France at a party who was working at IBM Almaden Research Center on Computer Languages.  During the party I told him, that I think Computer Languages are designed backward and quiet inefficiently.  Something I had thought about when I got to learn and later teach C and C++.  
I explained my reasoning to him, which he agreed with completely, and thought it was a great idea to design and change IDEs to translate simple English to structured languages like Java or C/C++. But unfortunately he was headed back to France and we never got to work on this together.   So here I am years later finally writing about it, thanks to Thierry and Steve who rekindled that idea in me. (I even tried to think where my notes from my NASA days on this would be, but I quickly gave up;  they are probably back in Canada).

The Main Idea and the "Reverse Syntax Notation"
One of the first and best "structured" computer languages was C programming language, designed by Dennis Ritchie.  It was "structured" because, unlike Fortran and Lisp, it introduced the concept of structures -- i.e. the keyword struct-- which was the predecessor to concepts of Class in Object Oriented Programming (which started with the programming language Simula).  

C was used by Ken Thompson to develop the Unix operating system, and it was the language I thought myself when I first arrived at CMU to pursue a PhD.  C was not only one of the first languages (together with Fortran and Lisp), it became so popular that it set the groundwork for the next languages to come such as Pascal, Modula, Ada, C++, Eiffle, Java, C#, and probably a lot more.  As a result, its syntax got carried over to these languages too.  The Reverse Syntactical Notation prevailed as a form of backward compatibility, if you may.  Kinda sorta like Windows 3.2 trying to be backward compatible with DOS!  But certainly not as bad. :-)

With struct C also had another beautiful notion -- the concept of pointers, which allowed programmers to directly access and modify any part of the memory.  While this was a huge step toward writing efficient programs, it could potentially get you in trouble (for instance, the dreaded null pointer exception).  Later versions of C introduced constants -- or const -- which led to the concept of references.  References replaced pointers in later languages such as Java not just for simplicity but, among other things, for security.   The two concepts of structures and pointers in C were central to its popularity and its power.  

As much as I nostalgically love and appreciate C for its power and beauty, I believe its syntax today is unnatural, linguistically backward and carried over to later languages unfortunately!  But more importantly that it can be easily corrected or augmented.

Let me explain.  The current syntax of C, mapped how a computer would treat a structure, and not how the human's think and speak in English.

You see struct and pointers were designed around each other and to augment arrays.   Here is why.  The syntax to access the 11th element of a character array, ca, is as ca[10].  (NB unlike Fortran or Matlab the array indexing in C and many of its descendants begins at 0).  
For the compiler this is easy to translate.  Go to the location or pointer where ca starts -- i.e., ca or  &ca[0] -- and add 10 times the size of a character -- i.e., 10 bytes -- now you are at the 11th character.   If we had another array, this time of integers -- lets call it ia -- to get access to the location of the 11th integer in the array, we would first go to ia (or &ia[0]) which the location of the first integer, then add 10 time size of integer, and volia we are at the beginning of the 11th element of ia -- that is &ia[10].

struct works exactly the same way.  To access an element within a struct  you first go to the pointer at the beginning of the object that struct represents, then you add the appropriate number of bytes depending on the size of the elements declared in struct to get to the appropriate element in the object. That makes life really easy for the compiler and the compiler writer, also for people who want to write efficient programs through manipulating pointers.  i.e., pointer arithmetic.

But the end result is that we create a "reversed notation" compared to how we think and how we speak in English.  This may sound like a stretch, but I feel it is a result of abstracting out how computers work, rule based thinking, a misplaced traditional need for historical compatibility that has made this reverse computational notation persist.

Ironically, this is not that different from the adoption of the Reverse Polish Notation that were popular in 70s and used in top end calculators such as HP.  While today, no calculator use the RPN, it was originally developed "to reduce computer memory access and utilized the stack to evaluate expressions".  What made today's computer languages syntax more wide spread, was newer languages such as Pascal that came after C, adopted the same syntax (I suspect for) backward compatibility and synergy with C.  It also helped that Nat Lang Processing, was not an area of active research and development.

A Concrete Example and  the Dot Dot Notational Solution
I haver already given a few examples of the double dot vs the traditional single dot notation.  But it is important to note that there is no reason to be strict and not to mix the two notations.

So what do you do if you have 10000 documents each having exactly  p>1000 paragraphs, and each paragraph having s>100 sentences, and each sentence having 20 words?  Well you write a struct for sentences that contains 20 words (i.e., null terminated character arrays), and a struct for paragraph that contains sentences, and a struct for documents that contains paragraphs.

The problem though is accessing them.  To get the "11th word, of 21st sentence, of 301st paragraph, of 4001st document" the C syntax is:
doc[4000].parag[300].sent[20].word[10]
But this is exactly backward from the way we speak.  Look at the words and the numbers.  As we will see later, this reverse index notation is even more unnatural to use when we add methods and operators.

Instead lets use the dot dot notation and we can encode the correct sequence without touching the old syntax.  Here the appropriate syntax becomes:
[10]..word..[20]..sent..[300]..parag..[4000]..doc
word[10]..sent[20]..parag[300]..doc[4000]
This notation is a lot more intuitive to read and to program, because the real focus is the 11th word and not the 4001st document. In essence you program the way you think, write and speak.  It simply reads 11th word of 21st sentence of 301st parag of 4001st doc.

What about functions and methods.  Again, the new syntax is consistent with the English language. Instead of saying or writing:  
doc[4000].parag[300].sent[20].word[10].compare("syntax")
we can say either "compare 'syntax' to 11th word of 21st sentence of 301st paragraph of 401st document"
compare ("syntax")..[10]..word..[20]..sent..[300]..parag..[4000]..doc
or alternatively, "compare 'syntax' to word 10 of sentence 20 of the 300th paragraph in document 4000"
compare ("syntax")..word[10]..sent[20]..[300]..parag..doc[4000]
Please note how the red words and the .. correspond directly to one another.  

A quick personal historical note.  At one point, I did want to write Emacs functions and macros that would translate between the double dot and single dot notion when writing C++ programs.  I was going to call the files using the double dot notation with a .ce extension, and then just translate them to .c or .cpp before passing them to the compiler.  But then I had to either come up with a new notation for -> which replaces the single dot when we are using a pointer to a structure in C and C++, or to de-reference every pointer to structure with the appropriate number of parentheses and a *.   A much better solution was to use <> in front of every pointer.  Then .. would translate to . in the reverse notation and <> would translate into ->. 

Emacs was starting to be replaced with modern IDE's like Visual C++ which was a closed system.  So you could not really extensions to do such a thing.  But today in Eclipse and Netbeans -- among other IDEs -- it is easy to write extensions in Java.  Statement completion (usually with Ctrl-Space) and examining elements of an object will still be possible and in fact more natural in the dot dot notation.  Because if an element or a method is missing in a class or object the list generated by Ctrl Space, which flags the programmer to go add the element or the method, potentially in real time.

Given the size of large system today, and the dominance of OO programming, these changes are even more appropriate that before.  In Modern languages like Java and C# all objects are references, and there is no notion of ->.  This means single dot notation can easily be translated into double dot notation and vice versa and they can live together in the same file.  The double dot is a simply an additional notation that adds flexibility to design and implementation.  

Summary and Concluding Remarks
This new dot dot notation is not only simple to use and understand, it is also much more intuitive, easier to teach, easier to understand and to remember.   It will increase the speed and efficiency with which we will program, and it will also reduce mistakes.  One will codes the way one speaks, and deciphering code and understanding it becomes trivial, simply because the code reads the way one writes it in an English sentence.  Later, we can take simple English sentences, and translate them directly into code.

I also strongly suspect that the new double dot notation will make program design and writing better specs faster.  Now as you write the specs, you think of the objects and their respective parts, their attributes, their methods and how your written words can be automatically translated into a program.  I may not have fully convinced myself that statement completion is easier with dot dot notation, even though I can see how.  But I am also fairly cerebral, so give me some time and if there is a problem, I will solve that too. :-)  

Cheers, E