16 May 2010

Yay Internship Paper

Good morning. We are here to tell you about our eight months of internship experience in approximately twenty minutes.

As two of the top programmers at the Academies, we set sights on a computer science related internship. We chose to intern with AltLaw because we wanted to work for a small scale open source project to get firsthand experience on working with a team of people all over the country and to be recognized as not just an intern, but also a contributor of the project. We also liked that it was going to be on-campus, so we did not have to arrange transportation to the city or elsewhere.

AltLaw is a free and open source legal search engine. The project was started by Stuart Sierra, Paul Ohm, Tom Bruce, Carl Malamud, and Tim Stanley in hopes of creating a searchable database of law cases and such that is accessible to everyone, not just paid subscribers. However, in November, Google Scholar added legal cases to its vast collection of scholarly documents, so AltLaw is no longer needed. The developers did not see this as a complete disappointment; they were happy to see that a large corporation has taken on this initiative to provide the general public with a searchable collection of legal documents.

We interacted with Stuart for most part of the project. He has a bachelors of arts in theatre from the New York University and a masters of sciences in computer science from Columbia University. He works for the Program on Law & Technology at the Columbia Law School, where he spends his day hacking on AltLaw.

The branch of computer science in which AltLaw belonged is known as Information Retrieval. It is the science of searching for documents, for information within documents, and for metadata about documents, as well as that of searching relational databases and the World Wide Web. The three branches on which we focused are scraping or getting data, processing data (usually with mathematical or algorithmic means), and displaying data, usually with a web interface. At AltLaw, this was done with a variety of tools and languages.

Working for AltLaw required us to learn a Lisp-like programming language called Clojure, among other things. It is radically different from languages one might be accustomed to; it forces one to think recursively instead of imperatively. The biggest omission that newcomers will notice and lament is the lack of `for' and `while' loops. This is a bigger part to data immutability, or the inability to modify the value of a variable once it has been set. Immutability allows for much simpler concurrency implementation because there will not be any discrepancies in variable values in multithreaded applications and such.

To practice coding with a functional programming mindset and to learn the Clojure API, we started out with little programs, such as factorial (which, surprisingly, can be written a bunch of different ways). There is the standard tail recursion method, loop-recur method, and a clean, one-liner map-reduce method. We were surprised that there was a rather large time discrepancy when we tested the different ways with a large test case.

In the meantime, the ATCS parent organization, Computer Science Families, was generous enough to donate two Sun Fire rack servers for us to use. Unfortunately, we were not allowed to keep the servers in school without handing them over. We tried to find another permanent home for them, but that proved to be difficult as well. They, now along with nine other machines, are sitting in the basement of the house of the CSFamilies' president.

We wrote a program similar to the UNIX wc (wordcount) program to test Clojure's performance. The test results showed that the Clojure program was consistently more than twice as slow as wc, which reinforced the notion that interpreted programs are slow (augh this wording is terribad -Sherry Wu 5/16/10 3:11 PM)

After we were relatively fluent with the basic syntax, Stuart taught us about how Clojure and any functional language was deeply intertwined with lists and its siblings, such as maps, sequences, and vectors. To practice with these, we wrote a sentence validator. This program takes a grammar and a collection of sentences and returns whether each sentence is valid with respect to the grammar. (darn i don't think this wording is good either -Sherry Wu 5/16/10 3:16 PM)

As an aside, we were asked to write a regex (regular expression) to parse Java-style floating point numbers. Regexes were quite easy to pick up; and this assignment showed us an application of using regexes. (lol this paragraph seems relatively useless -Sherry Wu 5/16/10 3:20 PM)

Now that we had a hang of working with the data structures, Stuart taught us about the basic infrastructure of search engines. They stored keywords into a table-like data structures, such as a Document Term Matrix or an Inverted Index. The inverted index is essentially a map used to count occurrences of words in a set of sentences and to keep track of which sentences had what words. One of our assignments, which turned out to be a mini-project, was to write an inverted index.

Stuart decided to shift from programming to learning about Hadoop. Unfortunately, our efforts were impeded by a variety of things, including a lack of 32-bit J2SE 6 for Mac OS X, restricted user rights on school computers, and a terribadly craptastic implementation of the school's network. Therefore, we were left with only three laptops to work with. It took quite a bit to get the software configured, and finally, when all three machines were up and running, we got the sample distributed wordcount program to run! Unfortunately, that was all for Hadoop.

On the bright side, we still continued discussing concurrent programming. Stuart introduced us to the meat of Clojure, starting with agents and refs and reinforcing immutability. It was more of a watch-and-learn session instead of hands-on coding. Stuart wrote a program to test the fairness of the random boolean generator, except this time, we could see the proportion of true booleans in real time. As a final example, we played with a prewritten ant farm simulator as a demonstration of concurrent programming.

Stuart also showed us some testing techniques, such as unit testing and applications of agents. He showed us his rapid testing library, in which a program watches several files and runs tests against them every time a change is made. This is done by having the operating system push changes to the program (instead of it asking the file system if the files in question have been changed). A question that still remains is whether it can be used to develop itself. Yay metatesting?

As a final `unit', Stuart introduced us to "better object-oriented programming" in Clojure, which involved message passing, similar to a practice in Objective C. This is radically different from object oriented programming in Java, where every class has a predefined set of messages. We messed with some Protocols (Interfaces in Java speak) and Structures (Classes in Java).

In the middle of working with Stuart, we decided to do some basic Monte Carlo. This is a technique for numerical integration and heavily involves randomness. The canonical example is computing the area of pi by throwing darts in a circle inscribed in a square and finding the proportion of darts that land within the circle. We just implemented this in Clojure for fun and ran increasing numbers of trials in each test to see how the value in question would converge.

To wrap up the end of the year, we did discussions of current technology, the industry, and the future. Our most interesting discussion was about cloud computing, in which we debated the effect of the iPad and Android devices and current development tools. We talked about HTML5 and its open source buddies against the rest of interactive development tools, such as Flash, Silverlight, and even Cocoa. One of the big points in discussion was whether many people would continue to support native iPhone development now that Apple has banned using all but Apple-approved languages.

As a last activity, Stuart told us about different uses of programming languages in industry and upcoming revolutions. Although he lamented that he wasn't part of the PC revolution, he is happy that he is able to see functional programming, massive data processing, and virtualization go mainstream. He showed us the enterprise hardware Java accelerator (essentially a massive server powered by a custom chip with fifty-some cores and oodles of gigabytes of memory). We did an exercise in which we had to write the GCD function in Clojure, Python, and Ruby and benchmark all three. We were surprised that Clojure was fastest (by quite a wide margin), followed by Python and then Ruby.

Overall, we highly recommend this internship to anyone up for a challenge. Picking up Clojure is not recommended for the incompetent, as it requires thinking and persistence to master it. We got a full exposure to a new (and very different!) language and were able to work with someone from the open source community. It is also a big advantage if you are able to push yourself to work in an unsupervised environment or have a superb problem solving skill; if you have either, you will be able to accomplish the assignments easily.

1 comment:

  1. So my favorite parts of this post are your comments. (augh this wording is terribad -Sherry Wu 5/16/10 3:11 PM)

    XDDDDDD.

    ReplyDelete