I just watched the TED talk by Kevin Slavin, titled ‘how algorithms shape our world’. A really mesmerizing talk on the way our world is literally being transformed by algos. Besides the content and the pleasant and funny way it is being delivered, the imagery is fascinating; especially the flock of birds.
The day started with the keynote by Ted Dunning. First was a remark about the flow of time and the invitation to submit questions via twitter, so that the microphone needn’t be passed around.
He defined a measure called ’sucktitude’ and applied it to the current Hadoop platform. It is several magnitudes away from being as efficient as can be. Especially for data-flow applications, the potential volumes far exceed what is currently processed by systems.
After an anecdote of ‘free beer on sale’ at a British train that Ted was on, he looked at Hadoop’s social problems: multiple agendas, different sets of values, conflicting goals, competition etc. The community consists of a patchwork of commercial and non-commercial organizations as well as individuals and to prevent a big crush on the one hand or drifting away from each other on the other hand, a common ground within the overall eco-system needs to be found.
I found it a bit too community-planning-and-controlling-y, but then again, that is what keynotes are for.
‘Oh Leonhard, Where art thou?’ by Jim Webber was a great session: Jim is an energetic speaker, and the talk was at a high pace (originally intended for 20 minutes, now filled with some extra examples and a somewhat longer q-and-a) and full of jokes. The talk is on graph stores, a data model that is much older (a few hundred years) than the relational model that enterprises are used to.
The whole talk was a big rant on why one would use graph data models, that document models are like sausage machines and relational data models ought not to be called relational (there are hardly any relations, and any inferences from the relations are up to the developers). There was also a good amount of Apple bashing.
I need to have a look at Koans, a neo4j tutorial.
‘Making Hadoop Secure’ was a bit disappointing. Between the many ‘you know’s and ‘OK’s there was a look at the security measures in Hadoop (it remains unclear whether this is in the currently available version or a special secure version). In the beginning there was no security as users were trusted. The next iteration added some security, but primarily to prevent some accidents as it was easy to circumvent.
Now, authentication is supported via SASL for RPC. Authorization was already in HDFS, but for map-reduce access control lists were added. Group membership is pluggable and checked on the masters. Auditing is supported in both HDFS (who accessed what file) and map-reduce (who ran each job). Fine grained logs are provided, and together with strong authentication it provides audit trails.
Using Kerberos and ticket granting tickets creates a single-sign-on situation. Using delegation tokens allows credentials to be passed to all tasks of a job, and the tokens are canceled when the job finishes (tokens are also valid for a limited time and have to be renewed, this is done automatically by the job tracker). Different types of tokens are used to allow access to certain facilities for a limited duration. Tasks are running isolated, with the uid of the user, and they cannot signal other users tasks, or access data from other tasks.
What I missed was the practicalities of implementing it on one’s own cluster: it was asked during the questions, for which there was plenty time and there was apparently a sheet in the reserve sheets, but the answer was vague.
‘CouchDB + Membase = CouchBase’ as the title indicates Couchbase is the combination of CouchOne and Membase. Membase is a distributed database on top of memcached. Keys are mapped to buckets using consistent hashing and each bucket is managed by a server (and there are some replicas).
Couchdb stores documents in json, and is accessed via (a wrapper over) the rest interface. It is append-only.
CouchBase also adds GeoCouch.
The talk was very fast and ended far before the scheduled end time.
‘Newer Developments in Large Data Techniques’ posed the question ‘how to get a competitive advantage with data?’. The short answer: more data, or, preferably, better algorithms. Caveat: only use better algos if they improve the results qualitatively. The talk took a dive into three subjects:
Deep learning: machine learning on large amounts of data. A deep learning architecture starts from extracting simple features from the raw input, and then successive layers work on more abstract data. This is compared to shallow architectures (the neural nets with just a few layers). In 2006 a big breakthrough happened in how to train these networks. A specific use is semantic hashing, which provides for fast semantic search, it is compact and general purpose.
Graph parallelism allows for many machine learning algorithms to scale. Map-reduce doesn’t really work for many (sophisticated) ML algorithms.
Unsupervised semantic parsing: the goal is natural language understanding. The manual approach is costly, ineffective and inflexible. An approach by Poon and Domingos brings the state-of-the-art a step further. Clusters are formed from groups of words in different texts that relate other groups of words.
A cool and inspiring presentation.
‘Building search app for public mail lists in 15 minutes with elasticsearch’ showed how elasticsearch was used to implement search functionality at the JBoss Community site.
It didn’t really follow the abstract, but dived mostly in the queries and facets, even so, interesting to see.
‘Composing Mahout clustering jobs’ introduced clustering. The input to the various algorithms is a feature vector. The presentation focused on K-Means, as it is one of the basic clustering algorithms that is easy to explain.
Mahout is mostly built on top of Hadoop and has three primary features: filtering, classification and clustering. An important mode of operation is invoking the different preparation and execution commands over the command-line.
A stack overflow dataset was used to extract a tagcloud from the data. It used some techniques to find more or less valid colocations (bigrams of words), bloom filters to determine later-on whether two words were an interesting colocation when encountered by Lucine. The talk was a bit shallow on the actual clustering process, focusing more on the preparation steps. As someone in the audience put it: ‘a nice basic introduction to using Mahout’.
‘Improve Relevance by Using Morphology and Named Entity Recognition’ started with a big blurb of info on the company, followed by info on inverted indices and the tokenization process (normalization, stop words, stemming, lemmatizer/decomposers, part of speech taggers and information extraction). The morphological analysis (lemmatizer and decomposer) have an edge over simple stemming especially for European languages where words are composed of other words without spacing, or because the base form cannot be obtained by removing the last letters.
The lemmatizer maps inflected forms to base forms, using a big (hand crafted) lexicon. The index stores multiple forms, that can be used in different types of searches and boosting.
Interesting ideas in this talk and it really showed that the changes have a big impact on the quality of the search results.
‘VoltDB – In-memory SQL Database for High Throughput Applications’ was the last actual session. Some sources emit enormous amounts of data (financial ticks for instance) that is to be analyzed as it comes in. Challenges are in the need to validate in real-time, to count and aggregate over the data, the need to enrich in real-time, as well as scale on demand, learn and adapt.
Voltdb’s distinguising feature is that it is designed for throughput. It uses transactions, and each transaction is a stored procedure. There are no disk waits in a transaction, and no server to client chatter. Concurrency is done by scheduling: work is ordered and then executed without locking.
On the inner workings: tables are distributed over multiple servers by distributing the columns. Procedures are routed to the correct server and executed there.
I felt the session assumed a little too much beforehand knowledge of VoltDB.
The two days were packed with, overall, good quality talks; time flew by.
I’m in Berlin, and instead of a regular Java conference, attend Berlin Buzzwords, a conference focusing on big-data/scalability, primarily Hadoop, Solr and the related technologies. It’s a smaller scale conference, but lots of good sessions and interesting people. This post summarizes the sessions I visited.
After the welcome, Doug Cutting presented the keynote. The session was a mixture of introducing the problem space, introducing Hadoop and the story of the projects Doug worked on (Lucene, Nutch, Hadoop). Some of the insights being put forward were:
- seeks are slow, avoid them as much as possible (sequential batch processing),
- B-trees do not perform well when faced with many updates (because of seeks), solve by sorting and batch processing,
- Opensource is good, as the price is right etc, etc.
There are many interesting projects within the Hadoop space. A couple were discussed: Avro (as a data format that should facilitate the migration of data between different projects within the Hadoop ecosystem), Mahout (machine learning library that has some of the algorithms running as map-reduce jobs) and Flume (a data collection framework).
The session ended with a good Q&A session.
‘Search Analytics: What? Why? How?’ had a chaotic start (the speaker was late, didn’t get the sheets properly on screen). Overall the session felt a bit commercial, presenting the reports that one might look at. The essence of search analytics is to improve the search experience by measuring and monitoring everything, and helping users find content.
Common indicators of a bad performing search engine are are returning zero hits, low click-through rate, high search exit rate, irrelevant results and many refinement searches.
Solutions to zero hits: ‘did you mean’ functionality, auto-complete, having a clear ‘no results’ page.
There are lots of ideas on reporting but the session was limited on the interpretation of the reports and the best-practices to follow-up on the reports.
‘Distributed search of heterogeneous collections with Solr’ was a good session on distributing search indices. It is recommended to scale vertically where possible, as if it is not possible you’ll run into new issues.
Added costs are in maintenance, operations, increased complexity, increased latency, and ‘many interesting issues’.
The common way is to distribute by document: send the document to a shard according to some rule, process the document at the shard, and query the shards simultaneously, and integrate the results.
Solr supports distributed search out-of-the-box, but is not easy for administrators. A positive feature is that any random node functions as query integrator, thus the whole system is balancing the load. Query integration performs two trips to the other nodes: retrieving and merging documentids, filtering duplicates, selecting the top-N results and then retrieving the document properties from the nodes that supplied the document ids.
An example compared single machine vs distributed search using different metrics (some classic such as recall and precision, but also Spearman’s as that one includes the order of the results). A problem turns out to be that different shards have incompatible scores that make the integration more difficult, even when the distribution of documents over shards is more or less random. There are different ways to deal with this: using a global IDF, which requires more round-trips, or having an estimated global IDF, where the term frequencies are regularly distributed over the nodes.
Solr needs some improvements in the way it deals with errors: at the moment the result of one node failing is that the search returns no results.
‘Wrap Your SQL Head Around Riak & MapReduce’ started with an introduction of Riak, a scalable, highly available, key/value store. It provides get/put/delete semantics over http or protocolbuffers. Metadata (content-type, last-modified, link, user-specific metadata) is/can be attached to keys. One of the two primary means of querying is map-reduce.
The combination of Riak + mapreduce is not batch-processing (yet), it is not for building indices, not for crunching entire dataset, and not limited to one of each function.
Quirks: values are opaque (you need to know the structure of the value yourself), etcetc
Different SQL constructs are mapped to their map-reduce equivalents in a fast pace, thereafter clarified somewhat with an example. A session to review the slides of.
‘Integrating Solr with JEE Applications’ was a more practical session. The focus in enterprise apps is to have clear interfaces and layering and integrating Solr should follow the same practices. The official API to use is SolrJ, and in the basic use-case it is easy to use, efficient, but it is also heavy on strings (untyped), it’s not domain related and essentially the api is an implementation detail. The talk therefore discussed how to go about writing a sort of ORM layer for your application, using typed fields, queries and typed responses.
‘Scaling Big Data Search with Solr and HBase’ and the following sessions were all in the ‘Loft’ where the temperature was making concentrating hard. Here in Berlin temperatures today reached into the lower half of the thirties, and it’s quite humid, making it feel like the tropics.
OpenLogic maintains a database with information on all open source projects (terabytes of info) which is used to test software to see which projects are used, for instance for legal purposes.
Mostly business speak for the first part of the talk (on the cost of resources). It shed some light on their hardware, what to do/don’t (do use ECC memory, do raid-1 the OS disk, don’t raid the HDFS storage, do attempt to have the network non-saturated).
‘Common MapReduce Patterns’ attracted much attention, pushing the temperature higher and the level of oxygen lower. The talk starts with MapReduce basics to get everyone on the same page. Then it went over patterns to retrieve unique values, perform a secondary sort, joining (reduce-side, map-side), combiners etc. A lot of information, another one to revisit. However, the conclusion was that one does better focusing on a higher level of abstraction using one of the many existing tools, then to string boring map-reduce jobs together (but I must admit that the graphs showing the flow over 50+ map and reduce tasks do look impressive).
The last session ‘Hadoop: A Reality Check’ tried to keep people concentrated by offering cold-drinks to those who participated most actively. To bad he kept coming back to that point.
Some of the take-aways are that Hadoop is not fast, but it scales very well. When inserting data, do not perform transformations (ETL): just store the raw data.
Lessons learned: store raw data, keep it simple, small companies/startups should probably just use SQL, and for very small datasets (a few megabytes) one can use Hadoop within a single JVM (eliminating the overhead of starting a new one every time). Consider a pull model, as it is faster/more reliable than pushing the data into HDFS.
The day was concluded with a sponsored barbecue at a different location: C-Base, a place with many references to star trek, hitchhikers guide to the galaxy and other SF.
The Android platform is great for tracking all kinds of stuff. It makes it easy to write programs to enter data manually, its sensors allow for some automatic data collection and it also exposes data on calls that were made or messages that were sent or received.
The following are some snippets on extracting phone calls and SMS messages. The genericDump method takes a log and dumps all fields, while exportCalls and exportSMS create a somewhat more readable output. Note that the methods were extracted from a working application I created, but adapted for use in this post (without compiling the code). Oh and, yes, the String concatination could be optimised by using a StringBuilder, and the println statements should be replaced…
Final day of Devoxx. It went by too fast.
The keynote was panel discussion, hosted by Joe Nuxoll and Dick Wall, featuring Joshua Bloch, Mark Reinhold, Stephen Colebourne, Antonio Goncalves, Juergen Hoeller and Bill Venners. Tin foil hats were present to prevent having to answer those questions that they could not answer for business reasons and Dick Wall acted as the enforcer, good fun. Various questions, more or less controversial ones (first few questions being handled: whether Java should remain backwards compatible, and whether the Java community was affected by the acquisition of Sun by Oracle). Overall enjoyable, though some answers were very polite or correct. The hats remained off most of the time though, which is a good thing.
‘Creating Lightweight Applications With Nothing But Vanilla Java EE 6′ by Adam Bien was a live coding demo, walking through the creation of a web application. The guy presents with a lot of very dry humour. The format of the talk presented the various features of JEE6 really well: one to lookup on Parleys!
ElasticSearch is a search engine that you can interact with via REST, pushing a document with data (for instance a JSON structure of a book and its author etc) and can then be queried. In queries, the original keys in the documents can be used. Lucene is used internally, and all Lucene power is exposed. ElasticSearch provides quite a few features, and also distribution of indices.
That was the end of three information-packed days. Devoxx remains a great conference, especially if you’re not able to make JavaOne (and from what I read JavaOne ain’t what it used to be, so this may be the best option at the moment). What is lacking is some more crowd-control: at JavaOne everyone MUST leave the room after a session, and there are no clashes of people trying to exit while others try to enter.
:: Next Page >>
Personal blog on my interests.
| Next >
|<< <||> >>|