Divolte Collector

Scalable clickstream collection for Hadoop and Kafka

Divolte Collector is a scalable and performant server for collecting clickstream data in HDFS and on Kafka topics. It uses a JavaScript tag on the client side to gather user interaction data, similar to many other web tracking solutions. Divolte Collector can be used as the foundation to build anything from basic web analytics dashboarding to real-time recommender engines or banner optimization systems.


single tag site integration

Including Divolte Collector is a HTML one-liner. Just load the JavaScript at the end of your document body.

built for big data

All data is collected directly in HDFS and on Kafka queues. Divolte Collector is both a HDFS client and a Kafka producer. No ETL or intermediate storage.

structured data collection

All data is captured in Apache Avro records using your own schema definition. Divolte Collector does not enforce a particular structure on your data.
 

user agent parsing

It's not just a string: add rich user-agent information to your click event records on the fly.

ip2geo lookup

Attach geo-coordinates to requests on the fly. (This requires a third-party database; a free version is available.)

fast

Handle many thousands of requests per second on a single node. Scale out as you need.
 

custom events

Just like any web analytics solution, you can log any event. Supply custom parameters in your page or JavaScript and map them onto your Avro schema.

integrate with anything

Work with anything that understands Avro and either HDFS or Kafka. Hive, Impala, Spark, Spark Streaming, Storm, etc. No more log file parsing is required.

open source

Divolte Collector is hosted on GitHub and released under the Apache License, Version 2.0.

Site integration

You integrate Divolte Collector into your site by simply including a small piece of JavaScript.

This takes care of logging all pageviews and exposes a JavaScript module. This module can be used to interact with Divolte Collector in the browser and log custom events.

<body>
<!--
  Your page content here.
-->

<!--
  Include Divolte Collector
  just before the closing
  body tag
-->
<script src="//example.com/divolte.js"
        defer async>
</script>
</body>

Scalable

Divolte Collector pushes data to Hadoop HDFS and Kafka topics. Data is written to HDFS as complete Avro files, while Kafka messages contain serialized Avro records.

Divolte Collector itself is effectively stateless; you can deploy multiple collectors behind a load balancer for availability and scalability.

Structured data in Avro

To preserve the sanity of developers and data scientists alike, all data should come with a schema. CSV is not a schema. The common log format is not a schema. JSON is not a schema. A schema defines which fields exist and what their types are. Using a schema allows you to inspect data without making assumptions about which fields are available.

Divolte Collector uses Apache Avro for storing data. Avro requires a schema for all data, yet it allows for full flexibility through schema evolution.

Through a special feature of Divolte Collector called mapping, you can map any part of incoming events onto any field in your schema. Mapping also allows for complex constructs such as mapping fields conditionally or setting values based on URL patterns or other incoming event data.

Avro schema

{
  "namespace": "com.example.record",
  "type": "record",
  "name": "MyEventRecord",
  "fields": [
    { "name": "location", "type": "string" },
    { "name": "pageType", "type": "string" },
    { "name": "timestamp", "type": "long" }
  ]
}

Divolte Collector mapping

mapping {
  map clientTimestamp() onto 'timestamp'
  map location() onto 'location'

  def u = parse location() to uri
  section {
    when u.path().equalTo('/checkout') apply {
      map 'checkout' onto 'pageType'
      exit()
    }
    map 'normal' onto 'pageType'
  }
}

Map user agent information onto Avro fields

map userAgent().family() onto 'browserName'
map userAgent().osFamily() onto 'operatingSystemName'
map userAgent().osVersion() onto 'operatingSystemVersion'

// Etc... More fields available

User agent parsing

Know what this means? Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.93 Safari/537.36

User agents are parsed into readable fields such as operating system, device type, and browser name. User agents are parsed using a database of known user agent strings, which Divolte Collector can update on the fly without requiring a restart.

IP to geolocation lookup

If enabled, Divolte Collector will perform on-the-fly ip2geo lookups using databases provided by MaxMind. You can use either the light version of their database, which is downloadable for free or use their more accurate subscription database, which comes with a commercial license.

Note that it's not possible for us to enable this feature by default as redistribution of the MaxMind database is restricted. Configuration is however very simple: just put the path of the database file in the Divolte Collector configuration.

If you have a subscription license to the MaxMind database, Divolte Collector will reload the database as updates appear without a restart.

Requests per second:    14010.80 [#/sec] (mean)
Time per request:       0.571 [ms] (mean)
Time per request:       0.071 [ms] (mean, across all concurrent requests)
Transfer rate:          4516.55 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.1      0       1
Processing:     0    0   0.2      0       3
Waiting:        0    0   0.2      0       3
Total:          0    1   0.2      1       3

Percentage of the requests served within a certain time (ms)
  50%      1
  66%      1
  75%      1
  80%      1
  90%      1
  95%      1
  98%      1
  99%      1
 100%      3 (longest request)

Test run on a laptop-hosted virtual machine configured with two virtual cores using pseudo-distributed Hadoop HDFS running within the same virtual machine.

Fast

Tracking code shouldn't keep the browser spinning any longer than necessary.

Divolte Collector was built with performance in mind. It relies on the high performance Undertow HTTP implementation and has a clean internal threading model with zero shared state and a high level of immutability. Everything is non-blocking which results in little contention under normal operation.

Custom events

Log anything from the browser. As with other web tracking tools you can fire custom events from your pages using JavaScript. Whether it's an add-to-basket, checkout or product image zoom, just add a custom event if you want to track it.

Custom events can have parameters in the form of arbitrary JavaScript objects and these are easily mapped onto your Avro records. They are part of your own schema. You can extract top-level object members directly by name or use JSONPath expressions to extract values, arrays or complete objects from the event payload.

In JavaScript

divolte.signal('searchResultClick', {
    productId: 309125,
    searchPhrase: 'sneakers',
    filters: [
        { name: 'size', value: 10 },
        { name: 'color': value: 'red' }
    ]
})

In Divolte Collector

// Using direct values
map eventParameters().value('searchPhrase') onto 'searchPhrase'
map eventParameters().value('productId') onto 'product'

// Using JSONPath extractions
map eventParameters().path('$.filters[*].name') onto 'filterKeys'
map eventParameters().path('$.filters[*].value') onto 'filterValues'

Hadoop ecosystem

Divolte Collector is not opinionated about the best way to process or use your data. By writing data as Avro records, you are free to use any framework of your choice for working with your data.

Perform offline processing of the clickstream data using Cloudera Impala, Apache Hive, Apache Flink, Apache Spark, Apache Pig or plain old MapReduce. Anything that understands Avro will work.

For near real-time processing, you can consume Divolte Collector's messages from Kafka using plain Kafka consumers, Spark Streaming or Storm.

Open source

Divolte Collector is released under the Apache License, Version 2.0.

It's never a good idea to be locked in to a vendor for your data collection. Similarly, sending your clickstream data to cloud providers can present issues. Better to take control and free yourself from data ownership issues, closed formats and license or service fees for obtaining for your own data.