Divolte Collector
Scalable clickstream collection for Hadoop and Kafka
You integrate Divolte Collector into your site by simply including a small piece of JavaScript.
This takes care of logging all pageviews and exposes a JavaScript module. This module can be used to interact with Divolte Collector in the browser and log custom events.
<body> <!-- Your page content here. --> <!-- Include Divolte Collector just before the closing body tag --> <script src="//example.com/divolte.js" defer async> </script> </body>
Divolte Collector pushes data to Hadoop HDFS and Kafka topics. Data is written to HDFS as complete Avro files, while Kafka messages contain serialized Avro records.
Divolte Collector itself is effectively stateless; you can deploy multiple collectors behind a load balancer for availability and scalability.
To preserve the sanity of developers and data scientists alike, all data should come with a schema. CSV is not a schema. The common log format is not a schema. JSON is not a schema. A schema defines which fields exist and what their types are. Using a schema allows you to inspect data without making assumptions about which fields are available.
Divolte Collector uses Apache Avro for storing data. Avro requires a schema for all data, yet it allows for full flexibility through schema evolution.
Through a special feature of Divolte Collector called mapping, you can map any part of incoming events onto any field in your schema. Mapping also allows for complex constructs such as mapping fields conditionally or setting values based on URL patterns or other incoming event data.
{ "namespace": "com.example.record", "type": "record", "name": "MyEventRecord", "fields": [ { "name": "location", "type": "string" }, { "name": "pageType", "type": "string" }, { "name": "timestamp", "type": "long" } ] }
mapping { map clientTimestamp() onto 'timestamp' map location() onto 'location' def u = parse location() to uri section { when u.path().equalTo('/checkout') apply { map 'checkout' onto 'pageType' exit() } map 'normal' onto 'pageType' } }
map userAgent().family() onto 'browserName' map userAgent().osFamily() onto 'operatingSystemName' map userAgent().osVersion() onto 'operatingSystemVersion' // Etc... More fields available
Know what this means? Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/40.0.2214.93 Safari/537.36
User agents are parsed into readable fields such as operating system, device type, and browser name. User agents are parsed using a database of known user agent strings, which Divolte Collector can update on the fly without requiring a restart.
If enabled, Divolte Collector will perform on-the-fly ip2geo lookups using databases provided by MaxMind. You can use either the light version of their database, which is downloadable for free or use their more accurate subscription database, which comes with a commercial license.
Note that it's not possible for us to enable this feature by default as redistribution of the MaxMind database is restricted. Configuration is however very simple: just put the path of the database file in the Divolte Collector configuration.
If you have a subscription license to the MaxMind database, Divolte Collector will reload the database as updates appear without a restart.
Requests per second: 14010.80 [#/sec] (mean) Time per request: 0.571 [ms] (mean) Time per request: 0.071 [ms] (mean, across all concurrent requests) Transfer rate: 4516.55 [Kbytes/sec] received Connection Times (ms) min mean[+/-sd] median max Connect: 0 0 0.1 0 1 Processing: 0 0 0.2 0 3 Waiting: 0 0 0.2 0 3 Total: 0 1 0.2 1 3 Percentage of the requests served within a certain time (ms) 50% 1 66% 1 75% 1 80% 1 90% 1 95% 1 98% 1 99% 1 100% 3 (longest request)
Test run on a laptop-hosted virtual machine configured with two virtual cores using pseudo-distributed Hadoop HDFS running within the same virtual machine.
Tracking code shouldn't keep the browser spinning any longer than necessary.
Divolte Collector was built with performance in mind. It relies on the high performance Undertow HTTP implementation and has a clean internal threading model with zero shared state and a high level of immutability. Everything is non-blocking which results in little contention under normal operation.
Log anything from the browser. As with other web tracking tools you can fire custom events from your pages using JavaScript. Whether it's an add-to-basket, checkout or product image zoom, just add a custom event if you want to track it.
Custom events can have parameters in the form of arbitrary JavaScript objects and these are easily mapped onto your Avro records. They are part of your own schema. You can extract top-level object members directly by name or use JSONPath expressions to extract values, arrays or complete objects from the event payload.
divolte.signal('searchResultClick', { productId: 309125, searchPhrase: 'sneakers', filters: [ { name: 'size', value: 10 }, { name: 'color': value: 'red' } ] })
// Using direct values map eventParameters().value('searchPhrase') onto 'searchPhrase' map eventParameters().value('productId') onto 'product' // Using JSONPath extractions map eventParameters().path('$.filters[*].name') onto 'filterKeys' map eventParameters().path('$.filters[*].value') onto 'filterValues'
Divolte Collector is not opinionated about the best way to process or use your data. By writing data as Avro records, you are free to use any framework of your choice for working with your data.
Perform offline processing of the clickstream data using Cloudera Impala, Apache Hive, Apache Flink, Apache Spark, Apache Pig or plain old MapReduce. Anything that understands Avro will work.
For near real-time processing, you can consume Divolte Collector's messages from Kafka using plain Kafka consumers, Spark Streaming or Storm.
Divolte Collector is released under the Apache License, Version 2.0.
It's never a good idea to be locked in to a vendor for your data collection. Similarly, sending your clickstream data to cloud providers can present issues. Better to take control and free yourself from data ownership issues, closed formats and license or service fees for obtaining for your own data.