JSON parsing in Python and Rust

Posted by Alan Barr on Tue 11 October 2016

I have a file of mixpanel data generated from a month of tracking information for various website data. This file is about 1.9GB and spans a month timeframe. There are 1,780,851 lines of new line delimited JSON strings. At first I was rewriting this file to become one big array of JSON objects. Loading this file into memory ended up becoming a problem though the systems tended to handle it most of the time. At first I was doing this in Python and attempting to use the ijson library to do this. While it did work and was fairly fast I went down a rabbit hole of trying to get YAJL(Yet another JSON Library) to compile for windows and use it. Ultimately this was too much work to figure out and I tried another path.

New line delimited JSON is a useful pattern because it is much faster to load a small JSON object than the whole array. Determining if I could find a speed increase I used the time function in linux/msys2 to time how fast my programs work with different libraries.

Mixpanel is an analytics service that allows one to trigger an event and record information about the user that triggered it. One task I wanted to do was just get the unique events since I don't manage this service. I need to loop through the whole file and record events then get the unique results. Comparing the built in json and ujson I had results that differed from 29 seconds to 22 seconds.

import os
import ujson

def getEvents():
    events = []
    if os.path.isfile("dataAll.json"):
        with open("dataAll.json", encoding="utf8") as file:
            for line in file:
                data = ujson.loads(line)
                events.append(data['event'])
    return set(events)

def countEventOccurences(events):
    eventCount = dict.fromkeys(events, 0)
    if os.path.isfile("dataAll.json"):
        with open("dataAll.json", encoding="utf8") as file:
            for line in file:
                data = ujson.loads(line)
                eventCount[data['event']] += 1
    return eventCount


if __name__ == '__main__':
    print(getEvents())
    print(countEventOccurences(getEvents()))

I had attempted to look into using Rust to do similar logic. It took me much longer to figure out how exactly to do this. However, once I did do it I compared the debug version to python and was a bit disappointed. I was hoping for some kind of gain. Little did I realize that when calling cargo build to compile our Rust program we need to pass in the --release parameter to perform additional optimizations. The Rust code below gave me a four second boost at 18 seconds to process my file. The only issue with this code is that the events are still JSON string encoded and I haven't figured out what I need to do in Rust to remove that. This took a while to figure out how to do but the increase in performance was surprising.

use std::io;
use std::io::BufReader;
use std::io::BufRead;
use std::fs::File;
extern crate serde_json;
use serde_json::Value;

fn main() {

    fn get_unique_events() -> io::Result<()>{
        use std::collections::BTreeSet;
        let mut events = BTreeSet::new();
        let f = try!(File::open("C:/Users/Alan.Barr/Source/dev/mixpanel/dataAll.json"));
        let f = BufReader::new(f);

        let iter = f.lines();

        for line in iter {
            let s: String = line.unwrap().to_string().to_owned();
            let string: &str = &s[..];
            let data: Value = serde_json::from_str(string).unwrap();
            let event = data.find("event").unwrap();
            events.insert(event.to_string());
        }

        println!("{:?}", events);

        Ok(())
    }

    get_unique_events();
}