After the launch of our flagship data product Activate, Jornaya had an interesting technology issue; file formatting. We had built this big data processing pipeline that allowed us to query the billions of consumer behavior events we witness filtered through the lens of a really powerful consumer graph to allow our clients to get alerted when one of their customers started exhibiting shopping behaviors in various markets. We had the data, the scalable pipelines, the delivery mechanism, and a really compelling data science analysis that showed we could reliably detect shopping behaviors long anybody else could. The unexpected problem we were facing was an unexpected one; our customers didnt know how to give us their data in a way we could safely and anonymously join it to our data to give them value.
The first few customers worked directly with our product team and we worked with their IT departments to pull the necessary data, hash the PII and format in a way our pipeline could easily ingest it. But hand-holding people through the file building processes is not a scalable solution.
The ingestion process didn’t make it easier. Clients had to log into an SFTP server and upload their prospective file to a directory. After some period of time, a script would copy that file into our pipeline for validation, where if it encountered any errors, it would spit back a CSV of the errors back into that SFTP directory. This could happen close to an hour later or even overnight so the clients would have to wait until the next day to try again.
This implementation and time-to-value problem came up as the theme of one of our company wide hackathons last year. Being a company wide hackathon, I ended up being the only engineer on a team of sales and support people who had pleanty of stories of customers struggling to start getting value out of the project due to the hard to meet (at least for marketing professionals) technical specification of the file they needed to send to us. At the time I felt the easiest way to fix this problem was to build a browser based wizard that people could upload their file to, and it would walk them through the steps of mapping and hashing a file to meet the specified format.
So in the 6 or so hours we had for the hackathon, I banged out a prototype in Elm that could be embedded in our wordpress site. It took in a CSV file, parsed it, walked a user through the mapping process and allowed them to download the properly formatted file. The project ended up winning the company hackathon through a good mix of technology and a really flashy presentation (the benefit of having a team of mostly sales people). During the presentation, we took a mock consumer file of 30,000 lines of un-hashed emails and phones and generated a compliant file in 5-6 seconds.
Due to resource issues and competing priorites, at the time a hackathon win was all that came out of that project. We ended up taking elements from the project in order to enable the use of a self service demo of the product on limited data sets, but never readdressed the customer problem of file building.
Earlier this year, I was put in charge of a team on the making our products easier to use and the value they provide easier to discover. The problem of file building was once again on the table. So I took some time to research our options and build out a few prototypes. What resulted was a trip down the rabbit hole of web technologies and performance tuning and a lot of learnings. Here are my findings.
Before we get into implementations and performance implications, the problem I set out to solve was taking a CSV file of any format and reformatting it into a specified format. The Activate File Format requires a record identifier, allows for up to 3 hashed email identifiers, up to 4 hashed email identifiers, and up to 3 leadids. Most of our clients had issues with the cleaning of emails and phones and hashing of the data properly, so a strong nice to have was the ability to clean and hash unhashed data for the customer.
The original hackathon project was written in elm and used papaparse for csv processing through a port. The read file was passed into elm, where the user could select headers in their file and map them to actions such as “Hash as Email” or “Prehashed Phone” or “Mark as RecordID” or “Ignore column”. After mapping the column actions to the columns, elm would apply that list of actions in order to each row of the CSV and dump the data back out as a string that could be downloaded.
Overall the project was decent for a 6 hour hackathon, and most importantly it worked. On reinvestigation of the problem, a lot of the user flow and strategy was borrowed from this original implementation. However, this implementation had plenty of issues.
The first problem came in speed. 30,000 rows in 6ish seconds sounds fast and demos really well, however effectively 5000 rows per second is not a good speed when customers need to process files that could be potentially 30 Million Rows and when our average large file is between 6 and 10 million. At 5000 rows per second, a 10 Million line file would take more than half an hour to complete processing.
While this is already much faster than the hours or days that a customer would have to wait for server side validation; it is not exactly what we would call a good user experience.
The next probelm we ran into was memory. Elm stores list structures as a linked list, and loading 10 million lines into elm through a port was absolutely a non-starter. The browser would crash long before the file was done parsing.
So while the original project proved out the idea and the user flow, its implementation was far from acceptable for a final solution.
The original project used papaparse to do the CSV parsing before loading it into elm mostly due to the ease of parsing CSVs with the library and its raw speed. The next attempt tried to capitalize on its raw speed.
As a side note, there are lot of configuration and tuning options for papaparse and in the course of testing this option, I tested most of them. For larger files I absolutely needed the streaming option and to use webworkers, even if it had a marginal impact on performance. That combination while 2-5% slower, was the only combo that allowed me to process any size file up to the 30 million line files. All other settings combinations ran out of memory or failed to load the file at all.
The performance of papapaprse doing all of the heavy lifting ended up averaging 35,000 lines per second. At this speed a 10 million line file would take just under 5 minutes to complete loading. At this point for a being able to process a larger file in a reasonable amount of time, I technically had enough to move forward with a solution; however after all of this performance tuning and testing, I had to know if I could go faster.
FileReader object with the loaded
FileBlob and the list of column actions to the WASM module. WASM used the standard
I was really impressed by the speed and was excited to finally use a really cool technology that I had heard so much about but never seen demonstrated.
To solve for the out of memory problems and the results coming back in a value that was too large for a string I took a look at chunking the file. Papaparse as a chunking mechanism that loads so many bytes at a time and parses each chunk. The main problem with this particular method is it ends up splitting a csv row creating 2 half rows both of which are invalid. Papaparse solves this by prepending the last unparsed row of one chunk to the next chunk allowing for the row to be made whole again. Taking inspriation from this, I decided to take a slightly different route.
I stripped off the header row and set it aside. I would then go to so predetermined number of bytes into the file, and search backwards for the first newline character. I would then take the chunk from the start to that position as a new blob with the header prepended. I would then rinse and repeat until the entire file was consumed. This allowed me to break a large CSV file into many smaller CSV files all with valid rows and headers. I could then pass each of these to a wasm module running in a web worker to allow them to be parallel processed.
After some tuning, I found the balance between the amount of time the chunking algorithm took to complete and how fast the results came back without creating too many web workers. The optimal size ended up being around 1 megabyte chunks up until there was more than 80 chunks, then divide the file into at most 80 chunks of equal size.
A parent worker did all of the chunking and coordination of all of the children workers as well as reporting back progress to a nice little elm powered progress bar. On very large files the workers did take some time to boot and initialize the wasm module, however by starting the parent worker as soon as the file was uploaded, it almost always loaded before the user was done mapping column actions.
The chunked processing allowed the system to handle files up to the 30 million lines and ended up averaging 240,000 lines per second. The sample 10 million line file completes processing in about 41 seconds. Most files that our customers load that arent bigger take less than a second to load.
I went back and tested this iteration a pre-hashed file (one where the WASM module didn’t have to clean and hash the values) and the processing speed for reformatting the file was just shy of 1 million lines per second.
Overall the project was a ton of fun and I am super glad I doubled down and tried for more speed after coming up with an acceptable solution. Allowing our customers to close the feedback loop from hours or days to seconds or potentially minutes, will vastly improve our customers ability to user our data products.
Relearning Rust and seeing the raw speed that comes out of web assembly also has me excited for a number of other projects where the lack of browser side power has been an issue. I certainly look forward to playing more and seeing what capabilities can be unlocked by bringing a lot more processing power to the browser.