Histograms and High Level Languages at StrangeLoop

This year’s StrangeLoop conference is less than a week away and I’m psyched. This meeting with an odd name lies at the intersection of an odd blend of topics, including distributed systems, languages, and data science. It would be a natural place for me to talk about PFA, which covers all three, but instead I decided to talk about something new: a language of histogram aggregation called Histo·grammar.

Histo·grammar arose from trying to fit together two conflicting philosophies of how to aggregate data. Histograms are the bread and butter of my first field of study, high energy physics, and high energy physics software views histograms as objects to be filled, like lists in LISP or dictionaries in Python. Non-physics analysis software views histograms as the statistical abstractions they technically are, an approximation of a dataset’s distribution. Physics code is infinitely scalable because histograms can forever accumulate data in-place, but it is cumbersome to use in a functional paradigm like Apache Spark. Non-physics histogram APIs are restrictive in how they let you add or access the aggregated data. The key to getting the best of both is to keep the idea of a histogram as a container, but make it a functional container that knows how to fill itself.

To non-physicists, my focus on histograms might seem narrow: after all, isn’t a histogram just one type of chart? According to the statistician’s definition, yes, but the ways physicists have used (abused?) histogram-filling software over the past forty years has led to much, much more. Histo·grammar makes this generality explicit by splitting the histogram into its constituent atoms— composable primitives of data-aggregation that can be used to build a statistician’s histogram and many other aggregate structures.

As datasets get larger in all fields, having a way to summarize them with complex aggregations will be increasingly important. I’ll show how the same declarative language can slice and dice data in HDFS, can be JIT-compiled for blazing speed, and can even be parallelized across vector devices like GPUs.

Around the time I was developing PFA, someone asked me if it was a big transition from particle physics to data science. I said no, because particle physics is the most industrial field in academia and data science is the most academic field in industry. Conferences like StrangeLoop prove this point, in that philosophical musings on some esoteric language can be followed by the next big software stack. If you’ll be there, I’m the guy with the long, scraggly beard (non-unique identifier?) and would love to hear your latest great idea.

A link to an overview of my talk can be found here.

Written by Jim Pivarski

You might also enjoy

Off Grid CTO: Starlink in Action

Exciting news up here at the off grid cabin location, we now have actual real high speed internet! With Starlink service we have a true high speed connection, and it even uses less power than the Viasat installation did.

Investing In AI Doesn’t Need To Be A Leap Of Faith: How To Track Your AI ROI

Many enterprise leaders have taken a “Field of Dreams” approach to AI ROI — “If we build it, profits will come.” While it’s one thing to take that approach for smaller pilot projects, it’s another thing when you’re consistently being asked to fund initiatives with seven-figure price tags.

Are You Flying Blind in the AI Arms Race?

Few are able to answer basic questions such as “What’s the ROI of our AI investments?” or “What’s the exposure to our AI compliance risks?”

Get the Latest News in Your Inbox

Further Reading

ModelOp releases ModelOp Center 3.0, adding significant new capabilities for governing and scaling AI in the enterprise

Leading ModelOps platform now includes solutions for Executive Visibility for AI, modernizing Model Risk Management and AI Orchestration

Read More
ModelOp Achieves AWS Advanced Technology Partner and ISV Partner Path Confirmed Status

ModelOp Center’s native integration with AWS SageMaker along with product readiness to run on AWS cloud infrastructure services makes it easy for AWS customers to govern and scale their AI initiatives across the enterprise.

Read More
Is the best practice for Enterprise AI to keep ModelOps independent from data science in both practice and platform? [Part 2]

Is the best practice for Enterprise AI to keep ModelOps independent from data science in both practice and platform?

Read More