Interview with Marco Bonvini

Author: Daniel McQuillen
Published: January 25, 2016
Marco Bonvini
Photo of Marco Bonvini, Data Scientist (used with permission by M. Bonvini)

This month’s interview is with Marco Bonvini, an accomplished data scientist working in the area of building energy analysis. As in all Tech+Earth interviews, we encouraged Marco to share a practical tip or code example drawn from his daily work. Read on to learn more about how Marco uses Python and Pandas in his efforts to make buildings more efficient. (Don’t forget to checkout the sample code files on GitHub.)

How important is energy efficiency in buildings to sustainability and mitigating climate change?

I think that energy efficiency in buildings, cities, and transportation are areas where governments and researchers should focus for a more sustainable future. In the US and other developed countries about 70% of the overall electricity consumption is used by buildings. Unfortunately we haven’t access to energy generated by 100% renewables – a reasonable goal according to Elon Musk https://www.teslamotors.com/powerwall. While we get there, every chunk of energy we save in our buildings reduces the consumption of fossil fuels and consequently reduces the C02 that goes in the atmosphere.

It is obvious that in order to reduce our impact on the environment and we need a shift. Improving how buildings work and enhancing the way we interact with them is a step in the right direction. It is the role of researchers, companies and the government to help people transition in a seamless way from where we stand now to the future we all envision.

We also have to accept that not every one cares about energy, in fact few care about it. We must find innovative ways to “sell” energy efficiency as a byproduct of more appealing and concrete solutions that touch everyday life such as comfort, health, productivity, happiness, etc. If we keep forgetting about this we’ll never convince people.

Where has most of your work been in this area?

In the last few years I spent a lot of time developing modeling and simulation tools for designing better control systems. Better control systems improve energy efficiency. I modeled a lot of different “objects” that have an impact on energy efficiency: appliances, buildings, Heating Ventilation and Air Conditioning (HVAC) systems, electrical networks, district cooling systems, an so on.

While at Lawrence Berkeley National Laboratories, I focused on different ways to leverage simulation models to improve the operation of buildings and district energy systems. The projects I worked on required a mix of model-based optimization, model predictive control, and fault detection and diagnostics. The ideas were all pretty simple: gain knowledge about the system via a model and leverage that knowledge to (1) control it better, and (2) detect if something affects its performance.

Recently I joined a startup in Oakland, Whisker Labs, where we focus on energy monitoring solutions. Here we’re trying to simplify people’s lives and as a byproduct improve their energy consumption. My job at Whisker Labs is related to signal processing and data science.

Which programming tools do you most use in your work? Which do you feel are crucial and how long did it take to get comfortable with them?

Most of the time I use Python. Being a data scientist I have to rapidly generate data sets from multiple data sources, and design algorithms that make sense of the data (e.g. capturing interesting patterns). I like Python’s flexibility and the incredible ecosystem around it. You have libraries and packages for basically anything. In particular I use Numpy, SciPy, Matplotlib, Pandas and scikit-learn.

I also like and used related Python frameworks like Django and Celery that can implement web apps with analytical capabilities, for example running a simulation of a dynamic system and processing the results quickly on the backend. I find Python simple and pretty easy to learn. As always the libraries and frameworks are the difficult part to learn. However all the libraries and frameworks I’ve just mentioned are very well supported and documented. I think one can safely call them a de-facto standard.

A technology I use (not on a regular basis) and never expected to need few years ago is D3.js. D3 opened new possibilities for visualizing data and playing with it in interactive ways. Another technology I recently used and never expected to is ObjectiveC, but that’s an other story!

Here’s a not exhaustive list of tools/languages I use for various tasks:

  • Python, any type of data manipulation, data processing/cleaning, visualization, quick-and dirty web servers, web applications (Django and Flask).
  • Pandas, when working with time series and data sets up to few GB.
  • D3, for fancy visualizations of the data.
  • Modelica, an Object-Oriented modular modeling language for dynamic systems described by ordinary differential equations, differential algebraic equations or even hybrid discrete/continuous time systems. With Modelica you can model pretty much anything.
  • FMI, a standard to export simulation models and integrate them with other tools for co-simulation. With FMI, for example, you can export the Modelica model of a control system that was tested in simulation and deploy it on the hardware of a building automation system.

A few words on something that most people are not familiar with and that I am passionate about: Modelica. Modelica is an object oriented modeling language that can be used to model and simulate complex physical system where multiple physical domains interact. An example of such systems are the so-called “Cyber-physical systems.” It turns out that buildings and energy systems fit perfectly into this category because of the interaction of thermal dynamics, electrical grid, control systems, etc. Imagine the typical house of the “not-so-far-future” with controllable windows and/or shades, AC system, a smart thermostat, an array of PV panels and the batteries of your electric car.

With respect to this technology, I’d like to talk about an open source project called JModelica. JModelica allows to both simulate Modelica models and solve optimization problems based on them. All wrapped within a Python interface. Here are a couple of works where I used it to solve optimization problems for optimal control of a small district of buildings (http://www.iea-annex60.org/downloads/p2270.pdf), and why using Modelica is potentially better than other approaches (http://www.sciencedirect.com/science/article/pii/S0378778815303315).

Can you show an example use of Pandas?

I’d like to show something that can help in many situations, making sense of a dataset with multiple time series of the same type using a data structure provided by Pandas, the Panel. Before talking about the Panel, however, I’d like to take a few steps back and briefly describe other common data structures we find in pandas on leveraged by the Panels.

First things first: because we deal with time series we have the Series. A Series is an object that contains an array of values with an index associated to them. In case of a time series the index can contain date and time values but it could be anything.

Second we have the DataFrame. If you think the Series as a column, the DataFrame is a table. The DataFrame is an object that groups multiple Series that share the same index. The DataFrame is one of the most used data structures in Pandas. DataFrames let you work as if you had a very fast in memory data base that can process data sets up to few GB.

Now we’re ready, for the Panel. The Panel is the natural extension of the DataFrame and can be seen as a 3D table, or a collection of multiple DataFrames. I’d like to show how to use a Panel in order to quickly visualize data and explore a data set. Given my recent involvement with energy metering solution I’ll take an example from the electrical domain.

Imagine we’re monitoring five different houses, and for every house we have a data set that contains voltage current and power with a resolution of two minutes. We’re given five different CSV files containing the data (see attached). Every file looks like this

date,Vrms,Irms,Power
Sat Jan 09 2016 00:00:00 GMT-0800 (PST),122.89474233,0.751786349902,92.2478478392
Sat Jan 09 2016 00:02:00 GMT-0800 (PST),122.89474233,0.751786349902,92.2478478392
Sat Jan 09 2016 00:04:00 GMT-0800 (PST),122.89474233,0.751786349902,92.2478478392
Sat Jan 09 2016 00:06:00 GMT-0800 (PST),122.89474233,0.751786349902,92.2478478392
...

and named like house<#house>.csv.

This is a quite common situation. We have N data sets containing time series data with an homogeneous data structure. If you think about it we have have to deal with three dimensions, and the Panel seems just the right data structure for this kind of job. The dimensions we’ll consider are

  1. the different houses (items in Pandas-Panel-lingo)
  2. time index (major-axis in Pandas-Panel-lingo)
  3. the measured values: voltage, current and power (the minor-axis in Pandas-Panel-lingo)

Organizing the data in a Panel makes easy to look at different variables for individual houses or compare the same variables across different houses. The most noticeable thing is that we’ll write just few lines of code to do this. This is one of the advantages of using Pandas!

A few “Pythonic” notes before diving into the script. I’ll make use of two concepts that come handy: lambda functions and the map operator.

Let’s go with an example for lambda functions

f = lambda x: x+1

f is a lambda function that is equivalent to

def f(x):
    return x+1

So lambda functions are just function, just more convenient to be declared without writing too many lines of code and can be easily passed as parameters to other functions.

Now let’s see the map operator. The map operator maps the element of an iterable object (e.g., an array or a list) to an other. Imagine you have a list that contains power measurements in W, and you want to convert them to kW, you can do it with a map operator

values_W = [1000.0, 1200.0, 3050.0]
values_kW = map(lambda x: x/1000.0, values_W)

results is

values_kW = [1.0, 1.2, 3.05]

basically every value of the original list has been mapped via the lambda function to a new value (watts to kilowatts).

Given these two concepts let’s dive in…

from matplotlib import pyplot as plt
from datetime import datetime
import pytz
import os
import pandas as pd
import numpy as np
 
# Get the name of the directory containing the files
dir_path = os.path.join(os.path.dirname(os.path.abspath(__file__)), 'raw_data')
 
# Create a list of file names (using their absolute paths)
file_names = map(lambda x: os.path.join(dir_path, x), os.listdir(dir_path))
 
# Load each file as a pandas DataFrame and put them in a list.
# Use the first column as the index, and convert it from string to a datetime object
dfs = map(lambda x: pd.read_csv(x, index_col=0, parse_dates=True), file_names)
 
# We could start working with all the data into separate DataFrames but that would make our 
# work less convenient and less flexible writing a bunch of for loops for iterating 
# over the DataFrames.
# Let's put them into a Panel, i.e. a 3D DataFrame
 
# Retrieve the names of the houses from the file names
# './raw_data/house1.csv' becomes 'house1' with the following steps:
# 1) './raw_data/house1.csv' is split into ['./raw_data/', 'house1.csv']
# 2) get last element ['./raw_data/', 'house1.csv'] that is 'house1.csv'
# 3) split the file name 'house1.csv' and get a list like ['house1', '']
# 4) get first element of ['house1', ''] that is `house1`
#
# The names will become the way we'll reference the different houses in the Panel,
# making it more convenient than using integer indexes.
house_names = map(lambda x: os.path.split(x)[-1].split('.csv')[0], file_names)
 
# Create a Panel that aggregates all the data
# - first dimension are the house names,
# - second dimension (major axis) is the time index, and
# - third dimension are the variables P, Vrms, and Irms
pnl = pd.Panel(map(lambda x: x.values, dfs), items = house_names, major_axis=dfs[0].index.tolist(), minor_axis=dfs[0].columns.tolist())
 
# Let's start with something simple, see the mean, max and min values for all the variables
# across the different houses
print pnl.mean()
print pnl.max()
 
# Let's try to plot the Power consumption across all the house
pnl.ix[:,:,'Power'].plot()
plt.legend(loc='upper center', ncol=5)
plt.ylabel('Power [W]')
plt.xlabel('Time')
plt.show()

The plot shows the energy used by the houses in the time period being analyzed.

# Compute the energy in kWh
# The energy is the sum of the energy computed in each time period that in our case
# is 2 minutes long and represented by the variable `dt`. The energy, computed in Joules,
# is then converted into kWh.
# We do not use any fancy methods to compute the integral, just a simple cumulative sum.
dt = 60*2
J_2_kWh = 1.0/3600.0/1000.0
pnl.ix[:,:,'Energy'] = (pnl.ix[:,:,'Power']*dt).apply(np.cumsum)*J_2_kWh
 mac
# Plot the energy consumption
pnl.ix[:,:,'Energy'].plot()
plt.legend(loc='upper left', ncol=5)
plt.ylabel('Energy [kWh]')
plt.xlabel('Time')
plt.show()

The plot shows the energy used by the houses in the time period being analyzed.

# Let's try to compute the average power consumption per house for hourly intervals
agg_hourly_power = pnl.ix[:,:,'Power'].groupby(by=[pnl.ix[:,:,'Power'].index.map(lambda x : x.hour)])
 
# Plot it with multiple bars
agg_hourly_power.mean().plot(kind='bar')
plt.legend(loc='upper center', ncol=5)
plt.ylabel('Power [W]')
plt.xlabel('Hour of day [hh]')
plt.show()

The plot shows how the different houses consume power at different hours of the day. Such a plot can be used to detect trends and/or recurring patterns or hourly power consumption.

# Compute the load duration curve.
# The load duration curve shows which percentage of a load falls into a certain
# range. This curve is useful to identify ranges where the operation should be optimized 
# in order to deliver energy savings.
bin_size = 200
loads_aggregated = map(lambda n: pnl.ix[n, :,'Power'].groupby(by=[pnl.ix[n, :, 'Power'].apply(lambda x: bin_size*(x//bin_size))]).count(), pnl.items.tolist())
load_duration = pd.concat(loads_aggregated, axis=1)
load_duration.columns = pnl.items.tolist()
load_duration.fillna(0, inplace=True)
(load_duration*100.0/load_duration.sum()).plot(kind='bar')
plt.legend(loc='upper right', ncol=5)
plt.xlabel('Power [W]')
plt.ylabel('Percentage of time [%]')
plt.show()

This plot, also called load duration curve, shows how the power consumption is distributed across different operating conditions. For example, it is evident that most of the time the houses consume a small amount of power compared to their peak consumption. Such a plot can be used to spot areas (or operational regimes) where it makes sense to focus in order to improve the overall energy consumption.

Interested readers can find the code and the data here on Github.

Do you recommend any free online courses for learning Python or Pandas?

I’m sure there are plenty of courses and tutorials around but I’ve never followed them. What I’d like to suggest is to do some competitions on Kaggle or similar platforms. They have all kinds of competitions from the purely educational ones to ones for serious data scientists, where if you win you’ll take home a considerable amount of money. At the end what really matters is to use the tools and get comfortable with them.

What kind of background work does somebody need to do if they want to work in the technical side of energy efficiency in buildings?

I think anyone with a STEM degree and a good understanding of the basic of physics (heat and mass transfer, thermodynamics) and basic coding abilities, or at least able to find algorithmic solutions to problems, can contribute to the field. Of course, if someone really wants to make substantial contributions and would like to become a “rockstar,” you’ll need a good understanding of the science/math behind the algorithms as well the necessary knowledge to implement them.

Here’s a list of interesting technologies and/or tools that I think are worth a looking at depending on your focus.

Building energy modelers:

I think you should seriously look at OpenStudio because it will gain even more popularity and eventually will be the interface to work with EnergyPlus. It’s pretty powerful, has a nice looking UI, many tools built around it, and an API accessible from a scripting environment. Here’s an example of what I was able to generate by using the OpenStudio Ruby API. (See attachment report_small_office_new.zip)

At-large energy modelers (from buildings to cities):

You should definitely look at Modelica, no questions! See the works and the reports made available in this project founded by the IEA for more info www.ieaannex60.com.

Everyone:

I think Python will continue to shine, at least for rapid prototyping and pre/post processing of data. Julia might take some of Python’s user base, but has a long way before being able to compete with all the packages currently available in Python such as Numpy, SciPy, Pandas and scikit-learn (without mentioning all the others…).

Data Scientists:

People that work with a lot of data, are interested in machine learning shoukd definitely look at Spark, and Scala with its ML library. How not to mention TensorFlow?

Interested in development/deployment:

  • Get familiar with AWS or any other infrastructure service like it.
  • Get familiar with technologies such as Docker/Ansible/Vagrant (you might not need to go all the way to chef, puppets, zookeepers, etc.)
  • Linux, learn how to use it, it’s never getting old!

And that’s it for our interview with Marco! You can reach him through his website or on LinkedIn.

Resources

Here’s a quick list of links for projects mentioned above…