CancerBase – first lessons learned

CancerBase.org has been up and running for a few months now. About 1000 people have signed up, and most of them suggested at least one new feature. We are preparing to launch CancerBase 2.0 in April, based on everything we have learned so far. Here is one complete surprise – many people complained (nicely) that we (deliberately) downsampled their physical location on the map. We did this to ensure anonymity – BUT – it’s become clear that some patients have a completely different perspective. Their point is that they are a real person with a real name and a real location, and they want everyone to know about them and their disease, and they want their dot on the CancerBase map to be right on their house. So we quickly added to checkbox to CancerBase so that people could tell us their mapping/geolocation preference.

It’s been amazing to work with everyone on CancerBase and we are growing quickly. Stay tuned for CancerBase 2.0 this April and May.

Fast Global Sharing of Medical Data?

When I started out in cancer biology, I was surprised by the difficulty of accessing medical data for science. This was puzzling to me because all the cancer patients I met were very open and extremely helpful. When I spoke to patients about this problem, they were surprised, too – many of them assumed that the data they shared with medical centers were broadly shared and accessible to the global research community.

In one study I was involved in, it took several years to work through the legal paperwork to access stored medical images, and even then, the images were subject to myriad constraints. If people can go to the moon, and 2.08 billion people on earth are active smartphone users, why are medical data frequently still stuck in, figuratively speaking, local libraries with only a limited selection of books?

The strange thing of course is that, fundamentally, medical data belong to the patient, and therefore, if a patient wants to share his or her information, they should find it easy to do so. The most telling conversation for me was a father with two kids. When asked about data sharing, he said said he could not care less about who saw his medical records; rather, it was much more important to him that as many scientists as possible had access to his data, so that his data would make the largest difference and hopefully reduce the chance of his kids having brain cancer, like he did, at some point in their lives. That made a lot of sense to me.

I still do not fully understand all the barriers to efficient data sharing in cancer biology, but I’m curious about standard web technologies that can help patients share what they want, when they want, and to whom they want. If a patient can share a movie, a picture, or a book within several seconds around the world, why is it sometimes still difficult for them to share their medical information?

For a while I thought the major problems had to do with the rules and regulations surrounding medical data, but that turns out not to be the case. The simplest way to start thinking about crowd-sharing of medical information is that millions of people around the world already crowd-share medical information. For example, women with breast cancer sometimes wear pink t-shirts to raise awareness, and they then circulate these pictures on social networks. That’s an example of someone sharing medical information – namely, their cancer diagnosis – in the form of a picture.

A few months ago, I started to look into web technologies that could potentially be used to help people share some of their medically-relevant information within 1 second. I chose the 1 second standard arbitrarily – it seemed like a reasonable number. Much below one second you run into various technical problems, but if you are willing to wait a few hundred milliseconds, the technologies are all already there: inexpensive, massively scalable, and globally deployed.

What if each cancer patient on earth had the ability to broadcast key pieces of information about their cancers around the world, in one second? 

If you are curious, here is the White House fact sheet announcing CancerBase, and here is a little bit more information about how we started out. The actual site is at CancerBase.org. It’s an experiment run by volunteers, many of whom are cancer patients, so bear with us, and if you can, help out!

Prediction of Overall Mortality from Fitbit heart rate data

From what I can tell, the Fitbit API returns heart rate data at an effective temporal resolution of 9.98 seconds (min: 5 s, median: 10 s, max: 15 s). Curiously, you are more likely to get either a 5 or 15 s interval than a 10 s interval. Using Mathematica, as before, we can plot the distribution of times between samples returned by the Fitbit API,


fitbitHR
That is still (although just barely) usable for measuring heart rate recovery, the change in your heart rate some time t after you stop your exercise. For most things you can measure on a wearable, any one datapoint is next to useless; the key is to look at first and second derivatives, such as gradual trends in how your heart rate drops following a few minutes on the treadmill. The key medical study is probably the October 1999 article in NEJM, Heart-rate recovery immediately after exercise as a predictor of mortality. The conclusion of that paper is that “A delayed decrease in the heart rate during the first minute after graded exercise, which may be a reflection of decreased vagal activity, is a powerful predictor of overall mortality”. Their standard for a ‘delayed’ decrease was a drop of ≤ 12 beats per minute from the heart rate at peak exercise, measured 1 minute after cessation of exercise. Since Fitbit is probably not in the “mortality prediction” market, ~10 s temporal resolution is fine; for medical researchers, however, it would be nice to have slightly higher temporal resolution data.

Fitbit API and High Resolution Heart Rate Data

After trying the Jawbone UP3 for a few days and quickly returning it due to multiple limitations, I’m now testing a Fitbit Charge HR. I’m mostly interested in heart rate data, so I had to to update my code to OAuth 2.0. Fortunately orcasgit/python-fitbit is completely on top of things and their new gather_keys_oauth2.py works perfectly. All I had to do was to set my callback URL in the ‘manage my apps’ tab at dev.fitbit.com to http://127.0.0.1:8080/, and then gather_keys_oauth2.py returned my OAuth 2.0 access and refresh tokens. I dropped those into a text file (‘config.ini’) and used them to set up my client:

There was a small issue with orcasgit/python-fitbit, which had to do with the new ‘1sec’ detail level for the heart rate data, but I made that change and the merge is pending at github. Now, the data are flowing,

I was expecting 1 sec resolution data (based on the ‘1sec’ parameter), but the timestamps are actually spaced 5 to 15 seconds apart. It would not surprise me if ‘1sec’ is a request rather than a guaranteed minimal temporal resolution; perhaps the device does some kind of (sensible) compression and concatenates runs of identical rates, e.g. if your heart rate is precisely 59 bpm for a while, it is probably silly to continuously report a sequence of {“value”: 59}, over and over again. If this is true, are we (basically) dealing with a lossless run length encoded (RLE) data stream? Any ideas? It’s not a simple RLE, as this data 4-tuple demonstrates: {“value”: 60, “time”: “00:05:00”}, {“value”: 60, “time”: “00:05:15”}, {“value”: 60, “time”: “00:05:30”}, {“value”: 62, “time”: “00:05:40”}. If it were a simple RLE-ish encoding, then this sequence would be {“value”: 60, “time”: “00:05:00”}, {“value”: 62, “time”: “00:05:40”}, with the recipient code then assuming 40 seconds of 60 +/- 0.5 bpm or something similar. My guess right now is RLE modified to provide at least one datapoint every 15 seconds, and updating more quickly when something is changing, yielding the observed {“value”: 60, “time”: “00:05:00”}, {“value”: 60, “time”: “00:05:15”}, {“value”: 60, “time”: “00:05:30”}, {“value”: 62, “time”: “00:05:40”}.

Directional Quantile Envelopes – making sense of 2D and 3D point clouds

Imagine some large multidimensional dataset; one of the things you might wish to do is to find outliers, and more generally, say something statistically-defined about the structure of clusters of points within that space. One of my favorite techniques for doing that is to use directional quantile envelopes, developed and implemented by Anton Antonov and described here and here. In that post, Antonov considers a set of uniformly distributed directions and constructs the lines (or planes) that separate the points into quantiles; if you consider enough directions, and do this a few times, you are left with lines (or planes) that define a curve (or surface) that envelops some quantile q of your data. The figures show a cloud of points with some interesting structure and the surface for q = 0.7, with and without the data.

Beyond general data analytics, the directional quantile envelope approach has at least one more application, which is in image processing and segmentation. Imagine taking a picture of a locally smooth blob-like object in the presence of various (complicated) artifacts and noise. You could throw the usual approaches at this problem (gradient filter, distance transform, morphological operations, watershed, …), but in many of those approaches you end up having to empirically play with dozens of parameters until things “look nice”, which is unsettling. What you would really like to do is to detect/localize/reconstruct the emitting object in a statistically-defined, principled manner, and this is what Antonov’s Directional Quantile Envelopes allow you to do.

segmentation_7 A quantile envelope is well defined and you can compactly communicate what you did to the raw imaging data to get some final picture of a cell or organoid, rather than reporting an inscrutable succession of filters, convolutions, and adaptive nonlinear thesholding steps. The figure shows a cell nucleus imaged with a confocal microscope; in reality, the cell nucleus is quite smooth, but various imaging artifacts result in the appearance of “ears”, which can be detected as outliers via directional quantile envelopes.

The Fitbit API – Mathematica vs. Python

I’m teaching a class later this year and part of what we will cover is how to explore data coming from wearables. I’ve had a Fitbit Zip for a while, and my plan is to collect data over the summer and then to use those data for class and for problem sets, so the students have real data to look at.

I thought I would use the Mathematica Connector (via its ServiceConnect[“Fitbit”] call) to get the data from the Fitbit API, but I quickly ran into various problems. The ServiceConnect functionality at present seems somewhat rudimentary. After spending a few hours on the internals of Mathematica’s OAuth.m and trying to get valid tokens from the undocumented HTTPClient`OAuthAuthentication call (can anyone tell me how to pass nontrivial OAuth 2.0 scopes into this function?), I gave up and just used the Python Fitbit client API, which all worked right away, since, among other reasons, there is actually documentation. I followed the instructions at first-steps-into-the-quantified-self-getting-to-know-the-fitbit-api. Once you have the 4 keys you need, just place them in a config.ini file and use something like this:

This gives you a JSON dump, which can then be manipulated in Mathematica:

The latter function gives the cumulative steps vs. time (I like CDFs!), which is nice way of seeing when you were moving (slope of line > 0). This requires partner access to the Fitbit API; I was impressed with their help (emails answered within minutes) and their enthusiastic support for education and our upcoming class.