We have measured how accurate Sleep as Android is when compared to a clinical sleep lab. And the results look very good! Smart wake up’s chance to not ring in PSG-measured N3 deep sleep is 96%. Any lucid cue we trigger has a 50% chance to hit the REM phase. We can detect 30% of awakes just from movement intensity. Light and deep sleep measured by our app strongly correspond with sleep phases measured on PSG.
For years we have striven to find a reliable way to compare actigraphy, used for sleep phase detection in our app, with polysomnography – the de facto clinical golden standard for sleep measuring. Finally, we came across a public dataset [2, 3, 4] containing 31 nights with both actigraphy (activity measurement using the accelerometer on iWatch) and expert-annotated PSG records (Hypnograms). This allowed us to test our algorithms on previously unseen data and compare the output against the golden standard of sleep tracking — PSG.
One of the measured nights of sleep from the data set is shown in the picture above. The light-gray bars represent the amount of physical activity at the moment, as recorded by iWatch. The red line is showing sleep phases annotated by a human expert on the corresponding PSG record (let’s call them PSG-phases further in this text). Finally, the blue line displays sleep phases generated by our algorithm from the activity data (ACT-phases).
PSG vs. Actigraphy
Please, refer to our earlier post, How we measure your dreams, where we explain in detail our activity-based approach to sleep phases.
Polysomnography (PSG) is today’s golden standard for clinical sleep monitoring. The patient spends one night in a laboratory, trying to sleep as normally as possible, with dozens of electrodes (EEG, EOG, EMG, etc) attached to his body.
One output of such a session is a record of sleep phases (N1, N2, N3, REM, AWAKE), compiled by a human specialist. The methodology is based on spotting typical patterns in the sensor readings, and depends heavily on human judgment, as the patterns are not always clear.
Actually, when two different experts evaluate the same sleep record, the resulting hypnogram agrees only 75% of the time on average [5, 6]. Any comparison with a PSG hypnogram is therefore burdened with this kind of error.
In our app, Sleep As Android, we monitor physical activity during sleep, as it is easy to measure on a wide range of consumer devices. We mark periods with relatively low activity as deep sleep and periods with somewhat higher activity as light sleep. Then we mark parts of the light sleep periods as REM candidates, based on their typical “text-book” occurrence within a hypnogram. Finally, we mark periods of extremely high activity as a potential awake.
Several studies [7–10] have demonstrated that there is a relationship between PSG-phases and the average amount of sleeper’s movement. Our app attempts to utilize this relationship in concrete and useful ways.
The picture above shows frequencies of the individual sleep phases across the entire dataset, and the expected relationship between PSG and ACT phases.
N3 is the deepest sleep phase when the body is completely relaxed and there are typical slow EEG waves. We would certainly like to recognize it as deep sleep.
On the other hand, N1 and REM (and awake) are characterized by relatively high physical activity. We expect to mark at least some awake and REM correctly, and the rest should fall into the generic light sleep category.
Then there is the N2 phase, which occupies about half of the night. A typical “text-book” sequence of sleep stages throughout the night may be something like N1 → N2 → N3 → N2 → REM → N2 → N3 → N2 → REM → N2 → REM → N1. N2 is a transitional phase between lighter and deeper sleep. There can be both high or low physical activity in this phase. Activity tends to be low as sleep descends towards N3 and higher as it approaches REM or awake. Naturally, we assign the low activity parts of N2 to deep sleep and higher activity parts to light sleep.
So much for theory. Now, how does our app perform in actual real use cases?
One of the core features of our app is smart alarm, waking up the sleeper in a light phase, demonstrated by higher activity readings. According to our analysis of user data, people rank their sleep better when they are woken up by smart alarm.
We simulated our smart alarm algorithm on the aforementioned dataset. At every moment of the night, we calculated whether alarm would have been triggered (had the user set his smart period around this moment), and we counted the likelihood that alarm rings in a particular PSG-phase.
There is only 4% chance of alarm in the deepest phase (N3), and 60% chance that it rings in a light phase (N1, REM, or the user is awake already), which is substantial improvement against a random benchmark. In the remaining 36%, the alarm rings in N2. However, we trigger it only if we measure relatively high activity, so it is likely to be a relatively light interval, possibly just preceding N1 or REM.
These are the results with default alarm sensitivity settings. Users can fine-tune the alarm trigger sensitivity in the app, according to their own sleep activity patterns.
Lucid cue is a favorite feature of dreaming enthusiasts. It plays them a gentle alarm when they are supposed to be in REM phase. They train themselves to react to these cues in order to participate consciously in their dream.
It is not possible to directly identify REM phase from activity data alone. REM is just one of the phases with relatively high activity. It is characterized by rapid eye movements and we simply do not have this input from a smartphone or smartwatch.
However, there are patterns as to when REM typically occurs. It usually begins short after a light sleep ACT-phase start, and covers a large part of the light phase.
Our algorithm triggers lucid cues 10, 20, and (optionally) 30 minutes after a light phase start. And the simulation shows that there is a 50% chance that the cue really is triggered in a REM phase. REM phases occupy about 20% of the time in the dataset, so our accuracy is 2.5 times higher than if we fired the cues just randomly.
When the activity is extremely high, we mark the period as likely awake. Simulation on the dataset shows that we detect only about 1/3 of actual awakes this way, and about a half of the detected awakes are false positives. It is much better than random, but there are huge differences among different people and hence the error is rather big. Some people lie in the bed still when they wake up, while others toss and turn heavily even when they are asleep.
On the other hand, our app offers a range on additional criteria for awake detection, such as the phone being used, talking, or light in the room after sunset. Users can choose to use only the criteria that work well for them and thus supposedly achieve much better accuracy in reality.
Light and deep sleep
Well, as light and deep sleep (ACT-phases) are defined purely by the amount of physical activity, it makes no sense to measure how accurate they are, compared to the PSG benchmark. They simply mark periods of relatively high and low physical activity and are correct by definition.
But we can explore how PSG-phases correspond to light/deep sleep. The relationship is summarized in the picture below.
For simplicity, we split the data into two parts – activity-based light and deep phase. The light part also includes intervals that we mark as potential REM or awake on the hypnogram, as our REMs and awakes are just additional hints on top of the light phase, based on heuristics and burdened with error, rather than ACT-phases proper.
Blue and orange columns show the portion of actual PSG-phases that fall into our deep and light phases, respectively.
All awakes and most N1s and REMs were classified as light, N2 was split to two similarly big parts, and N3 was more often classified as deep than light, as intended.
Open problems and future work
Apparently, classification of this particular dataset is somewhat skewed towards the light sleep – N1, REM, and AWAKE are identified much more precisely than N3. We can only speculate why. It can be a bias in our algorithm, or a systematic error in the dataset labeling (remember that two PSG experts agree only 75% of the time), or maybe a characteristic of the iWatch sensor.
However, this is just a beginning. This kind of data does not only allow us to evaluate the success rates of our current algorithms, but there is a path forward to improve it. We have to be very careful to not overfit our models, but with a larger dataset, we may be able to optimize and get even closer to the PSG results.
Data for this study were acquired with iWatch. As all consumer devices are using similar activity sensors today, we believe that the results would be alike with any smartwatch or wristband. Our experiments have shown that activity inputs from our contact-less sensors (like Sonar or SleepPhaser) and in-phone accelerometer, strongly correlate with smartwatch readings, so the results are relevant for them too.
However, there is a fundamental question, to what extent can PSG-phases be compared with activity measurements in general. A night on PSG is hardly natural. Even though the lab operators try to make things as comfortable as possible for the test subject, he still has dozens of electrodes attached to his body, cannot move freely in the bed, the room is unfamiliar.
How does it affect his activity readings? Maybe the subject tends to move more because he sleeps in a strange environment and his sleep is lighter than usual. Maybe he moves less, as there is a bunch of wires connected to his body and he remains stiff during his sleep. It may cast doubts on any comparative studies of activity and PSG until this phenomenon gets explored and clarified.
In the meantime, ACT-phases provide an objective measurement of certain aspects of sleep, that can be acquired easily at home. At nearly no cost it can provide useful insight into one’s sleep habits on its own.
- Urbandroid Team (2019). How we measure your dreams.
- Walch, O. (2019). Motion and heart rate from a wrist-worn wearable and labeled sleep from polysomnography. PhysioNet. doi:10.13026/hmhs-py35
- Olivia Walch, Yitong Huang, Daniel Forger, Cathy Goldstein, Sleep stage prediction with raw acceleration and photoplethysmography heart rate data derived from a consumer wearable device
- Goldberger AL, Amaral LAN, Glass L, Hausdorff JM, Ivanov PCh, Mark RG, Mietus JE, Moody GB, Peng C-K, Stanley HE. PhysioBank, PhysioToolkit, and PhysioNet: Components of a New Research Resource for Complex Physiologic Signals (2003). Circulation. 101(23):e215-e220
- Magdy Younes, Jill Raneri, Patrick Hanly, Staging Sleep in Polysomnograms: Analysis of Inter-Scorer Variability, J Clin Sleep Med. 2016 Jun 15; 12(6): 885–894. Published online 2016 Jun 15. doi: 10.5664/jcsm.5894
- Norman RG, Pal I, Stewart C, Walsleben JA, Rapoport DM. Interobserver agreement among sleep scorers from different centers in a large dataset. Sleep. 2000;23:901–8
- WILDE-FRENZ, JOHANNA & Schulz, Hartmut. (1983). Rate and distribution of body movements during sleep in Humans. Perceptual and motor skills. 56. 275-83. 10.2466/pms.19184.108.40.2065.
- Muzet, A., Naitoh, P., Townsend, R.E. et al. Psychon Sci (1972) 29: 7. https://doi.org/10.3758/BF03336549
- Stefani A1, Gabelia D1, Mitterling T1, Poewe W1, Högl B1, Frauscher B1. A Prospective Video-Polysomnographic Analysis of Movements during Physiological Sleep in 100 Healthy Sleepers. Sleep. 2015 Sep 1;38(9):1479-87. doi: 10.5665/sleep.4994.
- Middelkoop HA1, Van Hilten BJ, Kramer CG, Kamphuisen HA. Actigraphically recorded motor activity and immobility across sleep cycles and stages in healthy male subjects. J Sleep Res. 1993 Mar;2(1):28-33.
- Urbandroid Team (2017), A case for the smart alarm.