How does Sleep as Android compare to the Sleep lab

We have measured how accurate Sleep as Android is when compared to a clinical sleep lab. And the results look very good! Smart wake up’s chance to not ring in PSG-measured N3 deep sleep is 96%. We are 2.5 times more accurate in spotting the REM phase than blind chance. We can detect 30% of awakes just from movement intensity and more using movement frequency, ambient light level, talk detection and phone use. Light and deep sleep measured by our app strongly correspond with sleep phases measured on PSG.

For years we have striven to find a reliable way to compare actigraphy, used for sleep phase detection in our app, with polysomnography – the de facto clinical golden standard for sleep measuring. Finally, we came across a public dataset [2, 3, 4] containing 31 nights with both actigraphy (activity measurement using the accelerometer on iWatch) and expert-annotated PSG records (Hypnograms). This allowed us to test our algorithms on previously unseen data and compare the output against the golden standard of sleep tracking — PSG.

One of the measured nights of sleep from the data set is shown in the picture above. The light-gray bars represent the amount of physical activity at the moment, as recorded by iWatch. The red line is showing sleep phases annotated by a human expert on the corresponding PSG record (let’s call them PSG-phases further in this text). Finally, the blue line displays sleep phases generated by our algorithm from the activity data (ACT-phases).

PSG vs. Actigraphy

Please, refer to our earlier post, How we measure your dreams, where we explain in detail our activity-based approach to sleep phases.

Polysomnography (PSG) is today’s golden standard for clinical sleep monitoring. The patient spends one night in a laboratory, trying to sleep as normally as possible, with dozens of electrodes (EEG, EOG, EMG, etc) attached to his body.

One output of such a session is a record of sleep phases (N1, N2, N3, REM, AWAKE), compiled by a human specialist. The methodology is based on spotting typical patterns in the sensor readings, and depends heavily on human judgment, as the patterns are not always clear.

Actually, when two different experts evaluate the same sleep record, the resulting hypnogram agrees only 75% of the time on average [5, 6]. Any comparison with a PSG hypnogram is therefore burdened with this kind of error.

In our app, Sleep As Android, we monitor physical activity during sleep, as it is easy to measure on a wide range of consumer devices. We mark periods with relatively low activity as deep sleep and periods with somewhat higher activity as light sleep. Then we mark parts of the light sleep periods as REM candidates, based on their typical “text-book” occurrence within a hypnogram. Finally, we mark periods of extremely high activity as a potential awake.

Several studies [710] have demonstrated that there is a relationship between PSG-phases and the average amount of sleeper’s movement. We replicated the experiment on the dataset, and the results agree with the studies.

Sleep phaseAverage num of moves / 5 minStandard deviation


The table above shows the average number of significant moves per five-minute interval for individual sleep phases. There was some variability among the individual subjects, but even such simple statistics can provide a strong hint about the current sleep phase. Our app attempts to utilize this relationship in concrete and useful ways.

Figure 2. Idealized mapping between PSG and ACT phases.
Figure 2. Expected mapping between PSG and ACT phases.

The picture above shows frequencies of the individual sleep phases across the entire dataset, and the expected relationship between PSG and ACT phases.

N3 is the deepest sleep phase when the body is completely relaxed and there are typical slow EEG waves. We would certainly like to recognize it as deep sleep.

On the other hand, N1 and REM (and awake) are characterized by relatively high physical activity. We expect to mark at least some awake and REM correctly, and the rest should fall into the generic light sleep category.

Then there is the N2 phase, which occupies about half of the night. A typical “text-book” sequence of sleep stages throughout the night may be something like N1 → N2 → N3 → N2 → REM → N2 → N3 → N2 → REM → N2 → REM → N1. N2 is a transitional phase between lighter and deeper sleep. There can be both high or low physical activity in this phase. Activity tends to be low as sleep descends towards N3 and higher as it approaches REM or awake. Naturally, we assign the low activity parts of N2 to deep sleep and higher activity parts to light sleep.

The results

So much for theory. Now, how does our app perform in actual real use cases?

Smart alarm

One of the core features of our app is smart alarm, waking up the sleeper in a light phase, demonstrated by higher activity readings. According to our analysis of user data, people rank their sleep better when they are woken up by smart alarm.

We simulated our smart alarm algorithm on the aforementioned dataset. At every moment of the night, we calculated whether alarm would have been triggered (had the user set his smart period around this moment), and we counted the likelihood that alarm rings in a particular PSG-phase.

There is only 4% chance of alarm in the deepest phase (N3), and 60% chance that it rings in a light phase (N1, REM, or the user is awake already), which is substantial improvement against a random benchmark. In the remaining 36%, the alarm rings in N2. However, we trigger it only if we measure relatively high activity, so it is likely to be a relatively light interval, possibly just preceding N1 or REM.

These are the results with default alarm sensitivity settings. Users can fine-tune the alarm trigger sensitivity in the app, according to their own sleep activity patterns.

Lucid cues

Lucid cue is a favorite feature of dreaming enthusiasts. It plays them a gentle alarm when they are supposed to be in REM phase. They train themselves to react to these cues in order to participate consciously in their dream.

It is not possible to directly identify REM phase from activity data alone. REM is just one of the phases with relatively high activity. It is  characterized by rapid eye movements and we simply do not have this input from a smartphone or smartwatch.

However, there are patterns as to when REM typically occurs. It usually begins short after a light sleep ACT-phase start, and covers a large part of the light phase.

Our algorithm triggers lucid cues 10, 20, and (optionally) 30 minutes after a light phase start. And the simulation shows that there is a 50% chance that the cue really is triggered in a REM phase. REM phases occupy about 20% of the time in the dataset, so our accuracy is 2.5 times higher than if we fired the cues just randomly.


When the activity is extremely high, we mark the period as likely awake. Simulation on the dataset shows that we detect only about 1/3 of actual awakes this way, and about a half of the detected awakes are false positives. It is much better than random, but there are huge differences among different people and hence the error is rather big. Some people lie in the bed still when they wake up, while others toss and turn heavily even when they are asleep.

On the other hand, our app offers a range on additional criteria for awake detection, such as the phone being used, talking, or light in the room after sunset. Users can choose to use only the criteria that work well for them and thus supposedly achieve much better accuracy in reality.

Based on insights gathered from the dataset, we designed a new awake detection algorithm, which is operating on movement frequency, rather than its absolute intensity. This approach seems to work much better and will be released very soon in our app. We do not provide any concrete performance figures at this point, as it has not been validated yet on previously unseen data.

Light and deep sleep

Well, as light and deep sleep (ACT-phases) are defined purely by the amount of physical activity, it makes no sense to measure how accurate they are, compared to the PSG benchmark. They simply mark periods of relatively high and low physical activity and are correct by definition.

But we can explore how PSG-phases correspond to light/deep sleep. The relationship is summarized in the picture below.

Figure 3. PSG phases in light vs deep sleep.
Figure 3. PSG phases in light vs deep sleep.

For simplicity, we split the data into two parts – activity-based light and deep phase. The light part also includes intervals that we mark as potential REM or awake on the hypnogram, as our REMs and awakes are just additional hints on top of the light phase, based on heuristics and burdened with error, rather than ACT-phases proper.

Blue and orange columns show the portion of actual PSG-phases that fall into our deep and light phases, respectively.

All awakes and most N1s and REMs were classified as light, N2 was split to two similarly big parts, and N3 was more often classified as deep than light, as intended.

Apparently, the results are skewed towards the light sleep – N1, REM, and AWAKE are identified much more precisely than N3. This can be partly explained by the very nature of the N2 phase, which contains long periods of inactivity, indistinguishable from N3 using only activity data. There can also be a bias in our algorithm, or a systematic error in the dataset labeling (remember that two PSG experts agree only 75% of the time), or maybe it is a characteristic of the iWatch sensor. We have to be very careful to not overfit our models, but with a larger dataset, we may be able to optimize and get even closer to the PSG results.

Activity patterns – sleep lab vs. home

The results presented in this post are based on data recorded in a sleep lab on a polysomnograph. One can not avoid a fundamental question – to what extent are the conclusions applicable to the home environment.

A night on PSG is hardly natural. Even though the lab operators try to make things as comfortable as possible for the test subject, he still has dozens of electrodes attached to his body, cannot move freely in the bed, the room is unfamiliar. How does it affect his activity readings? It may cast doubts on any comparative studies of activity and PSG, and very little is known on this topic.

We acquired data from another sleep study [12], where the subjects wore a wrist band for a week of their normal life, recording their activity, and manually reporting when they were asleep. Then they spent a night in a sleep lab, still wearing the wristband, so we were able to compare their sleep movement characteristics between the two environments.

Unsurprisingly, most people move more when they sleep at home. They are not wired to the sensors, they can toss and turn and jump freely on the bed. However, the difference is mainly in the intensity of their moves. In terms of frequency, the difference was very small. And when we simulated our algorithms on the data, the amount of light and deep sleep periods was about the same at home and in the sleep lab.

This simple experiment suggests that typical movement patterns at home are very similar to those in a sleep lab, so the results make sense in the home environment too.


We have demonstrated how actigraphic data, acquired with consumer wearables, can be used for sleep analysis.

ACT-phases provide an objective measurement of certain aspects of sleep, that can be acquired easily at home. At nearly no cost it can provide useful insight into one’s sleep habits on its own.

Furthermore, it can be used to estimate PSG-phases proper, such as REM or N3, with reasonable accuracy, and apply them to improve our users’ sleep.

The results are not perfect, but this is just a beginning. This kind of data does not only allow us to evaluate the success rates of our current algorithms. We use the feedback to adjust and improve the algorithms. Then we go out to get more data, test, improve, repeat ad infinitum.


  1. Urbandroid Team (2019). How we measure your dreams.
  2. Walch, O. (2019). Motion and heart rate from a wrist-worn wearable and labeled sleep from polysomnography. PhysioNet. doi:10.13026/hmhs-py35
  3. Olivia Walch, Yitong Huang, Daniel Forger, Cathy Goldstein, Sleep stage prediction with raw acceleration and photoplethysmography heart rate data derived from a consumer wearable device
  4. Goldberger AL, Amaral LAN, Glass L, Hausdorff JM, Ivanov PCh, Mark RG, Mietus JE, Moody GB, Peng C-K, Stanley HE. PhysioBank, PhysioToolkit, and PhysioNet: Components of a New Research Resource for Complex Physiologic Signals (2003). Circulation. 101(23):e215-e220
  5. Magdy Younes, Jill Raneri, Patrick Hanly, Staging Sleep in Polysomnograms: Analysis of Inter-Scorer Variability, J Clin Sleep Med. 2016 Jun 15; 12(6): 885–894. Published online 2016 Jun 15. doi: 10.5664/jcsm.5894
  6. Norman RG, Pal I, Stewart C, Walsleben JA, Rapoport DM. Interobserver agreement among sleep scorers from different centers in a large dataset. Sleep. 2000;23:901–8
  7. WILDE-FRENZ, JOHANNA & Schulz, Hartmut. (1983). Rate and distribution of body movements during sleep in Humans. Perceptual and motor skills. 56. 275-83. 10.2466/pms.1983.56.1.275.
  8. Muzet, A., Naitoh, P., Townsend, R.E. et al. Psychon Sci (1972) 29: 7. https://doi.org/10.3758/BF03336549
  9. Stefani A1, Gabelia D1, Mitterling T1, Poewe W1, Högl B1, Frauscher B1. A Prospective Video-Polysomnographic Analysis of Movements during Physiological Sleep in 100 Healthy Sleepers. Sleep. 2015 Sep 1;38(9):1479-87. doi: 10.5665/sleep.4994.
  10. Middelkoop HA1, Van Hilten BJ, Kramer CG, Kamphuisen HA. Actigraphically recorded motor activity and immobility across sleep cycles and stages in healthy male subjects. J Sleep Res. 1993 Mar;2(1):28-33.
  11. Urbandroid Team (2017), A case for the smart alarm.
  12. Multi-Ethnic Study of Atherosclerosis.
Other articles in SleepCloud study series<< The World is Sleep DeprivedHow we measure your dreams >>

9 thoughts on “How does Sleep as Android compare to the Sleep lab

  1. “containing 31 nights with both actigraphy”
    31 nights is not so much.
    Why you do not cooperate with a sleep laboratory where you could analyze much more nights?

  2. Hello Germo,

    Thank you for your comment.

    You are right, 31 nights are not too much. On the other hand, as measurements in a sleep lab are quite expensive, it is a common practice to publish sleep studies even in proper scientific journals that are based on even less observations.

    We have tried to establish such a partnership in the past and there are some negotiations going on at the moment. We hope that we will get the right data eventually and publish more precise results.

    Best Regards


  3. At the last class reunion I talked to a fellow student who runs a sleep lab (www.schlafdoktor.de). He sees such Sleep Tracking Apps positively, because they sensitize. So I can imagine that a cooperation with sleep labs is possible. So in your country.

  4. This sentence is inaccurate: “On the other hand, N1 and REM (and awake) are characterized by relatively high physical activity.”
    There should be no physical activity during REM sleep (“REM atonia”). In fact, physical activity during REM sleep is a sign of REM-behavioral disorder.

    1. Hello Vicky, thank you for the comment. I believe that the sentence is correct. We drew the information from numerous scientific papers (see the references 7-10, and you can find many more similar papers), and we verified it on the aforementioned dataset (by comparing actigraph readings with manually annotated sleep phases). My understanding is that the muscles are indeed inhibited during REM phase, so that the sleeper does not jump on the bed or walk away, acting out his dreams, but the inhibition is not complete, his limbs and torso often jerks and twitches, and hence the actigraph readings are still relatively high for a healthy sleeper in REM. The disorder you are referring to (RBD) occurs when the muscles are not paralyzed sufficiently (or not at all) and people move excessively, fall out of bed, walk away, etc.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.