In this work, the task of wide-area indoor people detection in a network of depth sensors is examined. In particular, we investigate how the redundant and complementary multi-view information, including the temporal context, can be jointly leveraged to improve the detection performance. We recast the problem of multi-view people detection in overlapping depth images as an inverse problem and present a generative probabilistic framework to jointly exploit the temporal multi-view image evidence.