This work focuses the recognition of complex human activities in video data. A combination of new features and techniques from speech recognition is used to realize a recognition of action units and their combinations in video sequences. The presented approach shows how motion information gained from video data can be used to interpret the underlying structural information of actions and how higher level models allow an abstraction of different motion categories beyond simple classification.