I. Introduction

In the past few years, there has been a rise in availability of cheap and effective sensors. Several systems were made out of them, like multiple security cameras across a building perimeter, smoke detectors near hazardous places in the building, motion sensors across restricted areas, etc. They provided an arbitrary way to secure and monitor an area like office complexes, houses, warehouses, etc. However, these systems require human involvement to improve their throughput, leaving all the analysis part on them. They faced drawbacks like unimodality i.e. being able to tackle only one kind of data, growing sizes, and requirement of more manpower to analyse them. Two crucial developments can be made to evolve such systems, (i) combine several sensors from different modalities in a system and, (ii) to automate or at least assist the people using these systems. The community has addressed the former development via the Internet of Things (IoT) [4] concept where several objects are connected in a network, allowing smart analysis and remote access. The latter can be done using systems designed to interpret and combine the data streams from the sensors to detect events generated through interactions of various agents like humans, natural occurrences, etc. with the system. In this document, we provide a generalized framework for event detection using the data from sensors from multiple modalities.

In a large deployment of sensors like in a residential complex or in a factory there are multiple data streams generated from wearable sensors, microphones, mobile phones, surveillance cameras, smoke detectors, infrared/RFID presence detectors, etc. These data streams can be used to detect events which are then consolidated and organised at a central place (e.g., by using ibeyonde’s technology for cloud computing [5]) enabling detection of higher order events using data streams coming from wide range of sensors. In addition, the sensor outputs can also be tagged as per the topology of the deployment, distinguishing areas of high security to relaxed zone and the like.

In the following sections of the document, we expand upon the concepts around event detection using several multi-modal sensors. The definition section explains the key terms and concepts and standardizes them throughout the document. The objectives section states the goals and key challenges for such a system. We then delve into the various components of such a system generalizing to all kinds of modalities for sensor output. The properties of such components are also discussed with appropriate diagrams and flowcharts. The last section discusses the sensors and how they can be utilized in such a system. Throught all the sections, we also discuss the use case of a face recognition pipeline to match the face with access IDs detected using a RFID scanner.


Typical system architecture for event detection

II. Definitions

In this section, we define and standardize the terminologies for the sake of understandability, from the reader’s perspective. The terms are used throughout document without any change in its definition.

  • Event – It is a salient object occurring at a particular instance in time space caused by interaction of an agent with the system. An agent can be a human or any natural phenomenon.

In case of a face recognition system, ‘face detected’ can be an event where a face is located in a video stream or image. ‘face recognized’ can be an event where a face is identified as a known person whose information is already is present in the system’s database.

  • Event detection – The real time identification of occurrence of an event in time space. The event detection can on discrete signals, in cases like access ID swipe or a binary on/off smoke detector, etc. where no additional computation may be required. Detection can be on continuous signals too, like in cases of face detection, voice recognition, where real time computation is required for the data stream.
  • Event Model (EM) – A set of events governed by causality with a boundary in time space defining a scenario. The set of events are ordered and can have multiple triggers from the same or different sensors to mark its start state in time space. The events form a state diagram with detections causing a transition from one state to another. The end state can lead to various actions which can be executed depending upon the model. EMs can be classified based upon these characteristics,
  • Recognizability – States whether the EM is known to the system upon interpretation of user’s requirements
  • Desirability – States whether the EM’s occurrence is safe or not as defined by the user’s requirements
  • Nature – States whether the EM occurred naturally or had human initiation and/or involvement

In case of a face recognition system, an authorized entry by known person EM can have a state diagram,

Similarly, an unauthorized entry by a known person EM can have a state diagram,

  • EM detection – Relating a set of events to a predefined EM in the system in real time. The set of events arise from continuous outputs from the various sensors and must be associated with a particular EM in real time by the system’s control logic.

For example, in case of the face recognition system, if the face detected and access ID detected events happen in parallel or at different times, the control logic must be able to associate it with either of the two EMs stated above.

  • Anomaly – A set of events close to a known EM  but not the same. Such EMs can be flagged and various statistics regarding its occurrence, trigger conditions, etc. can be logged.
  • Intrusion – A particular kind of known EM initiated to break the system/security with deliberate human interaction. Such EMs most likely lead to end states with alarms, recording sensor data for proofs, notifying the administrators, etc. as per requirements.
  • Accident – A particular kind of known EM  without deliberate human interaction which breaks the system/security. These EMs too may lead to end states similar to that of an intrusion.
  • EM space – All possible set of events can be summarized in the following venn diagram,

III. Objectives

A multiple multimodal sensor event detection system must be capable of,

  • Evidence accumulation – The system must be able to gather information from various sensors and identifying events and EMs in real time. EMs must be designed while considering the data from various modalities and multiple sensors to relate to the scenarios designed as per requirements.
  • EM classification – The behaviour of the agents in the environment must be recognized as normal, unusual or any other use defined category based on the events.
  • Error handling – The system must be capable of handling errors in the form of mis-detections of events or EMs, sensor errors, etc. generated by the system.
  • Parallel EMs – The system must be capable of handling multiple EMs in parallel with or without overlapping events. EMs have their own instances but can intersect or conflict or remain separate with other EMs. The system’s control logic must be capable of handling multiple EMs, handling their instantiation, validation and termination.
  • Salient activity logging – All important or salient activities throughout the day must be logged. This service would provide a compact summary of daily activities in the form of a set of events and EMs with their timestamps. The summary should have a trade off between storage size and percentage of truly salient activities occurring in the day.

These systems are challenging to build [1] and must consider several factors in their design,

  • Event and EM boundaries in time space need to be identified for real time detection
  • Magnitude of events may vary affecting detection performance
  • Sensor readings may be sparse and error prone, requiring appropriate design changes
  • Ratio of event durations in a particular EM may vary which must be addressed in their state diagrams

IV. Components

A multiple sensor EM detection system [2] can have several components depending on the nature of task they undertake. They can be broadly classified based on their number of inputs and outputs. An EM can consist of several such components, each working for one or multiple states in an EM. All components in such a system can be categorized within these five categories,




# Inputs

# Outputs


Senses external information


>= 1


Provide results to the user

>= 1



Performs analysis on single data stream


>= 1


Synchronize inputs and combine them

>= 1


Control / Control-panels

Monitors function of the above components



IV. (a) Generalized Component Diagram

The interaction of the agents and the various components with themselves can be shown in the following diagram which is generalized to all modalities of sensor data,

IV. (b) Face Recognition Component Diagram

As an example, we modify the generalized component diagram for our face recognition task. The component diagram handles two EMs, one for authorized entry by a known person and other for authorized entry by an unknown person. In this example, the latter EM leads to an alarm.

V. Properties

A multiple multi-modal sensor detection system has several properties [3] bridging the system components with the EMs. The properties refer to the overall aspects of the system rather than that of a particular component. They are listed as following,

  • State – The values of all system components at a particular time instance
  • Focus – The set of EMs pursued by the system at particular time instance
  • Priority – Preference of detection of a particular EM at a particular time
  • Constraints – Set of spatial, temporal or logical rules on various components in a specific state

We show these properties on the pipeline diagram shown below with components on the Y-axis and time on the X-axis. We see that multiple events are detected in parallel with each component processing EMs based on availability and priority. EMs can be of various length and can wait for components to free up or other events to occur.

VI. Sensors

The sensor components can be categorized based on the amount of computation required to analyse the data stream for events. As mentioned earlier, discrete signals like RFID swipes, etc. from the sensors require little to no computation while continuous signals like voice, video, etc. require significant computation resources for event detection. Following is an arbitrary list of sensors ordered on the basis of the data stream complexity,


Sensor Type




video stream

detection and recognition of faces, persons, objects


audio stream

voice recognition, unusual sound detection e.g., gunshots, glass break, etc.



fingerprint recognition

Motion (pyroelectric)

IR stream

person detection, unusual activity detection e.g., earthquake, forced entry, etc.


temperature stream

fire detection

RFID scanner

RFID info

authorized person detection

VII. Conclusion

In this document, we present a generalized framework for event detection using multiple sensors from different modalities. We define and elaborate important concepts for the framework and support them with appropriate diagrams. We describe the various categories of components which build up the system and provide relevant examples for the face recognition use case.


[1] Hassan, Ehtesham, Gautam Shroff, and Puneet Agarwal. “Multi-sensor event detection using shape histograms.” Proceedings of the Second ACM IKDD Conference on Data Sciences. ACM, 2015.

[2] Long, William J., et al. “Detection of intrusion across multiple sensors.”AeroSense 2003. International Society for Optics and Photonics, 2003.

[3] Crispim-Junior, Carlos F., et al. “Combining Multiple Sensors for Event Detection of Older People.” Health Monitoring and Personalized Feedback using Multimedia Data. Springer International Publishing, 2015. 179-194.

[4] Vermesan, Ovidiu, and Peter Friess, eds. Internet of things: converging technologies for smart environments and integrated ecosystems. River Publishers, 2013.

[5] http://ibeyonde.com/wordpress/ibeyonde-the-cloud-services-powering-iot/