<?xml version="1.0"?> 

<!DOCTYPE rfc SYSTEM "rfc2629.dtd"> 

<?rfc toc="yes" ?> 

<?rfc compact="yes" ?> 

<?rfc sortrefs="no" ?>

<rfc ipr="trust200811" docName="draft-romano-dcon-recording-05">

<front>
	<title abbrev="DCON Session Recording">
		Session Recording for Conferences using SMIL
	</title>

	<author initials="A." surname="Amirante" fullname="Alessandro Amirante">
		<organization>University of Napoli</organization>
		<address>
			<postal>
				<street>Via Claudio 21</street>
				<code>80125</code> 
				<city>Napoli</city> 
				<country>Italy</country>
			</postal>
			<email>alessandro.amirante@unina.it</email>
		</address>
	</author>
	
 <author initials="T." surname="Castaldi" fullname="Tobia Castaldi">
	<organization>Meetecho</organization>
	<address>
		<postal>
			<street>Via Carlo Poerio 89</street>
			<code>80100</code> 
			<city>Napoli</city> 
			<country>Italy</country>
		</postal>
		<email>tcastaldi@meetecho.com</email>
	</address>
	</author>
	
	<author initials="L." surname="Miniero" fullname="Lorenzo Miniero">
	<organization>Meetecho</organization>
	<address>
		<postal>
			<street>Via Carlo Poerio 89</street>
			<code>80100</code> 
			<city>Napoli</city> 
			<country>Italy</country>
		</postal>
		<email>lorenzo@meetecho.com</email>
	</address>
	</author>
	
	<author initials="S P" surname="Romano" fullname="Simon Pietro Romano">
		<organization>University of Napoli</organization>
		<address>
			<postal>
				<street>Via Claudio 21</street>
				<code>80125</code> 
				<city>Napoli</city> 
				<country>Italy</country>
			</postal>
			<email>spromano@unina.it</email>
		</address>
	</author>
	
	<date month="December" year="2011"/>
	<area>RAI</area>
	<workgroup>DISPATCH</workgroup>
	<keyword>XCON</keyword>
	<keyword>DCON</keyword>
	<keyword>MEDIACTRL</keyword>
	<keyword>Conferencing</keyword>
	<keyword>Session Recording</keyword>
	
	
	<abstract>
		<t>
			This document deals with session recording, specifically for what
			concerns recording of multimedia conferences, both centralized and
			distributed. Each involved media is recorded separately, and is
			then properly tagged. A SMIL <xref target="W3C.CR-SMIL3-20080115"/>
			metadata is used to put all the separate recordings together and
			handle their synchronization, as well as the possibly asynchronous
			opening and closure of media within the context of a conference.
			This SMIL metadata can subsequently be used by an interested
			user by means of a compliant player in order to passively receive
			a playout of the whole multimedia conference session. The motivation
			for this document comes from our experience with our conferencing
			framework, Meetecho, for which we implemented a recording functionality.
		</t>
	</abstract>
	
</front>

<middle>
	
	<!-- Intro -->
	<section title="Introduction" anchor="sec-intro">
		<t>
			This document deals with session recording, specifically for what
			concerns recording of multimedia conferences, both centralized and
			distributed. Each involved media is recorded separately, and is
			then properly tagged. Such a functionality is often required
			in many conferencing systems, and is of great interest to the
			XCON <xref target="RFC5239"/> Working Group. The motivation
			for this document comes from our experience with our conferencing
			framework, Meetecho, for which we implemented a recording functionality.
			Meetecho is a standards-based conferencing framework, and so we
			tried our best to implement recording in a standard fashion as well.
		</t>
		<t>
			In the approach presented in this document, a SMIL <xref target="W3C.CR-SMIL3-20080115"/>
			metadata is used to put all the separate recordings together and
			handle their synchronization, as well as the possibly asynchronous
			opening and closure of media within the context of a conference.
			This SMIL metadata can subsequently be used by an interested
			user by means of a compliant player in order to passively receive
			a playout of the whole multimedia conference session.
		</t>
		<t>
			The document presents the approach by sequentially describing the
			several required steps. So, in <xref target="sec-recording"/> the
			recording step is presented, with an overview of how each involved
			media might be recorded and stored for future use. As it will be
			explained in the following sections, existing approaches might be
			exploited to achieve this steps (e.g. MEDIACTRL <xref target="RFC5567"/>. Then, in
			<xref target="sec-tagging"/> the tagging process is described, by
			showing how each media can be addressed in a SMIL metadata file,
			with specific focus upon the timing and inter-media synchronization
			aspects. Finally, <xref target="sec-playout"/> is devoted to describing
			how a potential player for the recorded session can be implemented
			and what it is supposed to achieve.
		</t>
	</section>
		
	<!-- Conventions -->
	<section title="Conventions" anchor="sec-conv">
		<t>
			In this document, the key words "MUST", "MUST NOT", "REQUIRED",
			"SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "NOT
			RECOMMENDED", "MAY", and "OPTIONAL" are to be interpreted as
			described in BCP 14, RFC 2119 <xref target="RFC2119"/> and
			indicate requirement levels for compliant implementations.
		</t>
	</section>
	
	<!-- Terminology -->
	<section title="Terminology" anchor="sec-teminology">
		<t>
			TBD.
		</t>
	</section>
		
	<!-- Recording -->
	<section title="Recording" anchor="sec-recording">
		<t>
			When a multimedia conference is realized over the Internet,
			several media might be involved at the same time. Besides,
			these media might come and go asynchronously during the
			lifetime of the same conference. This makes it quite clear
			that, in case such a conference needs to be recorded in order
			to allow a subsequent, possibly offline, playout, these media
			need to be recorded in a format that is aware of all the
			timing-related aspects. A typical example is a videoconference
			with slide sharing. While audio and video have a life of their own,
			slides changes might be triggered at a completely different pace.
			Besides, the start of a slideshow might occur much later than
			the start of the audio/video session. All these requirements
			must be taken into account when dealing with session recording
			in a conference. Besides, it's important that all the individual
			recordings be taken in a standard fashion, in order to achieve
			the maximum compatibility among different solutions and avoid
			any proprietary mechanism or approach that could prevent a
			successful playout later on.
		</t>
		<t>
			In this document, we present our approach towards media recording
			in a conference. Specifically, we will deal with the recording
			of the following media:
		</t>
		<t>
		<list style="symbols">
			<t>audio and video streams (in <xref target="sec-rec-av"/>);</t>
			<t>text chats (in <xref target="sec-rec-chat"/>);</t>
			<t>slide presentations (in <xref target="sec-rec-slides"/>);</t>
			<t>whiteboards (in <xref target="sec-rec-wb"/>).</t>
		</list>
		</t>
		<t>
			Additional media that might be involved in a conference (e.g. desktop
			or application sharing) are not presented in this document, and their
			description is left to future extensions.
		</t>

		<!-- Audio/Video Recording -->
		<section title="Audio/Video" anchor="sec-rec-av">
		<t>
			In a conferencing system compliant with <xref target="RFC5239"/>,
			audio and video streams contributed by participants are carried in RTP
			channels <xref target="RFC3550"/>. These RTP channels may or may not be secured
			(e.g by means of SRTP/ZRTP). Whether or not these channels are secured, anyway,
			is not an issue in this case. In fact, as it is usually the case, all the participants
			terminate their media streams at a central point (a mixer entity), with whom they
			would have a secured connection. This means that the mixer would get access
			to the unencrypted payloads, and would be able to mix and/or store them accordingly.
		</t>
		<t>
			From an high level topology point of view, this is how a recorder for
			audio and video streams could be envisaged:
		</t>
		<t>
			<figure anchor="fig-rec-av" title="Audio/Video Recorder">
				<artwork>
					<![CDATA[
           SIP   +------------+ SIP
      /----------|   XCON AS  |--------
     /           +------------+        \
    /                   |MEDIACTRL      \
   /                    |                \
+-----+              +-----+              +-----+
|     |     RTP      |     |   RTP        |     |
|UA-A +<------------>+Mixer+<------------>+UA-B |
|     |              |     |              |     |
+-----+              +-++--+              +-----+
                      |   |
           RTP UA-A   |   | RTP UA-B (Rx+Tx)
           (Rx+Tx)    V   V
                   +----------+
                   |          |
                   | Recorder |
                   |          |
                   +----------+
						]]>
				</artwork>
			</figure>
		</t>
		<t>
		<list style="hanging">
		<t>[Editors' Note: this is a slightly modified version of the topology proposed
		on the DISPATCH mailing list,</t>
		<t>http://www.ietf.org/mail-archive/web/dispatch/current/msg00256.html</t>
		<t>where the Application Server has been specialized in an XCON-aware AS,
		and the AS&lt;-&gt;Mixer protocol is the Media Control Channel Framework
		protocol (CFW) specified in <xref target="RFC6230"/>.]</t>
		</list>
		</t>
		<t>
			That said, actually recording audio and video streams in a conference may be accomplished
			in several ways. Two different approaches might be highlighted:
		</t>
		<t>
		<list style="numbers">
			<t>recording each contribution from/to each participant in a separate file (<xref target="fig-rec-sep"/>);</t>
			<t>recording the overall mix (one for audio and one from video, or more if several
			mixes for the same media type are available) in a dedicated file (<xref target="fig-rec-mix"/>).</t>
		</list>
		</t>
		<t>
			<figure anchor="fig-rec-sep" title="Recording individual streams">
				<artwork>
					<![CDATA[
                              +-------+
                              | UAC-C |
                              +-------+
                                  "
                          C (RTP) "
                                  "
                                  "
                                  v
+-------+  A (RTP)           +----------+           B (RTP)  +-------+
| UAC-A |===================>| Recorder |<===================| UAC-B |
+-------+                    +----------+                    +-------+
                                  *
                                  *
                                  *
                                  ****> A.gsm, A.h263
                                  ****> B.g711, B.h264
                                  ****> C.amr
						 ]]>
				</artwork>
			</figure>
		</t>
		<t>
			<figure anchor="fig-rec-mix" title="Recording mixed streams">
				<artwork>
					<![CDATA[
                              +-------+
                              | UAC-C |
                              +-------+
                                  "
                          C (RTP) "
                                  "
                                  "
                                  v
+-------+  A (RTP)           +----------+           B (RTP)  +-------+
| UAC-A |===================>| Recorder |<===================| UAC-B |
+-------+                    +----------+                    +-------+
                                  *
                                  *
                                  *
                                  ****> (A+B+C).wav, (A+B+C).h263
						 ]]>
				</artwork>
			</figure>
		</t>
		<t>
			Of the two, the second is probably more feasable. In fact, the first would require
			a potentially vast amount of separate recordings which would need to be subsequently
			muxed and correlated to each other. Besides, within the context of a multimedia
			conference, most of the times the streams are already mixed for all the
			participants, and so recording the mix directly would be a clear advantage.
			Such an approach, of course, assumes that all the streams pass through a
			central point where the mixing occurs: it is the case depicted in <xref target="fig-rec-av"/>.
			The recording would take place in that point. Such central point, the mixer
			(which in this case would also act as the recorder, or a frontend to it), might be a MEDIACTRL-based
			<xref target="RFC5567"/> Media Server. Considering the similar nature of
			audio and video (both being RTP based and mixed by probably the same entity)
			they are analysed in the same section of this document. The same applies to
			tagging and playout as well. It is important to note that in case any
			policy is involved (e.g. moderation by means of the BFCP <xref target="RFC4582"/>)
			the mixer would take it into account when recording. In fact, the same policies
			applied to the actual conference with respect to the delivery of audio and video
			to the participants needs to be enforced for the recording as well.
		</t>
		<t>
			In a more general way, if the mixer does not support a direct recording of the mixes it
			prepares, recording a mix can be achieved by attaching the recorder entity
			(whatever it is) as a passive participant to the conference. This would allow the
			recorder to receive all the involved audio and video streams already properly mixed,
			with policies already taken into consideration. This approach is depicted in
			<xref target="fig-rec-passive"/>.
		</t>
		<t>
			<figure anchor="fig-rec-passive" title="Recorder as a passive participant">
				<artwork>
					<![CDATA[
                             +-------+
                             |  UAC  |
                             |   C   |
                             +-------+
                                " ^
                        C (RTP) " "
                                " "
                                " " A+B (RTP)
                                v "
+-------+  A (RTP)           +--------+  A+C (RTP)         +-------+
|  UAC  |===================>| Media  |===================>|  UAC  |
|   A   |<===================| Server |<===================|   B   |
+-------+         B+C (RTP)  +--------+           B (RTP)  +-------+
                                 "
                                 "
                                 " A+B+C (RTP)
                                 "
                                 v
                           +----------+
                           | Recorder |
                           +----------+
                                 *
                                 ****> (A+B+C).wav, (A+B+C).h263
						 ]]>
				</artwork>
			</figure>
		</t>
		<t>
			Whether or not the mixer is MEDIACTRL-based, it's quite likely that the AS handling
			the multimedia conference business logic has some control on the mixing involved.
			This means it can request the recording of each available audio and/or video mix
			in a conference, if only by adding the passive participant as mentioned above.
			Besides, events occurring at the media level or business logic
			in the AS itself allow the AS to take note of timing information for each of the
			recorded media. For instance, the AS may take note of when the video mixing started,
			in order to properly tag the video recording in the tagging phase. Both the recordings
			and the timing events list would subsequently be used in order to prepare the metadata
			information of the audio and video in the overall session recording description. Such
			a phase is described in <xref target="sec-tag-av"/>.
		</t>
		<t>
			In a MEDIACTRL Media Server, such a functionality might be accomplished by means of the
			Mixer Control Package <xref target="I-D.ietf-mediactrl-mixer-control-package"/>. At the
			end of the conference, URLs to the actual recordings would be made available for the AS
			to use. The AS might then subsequently access those recordings according to its
			business logic, e.g. to store them somewhere else (the MS storage might be temporary)
			or to implement an offline transcoding and/or mixing of all the recordings in order
			to obtain a single file representative of the whole audio/video participation in
			the conference. Practical examples of these scenarios are presented in
			<xref target="I-D.ietf-mediactrl-call-flows"/>.
		</t>
		<t>
			Of course, if the recording of a mix is not possible or desired, one could still
			fallback to the first approach, that is individually recording all the incoming contributions.
			It is the case, for instance, of conferencing systems which don't implement video mixing, but
			just rely instead on a switching/forwarding of the potentially several video streams to each participant.
			This functionality can also be achieved by means of the same control package previously
			introduced, since it allows for the recording of both mixes and individual connections.
			Once the conference ends, the AS can then decide what to do with the recordings,
			e.g. mixing them all together offline (thus obtaining an overall mix) or leave them as they are.
			The tagging process would the subsequently take the decision into account, and address
			the resulting media accordingly.
		</t>
		</section>

		<!-- Chat Recording -->
		<section title="Chat" anchor="sec-rec-chat">
		<t>
			What has been said about audio and video partially applies to text chats as well. In fact,
			just as for audio and video a central mixer is usually involved, for instant messaging
			most of the times the contributions by all participants pass through a central node
			from where they are forwarded to the other participants. It is the case, for instance,
			of XMPP <xref target="RFC3920"/> and MSRP <xref target="RFC4975"/> based text conferences.
			If so, recording of the text part of a conference is not hard to achieve either.
			The AS just needs to implement some form of logging, in order to store all the messages
			flowing through the text conference central node, together with information on the senders
			of these messages and timing-related information. Of course, the AS may not directly
			be the text conference mixer: the same considerations apply, however, in the sense that
			the remote mixer must be able to implement the aforementioned logging, and must be
			able to receive related instructions from the controlling AS. Besides, considering
			the possible protocol-agnostic nature of the conferencing system (as envisaged in
			<xref target="RFC5239"/>), several different instant messaging protocols may be
			involved in the same conference. Just as the conferencing system would act as
			a protocol gateway during the lifetime of the conference (i.e. provide MSRP users
			with the text coming from XMPP participants and viceversa), all the contributions
			coming from the different instant messaging protocols would need to be recorded
			in the same log, and in the same format, to avoid ambiguity later on.
		</t>
		<t>
			An example of a recorder for instant messaging is presented in <xref target="fig-rec-chat"/>.
		</t>	
		<t>
			<figure anchor="fig-rec-chat" title="Recording a text conference">
				<artwork>
					<![CDATA[
                              +-------+
                              | UAC-C |
                              +-------+
                                  ^
                         C (MSRP) " '10.11.24 Hi!'
                                  "
                                  "
                                  v
+-------+  A (XMPP)          +----------+           B (IRC)  +-------+
| UAC-A |<==================>| Recorder |<==================>| UAC-B |
+-------+  '10.11.26 Hey C'  +----------+ '10.11.30 Hey man' +-------+
                                  *
                                  *
                                  *     [..]
                                  ****> 10.11.24 <User C> Hi!
                                  ****> 10.11.26 <User A> Hey C
                                  ****> 10.11.30 <User B> Hey man
                                        [..]
						 ]]>
				</artwork>
			</figure>
		</t>
		<t>
			The same considerations already mentioned
			about optional policies involved apply to text conferences as well: i.e., if a UAC is not
			allowed to contribute text to the chat, this contribution is excluded both from the
			mix the other participants receive and from the ongoing recording.
		</t>
		<t>
			Considerations about the format of the recording are left to <xref target="sec-tag-chat"/>.
			Until then, we just assume the AS has a way to record text conferences somehow in a format
			it is familiar with. This format would subsequently be converted to another, standard,
			format that a player would be able to access.
		</t>
		</section>

		<!-- Slides Recording -->
		<section title="Slides" anchor="sec-rec-slides">
		<t>
			Another media typically available in a multimedia conference over the internet is
			the slides presentation. In fact, slides, whatever format they're in, are still
			the most common way of presenting something within a collaboration framework.
			The problem is that, most of the times, these slides are deployed in a proprietary
			way (e.g. Microsoft Powerpoint and the like). This means that, besides the recording
			aspect of the issue, the delivery itself of such a slides can be problematic when
			considered in a standards based conferencing framework.
		</t>
		<t>
			Considering that no standard way of implementing such a functionality is commonly
			available yet, we assume that a conferencing framework makes such slides available
			to the participants in a conference as a slideshow, that is, a series of static
			images whose appearance might be dictated by a dedicated protocol. For instance,
			a presenter may trigger the change of a slide by means of an instant messaging
			protocol, providing each authorized participant with an URL from where to get
			the current slide with optional metadata to describe its content.
		</t>
		<t>
			An example is presented in <xref target="fig-rec-slide"/>. The presenter has
			previously uploaded its presentation converted in a proprietary format. The
			presentation has been converted to images and a description of the new
			format has been sent back to the presenter (e.g. an XML metadata). At this
			point, the presenter makes use of XMPP to inform the other participants
			about the current slide, by providing an HTTP URL to the related image.
		</t>
		<t>
			<figure anchor="fig-rec-slide" title="Presentation sharing via XMPP">
				<artwork>
					<![CDATA[
                             +-----------+
                             | Presenter |
                             +-----------+
                                  "
                          (XMPP)  " Current presentation: f44gf
                                  " Current slide number: 4
                                  " URL: http://example.com/f44gf/4.jpg
                                  "
                                  v
+-------+  (XMPP)            +----------+            (XMPP)  +-------+
| UAC-A |<===================| ConfServ |===================>| UAC-B |
+-------+                    +----------+                    +-------+
    |                                                            |
    | HTTP GET (http://example.com/f44gf/4.jpg)                  |
    v                  HTTP GET (http://example.com/f44gf/4.jpg) |
                                                                 v
						 ]]>
				</artwork>
			</figure>
		</t>
		<t>
			From this assumption, the recording of each slide presentation would be relatively
			trivial to achieve. In fact, the AS would just need to have access to the set
			of images (with the optional metadata involved) of each presentation, and to
			the additional information related to presenters and to when each slide was triggered.
			For instance, the AS may take note of the fact that slide 4 from presentation "f44gf"
			of the example above
			has been presented by UAC "spromano" from the second 56 of the conference to the second 302.
			Properly recording all those events would allow for a subsequent tagging, thus
			allowing for the integration of this medium in the whole session recording description
			together with the other media involved. This phase will be described in <xref target="sec-tag-slides"/>.
		</t>
		</section>

		<!-- Whiteboard Recording -->
		<section title="Whiteboard" anchor="sec-rec-wb">
		<t>
			To conclude the overview on the analysed media, we consider a further medium
			which is quite commonly deployed in multimedia conferences: the shared
			whiteboard. There are several ways of implementing such a functionality.
			While some standard solutions exist, they are rarely used within the context of
			commercial conferencing application, which usually prefer to implement it
			in a proprietary fashion.
		</t>
		<t>
			Without delving into a discussion on this aspect, suffices it to say that
			for a successful recording of a whiteboard session most of the times
			it is enough to just record the individual contributions of each involved
			participant (together with the usual timing-related information). In fact,
			this would allow for a subsequent replay of the whiteboard
			session in an easy way. Unlike audio and video, whiteboarding usually is
			a very lightweight media, and so recording the individual contributions
			rather than the resulting mix (as we suggested in <xref target="sec-rec-av"/>)
			is advisable. These contributions may subsequently be mixed together in
			order to obtain a standard recording (e.g. a series of images, animations,
			or even a low framerate video). An example of recording for this medium
			is presented in <xref target="fig-rec-wb"/>.
		</t>
		<t>
			<figure anchor="fig-rec-wb" title="Recording a whiteboard session">
				<artwork>
					<![CDATA[
                               +-------+
                               | UAC-C |
                               +-------+
                                   "
                          C (XMPP) " 10.11.20: line
                                   "
                                   "
                                   v
+-------+  A (XMPP)          +-----------+          B (XMPP)  +-------+
| UAC-A |===================>| WB server |<===================| UAC-B |
+-------+  10.10.56: circle  +-----------+    10.12.30: text  +-------+
                                   *
                                   *
                                   *
                                   ****> 10.10.56: circle (A)
                                   ****> 10.11.20: line (C)
                                   ****> 10.12.30: text (B)
						 ]]>
				</artwork>
			</figure>
		</t>
		<t>
			The recording process may be enriched by the population of a parallel
			event list. For instance, optimizations might include event as the
			creation of a new whiteboard, the clearing of an existing whiteboard
			or the adding of a background image that replaced the previously existing
			content. Such event would be precious in a subsequent playout of the
			recorded steps, since they would allow for a more lightweight replication
			in case seeking is involved. For instance, if 70 drawings have been done,
			but at second 560 of the conference the whiteboard has been cleared and
			since then only 5 drawings have been added, a viewer seeking to second 561
			would just need the clear+5 drawings to be replicated. Anyway, further
			discussion upon the tagging process of this media is presented in <xref target="sec-tag-wb"/>.
		</t>
		</section>

	</section>

	<!-- Tagging -->
	<section title="Tagging" anchor="sec-tagging">
		<t>
			Once the different media have been recorded and stored, and
			their timing related somehow, this information needs to be
			properly tagged in order to allow intra-media and inter-media
			synchronization in case a playout is invoked. Besides, it
			would be desirable to make use of standard means for
			achieving such a functionality. For these reasons, we chose
			to make use of the Synchronized Multimedia Integration
            Language <xref target="W3C.CR-SMIL3-20080115"/>, which
			fulfills all the aforementioned requirements, besides
			being a well-established W3C standard. In fact, timing
			information is very easy to address using this specification,
			and VCR-like controls (start, pause, stop, rewind, fast forward,
			seek and the like) are all easily deploayble in a player
			using the format.
		</t>
		<t>
			The SMIL specification provides means to address different
			media by using custom tags (e.g. audio, img, textstream and
			so on), and for each of these media the related tempification
			can be easily described. The following subsections will
			describe how a SMIL metadata could be prepared in order
			to map with the media recorded as described in <xref target="sec-recording"/>.
		</t>
		<t>
			Specifically, considering how a SMIL file is assumed to be
			constructed, the head will be described in <xref target="sec-tag-head"/>,
			while the body (with different focus for each media) will be
			presented in <xref target="sec-tag-body"/>.
		</t>

		<!-- Header -->
		<section title="SMIL Head" anchor="sec-tag-head">
		<t>
			As specified in <xref target="W3C.CR-SMIL3-20080115"/>, a SMIL
			file is composed of two separate sections: a head and a body.
			The head, among all the needed information, also includes
			details about the allowed layouts for a multimedia presentation.
			Considering the amount of media that might have been involved
			in a single conference, properly constructing such a section
			definitely makes much sense. In fact, all the involved media
			need to be placed in order not to prevent access to other
			concurrent media within the context of the same recording.
		</t>
		<t>
			For instance, this is how a series of different media might be
			placed in a layout according to different screen resolutions:
		</t>
		<t>
			<figure>
				<artwork>
					<![CDATA[
<?xml version="1.0" encoding="UTF-8"?>
<smil xmlns:xml="http://www.w3.org/XML/1998/namespace">
  <head>
    <switch systemScreenSize="800X600">
      <layout>
        <root-layout width="800" height="600" background-color="black"/>
        <region id="image0" regionName="image" fit="fill" top="310" \
                left="370" width="400" height="350" />
        <region id="video0" regionName="video" top="0" left="370" \
                width="430" height="310" fit="fill" />
        <region id="chat0" regionName="chat" fit="fill" alt="chat" \
                top="410" left="370" width="400" height="-60"/>
        <region id="wb0" regionName="wb" top="0" left="0" width="370" \
                height="520"/>
      </layout>
    </switch>
    <switch systemScreenSize="1024X768">
      <layout>
        <root-layout width="1024" height="768" \
                     background-color="black"/>
        <region id="image1" regionName="image" fit="fill" top="310" \
                left="594" width="400" height="350"/>
        <region id="video1" regionName="video" top="0" left="594" \
                width="430" height="310" fit="fill"/>
        <region id="chat1" regionName="chat" fit="fill" alt="chat" \
                top="578" left="594" width="400" height="108"/>
        <region id="wb1" regionName="wb" top="0" left="0" width="594" \
                height="688"/>
      </layout>
    </switch>
[..]
				]]>
				</artwork>
			</figure>
		</t>
		<t>
			That said, it's important that this section of the SMIL
			file be constructed properly. In fact, the layout description
			also contains explicit region identifiers, which are referred
			to when describing media in the body section.
		</t>
		<t>
			TBD. (?)
		</t>
		</section>

		<!-- Body -->
		<section title="SMIL Body" anchor="sec-tag-body">
		<t>
			The SMIL head section described previously is very important
			for what concerns presentation-related settings,
			but does not contain any timing-related information. Such
			information, in fact, belongs to a separate section in the
			SMIL file, the so called body. This body contains the information
			on all the involved media in the recorded session, and for
			each media timing information are provided. This timing information
			includes not only when each media appears and when it goes away,
			but also details on the media lifetime as well. By correlating
			the timing information for each media, a SMIL reader can
			infer inter-media synchronization and present the recorded
			session as it was conceived to appear.
		</t>
		<t>
			Besides, the involved media can be grouped in the body in order
			to implement sequential and/or parallel playback involving
			a subset of the available media. This is made possible by
			making use of the &lt;seq&gt; and &lt;par&gt; elements. The
			&lt;par&gt; element in particular is of great interest to
			this document, since in a multimedia conference many media
			are presented to participants at the same time.
		</t>
		<t>
			That said, it is important to be able to separately address each
			involved medium. To do so, SMIL makes use of well specified
			elements. For instance, a &lt;video&gt; element is used to
			state the presence of a video stream in the session. Each of
			these elements can be furtherly customized and configured
			by means of ad-hoc attributes. For instance, the 'src' attribute
			in a &lt;video&gt; element means that the actual video stream source
			can be found at the provided address.
		</t>
		<t>
			The element for each media is also the place where SMIL adds
			information upon when the addressed media comes into play.
			This is done by means of two attributes called 'begin' and 'end'
			respectively. As the names themselves suggest, the 'begin' attribute
			gives a temporal reference on the media start, while the 'end' attribute
			specifies when the media ends. For instance, an element formatted in the
			following way:
		</t>
		<t>
			<figure>
				<artwork>
					<![CDATA[
<video src="http://www.example.com/conference45.avi" region="box12" \
       begin="15s" end="400s"/>
				]]>
				</artwork>
			</figure>
		</t>
		<t>
			means that a video stream (whose URL is provided in 'src')
			must be played in the session only 15 seconds after the session
			beginning, and that it must end 385 seconds after. This information
			is also used when seeking through a session. For instance, if
			a user accessing the recording seeks to 200 seconds after the beginning,
			the video will appear as well at the relative time of 200-15=185 seconds.
		</t>
		<t>
			Considering the recorded media presented in <xref target="sec-recording"/>,
			the construction of following sections of the body will be described:
		</t>
		<t>
		<list style="symbols">
			<t>audio/video streams (in <xref target="sec-tag-av"/>);</t>
			<t>text chats (in <xref target="sec-tag-chat"/>);</t>
			<t>slide presentations (in <xref target="sec-tag-slides"/>);</t>
			<t>whiteboards (in <xref target="sec-tag-wb"/>).</t>
		</list>
		</t>
		
		<!-- Audio/Video Tagging -->
		<section title="Audio/Video" anchor="sec-tag-av">
		<t>
			In SMIL, the element to describe an audio stream is &lt;audio&gt;,
			while for video the element is &lt;video&gt;. Considering that these
			two stream types are handled in a very similar way, only video will
			be addressed. This is an explicit choice for two reasons: (i) video
			is slightly more complex to address than audio, and so treating
			video makes more sense; (ii) often off-line encoders/muxers will
			place the recorded elementary audio and video streams in a single
			video container, which means both streams can actually be addressed
			in a single media file.
		</t>
		<t>
			That said, &lt;video&gt; is the element used in a SMIL bod to state the presence
			of an audio/video stream. It's tempification, related to other media, might be
			implemented by making use of a &lt;par&gt;/&lt;seq&gt; aggregator. In such
			an element, some attributes are of great relevance and should be included:
		</t>
		<t>
		<list style="symbols">
			<t>'src', to address the actual video file to use (usually a HTTP URL);</t>
			<t>'begin' and 'end', for timing information (when the video should appear/disappear in the session);</t>
			<t>'region', to specify where the stream will need to appear in the layout as configured in the head (e.g. place it in the region called box12).</t>
		</list>
		</t>
		<t>
			All these information can easily be taken according to the stream as recorded previously (optionally re-encoded and/or re-muxed),
			together with the timing information as part of the event log. The 'src', in particular, can be
			any video file, which means that an encoding of the stream for a player is quite trivial to achieve.
		</t>
		<t>
			Besides, as mentioned in <xref target="sec-rec-av"/>, recordings may be available as already mixed
			streams, or individual streams. In case the recording is already mixed, then the tagging can be
			done as seen in the previous paragraph:
		</t>
		<t>
			<figure>
				<artwork>
					<![CDATA[
<video src="http://www.example.com/conference45.avi" region="box12" \
       begin="15s" end="400s"/>
				]]>
				</artwork>
			</figure>
		</t>
		<t>
			where this element would state the presence of an audio/video stream, to appear in the
			specified region in the specified range of time. In case several recordings are available,
			instead, the tagging would be a little more complex: in fact, the metadata would need to
			address the parallel playback of the different recordings, which would also need to
			reflect the actual lifetime of the original streams in the conference. For instance,
			if UAC A joined the conference much before UAC B, its contributions would appear in
			the playout accordingly. An example of how this could be achieved in a SMIL metadata
			is presented here:
		</t>
		<t>
			<figure>
				<artwork>
					<![CDATA[
<par>
   [..]
   <video src="http://www.example.com/userA.avi" region="box12" \
          begin="15s" end="400s"/>
   <video src="http://www.example.com/userB.avi" region="box16" \
          begin="230s" end="521s"/>
   [..]
</par>
				]]>
				</artwork>
			</figure>
		</t>
		<t>
			This lines tell an interested player that the two specified
			video streams (whose URLs are provided in the respective 'src'
			attributes) must be played in parallel, and in different regions.
			Anyway, video stream 'userA.avi' starts after 15 seconds, while
			'userB.avi' starts after 230 seconds since the beginning of
			the conference, reflecting the appearance of these media in the
			conference itself.
		</t>
		</section>

		<!-- Chat Tagging -->
		<section title="Chat" anchor="sec-tag-chat">
		<t>
			Text in SMIL can be addressed in several different ways, the most common ones being &lt;text&gt;
			and &lt;textstream&gt; elements. &lt;text&gt;, however, usually deals only with static text content,
			that is text without timing information (e.g. HTML). For this reason, &lt;textstream&gt; should
			be used instead, since it allows text to appear and disappear in real-time. 
		</t>
		<t>
			The attributes to configure the element are basically the same as the one presented for &lt;video&gt;
			(src, region, begin, end). The difference, anyway, is on the file to refer to in the
			'src' attribute. In fact, if timing information is needed, a proper format for tempified
			text is needed. The &lt;textstream&gt; element supports RealText Markup, which is a separate
			markup language for dealing with real-time text. It is the format used, for instance, for
			subtitle captioning. An example of RealText is presented in the following lines:
		</t>
		<t>
			<figure>
				<artwork>
					<![CDATA[
<window width="340" height="160" wordwrap="true" loop="false" \
        bgcolor="white">
   <font color="black" face="Arial" size="+0">
      <Time begin="0:00:02.2"/><br/><User C>Hi
      <Time begin="0:00:04.5"/><br/><User A>Hey C
      <Time begin="0:00:08.1"/><br/><User B>Hey man
      [..]
				]]>
				</artwork>
			</figure>
		</t>
		<t>
			This example recalls <xref target="fig-rec-chat"/>, where the first message (by User C)
			was sent at 10.11.24. Assuming the text conference started at 10.11.22, the log is
			converted to RealText and tagged accordingly (e.g. User C saying his first message two
			seconds after the conference started). The RealText fine can then be addressed in
			SMIL using the aforementioned &lt;textstream&gt; element:
		</t>
		<t>
			<figure>
				<artwork>
					<![CDATA[
<par>
   [..]
   <textstream src="http://example.com/chats/conf45.rt" region="chat" \
               begin="0s" end="500s"/>
   [..]
</par>
				]]>
				</artwork>
			</figure>
		</t>
		<t>
			Once the requirement on the file format is assessed, the next step is obvious. Whatever format
			the chat in the conference had been recorded into, it needs to be converted to a RealText file
			in order to have it addressed in the resulting SMIL file. The conversion is usually very trivial
			to achieve, considering that chat logs often have the same information needed in a RealText file
			except for the presentation format.
		</t>
		</section>

		<!-- Slides Tagging -->
		<section title="Slides" anchor="sec-tag-slides">
		<t>
			The easiest way to deal with a slideshow and/or a shared slide presentation is to make use
			of the &lt;img&gt; element. In fact, as anticipated in <xref target="sec-rec-slides"/>, slides
			in a presentation most often are composed of a static content, and can be assimilated to images.
			This means that addressing a complete presentation in a SMIL file can be achieved by
			following these steps:
		</t>
		<t>
		<list style="numbers">
			<t>preparing a list of images reflecting the original presentations (e.g. 10 images for 10 slides, or more
			if any animation was involved);</t>
			<t>prepare the timing related information (e.g. when slide 1 appeared, and when it was substituted by
			slide 2);</t>
			<t>placing a series of &lt;img&gt; elements in the SMIL metadata to address the first two steps.</t>
		</list>
		</t>
		<t>
			An example of this, recalling the scenario depicted in <xref target="fig-rec-slide"/>, is presented here:
		</t>
		<t>
			<figure>
				<artwork>
					<![CDATA[
<par>
   [..]
   <img src="http://www.example.com/f44gf/1.jpg" region="image" \
        begin="0s" end="10s"/>
   <img src="http://www.example.com/f44gf/2.jpg" region="image" \
        begin="10s" end="18s"/>
   <img src="http://www.example.com/f44gf/3.jpg" region="image" \
        begin="18s" end="30s"/>
   [..]
</par>
				]]>
				</artwork>
			</figure>
		</t>
		<t>
			The slideshow would usually be a sequence, and so a &lt;seq&gt; would
			seem the more apt way to address the presentation sharing. Nevertheless,
			timing information are very important, and it's quite likely that
			several additional media will flow in parallel with the slides (e.g.
			the video stream which includes the presenter talking). That's why
			a &lt;par&gt; element is used instead, which for brevity omits the
			other media involved.
		</t>
		</section>

		<!-- Whiteboard Tagging -->
		<section title="Whiteboard" anchor="sec-tag-wb">
		<t>
			As anticipated in <xref target="sec-rec-wb"/>, no standard solution is
			usually deployed when talking of whitebording in a conferencing system.
			For this reason, the recording process suggested in <xref target="sec-rec-wb"/>
			is just a timing-aware dump of all the interactions occurred at the
			whiteboard level. These interactions might subsequently be converted
			in a more common format as, for instance, a video or an image slide show.
			In case of a video, the same considerations of <xref target="sec-tag-av"/>
			would apply, since the whiteboard recording would actually be a video itself.
			In case it is converted to a slideshow, the tagging process would occur
			as explained in <xref target="sec-tag-slides"/>.
		</t>
		<t>
			However, SMIL also allows for custom, non-standard media to be involved
			in its metadata. This can be achieved by means of the standard element
			&lt;ref&gt;, which is a generic media reference. This element allows for the description and addressing of
			non-standard media (or at least media the chosen SMIL specification is not aware
			of), which could be implemented in a custom player. This means that, if
			a whiteboard has been recorded in a proprietary way, and this way needs
			for a reason or for another to be preserved, the &lt;ref&gt; element may
			be used to address it: in fact, the same attributes previously introduced
			(including 'src' and the others) are available to this element as well.
			Of course, if this approach is used only a player able to understand the
			proprietary media extension would be able to replay the recorded whiteboard
			session. To make the player aware of the format employed, a 'type' attribute
			could be added as well.
		</t>
		<t>
			An example of how the recorded whiteboard might be addressed is provided
			here:
		</t>
		<t>
			<figure>
				<artwork>
					<![CDATA[
<par>
   [..]
      <ref src="http://example.com/wb/wb12.txt" region="wb" \
           type="myFormat"/>
   [..]
</par>
				]]>
				</artwork>
			</figure>
		</t>
		</section>
		</section>

	</section>

	<!-- Playout -->
	<section title="Playout" anchor="sec-playout">
		<t>
			Once the SMIL metadata has been properly prepared, a playout of the
			recorded conference is not difficult to achieve. In fact, an interested
			user just needs to get a SMIL-aware player supporting the several
			file formats involved, that are: (i) audio/video; (ii) images; (iii) RealText;
			(iv) the whiteboarding session, whatever format it has been recorded into.
			Considering the standard nature of SMIL and of almost all the media
			involved, the session is likely to be easily accessable to many
			players out there in the wild. Anyway, the 'type' attribute for
			all the involved media can be used to check for the support of
			the related media or not.
		</t>
		<t>
			Additional information provided in the SMIL head (e.g. the &lt;switch&gt;
			elements and the &lt;layout&gt; they suggest) provide guidance for players
			to presenting the addressed media in the expected way. 
		</t>
		<t>
			The sequence an interested user needs to realize in order to access a recorded
			conference session can be summarized in the following simplified steps:
		</t>
		<t>
		<list style="symbols">
			<t>the user retrieves the SMIL file associated with the conference she/he is interested to
			(e.g. by means of HTTP or other out-of-band mechanisms);</t>
			<t>the SMIL file is passed to a compliant media player (which could have been
			the means to get the SMIL file in the first place);</t>
			<t>the player parses the SMIL file and checks if all the media are supported; apart
			from explicitly non-standard media (e.g. whiteboard) the player might check if the
			envolved media files are encoded in a format it supports (e.g. a video file encoded
			in H.264/MP3);</t>
			<t>the player prepares the presentation screen; it makes use of the information in
			the &lt;head&gt; in order to choose the right layout; the choice may be automatic
			(e.g. according to the screen resolution) or guided by the user;</t>
			<t>the player starts retrieving each involved media file; it may either retrieve
			each file in its completeness, or start downloading and then start the playout
			almost immediately (e.g. buffering); it also listens for user-generated events,
			like the user pausing/resuming the playout, or seeking to a specific time in
			the conference; if any of these events occur, it takes the related action
			(e.g. seeking to the right time for each medium in the conference, taking the
			timing information from the SMIL file as well).</t>
		</list>
		</t>
		<t>
			A general overview of the scenario can be seen in <xref target="fig-play"/>.
		</t>
		<t>
			<figure anchor="fig-play" title="Retrieving and playing a recorded conference session">
				<artwork>
					<![CDATA[
+------+ 1. START    +----------+                          +----------+
| User |------------>|   User   |------------------------->| Sessions |
|      |<------------| (player) |  2. get conf45.smil      | database |
+------+  6. SHOW    +----------+                          +----------+
                       |  |  |
                       |  |  |
                       |  |  |   3. get audios and videos  +-----------+
                       |  |  +---------------------------->| WebServer |
                       |  |                                |  (video)  |
                       |  |    4. get RealText files       +-----------+
                       |  +------------------------------->|  (text)   |
                       |    5. get slide images            +-----------+
                       +---------------------------------->|  (images) |
                                                           +-----------+
						 ]]>
				</artwork>
			</figure>
		</t>
		<t>
			In this quite oversimplified scenario, an interested viewer chooses to
			start viewing a previously recorded conference. She/he knows the address
			to the recorded session (http://example.com/conf45.smil) and passes it
			to her/his player (1.). Starting the playout triggers the retrieval
			of the SMIL description (2.), which may be achieved by means of
			HTTP or any other protocol. Once the player has access to the description,
			it starts retrieving the individual media resources addressed there
			(video in 3., chat in 4., slides in 5.), and, according to the
			implementation of the player, it either waits for all the downloads
			to complete or just buffers a little while before starting the
			presentation to the user (6.).
		</t>
	</section>

	<!-- Security Considerations -->
	<section title="Security Considerations" anchor="sec-security">
		<t>
			TBD.
		</t>
	</section>

	<!-- Acknowledgements -->
	<section title="Acknowledgements" anchor="sec-acks">
		<t>
			The authors would like to thank...
		</t> 
	</section>
	
</middle>

<back>
	<references title="References">
		<!-- BNF -->
		<?rfc include="reference.RFC.2234"?>
		<!-- Terminology -->
		<?rfc include="reference.RFC.2119"?>
		<!-- IANA Guidelines -->
		<?rfc include="reference.RFC.2434"?>
		<!-- SIP -->
		<?rfc include="reference.RFC.3261"?>
		<!-- RTP -->
		<?rfc include="reference.RFC.3550"?>
		<!-- MediaCtrl Architecture -->
		<?rfc include="reference.RFC.5567"?>
		<!-- Media Control Channel Framework -->
		<?rfc include="reference.RFC.6230"?>
		<!-- Conferencing package -->
		<?rfc include="reference.I-D.ietf-mediactrl-mixer-control-package"?>
		<!-- Call Flows -->
		<?rfc include="reference.I-D.ietf-mediactrl-call-flows"?>
		<!-- XCON framework -->
		<?rfc include="reference.RFC.5239"?>
		<!-- BFCP -->
		<?rfc include="reference.RFC.4582"?>
		<!-- SMIL -->
		<?rfc include="reference.W3C.CR-SMIL3-20080115"?>
		<!-- XMPP -->
		<?rfc include="reference.RFC.3920"?>
		<!-- MSRP -->
		<?rfc include="reference.RFC.4975"?>
		
	</references>
</back>

</rfc>

<!--  LocalWords:  xref PPR PPA SAA RTA RTR LIR LIA CDATA -->

