We analyze the important ingredients of constructing a realistic and challenging dataset for spatio-temporal action detection by proposing three criteria: (1) multi-person scenes and motion dependent identification, (2) with well-defined boundaries, (3) relatively fine-grained classes of high complexity. Based on these guidelines, we build the dataset of MultiSports v1.0 by selecting 4 sports classes, collecting 3200 video clips, and annotating 37701 action instances with 902k bounding boxes. Our datasets are characterized with important properties of high diversity, dense annotation, and high quality.
We present a new large-scale multi-object tracking dataset in diverse sports scenes, coined as SportsMOT, where all players on the court are supposed to be tracked. It consists of 240 video sequences, over 150K frames (almost 15× MOT17) and over 1.6M bounding boxes (3× MOT17) collected from 3 sports categories, including basketball, volleyball and football. Our dataset is characterized with two key properties: 1) fast and variable-speed motion and 2) similar yet distinguishable appearance.
We propose a new video visual relation detection task: video human-human interaction detection, and build a dataset named SportsHHI for it. SportsHHI contains 34 high-level interaction classes from basketball and volleyball sports. 118,075 human bounding boxes and 50,649 interaction instances are annotated on 11,398 keyframes. To benchmark this, we propose a two-stage baseline method and conduct extensive experiments to reveal the key factors for a successful human-human interaction detector.
SportsShot consists of 1,200 sports videos, over 4M frames, and over 30K shot annotations. Our SportsShot is characterized with important properties of well-defined shot boundaries, fine-grained shot categories of complexity, and high-quality annotations with consistency, resulting in more challenges in both shot segmentation and boundary detection. In particular, we group the sports shot into seven semantic categories, including close-up, close shot, full view, audience, transition, zooming and others.
We propose a spatio-temporal video grounding dataset for sports videos, coined as SportsGrounding. We analyze the important components for constructing a realistic and challenging dataset for spatio-temporal video grounding by proposing two criteria: (1) grounding in multi-person scenes and motion-dependent contexts, and (2) well-defined boundaries. Based on these guidelines, we build the SportsGrounding v1.0 dataset by collecting 526 video clips of basketball category and annotating 4,479 instances with 113k bounding boxes.