Professional Documents
Culture Documents
Outline
What is new with motion estimation Four Step Search and Hexagon Search Algorithms Parallelization strategies Results and discussions
Design Implementation
Parallelization is possible by dividing the image into small sub-image partitions. Each thread will work on a sub-image independently using a designed algorithm ( i.e Four step search or Hexagon Search). At the end, the minimum SAD of each subimage is compared to get the final minimum SAD and avoid local minimum.
Implementation Notes
Since the number of threads we use is multiple of 2s, if the number of sub-image is not multiple of 2s, we need to pad the image with additional rows and columns and we ignore the results from those extra sub-images. We excluded the time it takes to read a text file and store data into the window and image arrays when we compare the runtime for performance analysis.
Simulation Results
First we varied the number of threads per block to find the maximal configuration that gives the best run time.
6
Runtime (seconds)
5 4 3 2 1 0
32 64 128
256
512
Image Size
The runtime of the serial versions and the parallel versions of different algorithms are collected and compare to see what kind of performance improvement we achieved.
30 Runtime (seconds) 25 20 15 10 5 0 FSS_Serial FSS_Parallel Runtime (seconds)
Hexagon_Serial Hexagon_parallel
Image Size
Image Size
We only see the performance improvement when the image size is 4SS_Serial 256x256 or bigger. Any image of size 4SS_Parallel smaller than this will actually decrease the performance.
Image Size
Runtime (seconds)
30
25 Speed up 20 15 10 5 0 Speed_UP_FS Speed_UP_4SS Speed_UP_Hexagon
Image size
Full_Serial Full_Parallel 4SS_Serial 4SS_Parallel Hexagon_Serial Hexagon_parallel 0 0 0 0.016 0 0.078 0 0.016 0 0.015 0 0.047 0.01 0.016 0.01 0.015 0.01 0.062 0.02 0.09 0.41 1.64 6.56 26.29 0.016 0.031 0.078 0.265 0.922 3.719 0.01 0.02 0.06 0.236 0.87 3.38 0.015 0.016 0.016 0.032 0.047 0.11 0.01 0.02 0.06 0.22 0.85 3.3 0.062 0.047 0.063 0.062 0.078 0.157
FSS_Serial FSS_Parallel
4SS_Serial
4SS_Parallel Hexagon_Serial
5
0
Hexagon_parallel
1
0.5 0
Image Size
300
200 100 0
Image size
Image size
2. Fast search algorithms outperform full search algorithm, hence fast. 3. Parallelization on Four Step Search gives a slightly edge improvement over Hexagon Search. 4. The distortion we see on the two fast search algorithms are similar.
Result Conclusions
Based on the data collected from different algorithms, Four Step Search gives a slightly better performance than Hexagon Search, while the distortion is very similar.
Hence, Four Step Search is a better fast search algorithm than Hexagon Search. Only perform motion estimation algorithms on GPU if image size is larger than 256x256. Smaller image size should be ran serially on CPU.
Limitations
Image and window files are random.
Not make use of shared memory
References
Deepak Turaga , Mohamed Alkanhal . "Search Algorithms for BlockMatching in Motion Estimation". ECE - CMU. March 06, 2010 <http://www.ece.cmu.edu/~ee899/project/deepak_mid.htm>. Lai-Man Po, Wing-Chung Ma. A Novel Four-Step Search Algorithm for Fast Block Motion Estimation. JUNE 1996 Xuan Jing, Lap-Pui Chau. "An Efficient Three-Step Search Algorithm for Block Motion Estimation". IEEE TRANSACTIONS ON MULTIMEDIA JUNE 2004: 435-437. Chen Lu, Wang. "Diamond Search Algorithm". ECE, U of Texas. March 06, 2010 <http://users.ece.utexas.edu/~bevans/courses/ee381k/projects/fall98/ch en-lu-wang/presentation/sld012.htm>.
Questions?