The surveillance of local places, shopping malls and traffic signals is essential for safety and security purpose. Though the CCTV’s are available, the manual retrieval of footage frame by frame is time intensive and critical task. Hence with video surveillance, the task like object detection, person identification and tracking of suspicious movements will improve the result.  But there are several factors affecting the scene like environmental variations, light illumination, camera variations and occlusion. The existing systems are able to identify the object from limited categories under certain conditions with deep learning. But data training and computing system cost are very high. So, different challenges and aspects for scene interpretation around the object are discussed in this paper. The scope will highlight various methods with implementation and their results. Though the deep learning is more powerful tool for video analytics, there are still challenges that must need to improve for real-time scene interpretation.