Research Interests
I am interested in Computer Vision, Computer Graphics, and Deep Learning in general. Particularly, understanding the world around from 2D & 3D visual data through systems that can effectively utilize the acquired knowledge and data from other similar tasks and domains, learn from data with limited or no labels, and are robust in diverse real-world scenarios.
|
|
An End-To-End Framework For Pose Estimation of Occluded Pedestrians
Sudip Das*,
Perla Sai Raj Kishore*,
Ujjwal Bhattacharya
International Conference on Image Processing (ICIP), 2020
Abstract /
BibTex
Pose estimation in the wild is a challenging problem, particularly in situations of(i) occlusions of varying degrees, and (ii) crowded outdoor scenes. Most of the existing studies of pose estimation did not report the performance in similar situations. Moreover, pose annotations for occluded parts of the human figures have not been provided in any of the relevant standard datasets, which in turn creates further difficulties to the required studies for pose estimation of the entire Figure for occluded humans. Well known pedestrian detection datasets such as CityPersons contains samples of outdoor scenes but it does not include pose annotations. Here we propose a novel multi-task framework for end-to-end training towards the entire pose estimation of pedestrians including in situations of any kind of occlusion. To tackle this problem, we make use of a pose estimation dataset, MS-COCO, and employ unsupervised adversarial instance-level domain adaptation for estimating the entire pose of occluded pedestrians. The experimental studies show that the proposed framework outperforms the SOTA results for pose estimation, instance segmentation and pedestrian detection in cases of heavy occlusions (HO) and reasonable + heavy occlusions (R+HO) on the two benchmark datasets.
@INPROCEEDINGS{9191147,
author={S. {Das} and P. S. R. {Kishore} and U. {Bhattacharya}},
booktitle={2020 IEEE International Conference on Image Processing (ICIP)},
title={An End-To-End Framework For Pose Estimation Of Occluded Pedestrians},
year={2020},
volume={},
number={},
pages={1446-1450},
doi={10.1109/ICIP40778.2020.9191147}
}
|
|
ClueNet: A Deep Framework for Occluded Pedestrian Pose Estimation
Perla Sai Raj Kishore*,
Sudip Das*,
Partha Sarathi Mukherjee,
Ujjwal Bhattacharya
British Machine Vision Conference (BMVC), 2019
Abstract /
BibTex
Pose estimation of a pedestrian helps to gather information about the current activity or the instant behaviour of the subject. Such information is useful for autonomous vehicles, augmented reality, video surveillance, etc. Although a large volume of pedestrian detection studies are available in the literature, detection of the same in situations of significant occlusions still remains a challenging task. In this work, we take a step further to propose a novel deep learning framework, called ClueNet, to detect as well as estimate the entire pose of occluded pedestrians in an unsupervised manner. ClueNet is a two stage framework where the first stage generates visual clues for the second stage to accurately estimate the pose of occluded pedestrians. The first stage employs a multi-task network to segment the visible parts and predict a bounding box enclosing the visible and occluded regions for each pedestrian. The second stage uses these predictions from the first stage for pose estimation. Here we propose a novel strategy, called Mask and Predict, to train our ClueNet to estimate the pose even for occluded regions. Additionally, we make use of various other training strategies to further improve our results. The proposed work is first of its kind and the experimental results on CityPersons and MS COCO datasets show the superior performance of our approach over existing methods.
@article{kishore2019cluenet,
title={ClueNet: A Deep Framework for Occluded Pedestrian Pose Estimation},
author={Kishore, Perla Sai Raj and Das, Sudip and Mukherjee, Partha Sarathi and Bhattacharya, Ujjwal},
year={2019}
}
|
|
Handwriting Recognition in Low-Resource Scripts Using Adversarial Learning
Ayan Kumar Bhunia,
Abhirup Das,
Ankan Kumar Bhunia,
Perla Sai Raj Kishore,
Partha Pratim Roy
Conference on Computer Vision and Pattern Recognition (CVPR), 2019
Abstract /
Code /
arXiv /
BibTex
Handwritten Word Recognition and Spotting is a challenging field dealing with handwritten text possessing irregular and complex shapes. The design of deep neural network models makes it necessary to extend training datasets in order to introduce variations and increase the number of samples; word-retrieval is therefore very difficult in low-resource scripts. Much of the existing literature comprises preprocessing strategies which are seldom sufficient to cover all possible variations. We propose an Adversarial Feature Deformation Module (AFDM) that learns ways to elastically warp extracted features in a scalable manner. The AFDM is inserted between intermediate layers and trained alternatively with the original framework, boosting its capability to better learn highly informative features rather than trivial ones. We test our meta-framework, which is built on top of popular word-spotting and word-recognition frameworks and enhanced by AFDM, not only on extensive Latin word datasets but also on sparser Indic scripts. We record results for varying sizes of training data, and observe that our enhanced network generalizes much better in the low-data regime; the overall word-error rates and mAP scores are observed to improve as well.
@InProceedings{Bhunia_2019_CVPR,
author = {Bhunia, Ayan Kumar and Das, Abhirup and Bhunia, Ankan Kumar and Kishore, Perla Sai Raj and Roy, Partha Pratim},
title = {Handwriting Recognition in Low-Resource Scripts Using Adversarial Learning},
booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2019}
}
|
|
User Constrained Thumbnail Generation Using Adaptive Convolutions
Perla Sai Raj Kishore,
Ayan Kumar Bhunia,
Shovozit Ghose,
Partha Pratim Roy
International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019
(Oral)
Abstract /
Code /
arXiv /
BibTex
Thumbnails are widely used all over the world as a preview for digital images. In this work we propose a deep neural framework to generate thumbnails of any size and aspect ratio, even for unseen values during training, with high accuracy and precision. We use Global Context Aggregation (GCA) and a modified Region Proposal Network (RPN) with adaptive convolutions to generate thumbnails in real time. GCA is used to selectively attend and aggregate the global context information from the entire image while the RPN is used to generate candidate bounding boxes for the thumbnail image. Adaptive convolution eliminates the difficulty of generating thumbnails of various aspect ratios by using filter weights dynamically generated from the aspect ratio information. The experimental results indicate the superior performance of the proposed model 1 over existing state-of-the-art techniques.
@inproceedings{kishore2019user,
title={User Constrained Thumbnail Generation Using Adaptive Convolutions},
author={Kishore, Perla Sai Raj and Bhunia, Ayan Kumar and Ghose, Shuvozit and Roy, Partha Pratim},
booktitle={ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
pages={1677--1681},
year={2019},
organization={IEEE}
}
|
|
Texture Synthesis Guided Deep Hashing for Texture Image Retrieval
Ayan Kumar Bhunia,
Perla Sai Raj Kishore,
Pranay Mukherjee,
Abhirup Das,
Partha Pratim Roy
Winter Conference on Applications of Computer Vision (WACV), 2019
Abstract /
arXiv /
BibTex
With the large scale explosion of images and videos over the internet, efficient hashing methods have been developed to facilitate memory and time efficient retrieval of similar images. However, none of the existing works use hashing to address texture image retrieval mostly because of the lack of sufficiently large texture image databases. Our work addresses this problem by developing a novel deep learning architecture that generates binary hash codes for input texture images. For this, we first pre-train a Texture Synthesis Network (TSN) which takes a texture patch as input and outputs an enlarged view of the texture by injecting newer texture content. Thus it signifies that the TSN encodes the learnt texture specific information in its intermediate layers. In the next stage, a second network gathers the multi-scale feature representations from the TSN’s intermediate layers using channel-wise attention, combines them in a progressive manner to a dense continuous representation which is finally converted into a binary hash code with the help of individual and pairwise label information. The new enlarged texture patches from the TSN also help in data augmentation to alleviate the problem of insufficient texture data and are used to train the second stage of the network. Experiments on three public texture image retrieval datasets indicate the superiority of our texture synthesis guided hashing approach over existing state-of-the-art methods.
@inproceedings{bhunia2019texture,
title={Texture synthesis guided deep hashing for texture image retrieval},
author={Bhunia, Ayan Kumar and Perla, Sai Raj Kishore and Mukherjee, Pranay and Das, Abhirup and Roy, Partha Pratim},
booktitle={2019 IEEE Winter Conference on Applications of Computer Vision (WACV)},
pages={609--618},
year={2019},
organization={IEEE}
}
|
|
Flatten-T Swish: A thresholded ReLU-Swish-like Activation Function for Deep Learning
Hock Hung Chieng,
Noorhaniza Wahid,
Ong Pauline,
Perla Sai Raj Kishore
International Journal of Advances in Intelligent Informatics (IJAIN), 2018  
(Best Paper Award)
Abstract /
Code /
arXiv /
BibTex
Activation functions are essential for deep learning methods to learn and perform complex tasks such as image classification. Rectified Linear Unit (ReLU) has been widely used and become the default activation function across the deep learning community since 2012. Although ReLU has been popular, however, the hard zero property of the ReLU has heavily hindering the negative values from propagating through the network. Consequently, the deep neural network has not been benefited from the negative representations. In this work, an activation function called Flatten-T Swish (FTS) that leverage the benefit of the negative values is proposed. To verify its performance, this study evaluates FTS with ReLU and several recent activation functions. Each activation function is trained using MNIST dataset on five different deep fully connected neural networks (DFNNs) with depth vary from five to eight layers. For a fair evaluation, all DFNNs are using the same configuration settings. Based on the experimental results, FTS with a threshold value, T=-0.20 has the best overall performance. As compared with ReLU, FTS (T=-0.20) improves MNIST classification accuracy by 0.13%, 0.70%, 0.67%, 1.07% and 1.15% on wider 5 layers, slimmer 5 layers, 6 layers, 7 layers and 8 layers DFNNs respectively. Apart from this, the study also noticed that FTS converges twice as fast as ReLU. Although there are other existing activation functions are also evaluated, this study elects ReLU as the baseline activation function.
@article{IJAIN249|to_array:0,
author = {Hock Chieng and Noorhaniza Wahid and Ong Pauline and Sai Raj Kishore Perla},
title = {Flatten-T Swish: a thresholded ReLU-Swish-like activation function for deep learning},
journal = {International Journal of Advances in Intelligent Informatics},
volume = {4}, number = {2},
year = {2018},
pages = {76--86},
doi = {10.26555/ijain.v4i2.249},
url = {http://ijain.org/index.php/IJAIN/article/view/249}
}
|
|
Saliency Detection: PyTorch implementation of a CVPR 2019 Publication
PyTorch implementation of the paper "Pyramid Feature Attention Network for Saliency Detection", published at CVPR 2019.
Code /
Paper
|
|
Single Image Super Resolution
Image Super Resolution aims to increase the resolution of an image by generating pixels that interpolate best between a given low resolution and the required high resolution image. I built a deep learning based model for this purpose. A large amount of diverse data was also collected to train this model. The model was implemented using Keras in Python and comes with an easy to use graphical user interface. This was my project as an intern under Prof. A. V. Subramanyam of IIIT, Delhi.
Code
|
|
Mixture Density Networks
Mixture Density Networks (MDNs) are an interesting way to address multimodality (where the input and output hold a one-to-many relationship). In such scenarios, instead of directly predicting the output we model the probability distribution of the output as a weighed mixture of several Gaussians from which we sample the actual output. In this project, I implemented univariate and bivariate MDNs in Python using Tensorflow.
Code /
Original Paper
|
|
Character Level Language Model
Auto-correct and auto-complete, which have now become a standard feature in almost all virtual keyboards, make use of a language model at its core. In this project, I built an LSTM based character-level language model that aims to predict the next character from a sequence of input characters. The code for this project was written in Python using Tensorflow.
Code
|
|
Lane Detection in NFS: Underground 2
Self Driving Cars are one of the fascinating technologies in this modern world. Though the entire process, from perceiving the surroundings to getting the car to move, is fairly complex, the first step usually begins with the detection of lanes that guide the vehicle on the road. In this project, I attempt to detect lanes in real-time in one of the popular games, "NFS: Underground 2", using OpenCV in Python.
Code
|
|
Machine Learning Algorithms
In this project, I implemented various Machine Learning algorithms from scratch in Python using only Numpy.
Code
|
|