RH Logo
Hubs
About
Live
Leaderboard

My Hubs
All
ResearchHub Feeds
My Hubs
All

Trending Users

Author Profile Avatar
Emeka A. Ezewudo
Author Profile Avatar
Prateek Jassal
Author Profile Avatar
Patrick Joyce
Author Profile Avatar
Eric Cuellar
Author Profile Avatar
Titus Osikhiana Ogahbrai
Author Profile Avatar
Akhil Kunche
Author Profile Avatar
ayotune adebayo
Author Profile Avatar
Ramees P S
Author Profile Avatar
dhanu dhanu
Author Profile Avatar
Sebastian Hunte

Trending Papers in Computer Science

Trending
Today
Trending
Today

Sign in to discover all of the research papers you care about, live as they're published.

29
Published: Apr 20, 2021
Authors: Aravind Srinivas, Wilson Yan, Yunzhi Zhang, Pieter Abbeel, Aravind Srinivas
Published: Apr 20, 2021
Authors: Aravind Srinivas, Wilson Yan, Yunzhi Zhang, Pieter Abbeel, Aravind Srinivas
We present VideoGPT: a conceptually simple architecture for scaling likelihood based generative modeling to natural videos. VideoGPT uses VQ-VAE that learns downsampled discrete latent representations of a raw video by employing 3D convolutions and axial self-attention. A simple GPT-like architecture is then used to autoregressively model the discrete latents using spatio-temporal position encodings. Despite the simplicity in formulation and ease of training, our architecture is able to generate samples competitive with state-of-the-art GAN models for video generation on the BAIR Robot dataset, and generate high fidelity natural images from UCF-101 and Tumbler GIF Dataset (TGIF). We hope our proposed architecture serves as a reproducible reference for a minimalistic implementation of transformer based video generation models. Samples and code are available at https://wilson1yan.github.io/videogpt/index.html
Slide 1 of 1
  • Paper Preview Page 1
61
Published: Apr 18, 2021
Authors: Danqi Chen, Tianyu Gao, Xingcheng Yao, Danqi Chen
Published: Apr 18, 2021
Authors: Danqi Chen, Tianyu Gao, Xingcheng Yao, Danqi Chen
This paper presents SimCSE, a simple contrastive learning framework that greatly advances the state-of-the-art sentence embeddings. We first describe an unsupervised approach, which takes an input sentence and predicts itself in a contrastive objective, with only standard dropout used as noise. This simple method works surprisingly well, performing on par with previous supervised counterparts. We hypothesize that dropout acts as minimal data augmentation and removing it leads to a representation collapse. Then, we draw inspiration from the recent success of learning sentence embeddings from natural language inference (NLI) datasets and incorporate annotated pairs from NLI datasets into contrastive learning by using "entailment" pairs as positives and "contradiction" pairs as hard negatives. We evaluate SimCSE on standard semantic textual similarity (STS) tasks, and our unsupervised and supervised models using BERT-base achieve an average of 74.5% and 81.6% Spearman's correlation respectively, a 7.9 and 4.6 points improvement compared to previous best results. We also show that contrastive learning theoretically regularizes pre-trained embeddings' anisotropic space to be more uniform, and it better aligns positive pairs when supervised signals are available.
Slide 1 of 1
  • Paper Preview Page 1
26
Published: Apr 16, 2021
Authors: Matt Post, Elizabeth Salesky, David Etter, Matt Post
Published: Apr 16, 2021
Authors: Matt Post, Elizabeth Salesky, David Etter, Matt Post
Machine translation models have discrete vocabularies and commonly use subword segmentation techniques to achieve an 'open-vocabulary.' This approach relies on consistent and correct underlying unicode sequences, and makes models susceptible to degradation from common types of noise and variation. Motivated by the robustness of human language processing, we propose the use of visual text representations, which dispense with a finite set of text embeddings in favor of continuous vocabularies created by processing visually rendered text. We show that models using visual text representations approach or match performance of text baselines on clean TED datasets. More importantly, models with visual embeddings demonstrate significant robustness to varied types of noise, achieving e.g., 25.9 BLEU on a character permuted German--English task where subword models degrade to 1.9.
Slide 1 of 1
  • Paper Preview Page 1
13
Published: Apr 16, 2021
Authors: Shen, Zheyan, et al
Published: Apr 16, 2021
Authors: Shen, Zheyan, et al
Approaches based on deep neural networks have achieved striking performance when testing data and training data share similar distribution, but can significantly fail otherwise. Therefore, eliminating the impact of distribution shifts between training and testing data is crucial for building performance-promising deep models. Conventional methods assume either the known heterogeneity of training data (e.g. domain labels) or the approximately equal capacities of different domains. In this paper, we consider a more challenging case where neither of the above assumptions holds. We propose to address this problem by removing the dependencies between features via learning weights for training samples, which helps deep models get rid of spurious correlations and, in turn, concentrate more on the true connection between discriminative features and labels. Extensive experiments clearly demonstrate the effectiveness of our method on multiple distribution generalization benchmarks compared with state-of-the-art counterparts. Through extensive experiments on distribution generalization benchmarks including PACS, VLCS, MNIST-M, and NICO, we show the effectiveness of our method compared with state-of-the-art counterparts.
Slide 1 of 1
  • Paper Preview Page 1
25
Published: Apr 16, 2021
Authors: Neel Sundaresan, Dawn Drain, Chen Wu, Alexey Svyatkovskiy, Neel Sundaresan
Published: Apr 16, 2021
Authors: Neel Sundaresan, Dawn Drain, Chen Wu, Alexey Svyatkovskiy, Neel Sundaresan
Detecting and fixing bugs are two of the most important yet frustrating parts of the software development cycle. Existing bug detection tools are based mainly on static analyzers, which rely on mathematical logic and symbolic reasoning about the program execution to detect common types of bugs. Fixing bugs is typically left out to the developer. In this work we introduce DeepDebug: a data-driven program repair approach which learns to detect and fix bugs in Java methods mined from real-world GitHub repositories. We frame bug-patching as a sequence-to-sequence learning task consisting of two steps: (i) denoising pretraining, and (ii) supervised finetuning on the target translation task. We show that pretraining on source code programs improves the number of patches found by 33% as compared to supervised training from scratch, while domain-adaptive pretraining from natural language to code further improves the accuracy by another 32%. We refine the standard accuracy evaluation metric into non-deletion and deletion-only fixes, and show that our best model generates 75% more non-deletion fixes than the previous state of the art. In contrast to prior work, we attain our best results when generating raw code, as opposed to working with abstracted code that tends to only benefit smaller capacity models. Finally, we observe a subtle improvement from adding syntax embeddings along with the standard positional embeddings, as well as with adding an auxiliary task to predict each token's syntactic class. Despite focusing on Java, our approach is language agnostic, requiring only a general-purpose parser such as tree-sitter.
Slide 1 of 1
  • Paper Preview Page 1
16
Published: Apr 20, 2021
Authors: Liu, Yunfeng, et al
Published: Apr 20, 2021
Authors: Liu, Yunfeng, et al
Position encoding in transformer architecture provides supervision for dependency modeling between elements at different positions in the sequence. We investigate various methods to encode positional information in transformer-based language models and propose a novel implementation named Rotary Position Embedding(RoPE). The proposed RoPE encodes absolute positional information with rotation matrix and naturally incorporates explicit relative position dependency in self-attention formulation. Notably, RoPE comes with valuable properties such as flexibility of being expand to any sequence lengths, decaying inter-token dependency with increasing relative distances, and capability of equipping the linear self-attention with relative position encoding. As a result, the enhanced transformer with rotary position embedding, or RoFormer, achieves superior performance in tasks with long texts. We release the theoretical analysis along with some preliminary experiment results on Chinese data. The undergoing experiment for English benchmark will soon be updated.
Slide 1 of 1
  • Paper Preview Page 1
23
Published: Apr 16, 2021
Authors: Sheikh, Yaser, et al
Published: Apr 16, 2021
Authors: Sheikh, Yaser, et al
This paper presents a generic method for generating full facial 3D animation from speech. Existing approaches to audio-driven facial animation exhibit uncanny or static upper face animation, fail to produce accurate and plausible co-articulation or rely on person-specific models that limit their scalability. To improve upon existing models, we propose a generic audio-driven facial animation approach that achieves highly realistic motion synthesis results for the entire face. At the core of our approach is a categorical latent space for facial animation that disentangles audio-correlated and audio-uncorrelated information based on a novel cross-modality loss. Our approach ensures highly accurate lip motion, while also synthesizing plausible animation of the parts of the face that are uncorrelated to the audio signal, such as eye blinks and eye brow motion. We demonstrate that our approach outperforms several baselines and obtains state-of-the-art quality both qualitatively and quantitatively. A perceptual user study demonstrates that our approach is deemed more realistic than the current state-of-the-art in over 75% of cases. We recommend watching the supplemental video before reading the paper: https://research.fb.com/wp-content/uploads/2021/04/mesh_talk.mp4
Slide 1 of 1
  • Paper Preview Page 1
11
Published: Apr 16, 2021
Authors: Steven Hoi, Hung Le, Nancy Chen, Steven Hoi
Published: Apr 16, 2021
Authors: Steven Hoi, Hung Le, Nancy Chen, Steven Hoi
Neural module networks (NMN) have achieved success in image-grounded tasks such as Visual Question Answering (VQA) on synthetic images. However, very limited work on NMN has been studied in the video-grounded language tasks. These tasks extend the complexity of traditional visual tasks with the additional visual temporal variance. Motivated by recent NMN approaches on image-grounded tasks, we introduce Video-grounded Neural Module Network (VGNMN) to model the information retrieval process in video-grounded language tasks as a pipeline of neural modules. VGNMN first decomposes all language components to explicitly resolve any entity references and detect corresponding action-based inputs from the question. The detected entities and actions are used as parameters to instantiate neural module networks and extract visual cues from the video. Our experiments show that VGNMN can achieve promising performance on two video-grounded language tasks: video QA and video-grounded dialogues.
Slide 1 of 1
  • Paper Preview Page 1
11
Published: Apr 16, 2021
Authors: Reiter, Austin, et al
Published: Apr 16, 2021
Authors: Reiter, Austin, et al
Recent advances in using retrieval components over external knowledge sources have shown impressive results for a variety of downstream tasks in natural language processing. Here, we explore the use of unstructured external knowledge sources of images and their corresponding captions for improving visual question answering (VQA). First, we train a novel alignment model for embedding images and captions in the same space, which achieves substantial improvement in performance on image-caption retrieval w.r.t. similar methods. Second, we show that retrieval-augmented multi-modal transformers using the trained alignment model improve results on VQA over strong baselines. We further conduct extensive experiments to establish the promise of this approach, and examine novel applications for inference time such as hot-swapping indices.
Slide 1 of 1
  • Paper Preview Page 1
11
Published: Apr 16, 2021
Authors: Dan Roth, Yanai Elazar, Hongming Zhang, Yoav Goldberg, Dan Roth
Published: Apr 16, 2021
Authors: Dan Roth, Yanai Elazar, Hongming Zhang, Yoav Goldberg, Dan Roth
The Winograd Schema (WS) has been proposed as a test for measuring commonsense capabilities of models. Recently, pre-trained language model-based approaches have boosted performance on some WS benchmarks but the source of improvement is still not clear. We begin by showing that the current evaluation method of WS is sub-optimal and propose a modification that makes use of twin sentences for evaluation. We also propose two new baselines that indicate the existence of biases in WS benchmarks. Finally, we propose a method for evaluating WS-like sentences in a zero-shot setting and observe that popular language models perform randomly in this setting. We conclude that much of the apparent progress on WS may not necessarily reflect progress in commonsense reasoning, but much of it comes from supervised data, which is not likely to account for all the required commonsense reasoning skills and knowledge.
Slide 1 of 1
  • Paper Preview Page 1
Load More Papers