Epoch 1️⃣3️⃣ : This week in ML (+ Bioinformatics 🧬 and Astronomy 🌌)
CANINE, AMP, HOD and more ...
Abstract: Pipelined NLP systems have largely been superseded by end-to-end neural modeling, yet nearly all commonly-used models still require an explicit tokenization step. While recent tokenization approaches based on data-derived subword lexicons are less brittle than manually engineered tokenizers, these techniques are not equally suited to all languages, and the use of any fixed vocabulary may limit a model's ability to adapt. In this paper, the authors present CANINE, a neural encoder that operates directly on character sequences, without explicit tokenization or vocabulary, and a pre-training strategy that operates either directly on characters or optionally uses subwords as a soft inductive bias.
Abstract: Synthesizing graceful and life-like behaviours for physically simulated characters has been a fundamental challenge in computer animation. Data-driven methods that leverage motion tracking are a prominent class of techniques for producing high fidelity motions for a wide range of behaviours. However, the effectiveness of these tracking-based methods often hinges on carefully designed objective functions, and when applied to large and diverse motion datasets, these methods require significant additional machinery to select the appropriate motion for the character to track in a given scenario. In this work, the authors propose to obviate the need to manually design imitation objectives and mechanisms for motion selection by utilizing a fully automated approach based on adversarial imitation learning.
ML + Bioinformatics 🧬
Abstract: The authors develop and rigorously evaluate a deep learning based system that can accurately classify skin conditions while detecting rare conditions for which there is not enough data available for training a confident classifier. They frame this task as an out-of-distribution (OOD) detection problem. Their novel approach, hierarchical outlier detection (HOD) assigns multiple abstention classes for each training outlier class and jointly performs a coarse classification of inliers vs. outliers, along with fine-grained classification of the individual classes. They demonstrate the effectiveness of the HOD loss in conjunction with modern representation learning approaches (BiT, SimCLR, MICLe) and explore different ensembling strategies for further improving the results.
Effect of Radiology Report Labeler Quality on Deep Learning Models for Chest X-Ray Interpretation
Abstract: Although deep learning models for chest X-ray interpretation are commonly trained on labels generated by automatic radiology report labelers, the impact of improvements in report labeling on the performance of chest X-ray classification models has not been systematically investigated. In this short paper, the authors compare the CheXpert, CheXbert, and VisualCheXbert labelers on the task of extracting accurate chest X-ray image labels from radiology reports, reporting that the VisualCheXbert labeler outperforms the CheXpert and CheXbert labelers. Next, after training image classification models using labels generated from the different radiology report labelers on the Chexpert dataset, they show that an image classification model trained on labels from the VisualCheXbert labeler outperforms image classification models trained on labels from the CheXpert and CheXbert labelers. TLDR; Radiology report labelers have significantly improved over last couple of years and these improvements do translate to better chest X-ray diagnosis models.
Astroinformatics 🌌
CNN Architecture Comparison for Radio Galaxy Classification
Abstract: The morphological classification of radio sources is important to gain a full understanding of galaxy evolution processes and their relation with local environmental properties. Furthermore, the complex nature of the problem, its appeal for citizen scientists and the large data rates generated by existing and upcoming radio telescopes combine to make the morphological classification of radio sources an ideal test case for the application of machine learning techniques. One approach that has shown great promise recently is Convolutional Neural Networks (CNNs). Literature, however, lacks two major things when it comes to CNNs and radio galaxy morphological classification. Firstly, a proper analysis of whether overfitting occurs when training CNNs to perform radio galaxy morphological classification using a small curated training set is needed. Secondly, a good comparative study regarding the practical applicability of the CNN architectures in literature is required. Both of these shortcomings are addressed in this paper.
Abstract: Observations suggest that satellite quenching plays a major role in the build-up of passive, low-mass galaxies at late cosmic times. Studies of low-mass satellites, however, are limited by the ability to robustly characterize the local environment and star-formation activity of faint systems. In an effort to overcome the limitations of existing data sets, the authors utilize deep photometry in Stripe 82 of the Sloan Digital Sky Survey, in conjunction with a neural network classification scheme, to study the suppression of star formation in low-mass satellite galaxies in the local Universe. Similar to the results of previous studies of the Local Group, this increase in the quenched fraction at low satellite masses may correspond to an increase in the efficacy of ram-pressure stripping as a quenching mechanism in groups.
Explainable, Interpretable, Bias and Ethics in AI
The Chinese Approach to AI: An Analysis of Policy, Ethics, and Regulation by Dr. Marianna Ganapini
Teaching Data Ethics: Foundations and Possibilities from Engineering and Computer Science Ethics Education by Anna Lauren Hoffmann and Katherine Cross
Interesting Events and News 📰
Data Preparation and Feature Engineering in ML | Google Developers Course
Artificial Intelligence and Healthcare: From Sci-Fi to Reality by Marcel Hedman
Big Data To Good Data: Andrew Ng Urges ML Community To Be More Data-Centric And Less Model-Centric
DataDev Hackathon 2021 | HACK and customize Tableau! | Deadline: May 31
Articles and Resources 📃 I liked
There’s more to mathematics than rigour and proofs by Terence Tao
A look under the hood: how branches work in Git by Tobias Günther
Why Silicon Valley's most astute critics are all women by John Naughton
Recommended Podcasts 🎧
The Overflow Podcast | Web programming with nothing but Python
The Real Python Podcast | Building a Neural Network and How to Write Tests in Python
DataTalks.Club | Shifting Career from Analytics to Data Science - Andrada Olteanu