Anthony Song

Anthony Song

  • ✉ Email: abs343 [at] cornell [dot] edu

I am an undergraduate student at Cornell University, pursuing dual degrees in Electrical Engineering and Computer Science. I have interned at AMD, contributing to verification / diagnostics tools for AI Engine Compilers. Outside of academics, I am big fan of basketball, sketching, origami, photography, and cooking. I am also a big foodie so feel free to reach out to me if you have any recommendations!

I am fortunate to be advised by Prof. Zhiru Zhang in the Computer Systems Laboratory. Previously, I worked with Prof. Tapomayukh Bhattacharjee at EmPRISE Lab and Prof. Maja Matarić at the Interaction Lab.

My research focuses on building compilers, software systems, and hardware accelerators for efficient computation. I'm especially interested in hardware–software co-design: developing domain-specific languages, runtimes, and toolchains that enable productive programming of heterogeneous hardware through coordinated design across system layers. I’m excited to apply these ideas to machine learning, AI, and robotics, where end-to-end co-design can unlock major gains in performance and efficiency.

News

Education

Cornell University

B.S. in Electrical Engineering & Computer Science

Aug 2023 – May 2027

Work Experience

Zhang Research Group — Cornell University

Research Assistant

Hardware Accelerators for Linear Transformations

Advisor: Zhiru Zhang

Aug 2024 – Present

Advanced Micro Devices

AI Software Engineer Intern

Automated Diagnostics Tools for AI Engine Compilers

Mentors: Keshav Gurushankar, Bin Tu

May 2025 – Aug 2025

EmPRISE Lab — Cornell University

Research Assistant

Robot Systems Design for Assisted Feeding

Advisor: Tapomayukh Bhattacharjee

Dec 2023 – May 2025

Interaction Lab — University of Southern California

Research Intern

Visual Simultaneous Localization and Mapping

Advisor: Maja Matarić

May 2024 – Aug 2024

Publications

FEAST thumbnail

FEAST: A Flexible Mealtime-Assistance System Towards In-the-Wild Personalization

Rajat Kumar Jenamani, Tom Silver, Ben Dodson, Shiqin Tong, Anthony Song, Yuting Yang, Ziang Liu, Benjamin Howe, Aimee Whitneck, Tapomayukh Bhattacharjee

RSS, 2025

(Best Paper Award, Best Systems Paper Finalist)

Mealtime assistance is a critical activity of daily living (ADL) for individuals with motor impairments. Existing robotic systems typically require extensive customization and fine-tuning for each user, limiting their real-world deployment. We present FEAST, a flexible mealtime-assistance system that enables care recipients to personalize their feeding experience in-the-wild with minimal researcher intervention. FEAST incorporates adaptive learning mechanisms that allow the system to learn from user preferences and feedback across diverse in-home scenarios. Our system demonstrates significant improvements in user satisfaction and eating independence compared to traditional approaches, enabling more widespread adoption of assistive feeding technology.

Projects

NanoForge

NanoForge

Coming Soon...

ABAX

ABAX: ASIC Backend for Allo in XLS

Allo-to-XLS Compiler Backend to Support ASIC Flow

HLS Compilers MLIR LLVM ASIC FPGA
With the end of Dennard scaling, there has been a significant increase in the development of special-purpose hardware accelerators designed to meet the ever-growing demand for compute. While these accelerators - typically implemented on Field Programmable Gate Arrays (FPGAs) and Application-Specific Integrated Circuits (ASICs) - offer superior performance and energy efficiency, their design process is arduous and complex. High-Level Synthesis (HLS) frameworks aim to mitigate this challenge by automating the translation of algorithmic descriptions into hardware, thereby significantly reducing the barrier to entry for hardware engineers. Allo is one such composable HLS framework specifically targeting spatial accelerator design, allowing designs defined in Python to be lowered to FPGA RTL. In this work, we introduce ABAX, an extension for Allo that expands its backend support to target Google's Accelerated HW Synthesis (XLS) toolchain. ABAX acts as a bridge, transpiling Allo designs to XLS's two frontends: DSLX and XLS [CC]. To validate this flow, we implement five representative kernels-scalar add, vector-vector add (vvadd), matrix-vector multiplication (GEMV), matrix multiplication (GEMM), and systolic arrays-and successfully lower them to ASIC RTL. This work demonstrates Allo's novel capability to target ASIC flows, unifying the composable programming model across both FPGA and ASIC targets.
Superscalar Processor

Superscalar Processor

A Quad-Issue Superscalar, 5-Stage Pipelined Processor in SystemVerilog

Computer Architecture Out-of-Order Execution Multi-Core
In Lab 4, we investigate and evaluate two designs: a single-core system and a multi-core system with private instruction caches and shared data caches connected via a ring network. We also investigate and evaluate software (algorithm) design for both hardware systems. Our baseline is a single 5-stage pipelined processor connected with instruction and data caches and a corresponding single-threaded sorting algorithm. The alternative design instead instantiates 4 pipelined processors, utilizing private instruction caches and a banked data cache system via a ring network with a corresponding multi-threaded sorting algorithm. This assignment connects deeply with a fundamental theme in computer architecture: the multi-core era. Per the end of Dennard Scaling around 2005 (fundamental limitations of single-core processors), computer architects looked to exploiting parallelism to improve performance. In this lab, we aim to quantify both the benefits and drawbacks of such multi-core designs.
BlossomNav

BlossomNav

Visual Odometry Algorithm + Software System for Mobile Socially Assistive Robots

Embedded OS SLAM Visual Odometry Robotics

Efficient Lane Detector for Autonomous Model Cars

ISLPED 2022 Design Contest Submission — Pruned ResNet18 Lane Detector + Model Autonomous Car

Computer Vision Neural Networks Embedded Systems FPGA
In recent years, autonomous driving has attracted significant attention. There are many technical challenges in creating such automation, such as quick recognition of objects in the environment. A lane-detection system is a crucial component of connected and autonomous vehicles. Although deep learning models have been the state-of-the-art for lane detection from camera footage collected by vehicle sensors, these models require extensive computation and memory storage to detect and track lanes, which restricts their applicability. We apply a model compression technique to prune weights for a well-trained ResNet18 lane detector. The compressed model can run on a power-efficient TX2 or FPGA equipping an F1Tenth model car. We will demonstrate this system using our assembled model car and report detection accuracy and efficiency using benchmark datasets.

Teaching