I am a Data Scientist at Dataminr, working on performing analysis on the full Twitter fire-hose. Before Dataminr, I worked as an intern at Oracle Labs East, where I made contributions toward large scale machine learning inference using GPU programming. (github)(resume)
I have recently completed a masters degree at the University of Massachusetts Amherst specifically in the IESL lab working with Andrew McCallum.
I completed my undergrad education at the Computer Science department at Queens College – City University of New York. There I worked at a number of labs, mostly in the Blender Lab (now at RPI) with Heng Ji.
My interests lie in machine learning methods, most recently with applications in natural language processing. I am most fascinated by methods that prevent automatically learned models from making mistakes of the sort that humans would not have made, or the methods that help those models recover from such situations.
My most recent work at UMass involves automatic field extraction from citation strings in research papers. Machine learning methods on this task often only includes local information about the labeling of the citation strings. In contrast, humans can use the fact that a field such as volume number is unlikely to appear twice in a citation while performing inference on such tasks. I am investigating the use of constrained inference of this manner in order to bring the performance of models on such tasks up to human levels.
As an undergrad, I’ve worked with improving relation extraction by jointly doing inference over a large amount of automatically collected relations. We built a system that can detect unlikely collections of relations in a network. For example, this system can detect constraints such as that a company is unlikely to have been founded by somebody who does not live in the same country that the company is located in. The system proceeds to find the best configuration of relations that satisfies these learned constraints (joint work with Qi Li).
I’ve also done undergrad research at the SimStudent Lab at CMU HCII (NSF REU), at the MetroBotics project at Agents Lab at Brooklyn College(NSF REU), and the Brain Networks Lab at Texas A&M(NSF REU).
I also enjoy creating useful and functional user-facing systems that interact with complex machinery. In the past I’ve created a web-based browsing system for browsing networks of entities and relations automatically extracted via machine learning systems. I’ve also produced a web-based browser of 3D brain scans that can be collaboratively annotated by researchers.
Publications
- Learning Soft Linear Constraints with Application to Citation Field Extraction. Sam Anzaroot, Alexandre Passos, David Belanger, Andrew McCallum. Proc. the 52nd Annual Meeting of the Association for Computational Linguistics (ACL2014), 2014.
- A New Dataset for Fine-Grained Citation Field Extraction. Sam Anzaroot, Andrew McCallum. ICML Workshop on Peer Reviewing and Publishing Models (PEER), 2013.
- Joint Inference for Crossdocument Information Extraction. Qi Li, Sam Anzaroot, Wen-Pin Lin, Xiang Li and Heng Ji. Proc. 20th ACM Conference on Information and Knowledge Management (CIKM2011). 2011
- Cross-lingual Slot Filling from Comparable Corpora. Matthew Snover, Xiang Li, Wen-Pin Lin, Zheng Chen, Suzanne Tamang, Mingmin Ge, Adam Lee, Qi Li, Hao Li, Sam Anzaroot, Heng Ji. Proc. ACL2011 Worshop on Building and Using Comparable Corpora. 2011
- Developing a Framework for Team-based Robotics Research. Elizabeth Sklar, Simon Parsons, Susan Epstein, Arif T. Ozgelen, George Rabanca, Sam Anzaroot, Joel Gonzalez, Jesse Lopez, Mitch Lustig, Linda Ma, Mark Manashirov, J. Pablo Munoz, S. Bruno Salazar, Miriam Schwartz. AAAI 2010 Robotics Exhibition and Workshop. 2011
- Search, Mining and Browsing Self-Boosting Multi-Dimensional Text-Rich Information Networks. Sam Anzaroot, Javier Artiles, Hao Li, Qi Li, Zheng Chen, Suzanne Tamang, Heng Ji, Hongbo Deng, Jiawei Han. The Network Science Collaborative Technology Alliance Annual Meeting. 2011 (presentation)