Assemblage

A dataset (and tool for building) binary executable corpuses.

Follow me on GitHub

Assemblage is both a dataset (of x86-64 ELF and Windows PE executables) and a cloud-based distributed system for building large, diverse, corpuses of binaries. Assemblage runs continuously on AWS, crawling GitHub for available repositories (of C and C++ code, for now) and then configuring, diversifying (across compiler / flag variants), and building binary artifacts. To date, Assemblage has built over 890k Windows PE binaries, along with 428k Linux ELF binaries.

Assemblage’s high-level design looks like this:

Assemblage's high-level system design

March ‘24 Dataset Snapshot

As of March ‘24, our dataset looks roughly like this (see our datasheet for more information):

Source Platform License Total Repositories Functions Functions (w/ source code)
GitHub Windows Mixed 890k 172k 298M 20M
    Licensed 62k 12k 38M 3M
  Linux Mixed 428k 48k 316M N/A
    Licensed 211k 13k 186M N/A
vcpkg Windows Licensed 29k 1k 48M N/A

Publicly-Hosted Snapshots

Here we include only the subset of binaries for which permissive licenses can be ascertained. Please contact us if you would like recipes for unlicensed repositories. PDB files are too large to be included in our publicly-hosted repositories; datasets with PDB files are also available upon request.

Each dataset is broken up into both (a) an SQLite file, which includes metadata, and (b) a dump of the binaries themselves. Please see our datasheet for a description of database organization. We are working on additional tutorials now, please reach out if you are interested in specific types of queries.

We are currently hosting on AWS, please contact us is you plan to consume large amounts of bandwidth.

  1. 62k Windows PE Binaries (Processed to SQLite database, last updated: Apr 14th 2024):

2.Windows vcpkg dataset (Processed to SQLite database, 29k):

3.Linux GitHub dataset (Processed to SQLite database, 211k):

GitHub Repo / Bug Reports

Assemblage’s public source is kept here.

Please report bugs via GitHub.

Contact Us / Citations

Assemblage is primarily developed at Syracuse University, by a team that includes:

  • Chang Liu, cliu57@syr.edu, Syracuse University (PhD student)
  • Yihao Sun, ysun67@syr.edu, Syracuse University (PhD student)
  • Kristopher Micinski, kkmicins@syr.edu Asst. Prof @ Syracuse University

Please reach out if you are using the Assemblage dataset for your work or would be interested in chatting about your usage apropos binary analysis.

Please cite our Arxiv draft (link forthcoming).