Assemblage is both a dataset (of x86-64 ELF and Windows PE executables) and a cloud-based distributed system for building large, diverse, corpuses of binaries. Assemblage runs continuously on AWS, crawling GitHub for available repositories (of C and C++ code, for now) and then configuring, diversifying (across compiler / flag variants), and building binary artifacts. To date, Assemblage has built over 890k Windows PE binaries, along with 428k Linux ELF binaries.
Assemblage’s high-level design looks like this:
March ‘24 Dataset Snapshot
As of March ‘24, our dataset looks roughly like this (see our datasheet for more information):
Source | Platform | License | Total | Repositories | Functions | Functions (w/ source code) |
---|---|---|---|---|---|---|
GitHub | Windows | Mixed | 890k | 172k | 298M | 20M |
Licensed | 62k | 12k | 38M | 3M | ||
Linux | Mixed | 428k | 48k | 316M | N/A | |
Licensed | 211k | 13k | 186M | N/A | ||
vcpkg | Windows | Licensed | 29k | 1k | 48M | N/A |
Publicly-Hosted Snapshots
Here we include only the subset of binaries for which permissive licenses can be ascertained. Please contact us if you would like recipes for unlicensed repositories. PDB files are too large to be included in our publicly-hosted repositories; datasets with PDB files are also available upon request.
Each dataset is broken up into both (a) an SQLite file, which includes metadata, and (b) a dump of the binaries themselves. Please see our datasheet for a description of database organization. We are working on additional tutorials now, please reach out if you are interested in specific types of queries.
We are currently hosting on AWS, please contact us is you plan to consume large amounts of bandwidth.
- 62k Windows PE Binaries (Processed to SQLite database, last updated: Apr 14th 2024):
- SQLite databse (12G):
- Binary dataset (7G):
2.Windows vcpkg dataset (Processed to SQLite database, 29k):
- SQLite database (3.3GB):
- Binary dataset (18G):
3.Linux GitHub dataset (Processed to SQLite database, 211k):
- SQLite database (23M):
- Binary dataset (72G):
GitHub Repo / Bug Reports
Assemblage’s public source is kept here.
Please report bugs via GitHub.
Contact Us / Citations
Assemblage is primarily developed at Syracuse University, by a team that includes:
- Chang Liu, cliu57@syr.edu, Syracuse University (PhD student)
- Yihao Sun, ysun67@syr.edu, Syracuse University (PhD student)
- Kristopher Micinski, kkmicins@syr.edu Asst. Prof @ Syracuse University
Please reach out if you are using the Assemblage dataset for your work or would be interested in chatting about your usage apropos binary analysis.
Please cite our Arxiv draft (link forthcoming).