Datasets
Instant access to datasets — no pipeline, no API key negotiation, no rate limits. Query with SQL, pay fractions of a cent per call.
NVD — National Vulnerability Database
nvdEvery CVE published by NIST, including CVSS scores, severity ratings, CWE weaknesses, and advisory references. Updated periodically from the NVD API v2.0.
cve| Column | Type | Description |
|---|---|---|
| id | string | CVE identifier (e.g. CVE-2024-12345) |
| published | string | Publication date |
| lastModified | string | Last modification date |
| vulnStatus | string | Analyzed, Modified, Awaiting Analysis |
| descriptions[0].value | string | English description |
| metrics.cvssMetricV31[0] .cvssData.baseScore | float | CVSS v3.1 base score (0–10) |
| metrics.cvssMetricV31[0] .cvssData.baseSeverity | string | LOW / MEDIUM / HIGH / CRITICAL |
| metrics.cvssMetricV31[0] .cvssData.vectorString | string | Full CVSS vector string |
| weaknesses[0].description[0].value | string | Primary CWE identifier |
| references | array | Advisory URLs and sources |
Note: Timestamp columns require backtick-quoted ISO 8601 literals for range comparisons
— single-quoted strings return zero rows. Example: `2024-01-01T00:00:00Z`
-- Critical CVEs published in 2024 SELECT id, published, metrics.cvssMetricV31[0].cvssData.baseScore AS score, descriptions[0].value AS description FROM cve WHERE metrics.cvssMetricV31[0].cvssData.baseSeverity = 'CRITICAL' AND published >= `2024-01-01T00:00:00Z` ORDER BY score DESC LIMIT 20
OSV — Open Source Vulnerabilities
osvOSV.dev advisories flattened to one row per (advisory, affected package). Covers PyPI, npm, Go, Maven, Rust, Debian, Alpine, and more. Superset of NVD — includes ecosystem-specific advisories (GHSA, PYSEC, RUSTSEC) with precise package-version ranges for direct dependency scanning.
advisories| Column | Type | Description |
|---|---|---|
| id | string | Advisory ID: CVE-..., GHSA-..., PYSEC-... |
| published | string | First published date |
| modified | string | Last modified date |
| withdrawn | string | Withdrawal date if retracted (optional) |
| aliases | array | Related IDs (e.g. CVE alias for a GHSA) |
| summary | string | Short description |
| severity_type | string | CVSS type (e.g. CVSS_V3) |
| severity_score | string | Full CVSS vector string |
| cwe_ids | array | CWE weakness identifiers |
| ecosystem | string | PyPI, npm, Go, Maven, crates.io, ... |
| package_name | string | Package name within the ecosystem |
| purl | string | Package URL (optional) |
| introduced | string | First affected version (0 = all versions before fixed) |
| fixed | string | First fixed version; absent if not yet fixed |
-- All advisories for a specific npm package SELECT id, introduced, fixed, summary FROM advisories WHERE ecosystem = 'npm' AND package_name = 'lodash' -- Unfixed critical advisories in PyPI (CVSS network-reachable, high impact) SELECT id, package_name, summary FROM advisories WHERE ecosystem = 'PyPI' AND severity_score ~ 'AV:N.*C:H' AND fixed IS MISSING LIMIT 50 -- Advisory counts by ecosystem SELECT ecosystem, COUNT(*) AS n FROM advisories GROUP BY ecosystem ORDER BY n DESC
SEC EDGAR — Financial Facts
secStructured XBRL financial facts from SEC filings (10-K, 10-Q, 8-K). One row per reported fact — every line item a public company files: revenue, net income, assets, shares outstanding, and thousands of other GAAP/IFRS concepts.
Note: The end column (period end date) is a reserved keyword in Sneller SQL.
Do not use it in SELECT or ORDER BY — filter by filed or fy instead.
edgar| Column | Type | Description |
|---|---|---|
| cik | string | SEC Central Index Key |
| ticker | string | Stock ticker symbol |
| company | string | Company name |
| taxonomy | string | us-gaap, ifrs-full, dei, ... |
| concept | string | XBRL concept (e.g. Revenues, NetIncomeLoss) |
| label | string | Human-readable label for the concept |
| unit | string | USD, shares, pure, ... |
| end | string | Period end date (reserved keyword — avoid in SELECT) |
| val | float | Reported value |
| form | string | Filing form: 10-K, 10-Q, 8-K, ... |
| fy | string | Fiscal year |
| fp | string | Fiscal period: FY, Q1, Q2, Q3 |
| filed | string | Filing date |
| accn | string | SEC accession number |
-- Annual revenue for a company across fiscal years SELECT ticker, fy, fp, val AS revenue, filed FROM edgar WHERE ticker = 'AAPL' AND concept = 'RevenueFromContractWithCustomerExcludingAssessedTax' AND form = '10-K' ORDER BY fy DESC LIMIT 10 -- Compare net income across S&P 500 for latest fiscal year SELECT ticker, company, val AS net_income, fy FROM edgar WHERE concept = 'NetIncomeLoss' AND form = '10-K' AND fy = 2024 AND unit = 'USD' ORDER BY net_income DESC LIMIT 20 -- Balance sheet: total assets vs total liabilities for one company SELECT ticker, fy, fp, concept, val, filed FROM edgar WHERE ticker = 'MSFT' AND concept IN ('Assets', 'Liabilities') AND form = '10-K' ORDER BY fy DESC LIMIT 10 -- Discover which concepts a company has reported (useful before writing queries) SELECT concept, label, unit, COUNT(*) AS filings FROM edgar WHERE ticker = 'TSLA' GROUP BY concept, label, unit ORDER BY filings DESC LIMIT 50 -- Quarterly earnings per share trend SELECT ticker, fy, fp, val AS eps, filed FROM edgar WHERE ticker = 'NVDA' AND concept = 'EarningsPerShareBasic' AND form = '10-Q' ORDER BY filed DESC LIMIT 12
Ethereum — On-chain Data
eth
On-chain Ethereum data sourced from the AWS Public Blockchain dataset.
All tables are partitioned by date with native
block_timestamp indexing — queries target only the relevant
daily partitions; no full-table scans needed for time-range queries.
transfers — decoded ERC-20 / ERC-721 Transfer events · ~70 GB scanned (1 year ingested)| Column | Type | Description |
|---|---|---|
| block_number | int | Block height |
| block_timestamp | string | Block timestamp (ISO 8601) |
| transaction_hash | string | Transaction hash |
| log_index | int | Log position within the transaction |
| token_address | string | Contract address of the token |
| from_address | string | Sender address |
| to_address | string | Recipient address |
| value | string | Transfer amount (raw; token decimals vary) |
-- Recent large USDC transfers (USDC = 6 decimals, $1M = 1e12) SELECT transaction_hash, from_address, to_address, "value" FROM transfers WHERE token_address = '0xa0b86991c6218b36c1d19d4a2e9eb0ce3606eb48' AND CAST("value" AS FLOAT) > 1000000000000 AND block_timestamp >= '2025-01-01' LIMIT 20
blocks — one row per block (header, gas, base fee) · ~770 MB scanned (365 days ingested)| Column | Type | Description |
|---|---|---|
| number | int | Block height |
| timestamp | string | Block timestamp (ISO 8601) |
| hash | string | Block hash |
| miner | string | Validator / miner address |
| gas_used | int | Gas consumed by all transactions |
| gas_limit | int | Block gas limit |
| base_fee_per_gas | int | EIP-1559 base fee (wei); null pre-London |
| transaction_count | int | Number of transactions in the block |
| difficulty | int | Block difficulty (0 post-Merge) |
| size | int | Block size in bytes |
-- Average gas price and block utilisation over the last 7 days SELECT DATE(timestamp) AS day, AVG(base_fee_per_gas) / 1e9 AS avg_base_fee_gwei, AVG(gas_used * 100.0 / gas_limit) AS avg_utilisation_pct FROM blocks WHERE timestamp >= '2026-04-06' GROUP BY day ORDER BY day
contracts — one row per contract deployment · full history 2015 – present, ~9.6 GB scanned| Column | Type | Description |
|---|---|---|
| address | string | Deployed contract address |
| deployer_address | string | Address that deployed the contract |
| block_timestamp | string | Deployment timestamp (ISO 8601) |
| block_number | int | Block height of deployment |
| transaction_hash | string | Deployment transaction hash |
| is_erc20 | bool | Detected ERC-20 token contract |
| is_erc721 | bool | Detected ERC-721 NFT contract |
| function_sighashes | array | 4-byte function selectors present in bytecode |
-- Lookup a contract: who deployed it and when? SELECT deployer_address, block_timestamp, transaction_hash, is_erc20, is_erc721 FROM contracts WHERE address = '0xdac17f958d2ee523a2206206994597c13d831ec7'
transactions — one row per transaction (native ETH transfers, gas spend, contract calls) · ~136 GB, last 365 days ingested| Column | Type | Description |
|---|---|---|
| transaction_hash | string | Transaction hash (unique identifier) |
| block_number | int | Block height |
| block_timestamp | string | Block timestamp (ISO 8601) |
| transaction_index | int | Position of tx within the block |
| from_address | string | Sender address |
| to_address | string | Recipient address (null for contract creation) |
| value | int | Native ETH transferred (wei; 1 ETH = 1e18) |
| gas | int | Gas limit set by sender |
| gas_price | int | Gas price (wei/gas; for legacy type 0/1 txns) |
| nonce | int | Sender nonce |
| transaction_type | int | 0=legacy, 1=access-list, 2=EIP-1559 |
| max_fee_per_gas | int | EIP-1559 max total fee (wei/gas) |
| max_priority_fee_per_gas | int | EIP-1559 tip to validator (wei/gas) |
| receipt_status | int | 1=success, 0=reverted |
| receipt_gas_used | int | Actual gas consumed |
| receipt_effective_gas_price | int | Price actually paid (base fee + tip, wei/gas) |
| receipt_contract_address | string | Deployed contract address; null if not a creation |
| receipt_cumulative_gas_used | int | Cumulative gas used in block up to this tx |
-- Top 10 gas spenders yesterday (total ETH paid in fees) SELECT from_address, COUNT(*) AS tx_count, SUM(receipt_gas_used * receipt_effective_gas_price) / 1e18 AS eth_fees FROM transactions WHERE block_timestamp >= '2026-04-12' AND block_timestamp < '2026-04-13' GROUP BY from_address ORDER BY eth_fees DESC LIMIT 10
dex_swaps — Uniswap v2 & v3 Swap events · 365 daily partitions| Column | Type | Description |
|---|---|---|
| pool_address | string | Uniswap pool contract address |
| transaction_hash | string | Transaction hash |
| log_index | int | Log position within the transaction |
| block_number | int | Block height |
| block_timestamp | string | Block timestamp (ISO 8601) |
| protocol | string | v2 or v3 |
| sender | string | Swap initiator address |
| recipient | string | Swap recipient address |
| amount0 | int | v3: signed token0 delta (negative = token0 left pool) |
| amount1 | int | v3: signed token1 delta; null for v2 |
| amount0_in / amount0_out | int | v2: unsigned token0 in/out; null for v3 |
| amount1_in / amount1_out | int | v2: unsigned token1 in/out; null for v3 |
| sqrt_price_x96 | int | v3: post-swap sqrt(price) × 296; null for v2 |
| liquidity | int | v3: active liquidity at time of swap; null for v2 |
| tick | int | v3: current tick after swap; null for v2 |
-- Daily swap count and net USDC flow on the USDC/WETH 0.05% pool -- amount0 is in USDC raw units (6 dec); positive = USDC into pool (ETH bought) SELECT partition_date, COUNT(*) AS swap_count, SUM(amount0) / 1e6 AS net_usdc_flow FROM dex_swaps WHERE pool_address = '0x88e6a0c2ddd26feeb64f039a2c41296fcb3f5640' AND block_timestamp >= '2026-04-07' AND block_timestamp < '2026-04-14' GROUP BY partition_date ORDER BY partition_date LIMIT 10
lending — Aave v2/v3 and Compound v3 lending events · 365 daily partitions| Column | Type | Description |
|---|---|---|
| contract_address | string | Lending pool contract address |
| transaction_hash | string | Transaction hash |
| log_index | int | Log position within the transaction |
| block_number | int | Block height |
| block_timestamp | string | Block timestamp (ISO 8601) |
| protocol | string | aave_v2, aave_v3, or compound_v3 |
| event_type | string | supply, withdraw, borrow, repay, liquidation, supply_collateral, withdraw_collateral |
| asset | string | Token contract address (reserve asset) |
| user | string | User address |
| on_behalf_of | string | Beneficiary address (may differ from user) |
| amount | int | Raw token amount (apply asset decimals) |
| borrow_rate_mode | int | Aave borrow rate mode: 1=stable, 2=variable; null for other events |
| collateral_asset | string | Collateral token address; non-null for liquidation events only |
| collateral_amount | int | Collateral seized; non-null for liquidation events only |
-- Daily USDC supply volume by protocol (Aave v3, Compound v3) SELECT partition_date, protocol, COUNT(*) AS supply_events, SUM(amount) / 1e6 AS volume_usdc FROM lending WHERE event_type = 'supply' AND asset = '0xa0b86991c6218b36c1d19d4a2e9eb0ce3606eb48' AND block_timestamp >= '2026-04-07' AND block_timestamp < '2026-04-14' GROUP BY partition_date, protocol ORDER BY partition_date LIMIT 20
lp_events — Uniswap v3 Mint and Burn (liquidity provision) events · 365 daily partitions| Column | Type | Description |
|---|---|---|
| pool_address | string | Uniswap v3 pool contract address |
| owner | string | Address that owns the position |
| tick_lower | int | Lower bound of the price range (tick) |
| tick_upper | int | Upper bound of the price range (tick) |
| transaction_hash | string | Transaction hash |
| log_index | int | Log position within the transaction |
| block_number | int | Block height |
| block_timestamp | string | Block timestamp (ISO 8601) |
| event_type | string | mint (add liquidity) or burn (remove liquidity) |
| sender | string | Caller address for mint; null for burn |
| amount | int | Liquidity units added or removed |
| amount0 | int | Token0 deposited (mint) or withdrawn (burn) |
| amount1 | int | Token1 deposited (mint) or withdrawn (burn) |
-- Net liquidity minted vs burned in USDC/WETH 0.05% pool (last 7 days) SELECT event_type, COUNT(*) AS events, SUM(amount0) / 1e6 AS usdc_total, SUM(amount1) / 1e18 AS weth_total FROM lp_events WHERE pool_address = '0x88e6a0c2ddd26feeb64f039a2c41296fcb3f5640' AND block_timestamp >= '2026-04-07' GROUP BY event_type
mev — MEV-boost relay payloads (block auction data) · 365 daily partitions · ~2.2M rows · ~0.6 GB scanned
Note: Sourced from the ultra_sound relay API. The date partition field is stored as a datetime internally — use backtick ISO 8601 literals for date range filters.
| Column | Type | Description |
|---|---|---|
| block_number | int | Block height |
| block_hash | string | Block hash |
| slot | int | Beacon chain slot number |
| timestamp | int | Unix timestamp of the block |
| date | date | Partition key (stored as datetime internally) |
| relay | string | Relay name (e.g. ultra_sound, flashbots, bloxroute_max) |
| proposer_fee_recipient | string | Validator fee recipient address |
| builder_pubkey | string | Block builder BLS public key |
| gas_used | int | Gas consumed by the block |
| gas_limit | int | Block gas limit |
| num_txs | int | Number of transactions in the block |
| value_wei | int | Block value paid to proposer (wei) |
| value_eth | float | Block value paid to proposer (ETH) |
-- Relay market share — block count and MEV earned (last 30 days) SELECT relay, COUNT(*) AS blocks, TRUNC(AVG(value_eth)*1000000)/1000000 AS avg_mev_eth, TRUNC(SUM(value_eth)*10000)/10000 AS total_mev_eth FROM mev WHERE date >= `2026-03-16T00:00:00Z` GROUP BY relay ORDER BY blocks DESC LIMIT 10 -- Top 10 highest-value MEV blocks all time SELECT date, block_number, relay, value_eth, num_txs FROM mev ORDER BY value_eth DESC LIMIT 10
Base — On-chain Data
base
On-chain data for Base, Coinbase's OP-Stack L2 rollup, sourced from the AWS
Public Blockchain dataset. All tables are partitioned by
partition_date (one partition per calendar day). Source
data has a ~7-day publishing lag. Transactions include L2-specific fee fields
(l1_fee, l2_fee, l1_gas_price) not
present on Ethereum mainnet.
transfers — decoded ERC-20 Transfer events · ~70 GB / 30 days| Column | Type | Description |
|---|---|---|
| block_number | int | Block height |
| block_timestamp | string | Block timestamp (ISO 8601) |
| transaction_hash | string | Transaction hash |
| log_index | int | Log position within the transaction |
| token_address | string | Contract address of the token |
| from_address | string | Sender address (42-char, 0x-prefixed) |
| to_address | string | Recipient address |
| value | int | Transfer amount (raw, token decimals vary) |
| partition_date | string | Partition key (YYYY-MM-DD) |
-- Recent USDC transfers on Base over $50 (USDC = 6 decimals) -- Update partition_date to a recent date within the last 30 days SELECT transaction_hash, from_address, to_address, "value" FROM transfers WHERE token_address = '0x833589fcd6edb6e08f4c7c32d4f71b54bda02913' AND CAST("value" AS FLOAT) > 50000000 AND partition_date IN ('2026-04-12', '2026-04-13') LIMIT 20
blocks — one row per block · ~0.6 GB / 30 days| Column | Type | Description |
|---|---|---|
| number | int | Block height |
| timestamp | string | Block timestamp (ISO 8601) |
| hash | string | Block hash |
| miner | string | Sequencer / fee recipient address |
| gas_used | int | Gas consumed by all transactions |
| gas_limit | int | Block gas limit |
| base_fee_per_gas | int | EIP-1559 base fee (wei) |
| transaction_count | int | Number of transactions in the block |
| size | int | Block size in bytes |
| partition_date | string | Partition key (YYYY-MM-DD) |
-- Average base fee and block utilisation over the last 7 days on Base SELECT DATE(timestamp) AS day, AVG(base_fee_per_gas) / 1e9 AS avg_base_fee_gwei, AVG(gas_used * 100.0 / gas_limit) AS avg_utilisation_pct FROM blocks WHERE partition_date >= '2026-04-06' GROUP BY day ORDER BY day
transactions — one row per transaction, with L2 fee fields · ~67 GB / 30 days| Column | Type | Description |
|---|---|---|
| transaction_hash | string | Transaction hash |
| block_number | int | Block height |
| block_timestamp | string | Block timestamp (ISO 8601) |
| from_address | string | Sender address |
| to_address | string | Recipient address (null for contract creation) |
| value | int | Native ETH transferred (wei) |
| gas | int | Gas limit set by sender |
| receipt_status | int | 1=success, 0=reverted |
| receipt_gas_used | int | Actual gas consumed |
| receipt_effective_gas_price | int | Price actually paid (wei/gas) |
| l1_fee | int | L1 data fee paid to Ethereum (wei) |
| l1_gas_price | int | L1 gas price at inclusion (wei/gas) |
| l1_fee_scalar | string | Scalar used to compute L1 fee |
| l2_fee | int | L2 execution fee (wei) |
| mint | int | ETH minted by L1-to-L2 deposit; null for regular txns |
| source_hash | string | Deposit source hash; non-null for L1-to-L2 deposits |
| partition_date | string | Partition key (YYYY-MM-DD) |
-- Top gas spenders on Base by total L1 + L2 fees paid SELECT from_address, COUNT(*) AS tx_count, SUM(l2_fee + COALESCE(l1_fee, 0)) / 1e18 AS total_eth_fees FROM transactions WHERE partition_date >= '2026-04-06' GROUP BY from_address ORDER BY total_eth_fees DESC LIMIT 10
Bitcoin
btcblocks — one row per block · ~6.6 GB scanned, full history 2009 – present| Column | Type | Description |
|---|---|---|
| number | int | Block height |
| timestamp | string | Block timestamp (ISO 8601); natively indexed |
| mediantime | string | Median time of last 11 blocks (ISO 8601); natively indexed |
| hash | string | Block hash |
| transaction_count | int | Number of transactions in the block |
| difficulty | float | Mining difficulty at this block |
| size | int | Block size in bytes |
| weight | int | Block weight (SegWit units) |
| miner | string | Coinbase nonce (miner tag) |
| previousblockhash | string | Parent block hash |
-- Bitcoin difficulty trend over the last 90 days SELECT partition_date, AVG(difficulty) AS avg_difficulty, SUM(transaction_count) AS daily_txns FROM blocks WHERE timestamp >= `2026-01-13T00:00:00Z` GROUP BY partition_date ORDER BY partition_date
outputs — one row per transaction output · 730 daily partitions (2024 – present) · ~115 GB scanned · partition key: block_timestamp
Note: "value" must be quoted in SQL — it is a reserved word in Sneller.
This table contains outputs only; inputs are not included so UTXO spent/unspent state is not available.
| Column | Type | Description |
|---|---|---|
| txid | string | Transaction ID |
| block_number | int | Block height containing this transaction |
| block_timestamp | string | Block timestamp (ISO 8601); natively indexed |
| output_index | int | Output index within the transaction (vout) |
| address | string | Recipient address |
| "value" | int | Output value in satoshis (1 BTC = 100,000,000 satoshis) |
| type | string | Output script type: P2PKH, P2SH, P2WPKH, P2WSH, P2TR, etc. |
| is_coinbase | bool | True if this output is a miner block reward |
-- Total BTC received by an address over the last 2 years SELECT SUM("value") / 1e8 AS total_btc, COUNT(*) AS output_count FROM outputs WHERE address = 'bc1qxy2kgdygjrsqtzq2n0yrf2493p83kkfjhx0wlh'
DeFi — Protocol Data
defi
DeFi protocol data sourced from the DeFiLlama API. The protocols
table is a current snapshot of all tracked protocols; tvl contains
daily total-value-locked history for protocols with TVL ≥ $10M at fetch time.
Note: The date partition
field in tvl is stored as a datetime internally — use backtick
ISO 8601 literals: `2026-04-15T00:00:00Z`
protocols — snapshot of all DeFiLlama protocols · ~7,300 rows · ~2 MB scanned| Column | Type | Description |
|---|---|---|
| slug | string | DeFiLlama protocol slug (unique identifier) |
| name | string | Protocol display name |
| category | string | Protocol category (e.g. Lending, DEX, Liquid Staking) |
| chain | string | Primary chain |
| chains | array | All chains where the protocol is deployed |
| tvl | float | Total value locked (USD) |
| change_1h | float | TVL change over last 1 hour (%) |
| change_1d | float | TVL change over last 24 hours (%) |
| change_7d | float | TVL change over last 7 days (%) |
| mcap | float | Market capitalisation (USD; optional) |
| fdv | float | Fully diluted valuation (USD; optional) |
| description | string | Short protocol description |
| url | string | Protocol website URL |
| string | Twitter handle (optional) | |
| gecko_id | string | CoinGecko identifier (optional) |
-- Top 20 protocols by TVL SELECT name, category, chain, TRUNC(tvl/1e9*1000)/1000 AS tvl_billions FROM protocols ORDER BY tvl DESC LIMIT 20 -- TVL by category (all protocols with TVL > 0) SELECT category, COUNT(*) AS protocols, TRUNC(SUM(tvl)/1e9*100)/100 AS total_tvl_billions FROM protocols WHERE tvl > 0 GROUP BY category ORDER BY total_tvl_billions DESC LIMIT 20
tvl — daily TVL history per protocol · 365 daily partitions · ~0.4 GB scanned| Column | Type | Description |
|---|---|---|
| date | date | Partition key (stored as datetime; use backtick literals) |
| protocol | string | Protocol display name |
| slug | string | DeFiLlama protocol slug (joins to protocols) |
| category | string | Protocol category |
| chain | string | Chain for this TVL record |
| chains | array | All chains where the protocol is deployed |
| tvl | float | Total value locked on this date (USD) |
-- Total DeFi TVL trend — last 30 days SELECT date, TRUNC(SUM(tvl)/1e9*100)/100 AS total_tvl_billions, COUNT(*) AS protocol_count FROM tvl WHERE date >= `2026-03-16T00:00:00Z` GROUP BY date ORDER BY date LIMIT 30 -- Aave V3 TVL history (last 90 days; each version has its own slug) SELECT date, TRUNC(SUM(tvl)/1e9*1000)/1000 AS tvl_billions FROM tvl WHERE slug = 'aave-v3' AND date >= `2026-01-15T00:00:00Z` GROUP BY date ORDER BY date LIMIT 90
PubMed — Biomedical Literature
pubmed
Full PubMed baseline: titles, abstracts, authors, journals, MeSH terms, and publication
metadata for biomedical literature. Data originates from NLM XML; nested fields preserve
the original XML element names. Filter by pub_year (integer) to scan only
the relevant year partition instead of the full 138 GB.
baseline (selected fields)| Column | Type | Description |
|---|---|---|
| MedlineCitation.PMID.text | string | PubMed identifier (PMID is an object; use .text for the plain ID) |
| MedlineCitation.Article .ArticleTitle | string | Article title |
| MedlineCitation.Article .Abstract.AbstractText | string / object | Abstract text (89% coverage, polymorphic): string for plain abstracts; {text, Label} per section for structured abstracts (e.g. Label: “BACKGROUND”) |
| MedlineCitation.Article .Journal.Title | string | Journal name |
| MedlineCitation.Article .AuthorList.Author | array | Authors (polymorphic elements): {LastName, ForeName, Affiliation} for individuals; {CollectiveName} for groups; 57% coverage |
| MedlineCitation .MeshHeadingList.MeshHeading | array | MeSH controlled vocabulary terms; 57% coverage |
| MedlineCitation .KeywordList.Keyword | array | Author-supplied keywords; 60% coverage |
| MedlineCitation .ChemicalList.Chemical | array | Chemical substances (NameOfSubstance, RegistryNumber); 28% coverage |
| MedlineCitation .DateCompleted.Year | string | Year NLM indexing was completed |
| PubmedData.ArticleIdList .ArticleId | array | DOI, PMC ID, and other identifiers |
-- Recent articles matching a topic (regex title search) SELECT MedlineCitation.PMID.text AS pmid, MedlineCitation.Article.ArticleTitle AS title, MedlineCitation.Article.Journal.Title AS journal, MedlineCitation.DateCompleted.Year AS year FROM baseline WHERE pub_year = 2023 AND MedlineCitation.Article.ArticleTitle ~ '(?i)CRISPR' LIMIT 20
-- All papers by a specific author (any position in author list) -- Author is an array: use comma in FROM to expand one row per author SELECT DISTINCT p.MedlineCitation.PMID.text AS pmid, p.MedlineCitation.Article.ArticleTitle AS title, p.MedlineCitation.Article.Journal.Title AS journal, p.pub_year FROM baseline AS p, p.MedlineCitation.Article.AuthorList.Author AS a WHERE p.pub_year = 2023 AND a.LastName = 'Smith' LIMIT 20
-- Papers co-authored by ALL of the listed authors (AND semantics) -- IN() matches any name; HAVING matched = N enforces all-must-match SELECT p.MedlineCitation.PMID.text AS pmid, COUNT(DISTINCT a.LastName) AS matched FROM baseline AS p, p.MedlineCitation.Article.AuthorList.Author AS a WHERE p.pub_year = 2023 AND a.LastName IN ('Smith', 'Jones', 'Brown') GROUP BY pmid HAVING matched = 3
ClinicalTrials.gov — Clinical Studies
clinicaltrials
All registered clinical studies from ClinicalTrials.gov, sourced via the v2 API.
One row per study. Pairs with PubMed via nct_id. Useful for trial
matching by condition or drug, drug pipeline analysis, sponsor activity, and
eligibility criteria queries.
studies (selected fields)| Column | Type | Description |
|---|---|---|
| nct_id | string | ClinicalTrials.gov identifier (e.g. NCT00123456) |
| brief_title | string | Short study title |
| official_title | string | Full protocol title |
| overall_status | string | RECRUITING, COMPLETED, ACTIVE_NOT_RECRUITING, etc. |
| start_date | string | Study start date (YYYY-MM-DD) |
| completion_date | string | Primary completion date (YYYY-MM-DD) |
| study_first_submitted_date | string | Date first submitted to registry |
| study_type | string | INTERVENTIONAL or OBSERVATIONAL |
| phases | array | Trial phase(s): PHASE1, PHASE2, PHASE3, PHASE4, NA |
| enrollment | int | Target or actual enrollment count |
| lead_sponsor | string | Name of the lead sponsor |
| sponsor_class | string | INDUSTRY, NIH, OTHER_GOV, OTHER |
| conditions | array | Disease/condition names |
| keywords | array | Free-text keywords |
| interventions | array | Objects with type (DRUG, DEVICE, …) and name |
| brief_summary | string | Plain-language study summary (truncated at 2000 chars) |
| sex | string | Eligibility: ALL, FEMALE, MALE |
| minimum_age | string | Minimum participant age (e.g. "18 Years") |
| maximum_age | string | Maximum participant age |
| has_results | bool | True if results have been posted |
-- Trials mentioning GLP-1 (regex search on title) SELECT nct_id, brief_title, overall_status FROM studies WHERE brief_title ~ '(?i)GLP-?1' LIMIT 20
FDA FAERS — Adverse Event Reports
fdaFDA Adverse Event Reporting System (FAERS) — voluntary reports of suspected drug side effects submitted by patients, healthcare providers, and manufacturers. One row per (case, drug): reactions and outcomes are stored as arrays so a single WHERE clause covers all events in a case. Drug names are verbatim reporter text — use regex to match across brand names, generics, and dosage variants. Pairs with PubMed and ClinicalTrials.gov for pharmacovigilance workflows.
Note: Pre-2012 LAERS-era data is not
included. Reports are deduplicated to the highest caseversion per
case. Voluntary reporting introduces bias — counts reflect reporting volume, not
true incidence rates.
faers (selected fields)| Column | Type | Description |
|---|---|---|
| primaryid | int | Unique case-version identifier |
| caseid | int | Case identifier (stable across versions) |
| event_dt | string | Date of adverse event (YYYY-MM-DD; may be partial) |
| year | int | Year of adverse event; partition key |
| age_years | float | Patient age in years (normalised from age + age_cod) |
| sex | string | M / F / UNK |
| occr_country | string | Country where event occurred (ISO 2-letter) |
| drug_seq | int | Drug sequence number within the case |
| drugname | string | Verbatim drug name as reported |
| drug_role | string | PS = primary suspect, SS = secondary, C = concomitant, I = interacting |
| route | string | Route of administration (Oral, Intravenous, etc.) |
| reactions | array | MedDRA preferred terms for all reactions in this case |
| outcomes | array | DE=death, HO=hospitalisation, LT=life-threatening, DS=disability, CA=congenital anomaly, RI=required intervention, OT=other |
| indications | array | MedDRA preferred terms for the drug's indication(s) |
-- Most common reactions for a drug (regex matches brand + generic names) SELECT reaction, COUNT(*) AS n FROM faers, UNNEST(reactions) AS reaction WHERE drugname ~ '(?i)ozempic|semaglutide' AND drug_role = 'PS' GROUP BY reaction ORDER BY n DESC LIMIT 20
-- Fatal cases by drug class (regex on drug name, filter outcome = death) SELECT drugname, COUNT(*) AS fatal_reports FROM faers WHERE drugname ~ '(?i)fentanyl|oxycodone|hydrocodone' AND ARRAY_CONTAINS(outcomes, 'DE') GROUP BY drugname ORDER BY fatal_reports DESC LIMIT 15
FEC Campaign Finance
fec
Individual campaign contributions reported to the FEC, covering every US federal election
cycle from 1980 to 2026. One row per contribution, enriched with committee name, type,
and party affiliation (from the FEC committee master file) and candidate name where
applicable. Filter by cycle (4-digit even year) to scan only the cycles
relevant to your query. Source: FEC Bulk Data — US Government public domain.
contributions| Column | Type | Description |
|---|---|---|
| cycle | int | Election cycle year (1980, 1982, …, 2026) — partition key |
| sub_id | string | Unique FEC record identifier |
| cmte_id | string | FEC committee ID receiving the contribution |
| cmte_name | string | Committee name (enriched from committee master) |
| cmte_type | string | Committee type code (H=House, S=Senate, P=Presidential, Q/N/W=PAC, …) |
| cmte_party | string | Party affiliation (DEM, REP, …) |
| cand_id | string | FEC candidate ID (when contribution targets a specific candidate) |
| cand_name | string | Candidate name (enriched from candidate master) |
| entity_tp | string | Contributor entity type (IND=individual, PAC, ORG, …) |
| name | string | Contributor name |
| city / state / zip_code | string | Contributor location |
| employer | string | Contributor employer |
| occupation | string | Contributor occupation |
| transaction_dt | string | Contribution date (YYYY-MM-DD) |
| transaction_amt | int | Amount in dollars |
| transaction_tp | string | Transaction type code |
| other_id | string | Other committee ID (set for transfers between committees) |
| tran_id | string | Transaction identifier within the filing |
| memo_text | string | Optional memo / description |
-- Top employers of donors to a committee (2024 cycle) SELECT employer, COUNT(*) AS n, SUM(transaction_amt) AS total_usd FROM contributions WHERE cycle = 2024 AND cmte_name ~ '(?i)actblue|winred' AND employer != '' GROUP BY employer ORDER BY total_usd DESC LIMIT 20
-- Political exposure check: donations by employees of a company (2024 cycle) SELECT name, cycle, cmte_name, cmte_party, transaction_amt, transaction_dt FROM contributions WHERE employer ~ '(?i)goldman sachs' AND cycle = 2024 AND transaction_amt >= 2500 ORDER BY transaction_amt DESC LIMIT 25
-- Top recipients of Google employee donations, 2024 cycle SELECT cmte_name, cmte_party, SUM(transaction_amt) AS total_usd, COUNT(*) AS n_contributions FROM contributions WHERE cycle = 2024 AND employer ~ '(?i)google' GROUP BY cmte_name, cmte_party ORDER BY total_usd DESC LIMIT 10
FDA Orange Book — Approved Drug Products
fdaFDA-approved drug products from the Orange Book via the openFDA bulk export. One row per product strength/form. Includes therapeutic equivalence (TE) codes that determine generic substitutability. Essential for pharma agents querying approval status, generics entry, and formulary decisions.
orangebook| Column | Type | Description |
|---|---|---|
| application_number | string | NDA or ANDA number (e.g. NDA009700) |
| appl_type | string | N = NDA (brand), A = ANDA (generic) |
| sponsor_name | string | Applicant / sponsor company name |
| product_number | string | Product sequence within the application |
| brand_name | string | Trade / brand name |
| active_ingredients | array | [{name, strength}] — active ingredient names and strengths |
| dosage_form | string | e.g. TABLET, CAPSULE |
| route | string | e.g. ORAL, INTRAVENOUS |
| marketing_status | string | e.g. Prescription, OTC |
| te_code | string | Therapeutic equivalence code (AB = substitutable; may be null) |
-- All approved generics for a drug (AB-rated = substitutable) SELECT application_number, sponsor_name, brand_name, te_code FROM orangebook WHERE ARRAY_CONTAINS(active_ingredients[0].name, 'METFORMIN') AND te_code = 'AB' ORDER BY sponsor_name LIMIT 20
-- All products by a given sponsor SELECT application_number, brand_name, dosage_form, route, te_code FROM orangebook WHERE sponsor_name ~ '(?i)pfizer' AND appl_type = 'N' ORDER BY brand_name LIMIT 25
FRED — Federal Reserve Economic Data
fredCurated macroeconomic time series from the St. Louis Fed FRED database: GDP, inflation, employment, interest rates, money supply, housing, fiscal, markets, and trade. One row per (series, date) observation. Series with redistribution restrictions are excluded.
This product uses the FRED® API but is not endorsed or certified by the Federal Reserve Bank of St. Louis.
series| Column | Type | Description |
|---|---|---|
| series_id | string | FRED series identifier (e.g. UNRATE, GDP) |
| category | string | Thematic group: gdp, inflation, employment, rates, money, housing, production, trade, fiscal, markets, credit, income |
| title | string | Full series name (e.g. Unemployment Rate) |
| units | string | Unit of measure (e.g. %, Bil. of $) |
| frequency | string | Observation frequency (D=daily, M=monthly, Q=quarterly, A=annual) |
| seasonal_adjustment | string | SA = seasonally adjusted; NSA = not seasonally adjusted |
| obs_date | timestamp | Observation date |
| obs_year | int | Observation year (partition key; use for fast year-range queries) |
| value | float | Observation value in the series units (reserved word — quote as "value") |
-- US unemployment rate since 2020 SELECT obs_date, "value" FROM series WHERE series_id = 'UNRATE' AND obs_year >= 2020 ORDER BY obs_date LIMIT 100
-- Compare CPI and core PCE year-over-year SELECT series_id, title, obs_date, "value" FROM series WHERE series_id IN ('CPIAUCSL', 'PCEPILFE') AND obs_year >= 2021 ORDER BY obs_date, series_id LIMIT 200
-- All available series IDs with titles SELECT DISTINCT series_id, category, title, units, frequency FROM series ORDER BY category, series_id LIMIT 200
End of Life — Runtime & Framework Lifecycle Dates
eolEnd-of-life and support status for 450+ software products — languages, runtimes, frameworks, databases, operating systems, and cloud services. One row per release cycle. Source: endoflife.date. Ideal for agents checking dependency support status or auditing infrastructure.
cycles| Column | Type | Description |
|---|---|---|
| product | string | Product slug (e.g. python, nodejs, ubuntu) |
| cycle | string | Release cycle label (e.g. 3.11, 20.04) |
| release_date | string | General availability date (YYYY-MM-DD) |
| eol | string | End-of-life date, or false (still supported), or true (EOL, no exact date) |
| latest | string | Latest patch release in this cycle |
| latest_date | string | Release date of the latest version |
| lts | string | LTS status: true, false, or LTS end date |
| support | string | Active support end date (before EOL), or true/false |
| link | string | URL for the release notes / changelog |
-- Is Python 3.8 still supported? SELECT product, cycle, "eol", lts, "latest" FROM cycles WHERE product = 'python' ORDER BY cycle DESC LIMIT 10
-- All products reaching EOL in 2025 SELECT product, cycle, "eol", "latest" FROM cycles WHERE "eol" LIKE '2025%' ORDER BY "eol" LIMIT 50
arXiv — Preprint Metadata
arxivMetadata for all arXiv preprints — titles, abstracts, authors, categories, and submission dates. Covers CS, math, physics, quantitative finance, statistics, biology, and more. Papers appear on arXiv months before journal publication, making this the primary source for the latest research. Partitioned by submission month.
papers| Column | Type | Description |
|---|---|---|
| id | string | arXiv ID (e.g. 2301.00001) |
| title | string | Paper title (whitespace normalised) |
| abstract | string | Full abstract text |
| categories | array | arXiv category codes (e.g. ["cs.LG", "stat.ML"]) |
| primary_cat | string | Primary category (first in list) |
| authors | array | [{name, affiliation}] |
| submitted | string | Original submission date (YYYY-MM-DD); partition key |
| updated | string | Last-updated date |
| doi | string | DOI if published; null otherwise |
| journal_ref | string | Journal reference if published; null otherwise |
| comments | string | Author comments (page count, code links, etc.) |
| license | string | License URL (e.g. CC BY 4.0) |
-- Recent ML papers mentioning transformers SELECT id, submitted, title, primary_cat FROM papers WHERE primary_cat = 'cs.LG' AND submitted >= '2024-01-01' AND title ~ '(?i)transformer' ORDER BY submitted DESC LIMIT 25
-- Papers by an author across all categories SELECT id, submitted, title, categories FROM papers, UNNEST(authors) AS a WHERE a.name ~ '(?i)lecun' ORDER BY submitted DESC LIMIT 20
-- Category submission volume by year (quantitative finance) SELECT DATE_TRUNC('year', CAST(submitted AS TIMESTAMP)) AS yr, COUNT(*) AS papers FROM papers WHERE primary_cat = 'q-fin.TR' GROUP BY yr ORDER BY yr LIMIT 50
GWAS Catalog — Genome-Wide Association Studies
gwas
NHGRI-EBI GWAS Catalog — curated genome-wide association study results.
One row per variant–trait association. Joins ClinVar on rsid / dbsnp_id
and PubMed on pubmed_id. Source: EBI FTP.
Partition key: date_added.
associations| Column | Type | Description |
|---|---|---|
| rsid | string | dbSNP rsID (e.g. rs1801131) |
| snp_risk_allele | string | Risk allele in rsID-allele format (e.g. rs1801131-C) |
| risk_allele_freq | float | Risk allele frequency in discovery sample |
| p_value | float | Association p-value |
| pvalue_mlog | number | −log&sub1;&sub0;(p-value) |
| or_beta | float | Odds ratio (binary traits) or beta (quantitative traits) |
| ci_95 | string | 95% confidence interval |
| disease_trait | string | Disease or trait name as reported by the study |
| mapped_trait | string | EFO-mapped trait label |
| mapped_gene | string | Nearest gene(s) mapped by ENSEMBL |
| reported_genes | array | Genes reported by the study authors |
| chromosome | string | Chromosome |
| chr_pos | int | Chromosomal position (GRCh38) |
| context | string | Variant functional context (e.g. intron_variant, missense_variant) |
| pubmed_id | int | PubMed ID of the source study |
| pub_date | string | Publication date |
| first_author | string | First author surname |
| study_accession | string | GWAS Catalog study accession (e.g. GCST000001) |
| date_added | string | Date added to the GWAS Catalog (partition key) |
| initial_sample | string | Discovery sample description (ancestry, size) |
-- Top associations for Type 2 Diabetes SELECT rsid, snp_risk_allele, mapped_gene, p_value, or_beta, risk_allele_freq FROM associations WHERE disease_trait ~ '(?i)type 2 diabetes' ORDER BY pvalue_mlog DESC LIMIT 25
-- Join GWAS hits with ClinVar pathogenic variants SELECT g.rsid, g.disease_trait, g.mapped_gene, g.p_value, c.significance, c.gene_symbol FROM gwas.associations AS g JOIN clinvar.variants AS c ON g.rsid = c.dbsnp_id AND c.assembly = 'GRCh38' AND c.sig_simple = 1 WHERE g.pvalue_mlog > 50 ORDER BY g.pvalue_mlog DESC LIMIT 20
-- Most-studied traits by association count SELECT disease_trait, COUNT(*) AS n_associations FROM associations GROUP BY disease_trait ORDER BY n_associations DESC LIMIT 20
ClinVar — Genetic Variant Classifications
clinvarNCBI ClinVar variant classifications — pathogenicity, associated phenotypes, chromosomal coordinates, and review status for millions of human genetic variants. One row per variant × assembly (GRCh37 and GRCh38 each have their own row). Source: NCBI FTP, updated weekly.
variants| Column | Type | Description |
|---|---|---|
| allele_id | int | Stable ClinVar AlleleID |
| variation_id | int | ClinVar VariationID |
| type | string | Variant type (SNV, Indel, Deletion, Duplication, …) |
| name | string | HGVS-like variant name |
| gene_symbol | string | Gene symbol (e.g. BRCA1) |
| gene_id | int | NCBI GeneID; null if intergenic |
| significance | string | Clinical significance (Pathogenic, Likely pathogenic, Benign, …) |
| sig_simple | int | -1 conflicting, 0 uncertain/other, 1 pathogenic, 2 benign, 3 risk-factor |
| last_evaluated | string | Date last evaluated by submitter |
| dbsnp_id | string | dbSNP rsID (e.g. rs80357713); null if none |
| phenotypes | array | Associated disease/phenotype names |
| phenotype_ids | array | MedGen/OMIM/Orphanet IDs for phenotypes |
| rcv_accessions | array | RCV accession numbers |
| origin | string | Allele origin (germline, somatic, …) |
| assembly | string | Reference assembly (GRCh38, GRCh37, NCBI36) |
| chromosome | string | Chromosome |
| start | int | Start position (1-based) |
| stop | int | Stop position (1-based) |
| ref_allele | string | Reference allele (or null) |
| alt_allele | string | Alternate allele (or null) |
| review_status | string | Review status (e.g. "criteria provided, multiple submitters") |
| n_submitters | int | Number of submitters |
| pos_vcf | int | VCF position |
| ref_vcf | string | VCF reference allele |
| alt_vcf | string | VCF alternate allele |
| other_ids | string | Other identifiers (comma-sep "type:value" pairs) |
| oncogenicity | string | Oncogenicity classification; null if not assessed |
-- Pathogenic variants in BRCA1 (GRCh38) SELECT allele_id, name, significance, chromosome, start, dbsnp_id FROM variants WHERE gene_symbol = 'BRCA1' AND assembly = 'GRCh38' AND sig_simple = 1 ORDER BY start LIMIT 50
-- Variants associated with a phenotype (by name) SELECT gene_symbol, significance, name, review_status, n_submitters FROM variants, UNNEST(phenotypes) AS ph WHERE assembly = 'GRCh38' AND ph ~ '(?i)breast cancer' AND sig_simple = 1 ORDER BY n_submitters DESC LIMIT 25
-- Pathogenic variant count per gene (well-reviewed only) SELECT gene_symbol, COUNT(*) AS pathogenic_variants FROM variants WHERE assembly = 'GRCh38' AND sig_simple = 1 AND n_submitters >= 2 GROUP BY gene_symbol ORDER BY pathogenic_variants DESC LIMIT 20
ALFA — Allele Frequency Aggregator
alfa
NCBI Allele Frequency Aggregator — allele frequencies from 1M+
dbGaP subjects across 12 population groups. GRCh38 coordinates.
Partitioned by chromosome. The key advantage over gnomAD is finer
African ancestry resolution: afr_am (African American),
afr (continental African), and afr_oth (other
African) are reported separately rather than collapsed into a single
afr bucket. This split is clinically significant for
variant interpretation in underrepresented populations.
Source: NCBI FTP,
open access.
Note: chromosome uses no chr prefix
(“1” not “chr1”) — same as ClinVar,
opposite of gnomAD.
variants · partitioned by chrom| Column | Type | Description |
|---|---|---|
| chrom | string | Chromosome (1–22, X, Y, MT — no chr prefix) |
| pos | int | GRCh38 position (1-based) |
| rsid | string | dbSNP RS identifier (e.g. rs12345); null if none |
| ref | string | Reference allele |
| alt | string | Alternate allele (first for multi-allelic sites) |
| af_tot | float | Total allele frequency (all subjects) |
| ac_tot / an_tot | int | Total alternate allele count / total allele count |
| af_eur | float | European — closest to gnomAD af_nfe |
| af_afr_am | float | African American (US) — not available in gnomAD separately |
| af_afr | float | Continental African (non-US) — not available in gnomAD separately |
| af_afr_oth | float | Other African ancestry — not available in gnomAD separately |
| af_eas | float | East Asian — comparable to gnomAD af_eas |
| af_oth_asn | float | Other Asian — broader than gnomAD coverage |
| af_asn | float | Asian (broad grouping, includes EAS + other) |
| af_sas | float | South Asian |
| af_lat1 / af_lat2 | float | Two Latino sub-populations — gnomAD reports a single af_amr |
| af_oth | float | Other / unassigned |
| ac_* / an_* | int | Per-population allele count / allele number for each population above |
-- Population frequency comparison for a known variant SELECT rsid, ref, alt, af_tot, af_eur, af_afr_am, af_afr, af_eas, af_sas FROM variants WHERE chrom = '17' AND rsid = 'rs80357713'
-- Variants with large African American vs European frequency gap -- (population-enriched variants, clinically under-studied) SELECT rsid, chrom, pos, ref, alt, af_afr_am, af_eur, af_afr_am - af_eur AS delta FROM variants WHERE chrom = '13' AND an_afr_am >= 100 AND an_eur >= 100 AND af_afr_am - af_eur > 0.3 ORDER BY delta DESC LIMIT 20
DGIdb — Drug–Gene Interaction Database
dgidbDrug-Gene Interaction Database (DGIdb) — ~100K curated drug–gene interactions aggregated from 40+ sources including ChEMBL, CIViC, and FDA drug labels. MIT licensed; commercial use permitted. Useful for target druggability checks: given a gene of interest, find what drugs already modulate it (approved or investigational), enabling competitive landscape and prior art queries without a manual ChEMBL or DrugBank search. Source: dgidb.org.
Usage notes:
interaction_type is NULL in most records from the DTC source —
filter on approved, anti_neoplastic, and
interaction_score instead.
drug_concept_id uses ChEMBL/NCI-T namespaces and will not
directly join FAERS or NADAC; use drug_name (plain text) as the
practical join key.
interactions (98,239 rows)| Column | Type | Description |
|---|---|---|
| gene_name | string | HGNC gene symbol (e.g. EGFR) |
| gene_concept_id | string | Normalized gene concept ID (HGNC namespace) |
| drug_name | string | Drug name (normalized); practical join key against FAERS / NADAC |
| drug_concept_id | string | Normalized drug concept ID (ChEMBL or NCI-T namespace) |
| approved | bool | FDA-approved drug |
| immunotherapy | bool | Immunotherapy agent |
| anti_neoplastic | bool | Anti-neoplastic (cancer) drug |
| interaction_type | string | Interaction type (inhibitor, activator, …); NULL in most DTC-source rows |
| interaction_score | float | DGIdb composite score; null if not computed |
| source_db_name | string | Source database (e.g. ChEMBL, CIViC, DTC) |
| source_db_version | string | Source database version |
genes (80,234 rows)| Column | Type | Description |
|---|---|---|
| gene_name | string | HGNC gene symbol |
| concept_id | string | Normalized concept ID |
| nomenclature | string | Gene nomenclature source |
| source_db_name | string | Source database |
| source_db_version | string | Source database version |
drugs (81,572 rows)| Column | Type | Description |
|---|---|---|
| drug_name | string | Drug name (normalized) |
| concept_id | string | Normalized concept ID (ChEMBL or NCI-T) |
| nomenclature | string | Drug nomenclature source |
| approved | bool | FDA-approved |
| immunotherapy | bool | Immunotherapy agent |
| anti_neoplastic | bool | Anti-neoplastic (cancer) drug |
| source_db_name | string | Source database |
| source_db_version | string | Source database version |
-- Approved drugs targeting a gene (prior art / competitor check) SELECT drug_name, interaction_type, interaction_score, source_db_name FROM interactions WHERE gene_name = 'EGFR' AND approved = TRUE ORDER BY interaction_score DESC LIMIT 20
-- Anti-cancer drugs for a gene, with adverse-event cross-check via FAERS -- Step 1: get approved anti-neoplastic drugs targeting the gene SELECT drug_name FROM dgidb.interactions WHERE gene_name = 'BRAF' AND approved = TRUE AND anti_neoplastic = TRUE -- Step 2: feed drug_name list into FAERS for adverse event profile SELECT reaction_meddra_pt, COUNT(*) AS reports FROM fda.faers WHERE drug_name IN ('VEMURAFENIB', 'DABRAFENIB', 'ENCORAFENIB') GROUP BY reaction_meddra_pt ORDER BY reports DESC LIMIT 15
Sanctions — U.S. Consolidated Screening List
sanctionsThe U.S. government Consolidated Screening List (CSL) — a unified export of OFAC SDN (Treasury), BIS Entity List (Commerce), State Dept Nonproliferation, ITAR Debarred, and several other sanctions and export-control lists. One row per sanctioned individual, entity, vessel, or aircraft. Source: trade.gov, updated daily.
entities| Column | Type | Description |
|---|---|---|
| id | string | Stable hash ID from trade.gov |
| source | string | Source list (e.g. "Specially Designated Nationals (SDN) - Treasury Department") |
| name | string | Primary name |
| type | string | Individual, Entity, Vessel, or Aircraft (may be empty) |
| programs | array | Sanctions program codes (e.g. ["UKRAINE-EO13685", "RUSSIA-EO14024"]) |
| alt_names | array | Aliases and alternate name spellings |
| addresses | array | Known addresses |
| ids | array | Identity documents: passport, tax ID, registration (format: "type, value, country") |
| nationalities | array | Nationality countries |
| dobs | array | Dates of birth (YYYY or YYYY-MM-DD) |
| pobs | array | Places of birth |
| start_date | string | Listing date (YYYY-MM-DD) |
| license_requirement | string | BIS license requirement (BIS entries only) |
| remarks | string | Additional remarks |
| source_list_url | string | URL to the source list page |
-- Look up an entity by name (partial match) SELECT name, type, source, programs, addresses FROM entities WHERE name ~ '(?i)vtb bank' LIMIT 10
-- All individuals sanctioned under Russia programs SELECT name, type, start_date, alt_names FROM entities, UNNEST(programs) AS prog WHERE prog ~ '(?i)russia' AND type = 'Individual' ORDER BY start_date DESC LIMIT 25
-- Count entities per source list SELECT source, COUNT(*) AS total FROM entities GROUP BY source ORDER BY total DESC
MalwareBazaar — Malware Sample Repository
malwarebazaar
abuse.ch MalwareBazaar — crowdsourced malware sample metadata. One row per
submitted sample. Includes file hashes, family signatures, VirusTotal detection
rates, and fuzzy hashes for similarity matching. License: CC0.
Partition key: first_seen_utc.
samples| Column | Type | Description |
|---|---|---|
| sha256_hash | string | SHA-256 hash (primary identifier) |
| md5_hash | string | MD5 hash |
| sha1_hash | string | SHA-1 hash |
| file_name | string | Original submitted file name |
| file_type_guess | string | Guessed file type (e.g. Win32 EXE, PDF Document, ZIP) |
| mime_type | string | MIME type |
| first_seen_utc | timestamp | Date first submitted to MalwareBazaar (partition key) |
| reporter | string | Reporting researcher or automated feed |
| signature | string | Malware family name / AV detection name |
| vtpercent | number | VirusTotal detection rate 0–100 (null if not scanned) |
| clamav | string | ClamAV signature name |
| imphash | string | PE import hash (Windows executables) |
| ssdeep | string | ssdeep fuzzy hash for similarity clustering |
| tlsh | string | TLSH fuzzy hash |
-- All samples for a malware family SELECT sha256_hash, file_name, file_type_guess, first_seen_utc, vtpercent FROM samples WHERE signature ~ '(?i)emotet' ORDER BY first_seen_utc DESC LIMIT 25
-- Most prevalent malware families (last 6 months) SELECT signature, COUNT(*) AS samples, AVG(vtpercent) AS avg_vt FROM samples WHERE first_seen_utc >= '2025-03-01' AND signature != '' GROUP BY signature ORDER BY samples DESC LIMIT 20
-- Lookup by hash (any of the three types) SELECT sha256_hash, file_name, signature, first_seen_utc, vtpercent FROM samples WHERE md5_hash = 'd41d8cd98f00b204e9800998ecf8427e' LIMIT 5
Open Food Facts — Global Food Product Database
openfoodfacts
Open Food Facts 2019 Kaggle snapshot (DbCL 1.0) — crowdsourced nutritional data
for food products sold globally. One row per product. Includes ingredients, allergens,
Nutri-Score grades, and 12 key nutrients per 100g. Partition key: created.
products| Column | Type | Description |
|---|---|---|
| code | string | Barcode (EAN/UPC) |
| product_name | string | Product name |
| brands | string | Brand names (comma-separated) |
| categories | string | Product categories (comma-separated) |
| main_category | string | Primary category |
| countries | string | Countries where sold (comma-separated) |
| ingredients | string | Ingredients text |
| allergens | string | Declared allergens |
| traces | string | May-contain-traces declarations |
| additives_n | int | Count of food additives |
| nutrition_grade | string | Nutri-Score grade (a–e; a = healthiest) |
| energy_100g | number | Energy per 100g (kcal) |
| fat_100g | number | Total fat per 100g (g) |
| sat_fat_100g | number | Saturated fat per 100g (g) |
| carbs_100g | number | Carbohydrates per 100g (g) |
| sugars_100g | number | Sugars per 100g (g) |
| fiber_100g | number | Dietary fiber per 100g (g) |
| proteins_100g | number | Protein per 100g (g) |
| salt_100g | number | Salt per 100g (g) |
| created | timestamp | Date product was added (partition key) |
-- Lowest-sugar breakfast cereals with Nutri-Score A or B SELECT product_name, brands, sugars_100g, nutrition_grade, countries FROM products WHERE categories ~ '(?i)cereal' AND nutrition_grade IN ('a', 'b') AND sugars_100g IS NOT MISSING ORDER BY sugars_100g LIMIT 20
-- Average macros and Nutri-Score by food category SELECT main_category, COUNT(*) AS products, AVG(energy_100g) AS avg_kcal, AVG(sugars_100g) AS avg_sugars, AVG(proteins_100g) AS avg_protein FROM products WHERE main_category != '' GROUP BY main_category ORDER BY products DESC LIMIT 20
Hacker News — Y Combinator Tech Community
hackernews
Every story, Ask HN, Show HN, and job post from Hacker News since 2007. One row per item
with score, author, linked URL, and descendant comment count. Partitioned by year.
Useful for tech trend analysis, startup signal detection, and tracking how developer
attention has shifted over time. Source: BigQuery
bigquery-public-data.hacker_news (CC0).
items| Column | Type | Description |
|---|---|---|
| id | string | Item ID |
| by | string | Author username |
| title | string | Story or post title |
| url | string | Linked URL (null for Ask/Show HN text posts) |
| score | int | Upvote score |
| time | int | Unix timestamp of submission |
| descendants | int | Total comment count |
| year | int | Submission year (partition key) |
-- Top stories of 2024 by score SELECT title, url, score, descendants FROM items WHERE year = 2024 ORDER BY score DESC LIMIT 20
-- Show HN posts per year — measure of builder activity over time SELECT year, COUNT(*) AS show_hn_posts, AVG(score) AS avg_score FROM items WHERE title ~ '(?i)^show hn' GROUP BY year ORDER BY year DESC
NADAC — National Average Drug Acquisition Cost
nadac
CMS weekly drug acquisition cost survey — the price pharmacies actually pay for drugs.
One row per NDC code per weekly report. Covers generic and brand-name drugs, retail and
non-retail pharmacies. Pairs with FDA FAERS and Orange Book. Source: data.medicaid.gov,
public domain. Partition key: effective_date.
prices| Column | Type | Description |
|---|---|---|
| ndc | string | National Drug Code (11-digit) |
| ndc_description | string | Drug name, strength, and dosage form |
| nadac_per_unit | string | Acquisition cost per unit (tablet, mL, gram, etc.) |
| pricing_unit | string | Unit type: TAB, ML, GM, EA |
| classification | string | Generic or Brand Name |
| otc | bool | Over-the-counter (true) or prescription (false) |
| pharmacy_type | string | Retail or Non-Retail |
| as_of_date | string | Date of the weekly survey report |
| effective_date | string | Price effective date (partition key) |
| explanation_code | string | Reason for price change or data source code |
| corresponding_generic_nadac | string | For brand drugs: comparable generic NADAC price |
-- Current acquisition cost for metformin generics SELECT ndc_description, nadac_per_unit, pricing_unit, as_of_date FROM prices WHERE ndc_description ~ '(?i)metformin' AND classification = 'Generic' AND as_of_date >= '2025-01-01' ORDER BY as_of_date DESC, ndc_description LIMIT 20
-- Brand vs generic price gap for a drug class SELECT classification, AVG(CAST(nadac_per_unit AS float)) AS avg_cost, COUNT(DISTINCT ndc) AS ndcs FROM prices WHERE ndc_description ~ '(?i)atorvastatin' AND as_of_date >= '2025-01-01' GROUP BY classification
World Bank — Global Commodity Prices
worldbank
World Bank Pink Sheet — monthly prices for ~70 global commodities: energy,
metals, agriculture, and fertilizers. One row per month, one column per commodity.
All prices are nominal USD. Source: World Bank, CC BY 4.0.
Partition key: year.
commodity_prices (selected columns)| Column | Type | Description |
|---|---|---|
| period | string | Month label in YYYY-Mmm format (e.g. 2024-Jan) |
| year | int | Year (partition key) |
| month | int | Month number (1–12) |
| crude_oil_brent | number | Brent crude oil ($/bbl) |
| crude_oil_wti | number | WTI crude oil ($/bbl) |
| natural_gas_us | number | US natural gas ($/mmbtu) |
| natural_gas_europe | number | European natural gas ($/mmbtu) |
| coal_australian | number | Australian thermal coal ($/mt) |
| gold | number | Gold ($/troy oz) |
| copper | number | Copper ($/mt) |
| aluminum | number | Aluminum ($/mt) |
| maize | number | Maize / corn ($/mt) |
| wheat_us_hrw | number | US hard red winter wheat ($/mt) |
| soybeans | number | Soybeans ($/mt) |
| sugar_world | number | World sugar (¢/kg) |
| coffee_arabica | number | Arabica coffee (¢/kg) |
| cocoa | number | Cocoa ($/mt) |
Plus ~50 additional commodity columns. Use GET /v1/databases for the full list.
-- Brent crude and natural gas price history since 2020 SELECT period, crude_oil_brent, crude_oil_wti, natural_gas_us FROM commodity_prices WHERE year >= 2020 ORDER BY year, month
-- Gold vs copper ratio over time (risk-off indicator) SELECT period, gold, copper, gold / NULLIF(copper, 0) AS gold_copper_ratio FROM commodity_prices WHERE year >= 2010 ORDER BY year, month
FHFA — House Price Index
fhfa
Federal Housing Finance Agency (FHFA) House Price Index — quarterly and monthly
home price indices for the U.S., all 50 states, and ~400 metropolitan statistical areas
(MSAs). Covers purchase-only and all-transactions variants, seasonally adjusted and not.
Source: FHFA, public domain. Partition key: yr.
hpi| Column | Type | Description |
|---|---|---|
| yr | int | Year (partition key) |
| period | int | Quarter (1–4) or month (1–12) depending on frequency |
| frequency | string | quarterly or monthly |
| hpi_type | string | purchase-only or all-transactions |
| hpi_flavor | string | traditional or expanded-data |
| level | string | Geographic level: USA, state, MSA |
| place_id | string | FIPS code or state abbreviation |
| place_name | string | State or MSA name (e.g. San Francisco-Oakland-Fremont, CA) |
| index_nsa | number | Not-seasonally-adjusted HPI (base = 100 at 1991 Q1) |
| index_sa | number | Seasonally-adjusted HPI (null for MSA-level series) |
-- National quarterly HPI since 2000 (purchase-only, seasonally adjusted) SELECT yr, period, index_sa, index_nsa FROM hpi WHERE level = 'USA' AND hpi_type = 'purchase-only' AND frequency = 'quarterly' AND yr >= 2000 ORDER BY yr, period
-- State-level HPI comparison (most recent quarter) SELECT place_name, index_nsa, index_sa FROM hpi WHERE level = 'state' AND hpi_type = 'purchase-only' AND frequency = 'quarterly' AND yr = 2025 AND period = 4 ORDER BY index_nsa DESC
ClinPGx — Pharmacogenomics Knowledge Base
clinpgxCurated pharmacogenomics data from ClinPGx (formerly PharmGKB) — gene–drug interactions, clinical annotations, dosing guidelines (CPIC, DPWG), and FDA/EMA drug label PGx requirements. Four tables: clinical variants, guidelines, drug labels, and relationships.
clinical_variants (~5,200 rows)| Column | Type | Description |
|---|---|---|
| variant | string | Variant name or star allele (e.g. CYP2C9*3, rs1801131) |
| gene | string | Gene symbol (e.g. CYP2C9) |
| type | string | Annotation type (Metabolism/PK, Toxicity, Efficacy, Dosage) |
| level_of_evidence | string | Evidence level (1A, 1B, 2A, 2B, 3, 4) |
| chemicals | array | Associated drug names |
| phenotypes | array | Associated phenotype names |
-- Highest-evidence (1A) variant–drug pairs SELECT variant, gene, type, chemicals, phenotypes FROM clinical_variants WHERE level_of_evidence = '1A' ORDER BY gene
guidelines (~217 rows)| Column | Type | Description |
|---|---|---|
| id | string | PharmGKB accession ID |
| name | string | Guideline annotation name |
| source | string | Issuing body (CPIC, DPWG, FDA, HCSC, EMA) |
| recommendation | bool | Includes a dosing recommendation |
| dosing_information | bool | Includes dosing information |
| has_testing_info | bool | Includes genetic testing guidance |
| alternate_drug_available | bool | Alternative drug available |
| chemicals | array | Involved drug names |
| genes | array | Involved gene symbols |
| alleles | array | Specific alleles referenced |
| pmids | array | Supporting PubMed IDs |
-- All CPIC guidelines with dosing recommendations SELECT name, chemicals, genes, alleles FROM guidelines WHERE source = 'CPIC' AND recommendation = true ORDER BY name
drug_labels (~1,400 rows)| Column | Type | Description |
|---|---|---|
| id | string | PharmGKB ID |
| name | string | Annotation name |
| source | string | Regulatory body (FDA, EMA, HCSC) |
| testing_level | string | Testing Required, Actionable PGx, Informative PGx |
| has_prescribing_info | bool | Label includes prescribing guidance |
| has_dosing_info | bool | Label includes dosing guidance |
| has_alternate_drug | bool | Alternate drug recommended |
| chemicals | array | Drug names |
| genes | array | Gene symbols |
| variants | array | Relevant variants or haplotypes |
| latest_history_date | string | Most recent annotation update (YYYY-MM-DD) |
-- FDA labels requiring genetic testing SELECT name, chemicals, genes, testing_level FROM drug_labels WHERE source = 'FDA' AND testing_level = 'Testing Required' ORDER BY latest_history_date DESC LIMIT 20
relationships (~127,700 rows)| Column | Type | Description |
|---|---|---|
| entity1_id | string | PharmGKB ID of first entity |
| entity1_name | string | Name of first entity |
| entity1_type | string | Gene, Chemical, Disease, Variant, ... |
| entity2_id | string | PharmGKB ID of second entity |
| entity2_name | string | Name of second entity |
| entity2_type | string | Type of second entity |
| evidence | array | Evidence sources (ClinicalAnnotation, VariantAnnotation, ...) |
| association | string | associated, not associated, ambiguous |
| pk | bool | Pharmacokinetic relationship |
| pd | bool | Pharmacodynamic relationship |
| pmids | array | Supporting PubMed IDs |
-- All confirmed gene–drug associations for CYP2D6 SELECT entity2_name AS drug, association, evidence, pmids FROM relationships WHERE entity1_name = 'CYP2D6' AND entity1_type = 'Gene' AND entity2_type = 'Chemical' AND association = 'associated' ORDER BY entity2_name
Reference tables: genes (~25,000 rows) — NCBI Gene ID, Ensembl ID, GRCh37/GRCh38 coordinates, VIP and CPIC flags. drugs (~3,700 rows) — SMILES, InChI, RxNorm/ATC identifiers, annotation counts. variants (~7,600 rows) — chromosomal location, synonyms, annotation counts. phenotypes (~1,600 rows) — alternate names, MeSH/OMIM cross-references.
Illumina — SNP Array Manifests
illuminaProbe-level manifests for two Illumina SNP genotyping arrays. One row per assay locus — variant coordinates, probe sequence, alleles, strand orientation, and source annotation. Useful for pharmacogenomics variant lookup, array coverage comparison, and probe-sequence retrieval.
| Table | Array | Loci | Decompressed |
|---|---|---|---|
| gda_pgx | GDA+ePGx v1 | 1,933,117 | 800 MiB |
| gsa_pgx | GSA-PGx v4 | 683,054 | 274 MiB |
Both tables share the same schema.
gda_pgx / gsa_pgx| Column | Type | Description |
|---|---|---|
| ilmn_id | string | Illumina internal probe ID |
| name | string | Variant name (position-based or rsID) |
| ilmn_strand | string | Probe strand relative to TOP/BOT designation |
| alleles | array | Two alleles, e.g. ["A","C"] |
| address_a | string | Bead address for allele A |
| probe_seq_a | string | 50-mer probe sequence for allele A |
| address_b | string | Bead address for allele B (two-colour probes only) |
| probe_seq_b | string | 50-mer probe sequence for allele B (optional) |
| chr | string | Chromosome |
| position | int | GRCh38 genomic position (1-based) |
| source | string | Variant source (PAGE, 1000genomes, CNV, ...) |
| source_strand | string | Source strand (TOP / BOT) |
| source_seq | string | ~120 bp flanking sequence with allele in brackets |
| top_genomic_seq | string | TOP-strand genomic context with allele in brackets |
| ref_strand | string | Strand relative to reference (+ / -) |
| bead_set_id | int | Bead set identifier |
| exp_clusters | int | Expected cluster count (2 = mono, 3 = biallelic) |
| intensity_only | bool | Intensity-only probe (no genotype call) |
-- All probes on chromosome 22 in gda_pgx SELECT name, position, alleles, source, ref_strand FROM gda_pgx WHERE chr = '22' ORDER BY position LIMIT 20
-- CNV probes in gsa_pgx SELECT name, chr, position, probe_seq_a FROM gsa_pgx WHERE source = 'CNV' ORDER BY chr, position LIMIT 20