Datasets — SEC EDGAR, PubMed, NVD/CVE, Ethereum, Bitcoin, GWAS, arXiv, FRED, ClinPGx

NVD — National Vulnerability Database

nvd

Table: cve Rows: ~333,000 Size: ~720 MB scanned

Every CVE published by NIST, including CVSS scores, severity ratings, CWE weaknesses, and advisory references. Updated periodically from the NVD API v2.0.

Schema — table cve

Column	Type	Description
id	string	CVE identifier (e.g. `CVE-2024-12345`)
published	string	Publication date
lastModified	string	Last modification date
vulnStatus	string	`Analyzed`, `Modified`, `Awaiting Analysis`
descriptions[0].value	string	English description
metrics.cvssMetricV31[0] .cvssData.baseScore	float	CVSS v3.1 base score (0–10)
metrics.cvssMetricV31[0] .cvssData.baseSeverity	string	`LOW` / `MEDIUM` / `HIGH` / `CRITICAL`
metrics.cvssMetricV31[0] .cvssData.vectorString	string	Full CVSS vector string
weaknesses[0].description[0].value	string	Primary CWE identifier
references	array	Advisory URLs and sources

Note: Timestamp columns require backtick-quoted ISO 8601 literals for range comparisons — single-quoted strings return zero rows. Example: `2024-01-01T00:00:00Z`

-- Critical CVEs published in 2024
SELECT id, published,
       metrics.cvssMetricV31[0].cvssData.baseScore AS score,
       descriptions[0].value AS description
FROM   cve
WHERE  metrics.cvssMetricV31[0].cvssData.baseSeverity = 'CRITICAL'
  AND  published >= `2024-01-01T00:00:00Z`
ORDER BY score DESC
LIMIT 20

OSV — Open Source Vulnerabilities

osv

Table: advisories Rows: ~4,140,000 Size: ~420 MB scanned

OSV.dev advisories flattened to one row per (advisory, affected package). Covers PyPI, npm, Go, Maven, Rust, Debian, Alpine, and more. Superset of NVD — includes ecosystem-specific advisories (GHSA, PYSEC, RUSTSEC) with precise package-version ranges for direct dependency scanning.

Schema — table advisories

Column	Type	Description
id	string	Advisory ID: `CVE-...`, `GHSA-...`, `PYSEC-...`
published	string	First published date
modified	string	Last modified date
withdrawn	string	Withdrawal date if retracted (optional)
aliases	array	Related IDs (e.g. CVE alias for a GHSA)
summary	string	Short description
severity_type	string	CVSS type (e.g. `CVSS_V3`)
severity_score	string	Full CVSS vector string
cwe_ids	array	CWE weakness identifiers
ecosystem	string	`PyPI`, `npm`, `Go`, `Maven`, `crates.io`, ...
package_name	string	Package name within the ecosystem
purl	string	Package URL (optional)
introduced	string	First affected version (`0` = all versions before fixed)
fixed	string	First fixed version; absent if not yet fixed

-- All advisories for a specific npm package
SELECT id, introduced, fixed, summary
FROM   advisories
WHERE  ecosystem = 'npm' AND package_name = 'lodash'

-- Unfixed critical advisories in PyPI (CVSS network-reachable, high impact)
SELECT id, package_name, summary
FROM   advisories
WHERE  ecosystem = 'PyPI'
  AND  severity_score ~ 'AV:N.*C:H'
  AND  fixed IS MISSING
LIMIT 50

-- Advisory counts by ecosystem
SELECT ecosystem, COUNT(*) AS n
FROM   advisories
GROUP BY ecosystem
ORDER BY n DESC

SEC EDGAR — Financial Facts

sec

Table: edgar Rows: ~97,800,000 Companies: ~8,900 Size: ~6.2 GB scanned

Structured XBRL financial facts from SEC filings (10-K, 10-Q, 8-K). One row per reported fact — every line item a public company files: revenue, net income, assets, shares outstanding, and thousands of other GAAP/IFRS concepts.

Note: The end column (period end date) is a reserved keyword in Sneller SQL. Do not use it in SELECT or ORDER BY — filter by filed or fy instead.

Schema — table edgar

Column	Type	Description
cik	string	SEC Central Index Key
ticker	string	Stock ticker symbol
company	string	Company name
taxonomy	string	`us-gaap`, `ifrs-full`, `dei`, ...
concept	string	XBRL concept (e.g. `Revenues`, `NetIncomeLoss`)
label	string	Human-readable label for the concept
unit	string	`USD`, `shares`, `pure`, ...
end	string	Period end date (reserved keyword — avoid in SELECT)
val	float	Reported value
form	string	Filing form: `10-K`, `10-Q`, `8-K`, ...
fy	string	Fiscal year
fp	string	Fiscal period: `FY`, `Q1`, `Q2`, `Q3`
filed	string	Filing date
accn	string	SEC accession number

-- Annual revenue for a company across fiscal years
SELECT ticker, fy, fp, val AS revenue, filed
FROM   edgar
WHERE  ticker = 'AAPL'
  AND  concept = 'RevenueFromContractWithCustomerExcludingAssessedTax'
  AND  form = '10-K'
ORDER BY fy DESC
LIMIT 10

-- Compare net income across S&P 500 for latest fiscal year
SELECT ticker, company, val AS net_income, fy
FROM   edgar
WHERE  concept = 'NetIncomeLoss'
  AND  form = '10-K' AND fy = 2024 AND unit = 'USD'
ORDER BY net_income DESC
LIMIT 20

-- Balance sheet: total assets vs total liabilities for one company
SELECT ticker, fy, fp, concept, val, filed
FROM   edgar
WHERE  ticker = 'MSFT'
  AND  concept IN ('Assets', 'Liabilities')
  AND  form = '10-K'
ORDER BY fy DESC
LIMIT 10

-- Discover which concepts a company has reported (useful before writing queries)
SELECT concept, label, unit, COUNT(*) AS filings
FROM   edgar
WHERE  ticker = 'TSLA'
GROUP BY concept, label, unit
ORDER BY filings DESC
LIMIT 50

-- Quarterly earnings per share trend
SELECT ticker, fy, fp, val AS eps, filed
FROM   edgar
WHERE  ticker = 'NVDA'
  AND  concept = 'EarningsPerShareBasic'
  AND  form = '10-Q'
ORDER BY filed DESC
LIMIT 12

Ethereum — On-chain Data

eth

Tables: transfers, blocks, contracts, transactions, dex_swaps, lending, lp_events, mev History: 2015 – present Size: transfers ~70 GB · blocks ~770 MB · contracts ~9.6 GB · transactions ~136 GB · dex_swaps ~17 GB

On-chain Ethereum data sourced from the AWS Public Blockchain dataset. All tables are partitioned by date with native block_timestamp indexing — queries target only the relevant daily partitions; no full-table scans needed for time-range queries.

Table transfers — decoded ERC-20 / ERC-721 Transfer events · ~70 GB scanned (1 year ingested)

Column	Type	Description
block_number	int	Block height
block_timestamp	string	Block timestamp (ISO 8601)
transaction_hash	string	Transaction hash
log_index	int	Log position within the transaction
token_address	string	Contract address of the token
from_address	string	Sender address
to_address	string	Recipient address
value	string	Transfer amount (raw; token decimals vary)

-- Recent large USDC transfers (USDC = 6 decimals, $1M = 1e12)
SELECT transaction_hash, from_address, to_address, "value"
FROM   transfers
WHERE  token_address = '0xa0b86991c6218b36c1d19d4a2e9eb0ce3606eb48'
  AND  CAST("value" AS FLOAT) > 1000000000000
  AND  block_timestamp >= '2025-01-01'
LIMIT 20

Table blocks — one row per block (header, gas, base fee) · ~770 MB scanned (365 days ingested)

Column	Type	Description
number	int	Block height
timestamp	string	Block timestamp (ISO 8601)
hash	string	Block hash
miner	string	Validator / miner address
gas_used	int	Gas consumed by all transactions
gas_limit	int	Block gas limit
base_fee_per_gas	int	EIP-1559 base fee (wei); null pre-London
transaction_count	int	Number of transactions in the block
difficulty	int	Block difficulty (0 post-Merge)
size	int	Block size in bytes

-- Average gas price and block utilisation over the last 7 days
SELECT DATE(timestamp) AS day,
       AVG(base_fee_per_gas) / 1e9 AS avg_base_fee_gwei,
       AVG(gas_used * 100.0 / gas_limit) AS avg_utilisation_pct
FROM   blocks
WHERE  timestamp >= '2026-04-06'
GROUP BY day ORDER BY day

Table contracts — one row per contract deployment · full history 2015 – present, ~9.6 GB scanned

Column	Type	Description
address	string	Deployed contract address
deployer_address	string	Address that deployed the contract
block_timestamp	string	Deployment timestamp (ISO 8601)
block_number	int	Block height of deployment
transaction_hash	string	Deployment transaction hash
is_erc20	bool	Detected ERC-20 token contract
is_erc721	bool	Detected ERC-721 NFT contract
function_sighashes	array	4-byte function selectors present in bytecode

-- Lookup a contract: who deployed it and when?
SELECT deployer_address, block_timestamp, transaction_hash,
       is_erc20, is_erc721
FROM   contracts
WHERE  address = '0xdac17f958d2ee523a2206206994597c13d831ec7'

Table transactions — one row per transaction (native ETH transfers, gas spend, contract calls) · ~136 GB, last 365 days ingested

Column	Type	Description
transaction_hash	string	Transaction hash (unique identifier)
block_number	int	Block height
block_timestamp	string	Block timestamp (ISO 8601)
transaction_index	int	Position of tx within the block
from_address	string	Sender address
to_address	string	Recipient address (null for contract creation)
value	int	Native ETH transferred (wei; 1 ETH = 1e18)
gas	int	Gas limit set by sender
gas_price	int	Gas price (wei/gas; for legacy type 0/1 txns)
nonce	int	Sender nonce
transaction_type	int	0=legacy, 1=access-list, 2=EIP-1559
max_fee_per_gas	int	EIP-1559 max total fee (wei/gas)
max_priority_fee_per_gas	int	EIP-1559 tip to validator (wei/gas)
receipt_status	int	1=success, 0=reverted
receipt_gas_used	int	Actual gas consumed
receipt_effective_gas_price	int	Price actually paid (base fee + tip, wei/gas)
receipt_contract_address	string	Deployed contract address; null if not a creation
receipt_cumulative_gas_used	int	Cumulative gas used in block up to this tx

-- Top 10 gas spenders yesterday (total ETH paid in fees)
SELECT from_address,
       COUNT(*) AS tx_count,
       SUM(receipt_gas_used * receipt_effective_gas_price) / 1e18 AS eth_fees
FROM   transactions
WHERE  block_timestamp >= '2026-04-12'
  AND  block_timestamp <  '2026-04-13'
GROUP BY from_address
ORDER BY eth_fees DESC
LIMIT 10

Table dex_swaps — Uniswap v2 & v3 Swap events · 365 daily partitions

Column	Type	Description
pool_address	string	Uniswap pool contract address
transaction_hash	string	Transaction hash
log_index	int	Log position within the transaction
block_number	int	Block height
block_timestamp	string	Block timestamp (ISO 8601)
protocol	string	`v2` or `v3`
sender	string	Swap initiator address
recipient	string	Swap recipient address
amount0	int	v3: signed token0 delta (negative = token0 left pool)
amount1	int	v3: signed token1 delta; null for v2
amount0_in / amount0_out	int	v2: unsigned token0 in/out; null for v3
amount1_in / amount1_out	int	v2: unsigned token1 in/out; null for v3
sqrt_price_x96	int	v3: post-swap sqrt(price) × 2⁹⁶; null for v2
liquidity	int	v3: active liquidity at time of swap; null for v2
tick	int	v3: current tick after swap; null for v2

-- Daily swap count and net USDC flow on the USDC/WETH 0.05% pool
-- amount0 is in USDC raw units (6 dec); positive = USDC into pool (ETH bought)
SELECT partition_date,
       COUNT(*) AS swap_count,
       SUM(amount0) / 1e6 AS net_usdc_flow
FROM   dex_swaps
WHERE  pool_address = '0x88e6a0c2ddd26feeb64f039a2c41296fcb3f5640'
  AND  block_timestamp >= '2026-04-07'
  AND  block_timestamp <  '2026-04-14'
GROUP BY partition_date ORDER BY partition_date LIMIT 10

Table lending — Aave v2/v3 and Compound v3 lending events · 365 daily partitions

Column	Type	Description
contract_address	string	Lending pool contract address
transaction_hash	string	Transaction hash
log_index	int	Log position within the transaction
block_number	int	Block height
block_timestamp	string	Block timestamp (ISO 8601)
protocol	string	`aave_v2`, `aave_v3`, or `compound_v3`
event_type	string	`supply`, `withdraw`, `borrow`, `repay`, `liquidation`, `supply_collateral`, `withdraw_collateral`
asset	string	Token contract address (reserve asset)
user	string	User address
on_behalf_of	string	Beneficiary address (may differ from user)
amount	int	Raw token amount (apply asset decimals)
borrow_rate_mode	int	Aave borrow rate mode: 1=stable, 2=variable; null for other events
collateral_asset	string	Collateral token address; non-null for liquidation events only
collateral_amount	int	Collateral seized; non-null for liquidation events only

-- Daily USDC supply volume by protocol (Aave v3, Compound v3)
SELECT partition_date,
       protocol,
       COUNT(*) AS supply_events,
       SUM(amount) / 1e6 AS volume_usdc
FROM   lending
WHERE  event_type = 'supply'
  AND  asset = '0xa0b86991c6218b36c1d19d4a2e9eb0ce3606eb48'
  AND  block_timestamp >= '2026-04-07'
  AND  block_timestamp <  '2026-04-14'
GROUP BY partition_date, protocol ORDER BY partition_date LIMIT 20

Table lp_events — Uniswap v3 Mint and Burn (liquidity provision) events · 365 daily partitions

Column	Type	Description
pool_address	string	Uniswap v3 pool contract address
owner	string	Address that owns the position
tick_lower	int	Lower bound of the price range (tick)
tick_upper	int	Upper bound of the price range (tick)
transaction_hash	string	Transaction hash
log_index	int	Log position within the transaction
block_number	int	Block height
block_timestamp	string	Block timestamp (ISO 8601)
event_type	string	`mint` (add liquidity) or `burn` (remove liquidity)
sender	string	Caller address for mint; null for burn
amount	int	Liquidity units added or removed
amount0	int	Token0 deposited (mint) or withdrawn (burn)
amount1	int	Token1 deposited (mint) or withdrawn (burn)

-- Net liquidity minted vs burned in USDC/WETH 0.05% pool (last 7 days)
SELECT event_type,
       COUNT(*) AS events,
       SUM(amount0) / 1e6  AS usdc_total,
       SUM(amount1) / 1e18 AS weth_total
FROM   lp_events
WHERE  pool_address = '0x88e6a0c2ddd26feeb64f039a2c41296fcb3f5640'
  AND  block_timestamp >= '2026-04-07'
GROUP BY event_type

Table mev — MEV-boost relay payloads (block auction data) · 365 daily partitions · ~2.2M rows · ~0.6 GB scanned

Note: Sourced from the ultra_sound relay API. The date partition field is stored as a datetime internally — use backtick ISO 8601 literals for date range filters.

Column	Type	Description
block_number	int	Block height
block_hash	string	Block hash
slot	int	Beacon chain slot number
timestamp	int	Unix timestamp of the block
date	date	Partition key (stored as datetime internally)
relay	string	Relay name (e.g. `ultra_sound`, `flashbots`, `bloxroute_max`)
proposer_fee_recipient	string	Validator fee recipient address
builder_pubkey	string	Block builder BLS public key
gas_used	int	Gas consumed by the block
gas_limit	int	Block gas limit
num_txs	int	Number of transactions in the block
value_wei	int	Block value paid to proposer (wei)
value_eth	float	Block value paid to proposer (ETH)

-- Relay market share — block count and MEV earned (last 30 days)
SELECT relay,
       COUNT(*) AS blocks,
       TRUNC(AVG(value_eth)*1000000)/1000000  AS avg_mev_eth,
       TRUNC(SUM(value_eth)*10000)/10000      AS total_mev_eth
FROM   mev
WHERE  date >= `2026-03-16T00:00:00Z`
GROUP BY relay
ORDER BY blocks DESC
LIMIT 10

-- Top 10 highest-value MEV blocks all time
SELECT date, block_number, relay, value_eth, num_txs
FROM   mev
ORDER BY value_eth DESC
LIMIT 10

Base — On-chain Data

base

Tables: transfers, blocks, transactions History: 30 days (~7-day publishing lag) Size: transfers ~70 GB · blocks ~0.6 GB · transactions ~67 GB

On-chain data for Base, Coinbase's OP-Stack L2 rollup, sourced from the AWS Public Blockchain dataset. All tables are partitioned by partition_date (one partition per calendar day). Source data has a ~7-day publishing lag. Transactions include L2-specific fee fields (l1_fee, l2_fee, l1_gas_price) not present on Ethereum mainnet.

Table transfers — decoded ERC-20 Transfer events · ~70 GB / 30 days

Column	Type	Description
block_number	int	Block height
block_timestamp	string	Block timestamp (ISO 8601)
transaction_hash	string	Transaction hash
log_index	int	Log position within the transaction
token_address	string	Contract address of the token
from_address	string	Sender address (42-char, 0x-prefixed)
to_address	string	Recipient address
value	int	Transfer amount (raw, token decimals vary)
partition_date	string	Partition key (YYYY-MM-DD)

-- Recent USDC transfers on Base over $50 (USDC = 6 decimals)
-- Update partition_date to a recent date within the last 30 days
SELECT transaction_hash, from_address, to_address, "value"
FROM   transfers
WHERE  token_address = '0x833589fcd6edb6e08f4c7c32d4f71b54bda02913'
  AND  CAST("value" AS FLOAT) > 50000000
  AND  partition_date IN ('2026-04-12', '2026-04-13')
LIMIT 20

Table blocks — one row per block · ~0.6 GB / 30 days

Column	Type	Description
number	int	Block height
timestamp	string	Block timestamp (ISO 8601)
hash	string	Block hash
miner	string	Sequencer / fee recipient address
gas_used	int	Gas consumed by all transactions
gas_limit	int	Block gas limit
base_fee_per_gas	int	EIP-1559 base fee (wei)
transaction_count	int	Number of transactions in the block
size	int	Block size in bytes
partition_date	string	Partition key (YYYY-MM-DD)

-- Average base fee and block utilisation over the last 7 days on Base
SELECT DATE(timestamp) AS day,
       AVG(base_fee_per_gas) / 1e9 AS avg_base_fee_gwei,
       AVG(gas_used * 100.0 / gas_limit) AS avg_utilisation_pct
FROM   blocks
WHERE  partition_date >= '2026-04-06'
GROUP BY day ORDER BY day

Table transactions — one row per transaction, with L2 fee fields · ~67 GB / 30 days

Column	Type	Description
transaction_hash	string	Transaction hash
block_number	int	Block height
block_timestamp	string	Block timestamp (ISO 8601)
from_address	string	Sender address
to_address	string	Recipient address (null for contract creation)
value	int	Native ETH transferred (wei)
gas	int	Gas limit set by sender
receipt_status	int	1=success, 0=reverted
receipt_gas_used	int	Actual gas consumed
receipt_effective_gas_price	int	Price actually paid (wei/gas)
l1_fee	int	L1 data fee paid to Ethereum (wei)
l1_gas_price	int	L1 gas price at inclusion (wei/gas)
l1_fee_scalar	string	Scalar used to compute L1 fee
l2_fee	int	L2 execution fee (wei)
mint	int	ETH minted by L1-to-L2 deposit; null for regular txns
source_hash	string	Deposit source hash; non-null for L1-to-L2 deposits
partition_date	string	Partition key (YYYY-MM-DD)

-- Top gas spenders on Base by total L1 + L2 fees paid
SELECT from_address,
       COUNT(*) AS tx_count,
       SUM(l2_fee + COALESCE(l1_fee, 0)) / 1e18 AS total_eth_fees
FROM   transactions
WHERE  partition_date >= '2026-04-06'
GROUP BY from_address
ORDER BY total_eth_fees DESC
LIMIT 10

Bitcoin

btc

Tables: blocks, outputs Source: AWS Public Blockchain

Table blocks — one row per block · ~6.6 GB scanned, full history 2009 – present

Column	Type	Description
number	int	Block height
timestamp	string	Block timestamp (ISO 8601); natively indexed
mediantime	string	Median time of last 11 blocks (ISO 8601); natively indexed
hash	string	Block hash
transaction_count	int	Number of transactions in the block
difficulty	float	Mining difficulty at this block
size	int	Block size in bytes
weight	int	Block weight (SegWit units)
miner	string	Coinbase nonce (miner tag)
previousblockhash	string	Parent block hash

-- Bitcoin difficulty trend over the last 90 days
SELECT partition_date,
       AVG(difficulty) AS avg_difficulty,
       SUM(transaction_count) AS daily_txns
FROM   blocks
WHERE  timestamp >= `2026-01-13T00:00:00Z`
GROUP BY partition_date ORDER BY partition_date

Table outputs — one row per transaction output · 730 daily partitions (2024 – present) · ~115 GB scanned · partition key: block_timestamp

Note: "value" must be quoted in SQL — it is a reserved word in Sneller. This table contains outputs only; inputs are not included so UTXO spent/unspent state is not available.

Column	Type	Description
txid	string	Transaction ID
block_number	int	Block height containing this transaction
block_timestamp	string	Block timestamp (ISO 8601); natively indexed
output_index	int	Output index within the transaction (vout)
address	string	Recipient address
"value"	int	Output value in satoshis (1 BTC = 100,000,000 satoshis)
type	string	Output script type: P2PKH, P2SH, P2WPKH, P2WSH, P2TR, etc.
is_coinbase	bool	True if this output is a miner block reward

-- Total BTC received by an address over the last 2 years
SELECT SUM("value") / 1e8    AS total_btc,
       COUNT(*)               AS output_count
FROM   outputs
WHERE  address = 'bc1qxy2kgdygjrsqtzq2n0yrf2493p83kkfjhx0wlh'

DeFi — Protocol Data

defi

Tables: protocols, tvl Source: DeFiLlama API

DeFi protocol data sourced from the DeFiLlama API. The protocols table is a current snapshot of all tracked protocols; tvl contains daily total-value-locked history for protocols with TVL ≥ $10M at fetch time.

Note: The date partition field in tvl is stored as a datetime internally — use backtick ISO 8601 literals: `2026-04-15T00:00:00Z`

Table protocols — snapshot of all DeFiLlama protocols · ~7,300 rows · ~2 MB scanned

Column	Type	Description
slug	string	DeFiLlama protocol slug (unique identifier)
name	string	Protocol display name
category	string	Protocol category (e.g. `Lending`, `DEX`, `Liquid Staking`)
chain	string	Primary chain
chains	array	All chains where the protocol is deployed
tvl	float	Total value locked (USD)
change_1h	float	TVL change over last 1 hour (%)
change_1d	float	TVL change over last 24 hours (%)
change_7d	float	TVL change over last 7 days (%)
mcap	float	Market capitalisation (USD; optional)
fdv	float	Fully diluted valuation (USD; optional)
description	string	Short protocol description
url	string	Protocol website URL
twitter	string	Twitter handle (optional)
gecko_id	string	CoinGecko identifier (optional)

-- Top 20 protocols by TVL
SELECT name, category, chain,
       TRUNC(tvl/1e9*1000)/1000  AS tvl_billions
FROM   protocols
ORDER BY tvl DESC
LIMIT 20

-- TVL by category (all protocols with TVL > 0)
SELECT category,
       COUNT(*) AS protocols,
       TRUNC(SUM(tvl)/1e9*100)/100  AS total_tvl_billions
FROM   protocols
WHERE  tvl > 0
GROUP BY category
ORDER BY total_tvl_billions DESC
LIMIT 20

Table tvl — daily TVL history per protocol · 365 daily partitions · ~0.4 GB scanned

Column	Type	Description
date	date	Partition key (stored as datetime; use backtick literals)
protocol	string	Protocol display name
slug	string	DeFiLlama protocol slug (joins to `protocols`)
category	string	Protocol category
chain	string	Chain for this TVL record
chains	array	All chains where the protocol is deployed
tvl	float	Total value locked on this date (USD)

-- Total DeFi TVL trend — last 30 days
SELECT date,
       TRUNC(SUM(tvl)/1e9*100)/100  AS total_tvl_billions,
       COUNT(*)                        AS protocol_count
FROM   tvl
WHERE  date >= `2026-03-16T00:00:00Z`
GROUP BY date
ORDER BY date
LIMIT 30

-- Aave V3 TVL history (last 90 days; each version has its own slug)
SELECT date,
       TRUNC(SUM(tvl)/1e9*1000)/1000  AS tvl_billions
FROM   tvl
WHERE  slug = 'aave-v3'
  AND  date >= `2026-01-15T00:00:00Z`
GROUP BY date
ORDER BY date
LIMIT 90

PubMed — Biomedical Literature

pubmed

Table: baseline Rows: ~40,000,000 Partitions: 247 year partitions (1781–2027) · partition key: pub_year Size: ~138 GB scanned

Full PubMed baseline: titles, abstracts, authors, journals, MeSH terms, and publication metadata for biomedical literature. Data originates from NLM XML; nested fields preserve the original XML element names. Filter by pub_year (integer) to scan only the relevant year partition instead of the full 138 GB.

Schema — table baseline (selected fields)

Column	Type	Description
MedlineCitation.PMID.text	string	PubMed identifier (PMID is an object; use `.text` for the plain ID)
MedlineCitation.Article .ArticleTitle	string	Article title
MedlineCitation.Article .Abstract.AbstractText	string / object	Abstract text (89% coverage, polymorphic): `string` for plain abstracts; `{text, Label}` per section for structured abstracts (e.g. Label: “BACKGROUND”)
MedlineCitation.Article .Journal.Title	string	Journal name
MedlineCitation.Article .AuthorList.Author	array	Authors (polymorphic elements): `{LastName, ForeName, Affiliation}` for individuals; `{CollectiveName}` for groups; 57% coverage
MedlineCitation .MeshHeadingList.MeshHeading	array	MeSH controlled vocabulary terms; 57% coverage
MedlineCitation .KeywordList.Keyword	array	Author-supplied keywords; 60% coverage
MedlineCitation .ChemicalList.Chemical	array	Chemical substances (NameOfSubstance, RegistryNumber); 28% coverage
MedlineCitation .DateCompleted.Year	string	Year NLM indexing was completed
PubmedData.ArticleIdList .ArticleId	array	DOI, PMC ID, and other identifiers

-- Recent articles matching a topic (regex title search)
SELECT MedlineCitation.PMID.text                  AS pmid,
       MedlineCitation.Article.ArticleTitle        AS title,
       MedlineCitation.Article.Journal.Title        AS journal,
       MedlineCitation.DateCompleted.Year           AS year
FROM   baseline
WHERE  pub_year = 2023
  AND  MedlineCitation.Article.ArticleTitle ~ '(?i)CRISPR'
LIMIT 20

-- All papers by a specific author (any position in author list)
-- Author is an array: use comma in FROM to expand one row per author
SELECT DISTINCT
       p.MedlineCitation.PMID.text                  AS pmid,
       p.MedlineCitation.Article.ArticleTitle       AS title,
       p.MedlineCitation.Article.Journal.Title      AS journal,
       p.pub_year
FROM   baseline AS p,
       p.MedlineCitation.Article.AuthorList.Author AS a
WHERE  p.pub_year = 2023
  AND  a.LastName = 'Smith'
LIMIT 20

-- Papers co-authored by ALL of the listed authors (AND semantics)
-- IN() matches any name; HAVING matched = N enforces all-must-match
SELECT p.MedlineCitation.PMID.text             AS pmid,
       COUNT(DISTINCT a.LastName)              AS matched
FROM   baseline AS p,
       p.MedlineCitation.Article.AuthorList.Author AS a
WHERE  p.pub_year = 2023
  AND  a.LastName IN ('Smith', 'Jones', 'Brown')
GROUP BY pmid
HAVING matched = 3

ClinicalTrials.gov — Clinical Studies

clinicaltrials

Table: studies Rows: ~580,000 Size: ~663 MB scanned

All registered clinical studies from ClinicalTrials.gov, sourced via the v2 API. One row per study. Pairs with PubMed via nct_id. Useful for trial matching by condition or drug, drug pipeline analysis, sponsor activity, and eligibility criteria queries.

Schema — table studies (selected fields)

Column	Type	Description
nct_id	string	ClinicalTrials.gov identifier (e.g. NCT00123456)
brief_title	string	Short study title
official_title	string	Full protocol title
overall_status	string	RECRUITING, COMPLETED, ACTIVE_NOT_RECRUITING, etc.
start_date	string	Study start date (YYYY-MM-DD)
completion_date	string	Primary completion date (YYYY-MM-DD)
study_first_submitted_date	string	Date first submitted to registry
study_type	string	INTERVENTIONAL or OBSERVATIONAL
phases	array	Trial phase(s): PHASE1, PHASE2, PHASE3, PHASE4, NA
enrollment	int	Target or actual enrollment count
lead_sponsor	string	Name of the lead sponsor
sponsor_class	string	INDUSTRY, NIH, OTHER_GOV, OTHER
conditions	array	Disease/condition names
keywords	array	Free-text keywords
interventions	array	Objects with `type` (DRUG, DEVICE, …) and `name`
brief_summary	string	Plain-language study summary (truncated at 2000 chars)
sex	string	Eligibility: ALL, FEMALE, MALE
minimum_age	string	Minimum participant age (e.g. "18 Years")
maximum_age	string	Maximum participant age
has_results	bool	True if results have been posted

-- Trials mentioning GLP-1 (regex search on title)
SELECT nct_id, brief_title, overall_status
FROM   studies
WHERE  brief_title ~ '(?i)GLP-?1'
LIMIT 20

FDA FAERS — Adverse Event Reports

fda

Table: faers Rows: ~63,000,000 Coverage: 2012Q4 – 2025Q4 (53 quarters) Size: ~6.5 GB scanned

FDA Adverse Event Reporting System (FAERS) — voluntary reports of suspected drug side effects submitted by patients, healthcare providers, and manufacturers. One row per (case, drug): reactions and outcomes are stored as arrays so a single WHERE clause covers all events in a case. Drug names are verbatim reporter text — use regex to match across brand names, generics, and dosage variants. Pairs with PubMed and ClinicalTrials.gov for pharmacovigilance workflows.

Note: Pre-2012 LAERS-era data is not included. Reports are deduplicated to the highest caseversion per case. Voluntary reporting introduces bias — counts reflect reporting volume, not true incidence rates.

Schema — table faers (selected fields)

Column	Type	Description
primaryid	int	Unique case-version identifier
caseid	int	Case identifier (stable across versions)
event_dt	string	Date of adverse event (YYYY-MM-DD; may be partial)
year	int	Year of adverse event; partition key
age_years	float	Patient age in years (normalised from age + age_cod)
sex	string	M / F / UNK
occr_country	string	Country where event occurred (ISO 2-letter)
drug_seq	int	Drug sequence number within the case
drugname	string	Verbatim drug name as reported
drug_role	string	PS = primary suspect, SS = secondary, C = concomitant, I = interacting
route	string	Route of administration (Oral, Intravenous, etc.)
reactions	array	MedDRA preferred terms for all reactions in this case
outcomes	array	DE=death, HO=hospitalisation, LT=life-threatening, DS=disability, CA=congenital anomaly, RI=required intervention, OT=other
indications	array	MedDRA preferred terms for the drug's indication(s)

-- Most common reactions for a drug (regex matches brand + generic names)
SELECT reaction, COUNT(*) AS n
FROM   faers, UNNEST(reactions) AS reaction
WHERE  drugname ~ '(?i)ozempic|semaglutide'
  AND  drug_role = 'PS'
GROUP BY reaction
ORDER BY n DESC
LIMIT 20

-- Fatal cases by drug class (regex on drug name, filter outcome = death)
SELECT drugname,
       COUNT(*) AS fatal_reports
FROM   faers
WHERE  drugname ~ '(?i)fentanyl|oxycodone|hydrocodone'
  AND  ARRAY_CONTAINS(outcomes, 'DE')
GROUP BY drugname
ORDER BY fatal_reports DESC
LIMIT 15

FEC Campaign Finance

fec

Table: contributions Rows: 279,415,061 Partitions: 24 cycle partitions (1980–2026) · partition key: cycle Size: ~37 GB scanned

Individual campaign contributions reported to the FEC, covering every US federal election cycle from 1980 to 2026. One row per contribution, enriched with committee name, type, and party affiliation (from the FEC committee master file) and candidate name where applicable. Filter by cycle (4-digit even year) to scan only the cycles relevant to your query. Source: FEC Bulk Data — US Government public domain.

Schema — table contributions

Column	Type	Description
cycle	int	Election cycle year (1980, 1982, …, 2026) — partition key
sub_id	string	Unique FEC record identifier
cmte_id	string	FEC committee ID receiving the contribution
cmte_name	string	Committee name (enriched from committee master)
cmte_type	string	Committee type code (H=House, S=Senate, P=Presidential, Q/N/W=PAC, …)
cmte_party	string	Party affiliation (DEM, REP, …)
cand_id	string	FEC candidate ID (when contribution targets a specific candidate)
cand_name	string	Candidate name (enriched from candidate master)
entity_tp	string	Contributor entity type (IND=individual, PAC, ORG, …)
name	string	Contributor name
city / state / zip_code	string	Contributor location
employer	string	Contributor employer
occupation	string	Contributor occupation
transaction_dt	string	Contribution date (YYYY-MM-DD)
transaction_amt	int	Amount in dollars
transaction_tp	string	Transaction type code
other_id	string	Other committee ID (set for transfers between committees)
tran_id	string	Transaction identifier within the filing
memo_text	string	Optional memo / description

Example queries

-- Top employers of donors to a committee (2024 cycle)
SELECT   employer, COUNT(*) AS n, SUM(transaction_amt) AS total_usd
FROM     contributions
WHERE    cycle = 2024
  AND    cmte_name ~ '(?i)actblue|winred'
  AND    employer != ''
GROUP BY employer
ORDER BY total_usd DESC
LIMIT    20

-- Political exposure check: donations by employees of a company (2024 cycle)
SELECT   name, cycle, cmte_name, cmte_party, transaction_amt, transaction_dt
FROM     contributions
WHERE    employer ~ '(?i)goldman sachs'
  AND    cycle = 2024
  AND    transaction_amt >= 2500
ORDER BY transaction_amt DESC
LIMIT    25

-- Top recipients of Google employee donations, 2024 cycle
SELECT   cmte_name, cmte_party,
         SUM(transaction_amt) AS total_usd,
         COUNT(*) AS n_contributions
FROM     contributions
WHERE    cycle = 2024
  AND    employer ~ '(?i)google'
GROUP BY cmte_name, cmte_party
ORDER BY total_usd DESC
LIMIT    10

FDA Orange Book — Approved Drug Products

fda

Table: orangebook Rows: 51,063 Size: 5 MB scanned

FDA-approved drug products from the Orange Book via the openFDA bulk export. One row per product strength/form. Includes therapeutic equivalence (TE) codes that determine generic substitutability. Essential for pharma agents querying approval status, generics entry, and formulary decisions.

Schema — table orangebook

Column	Type	Description
application_number	string	NDA or ANDA number (e.g. `NDA009700`)
appl_type	string	`N` = NDA (brand), `A` = ANDA (generic)
sponsor_name	string	Applicant / sponsor company name
product_number	string	Product sequence within the application
brand_name	string	Trade / brand name
active_ingredients	array	`[{name, strength}]` — active ingredient names and strengths
dosage_form	string	e.g. `TABLET`, `CAPSULE`
route	string	e.g. `ORAL`, `INTRAVENOUS`
marketing_status	string	e.g. `Prescription`, `OTC`
te_code	string	Therapeutic equivalence code (AB = substitutable; may be null)

-- All approved generics for a drug (AB-rated = substitutable)
SELECT application_number, sponsor_name, brand_name, te_code
FROM   orangebook
WHERE  ARRAY_CONTAINS(active_ingredients[0].name, 'METFORMIN')
  AND  te_code = 'AB'
ORDER BY sponsor_name
LIMIT 20

-- All products by a given sponsor
SELECT application_number, brand_name, dosage_form,
       route, te_code
FROM   orangebook
WHERE  sponsor_name ~ '(?i)pfizer'
  AND  appl_type = 'N'
ORDER BY brand_name
LIMIT 25

FRED — Federal Reserve Economic Data

fred

Table: series Rows: 357,442 Series: 115 curated Size: 16 MB scanned

Curated macroeconomic time series from the St. Louis Fed FRED database: GDP, inflation, employment, interest rates, money supply, housing, fiscal, markets, and trade. One row per (series, date) observation. Series with redistribution restrictions are excluded.

This product uses the FRED® API but is not endorsed or certified by the Federal Reserve Bank of St. Louis.

Schema — table series

Column	Type	Description
series_id	string	FRED series identifier (e.g. `UNRATE`, `GDP`)
category	string	Thematic group: gdp, inflation, employment, rates, money, housing, production, trade, fiscal, markets, credit, income
title	string	Full series name (e.g. `Unemployment Rate`)
units	string	Unit of measure (e.g. `%`, `Bil. of $`)
frequency	string	Observation frequency (D=daily, M=monthly, Q=quarterly, A=annual)
seasonal_adjustment	string	SA = seasonally adjusted; NSA = not seasonally adjusted
obs_date	timestamp	Observation date
obs_year	int	Observation year (partition key; use for fast year-range queries)
value	float	Observation value in the series units (reserved word — quote as `"value"`)

-- US unemployment rate since 2020
SELECT obs_date, "value"
FROM   series
WHERE  series_id = 'UNRATE'
  AND  obs_year >= 2020
ORDER BY obs_date
LIMIT 100

-- Compare CPI and core PCE year-over-year
SELECT series_id, title, obs_date, "value"
FROM   series
WHERE  series_id IN ('CPIAUCSL', 'PCEPILFE')
  AND  obs_year >= 2021
ORDER BY obs_date, series_id
LIMIT 200

-- All available series IDs with titles
SELECT DISTINCT series_id, category, title, units, frequency
FROM   series
ORDER BY category, series_id
LIMIT 200

End of Life — Runtime & Framework Lifecycle Dates

eol

Table: cycles Rows: 7,884 Products: 451 Size: 1 MB scanned

End-of-life and support status for 450+ software products — languages, runtimes, frameworks, databases, operating systems, and cloud services. One row per release cycle. Source: endoflife.date. Ideal for agents checking dependency support status or auditing infrastructure.

Schema — table cycles

Column	Type	Description
product	string	Product slug (e.g. `python`, `nodejs`, `ubuntu`)
cycle	string	Release cycle label (e.g. `3.11`, `20.04`)
release_date	string	General availability date (YYYY-MM-DD)
eol	string	End-of-life date, or `false` (still supported), or `true` (EOL, no exact date)
latest	string	Latest patch release in this cycle
latest_date	string	Release date of the latest version
lts	string	LTS status: `true`, `false`, or LTS end date
support	string	Active support end date (before EOL), or `true`/`false`
link	string	URL for the release notes / changelog

-- Is Python 3.8 still supported?
SELECT product, cycle, "eol", lts, "latest"
FROM   cycles
WHERE  product = 'python'
ORDER BY cycle DESC
LIMIT 10

-- All products reaching EOL in 2025
SELECT product, cycle, "eol", "latest"
FROM   cycles
WHERE  "eol" LIKE '2025%'
ORDER BY "eol"
LIMIT 50

arXiv — Preprint Metadata

arxiv

Table: papers Rows: ~2,400,000 Coverage: 1991 – present Size: ~4 GB scanned

Metadata for all arXiv preprints — titles, abstracts, authors, categories, and submission dates. Covers CS, math, physics, quantitative finance, statistics, biology, and more. Papers appear on arXiv months before journal publication, making this the primary source for the latest research. Partitioned by submission month.

Schema — table papers

Column	Type	Description
id	string	arXiv ID (e.g. `2301.00001`)
title	string	Paper title (whitespace normalised)
abstract	string	Full abstract text
categories	array	arXiv category codes (e.g. `["cs.LG", "stat.ML"]`)
primary_cat	string	Primary category (first in list)
authors	array	`[{name, affiliation}]`
submitted	string	Original submission date (YYYY-MM-DD); partition key
updated	string	Last-updated date
doi	string	DOI if published; null otherwise
journal_ref	string	Journal reference if published; null otherwise
comments	string	Author comments (page count, code links, etc.)
license	string	License URL (e.g. CC BY 4.0)

-- Recent ML papers mentioning transformers
SELECT id, submitted, title, primary_cat
FROM   papers
WHERE  primary_cat = 'cs.LG'
  AND  submitted >= '2024-01-01'
  AND  title ~ '(?i)transformer'
ORDER BY submitted DESC
LIMIT 25

-- Papers by an author across all categories
SELECT id, submitted, title, categories
FROM   papers, UNNEST(authors) AS a
WHERE  a.name ~ '(?i)lecun'
ORDER BY submitted DESC
LIMIT 20

-- Category submission volume by year (quantitative finance)
SELECT DATE_TRUNC('year', CAST(submitted AS TIMESTAMP)) AS yr,
       COUNT(*) AS papers
FROM   papers
WHERE  primary_cat = 'q-fin.TR'
GROUP BY yr
ORDER BY yr
LIMIT 50

GWAS Catalog — Genome-Wide Association Studies

gwas

Table: associations Rows: 1,095,212 Partitions: 19 years (2008–2026) Size: ~0.3 GB scanned

NHGRI-EBI GWAS Catalog — curated genome-wide association study results. One row per variant–trait association. Joins ClinVar on rsid / dbsnp_id and PubMed on pubmed_id. Source: EBI FTP. Partition key: date_added.

Schema — table associations

Column	Type	Description
rsid	string	dbSNP rsID (e.g. `rs1801131`)
snp_risk_allele	string	Risk allele in `rsID-allele` format (e.g. `rs1801131-C`)
risk_allele_freq	float	Risk allele frequency in discovery sample
p_value	float	Association p-value
pvalue_mlog	number	−log&sub1;&sub0;(p-value)
or_beta	float	Odds ratio (binary traits) or beta (quantitative traits)
ci_95	string	95% confidence interval
disease_trait	string	Disease or trait name as reported by the study
mapped_trait	string	EFO-mapped trait label
mapped_gene	string	Nearest gene(s) mapped by ENSEMBL
reported_genes	array	Genes reported by the study authors
chromosome	string	Chromosome
chr_pos	int	Chromosomal position (GRCh38)
context	string	Variant functional context (e.g. `intron_variant`, `missense_variant`)
pubmed_id	int	PubMed ID of the source study
pub_date	string	Publication date
first_author	string	First author surname
study_accession	string	GWAS Catalog study accession (e.g. `GCST000001`)
date_added	string	Date added to the GWAS Catalog (partition key)
initial_sample	string	Discovery sample description (ancestry, size)

-- Top associations for Type 2 Diabetes
SELECT rsid, snp_risk_allele, mapped_gene, p_value, or_beta, risk_allele_freq
FROM   associations
WHERE  disease_trait ~ '(?i)type 2 diabetes'
ORDER BY pvalue_mlog DESC
LIMIT 25

-- Join GWAS hits with ClinVar pathogenic variants
SELECT g.rsid, g.disease_trait, g.mapped_gene, g.p_value,
       c.significance, c.gene_symbol
FROM   gwas.associations AS g
JOIN   clinvar.variants   AS c
  ON   g.rsid = c.dbsnp_id
  AND  c.assembly = 'GRCh38' AND c.sig_simple = 1
WHERE  g.pvalue_mlog > 50
ORDER BY g.pvalue_mlog DESC
LIMIT 20

-- Most-studied traits by association count
SELECT disease_trait, COUNT(*) AS n_associations
FROM   associations
GROUP BY disease_trait
ORDER BY n_associations DESC
LIMIT 20

ClinVar — Genetic Variant Classifications

clinvar

Table: variants Rows: 8,918,806 Assemblies: GRCh37 + GRCh38 Size: ~2.1 GB scanned

NCBI ClinVar variant classifications — pathogenicity, associated phenotypes, chromosomal coordinates, and review status for millions of human genetic variants. One row per variant × assembly (GRCh37 and GRCh38 each have their own row). Source: NCBI FTP, updated weekly.

Schema — table variants

Column	Type	Description
allele_id	int	Stable ClinVar AlleleID
variation_id	int	ClinVar VariationID
type	string	Variant type (SNV, Indel, Deletion, Duplication, …)
name	string	HGVS-like variant name
gene_symbol	string	Gene symbol (e.g. `BRCA1`)
gene_id	int	NCBI GeneID; null if intergenic
significance	string	Clinical significance (Pathogenic, Likely pathogenic, Benign, …)
sig_simple	int	-1 conflicting, 0 uncertain/other, 1 pathogenic, 2 benign, 3 risk-factor
last_evaluated	string	Date last evaluated by submitter
dbsnp_id	string	dbSNP rsID (e.g. `rs80357713`); null if none
phenotypes	array	Associated disease/phenotype names
phenotype_ids	array	MedGen/OMIM/Orphanet IDs for phenotypes
rcv_accessions	array	RCV accession numbers
origin	string	Allele origin (germline, somatic, …)
assembly	string	Reference assembly (GRCh38, GRCh37, NCBI36)
chromosome	string	Chromosome
start	int	Start position (1-based)
stop	int	Stop position (1-based)
ref_allele	string	Reference allele (or null)
alt_allele	string	Alternate allele (or null)
review_status	string	Review status (e.g. "criteria provided, multiple submitters")
n_submitters	int	Number of submitters
pos_vcf	int	VCF position
ref_vcf	string	VCF reference allele
alt_vcf	string	VCF alternate allele
other_ids	string	Other identifiers (comma-sep "type:value" pairs)
oncogenicity	string	Oncogenicity classification; null if not assessed

-- Pathogenic variants in BRCA1 (GRCh38)
SELECT allele_id, name, significance, chromosome, start, dbsnp_id
FROM   variants
WHERE  gene_symbol = 'BRCA1'
  AND  assembly    = 'GRCh38'
  AND  sig_simple  = 1
ORDER BY start
LIMIT 50

-- Variants associated with a phenotype (by name)
SELECT gene_symbol, significance, name, review_status, n_submitters
FROM   variants, UNNEST(phenotypes) AS ph
WHERE  assembly = 'GRCh38'
  AND  ph ~ '(?i)breast cancer'
  AND  sig_simple = 1
ORDER BY n_submitters DESC
LIMIT 25

-- Pathogenic variant count per gene (well-reviewed only)
SELECT gene_symbol, COUNT(*) AS pathogenic_variants
FROM   variants
WHERE  assembly      = 'GRCh38'
  AND  sig_simple    = 1
  AND  n_submitters >= 2
GROUP BY gene_symbol
ORDER BY pathogenic_variants DESC
LIMIT 20

ALFA — Allele Frequency Aggregator

alfa

Table: variants Rows: ~200M Populations: 12 Size: ~100.8 GB scanned

NCBI Allele Frequency Aggregator — allele frequencies from 1M+ dbGaP subjects across 12 population groups. GRCh38 coordinates. Partitioned by chromosome. The key advantage over gnomAD is finer African ancestry resolution: afr_am (African American), afr (continental African), and afr_oth (other African) are reported separately rather than collapsed into a single afr bucket. This split is clinically significant for variant interpretation in underrepresented populations. Source: NCBI FTP, open access. Note: chromosome uses no chr prefix (“1” not “chr1”) — same as ClinVar, opposite of gnomAD.

Schema — table variants · partitioned by chrom

Column	Type	Description
chrom	string	Chromosome (1–22, X, Y, MT — no chr prefix)
pos	int	GRCh38 position (1-based)
rsid	string	dbSNP RS identifier (e.g. `rs12345`); null if none
ref	string	Reference allele
alt	string	Alternate allele (first for multi-allelic sites)
af_tot	float	Total allele frequency (all subjects)
ac_tot / an_tot	int	Total alternate allele count / total allele count
af_eur	float	European — closest to gnomAD `af_nfe`
af_afr_am	float	African American (US) — not available in gnomAD separately
af_afr	float	Continental African (non-US) — not available in gnomAD separately
af_afr_oth	float	Other African ancestry — not available in gnomAD separately
af_eas	float	East Asian — comparable to gnomAD `af_eas`
af_oth_asn	float	Other Asian — broader than gnomAD coverage
af_asn	float	Asian (broad grouping, includes EAS + other)
af_sas	float	South Asian
af_lat1 / af_lat2	float	Two Latino sub-populations — gnomAD reports a single `af_amr`
af_oth	float	Other / unassigned
ac_* / an_*	int	Per-population allele count / allele number for each population above

-- Population frequency comparison for a known variant
SELECT rsid, ref, alt,
       af_tot, af_eur, af_afr_am, af_afr, af_eas, af_sas
FROM   variants
WHERE  chrom = '17'
  AND  rsid  = 'rs80357713'

-- Variants with large African American vs European frequency gap
-- (population-enriched variants, clinically under-studied)
SELECT rsid, chrom, pos, ref, alt,
       af_afr_am, af_eur,
       af_afr_am - af_eur AS delta
FROM   variants
WHERE  chrom     = '13'
  AND  an_afr_am >= 100
  AND  an_eur    >= 100
  AND  af_afr_am - af_eur > 0.3
ORDER BY delta DESC
LIMIT 20

DGIdb — Drug–Gene Interaction Database

dgidb

Tables: interactions, genes, drugs Interactions: 98,239 Genes: 80,234 Drugs: 81,572

Drug-Gene Interaction Database (DGIdb) — ~100K curated drug–gene interactions aggregated from 40+ sources including ChEMBL, CIViC, and FDA drug labels. MIT licensed; commercial use permitted. Useful for target druggability checks: given a gene of interest, find what drugs already modulate it (approved or investigational), enabling competitive landscape and prior art queries without a manual ChEMBL or DrugBank search. Source: dgidb.org.

Usage notes: interaction_type is NULL in most records from the DTC source — filter on approved, anti_neoplastic, and interaction_score instead. drug_concept_id uses ChEMBL/NCI-T namespaces and will not directly join FAERS or NADAC; use drug_name (plain text) as the practical join key.

Schema — table interactions (98,239 rows)

Column	Type	Description
gene_name	string	HGNC gene symbol (e.g. `EGFR`)
gene_concept_id	string	Normalized gene concept ID (HGNC namespace)
drug_name	string	Drug name (normalized); practical join key against FAERS / NADAC
drug_concept_id	string	Normalized drug concept ID (ChEMBL or NCI-T namespace)
approved	bool	FDA-approved drug
immunotherapy	bool	Immunotherapy agent
anti_neoplastic	bool	Anti-neoplastic (cancer) drug
interaction_type	string	Interaction type (inhibitor, activator, …); NULL in most DTC-source rows
interaction_score	float	DGIdb composite score; null if not computed
source_db_name	string	Source database (e.g. `ChEMBL`, `CIViC`, `DTC`)
source_db_version	string	Source database version

Schema — table genes (80,234 rows)

Column	Type	Description
gene_name	string	HGNC gene symbol
concept_id	string	Normalized concept ID
nomenclature	string	Gene nomenclature source
source_db_name	string	Source database
source_db_version	string	Source database version

Schema — table drugs (81,572 rows)

Column	Type	Description
drug_name	string	Drug name (normalized)
concept_id	string	Normalized concept ID (ChEMBL or NCI-T)
nomenclature	string	Drug nomenclature source
approved	bool	FDA-approved
immunotherapy	bool	Immunotherapy agent
anti_neoplastic	bool	Anti-neoplastic (cancer) drug
source_db_name	string	Source database
source_db_version	string	Source database version

-- Approved drugs targeting a gene (prior art / competitor check)
SELECT drug_name, interaction_type, interaction_score, source_db_name
FROM   interactions
WHERE  gene_name = 'EGFR'
  AND  approved  = TRUE
ORDER BY interaction_score DESC
LIMIT 20

-- Anti-cancer drugs for a gene, with adverse-event cross-check via FAERS
-- Step 1: get approved anti-neoplastic drugs targeting the gene
SELECT drug_name
FROM   dgidb.interactions
WHERE  gene_name       = 'BRAF'
  AND  approved        = TRUE
  AND  anti_neoplastic = TRUE

-- Step 2: feed drug_name list into FAERS for adverse event profile
SELECT reaction_meddra_pt, COUNT(*) AS reports
FROM   fda.faers
WHERE  drug_name IN ('VEMURAFENIB', 'DABRAFENIB', 'ENCORAFENIB')
GROUP BY reaction_meddra_pt
ORDER BY reports DESC
LIMIT 15

Sanctions — U.S. Consolidated Screening List

sanctions

Table: entities Rows: 25,447 Lists: OFAC SDN + BIS + State Dept Size: ~8 MB scanned

The U.S. government Consolidated Screening List (CSL) — a unified export of OFAC SDN (Treasury), BIS Entity List (Commerce), State Dept Nonproliferation, ITAR Debarred, and several other sanctions and export-control lists. One row per sanctioned individual, entity, vessel, or aircraft. Source: trade.gov, updated daily.

Schema — table entities

Column	Type	Description
id	string	Stable hash ID from trade.gov
source	string	Source list (e.g. "Specially Designated Nationals (SDN) - Treasury Department")
name	string	Primary name
type	string	Individual, Entity, Vessel, or Aircraft (may be empty)
programs	array	Sanctions program codes (e.g. `["UKRAINE-EO13685", "RUSSIA-EO14024"]`)
alt_names	array	Aliases and alternate name spellings
addresses	array	Known addresses
ids	array	Identity documents: passport, tax ID, registration (format: "type, value, country")
nationalities	array	Nationality countries
dobs	array	Dates of birth (YYYY or YYYY-MM-DD)
pobs	array	Places of birth
start_date	string	Listing date (YYYY-MM-DD)
license_requirement	string	BIS license requirement (BIS entries only)
remarks	string	Additional remarks
source_list_url	string	URL to the source list page

-- Look up an entity by name (partial match)
SELECT name, type, source, programs, addresses
FROM   entities
WHERE  name ~ '(?i)vtb bank'
LIMIT 10

-- All individuals sanctioned under Russia programs
SELECT name, type, start_date, alt_names
FROM   entities, UNNEST(programs) AS prog
WHERE  prog  ~ '(?i)russia'
  AND  type = 'Individual'
ORDER BY start_date DESC
LIMIT 25

-- Count entities per source list
SELECT source, COUNT(*) AS total
FROM   entities
GROUP BY source
ORDER BY total DESC

MalwareBazaar — Malware Sample Repository

malwarebazaar

Table: samples Rows: 984,697 Partitions: 68 months (2020-02–2025-09) Size: ~0.4 GB scanned

abuse.ch MalwareBazaar — crowdsourced malware sample metadata. One row per submitted sample. Includes file hashes, family signatures, VirusTotal detection rates, and fuzzy hashes for similarity matching. License: CC0. Partition key: first_seen_utc.

Schema — table samples

Column	Type	Description
sha256_hash	string	SHA-256 hash (primary identifier)
md5_hash	string	MD5 hash
sha1_hash	string	SHA-1 hash
file_name	string	Original submitted file name
file_type_guess	string	Guessed file type (e.g. `Win32 EXE`, `PDF Document`, `ZIP`)
mime_type	string	MIME type
first_seen_utc	timestamp	Date first submitted to MalwareBazaar (partition key)
reporter	string	Reporting researcher or automated feed
signature	string	Malware family name / AV detection name
vtpercent	number	VirusTotal detection rate 0–100 (null if not scanned)
clamav	string	ClamAV signature name
imphash	string	PE import hash (Windows executables)
ssdeep	string	ssdeep fuzzy hash for similarity clustering
tlsh	string	TLSH fuzzy hash

-- All samples for a malware family
SELECT sha256_hash, file_name, file_type_guess, first_seen_utc, vtpercent
FROM   samples
WHERE  signature ~ '(?i)emotet'
ORDER BY first_seen_utc DESC
LIMIT 25

-- Most prevalent malware families (last 6 months)
SELECT signature, COUNT(*) AS samples, AVG(vtpercent) AS avg_vt
FROM   samples
WHERE  first_seen_utc >= '2025-03-01'
  AND  signature != ''
GROUP BY signature
ORDER BY samples DESC
LIMIT 20

-- Lookup by hash (any of the three types)
SELECT sha256_hash, file_name, signature, first_seen_utc, vtpercent
FROM   samples
WHERE  md5_hash = 'd41d8cd98f00b204e9800998ecf8427e'
LIMIT 5

Open Food Facts — Global Food Product Database

openfoodfacts

Table: products Rows: 556,027 Partitions: 7 years (2012–2017) Size: ~0.4 GB scanned

Open Food Facts 2019 Kaggle snapshot (DbCL 1.0) — crowdsourced nutritional data for food products sold globally. One row per product. Includes ingredients, allergens, Nutri-Score grades, and 12 key nutrients per 100g. Partition key: created.

Schema — table products

Column	Type	Description
code	string	Barcode (EAN/UPC)
product_name	string	Product name
brands	string	Brand names (comma-separated)
categories	string	Product categories (comma-separated)
main_category	string	Primary category
countries	string	Countries where sold (comma-separated)
ingredients	string	Ingredients text
allergens	string	Declared allergens
traces	string	May-contain-traces declarations
additives_n	int	Count of food additives
nutrition_grade	string	Nutri-Score grade (a–e; a = healthiest)
energy_100g	number	Energy per 100g (kcal)
fat_100g	number	Total fat per 100g (g)
sat_fat_100g	number	Saturated fat per 100g (g)
carbs_100g	number	Carbohydrates per 100g (g)
sugars_100g	number	Sugars per 100g (g)
fiber_100g	number	Dietary fiber per 100g (g)
proteins_100g	number	Protein per 100g (g)
salt_100g	number	Salt per 100g (g)
created	timestamp	Date product was added (partition key)

-- Lowest-sugar breakfast cereals with Nutri-Score A or B
SELECT product_name, brands, sugars_100g, nutrition_grade, countries
FROM   products
WHERE  categories ~ '(?i)cereal'
  AND  nutrition_grade IN ('a', 'b')
  AND  sugars_100g IS NOT MISSING
ORDER BY sugars_100g
LIMIT 20

-- Average macros and Nutri-Score by food category
SELECT main_category,
       COUNT(*) AS products,
       AVG(energy_100g) AS avg_kcal,
       AVG(sugars_100g) AS avg_sugars,
       AVG(proteins_100g) AS avg_protein
FROM   products
WHERE  main_category != ''
GROUP BY main_category
ORDER BY products DESC
LIMIT 20

Hacker News — Y Combinator Tech Community

hackernews

Table: items Rows: 4.6M Partitions: 20 years (2007–2026) License: CC0 Size: ~0.7 GB scanned

Every story, Ask HN, Show HN, and job post from Hacker News since 2007. One row per item with score, author, linked URL, and descendant comment count. Partitioned by year. Useful for tech trend analysis, startup signal detection, and tracking how developer attention has shifted over time. Source: BigQuery bigquery-public-data.hacker_news (CC0).

Schema — table items

Column	Type	Description
id	string	Item ID
by	string	Author username
title	string	Story or post title
url	string	Linked URL (null for Ask/Show HN text posts)
score	int	Upvote score
time	int	Unix timestamp of submission
descendants	int	Total comment count
year	int	Submission year (partition key)

-- Top stories of 2024 by score
SELECT title, url, score, descendants
FROM   items
WHERE  year = 2024
ORDER BY score DESC
LIMIT 20

-- Show HN posts per year — measure of builder activity over time
SELECT year,
       COUNT(*) AS show_hn_posts,
       AVG(score) AS avg_score
FROM   items
WHERE  title ~ '(?i)^show hn'
GROUP BY year
ORDER BY year DESC

NADAC — National Average Drug Acquisition Cost

nadac

Table: prices Rows: 18,172,629 Coverage: 2013–2026 (weekly) Size: ~0.8 GB scanned

CMS weekly drug acquisition cost survey — the price pharmacies actually pay for drugs. One row per NDC code per weekly report. Covers generic and brand-name drugs, retail and non-retail pharmacies. Pairs with FDA FAERS and Orange Book. Source: data.medicaid.gov, public domain. Partition key: effective_date.

Schema — table prices

Column	Type	Description
ndc	string	National Drug Code (11-digit)
ndc_description	string	Drug name, strength, and dosage form
nadac_per_unit	string	Acquisition cost per unit (tablet, mL, gram, etc.)
pricing_unit	string	Unit type: `TAB`, `ML`, `GM`, `EA`
classification	string	`Generic` or `Brand Name`
otc	bool	Over-the-counter (`true`) or prescription (`false`)
pharmacy_type	string	`Retail` or `Non-Retail`
as_of_date	string	Date of the weekly survey report
effective_date	string	Price effective date (partition key)
explanation_code	string	Reason for price change or data source code
corresponding_generic_nadac	string	For brand drugs: comparable generic NADAC price

-- Current acquisition cost for metformin generics
SELECT ndc_description, nadac_per_unit, pricing_unit, as_of_date
FROM   prices
WHERE  ndc_description ~ '(?i)metformin'
  AND  classification = 'Generic'
  AND  as_of_date >= '2025-01-01'
ORDER BY as_of_date DESC, ndc_description
LIMIT 20

-- Brand vs generic price gap for a drug class
SELECT classification,
       AVG(CAST(nadac_per_unit AS float)) AS avg_cost,
       COUNT(DISTINCT ndc) AS ndcs
FROM   prices
WHERE  ndc_description ~ '(?i)atorvastatin'
  AND  as_of_date >= '2025-01-01'
GROUP BY classification

World Bank — Global Commodity Prices

worldbank

Table: commodity_prices Rows: 795 Coverage: 1960–2026 (monthly) License: CC BY 4.0

World Bank Pink Sheet — monthly prices for ~70 global commodities: energy, metals, agriculture, and fertilizers. One row per month, one column per commodity. All prices are nominal USD. Source: World Bank, CC BY 4.0. Partition key: year.

Schema — table commodity_prices (selected columns)

Column	Type	Description
period	string	Month label in `YYYY-Mmm` format (e.g. `2024-Jan`)
year	int	Year (partition key)
month	int	Month number (1–12)
crude_oil_brent	number	Brent crude oil ($/bbl)
crude_oil_wti	number	WTI crude oil ($/bbl)
natural_gas_us	number	US natural gas ($/mmbtu)
natural_gas_europe	number	European natural gas ($/mmbtu)
coal_australian	number	Australian thermal coal ($/mt)
gold	number	Gold ($/troy oz)
copper	number	Copper ($/mt)
aluminum	number	Aluminum ($/mt)
maize	number	Maize / corn ($/mt)
wheat_us_hrw	number	US hard red winter wheat ($/mt)
soybeans	number	Soybeans ($/mt)
sugar_world	number	World sugar (¢/kg)
coffee_arabica	number	Arabica coffee (¢/kg)
cocoa	number	Cocoa ($/mt)

Plus ~50 additional commodity columns. Use GET /v1/databases for the full list.

-- Brent crude and natural gas price history since 2020
SELECT period, crude_oil_brent, crude_oil_wti, natural_gas_us
FROM   commodity_prices
WHERE  year >= 2020
ORDER BY year, month

-- Gold vs copper ratio over time (risk-off indicator)
SELECT period, gold, copper,
       gold / NULLIF(copper, 0) AS gold_copper_ratio
FROM   commodity_prices
WHERE  year >= 2010
ORDER BY year, month

FHFA — House Price Index

fhfa

Table: hpi Rows: 133,216 Coverage: 1975–2026 Levels: USA · state · MSA

Federal Housing Finance Agency (FHFA) House Price Index — quarterly and monthly home price indices for the U.S., all 50 states, and ~400 metropolitan statistical areas (MSAs). Covers purchase-only and all-transactions variants, seasonally adjusted and not. Source: FHFA, public domain. Partition key: yr.

Schema — table hpi

Column	Type	Description
yr	int	Year (partition key)
period	int	Quarter (1–4) or month (1–12) depending on `frequency`
frequency	string	`quarterly` or `monthly`
hpi_type	string	`purchase-only` or `all-transactions`
hpi_flavor	string	`traditional` or `expanded-data`
level	string	Geographic level: `USA`, `state`, `MSA`
place_id	string	FIPS code or state abbreviation
place_name	string	State or MSA name (e.g. `San Francisco-Oakland-Fremont, CA`)
index_nsa	number	Not-seasonally-adjusted HPI (base = 100 at 1991 Q1)
index_sa	number	Seasonally-adjusted HPI (null for MSA-level series)

-- National quarterly HPI since 2000 (purchase-only, seasonally adjusted)
SELECT yr, period, index_sa, index_nsa
FROM   hpi
WHERE  level      = 'USA'
  AND  hpi_type  = 'purchase-only'
  AND  frequency = 'quarterly'
  AND  yr >= 2000
ORDER BY yr, period

-- State-level HPI comparison (most recent quarter)
SELECT place_name, index_nsa, index_sa
FROM   hpi
WHERE  level      = 'state'
  AND  hpi_type  = 'purchase-only'
  AND  frequency = 'quarterly'
  AND  yr = 2025 AND period = 4
ORDER BY index_nsa DESC

ClinPGx — Pharmacogenomics Knowledge Base

clinpgx

Tables: 4 License: CC BY-SA 4.0

Curated pharmacogenomics data from ClinPGx (formerly PharmGKB) — gene–drug interactions, clinical annotations, dosing guidelines (CPIC, DPWG), and FDA/EMA drug label PGx requirements. Four tables: clinical variants, guidelines, drug labels, and relationships.

Schema — table clinical_variants (~5,200 rows)

Column	Type	Description
variant	string	Variant name or star allele (e.g. `CYP2C9*3`, `rs1801131`)
gene	string	Gene symbol (e.g. `CYP2C9`)
type	string	Annotation type (`Metabolism/PK`, `Toxicity`, `Efficacy`, `Dosage`)
level_of_evidence	string	Evidence level (`1A`, `1B`, `2A`, `2B`, `3`, `4`)
chemicals	array	Associated drug names
phenotypes	array	Associated phenotype names

-- Highest-evidence (1A) variant–drug pairs
SELECT variant, gene, type, chemicals, phenotypes
FROM   clinical_variants
WHERE  level_of_evidence = '1A'
ORDER BY gene

Schema — table guidelines (~217 rows)

Column	Type	Description
id	string	PharmGKB accession ID
name	string	Guideline annotation name
source	string	Issuing body (`CPIC`, `DPWG`, `FDA`, `HCSC`, `EMA`)
recommendation	bool	Includes a dosing recommendation
dosing_information	bool	Includes dosing information
has_testing_info	bool	Includes genetic testing guidance
alternate_drug_available	bool	Alternative drug available
chemicals	array	Involved drug names
genes	array	Involved gene symbols
alleles	array	Specific alleles referenced
pmids	array	Supporting PubMed IDs

-- All CPIC guidelines with dosing recommendations
SELECT name, chemicals, genes, alleles
FROM   guidelines
WHERE  source = 'CPIC'
  AND  recommendation = true
ORDER BY name

Schema — table drug_labels (~1,400 rows)

Column	Type	Description
id	string	PharmGKB ID
name	string	Annotation name
source	string	Regulatory body (`FDA`, `EMA`, `HCSC`)
testing_level	string	`Testing Required`, `Actionable PGx`, `Informative PGx`
has_prescribing_info	bool	Label includes prescribing guidance
has_dosing_info	bool	Label includes dosing guidance
has_alternate_drug	bool	Alternate drug recommended
chemicals	array	Drug names
genes	array	Gene symbols
variants	array	Relevant variants or haplotypes
latest_history_date	string	Most recent annotation update (YYYY-MM-DD)

-- FDA labels requiring genetic testing
SELECT name, chemicals, genes, testing_level
FROM   drug_labels
WHERE  source = 'FDA'
  AND  testing_level = 'Testing Required'
ORDER BY latest_history_date DESC
LIMIT 20

Schema — table relationships (~127,700 rows)

Column	Type	Description
entity1_id	string	PharmGKB ID of first entity
entity1_name	string	Name of first entity
entity1_type	string	`Gene`, `Chemical`, `Disease`, `Variant`, ...
entity2_id	string	PharmGKB ID of second entity
entity2_name	string	Name of second entity
entity2_type	string	Type of second entity
evidence	array	Evidence sources (`ClinicalAnnotation`, `VariantAnnotation`, ...)
association	string	`associated`, `not associated`, `ambiguous`
pk	bool	Pharmacokinetic relationship
pd	bool	Pharmacodynamic relationship
pmids	array	Supporting PubMed IDs

-- All confirmed gene–drug associations for CYP2D6
SELECT entity2_name AS drug, association, evidence, pmids
FROM   relationships
WHERE  entity1_name = 'CYP2D6'
  AND  entity1_type = 'Gene'
  AND  entity2_type = 'Chemical'
  AND  association  = 'associated'
ORDER BY entity2_name

Reference tables: genes (~25,000 rows) — NCBI Gene ID, Ensembl ID, GRCh37/GRCh38 coordinates, VIP and CPIC flags. drugs (~3,700 rows) — SMILES, InChI, RxNorm/ATC identifiers, annotation counts. variants (~7,600 rows) — chromosomal location, synonyms, annotation counts. phenotypes (~1,600 rows) — alternate names, MeSH/OMIM cross-references.

Illumina — SNP Array Manifests

illumina

Tables: 2 Build: GRCh38

Probe-level manifests for two Illumina SNP genotyping arrays. One row per assay locus — variant coordinates, probe sequence, alleles, strand orientation, and source annotation. Useful for pharmacogenomics variant lookup, array coverage comparison, and probe-sequence retrieval.

Table	Array	Loci	Decompressed
gda_pgx	GDA+ePGx v1	1,933,117	800 MiB
gsa_pgx	GSA-PGx v4	683,054	274 MiB

Both tables share the same schema.

Schema — tables gda_pgx / gsa_pgx

Column	Type	Description
ilmn_id	string	Illumina internal probe ID
name	string	Variant name (position-based or rsID)
ilmn_strand	string	Probe strand relative to TOP/BOT designation
alleles	array	Two alleles, e.g. `["A","C"]`
address_a	string	Bead address for allele A
probe_seq_a	string	50-mer probe sequence for allele A
address_b	string	Bead address for allele B (two-colour probes only)
probe_seq_b	string	50-mer probe sequence for allele B (optional)
chr	string	Chromosome
position	int	GRCh38 genomic position (1-based)
source	string	Variant source (`PAGE`, `1000genomes`, `CNV`, ...)
source_strand	string	Source strand (`TOP` / `BOT`)
source_seq	string	~120 bp flanking sequence with allele in brackets
top_genomic_seq	string	TOP-strand genomic context with allele in brackets
ref_strand	string	Strand relative to reference (`+` / `-`)
bead_set_id	int	Bead set identifier
exp_clusters	int	Expected cluster count (2 = mono, 3 = biallelic)
intensity_only	bool	Intensity-only probe (no genotype call)

-- All probes on chromosome 22 in gda_pgx
SELECT name, position, alleles, source, ref_strand
FROM   gda_pgx
WHERE  chr = '22'
ORDER BY position
LIMIT 20

-- CNV probes in gsa_pgx
SELECT name, chr, position, probe_seq_a
FROM   gsa_pgx
WHERE  source = 'CNV'
ORDER BY chr, position
LIMIT 20