Datasets

Instant access to datasets — no pipeline, no API key negotiation, no rate limits. Query with SQL, pay fractions of a cent per call.

NVD — National Vulnerability Database

nvd
Table: cve Rows: ~333,000 Size: ~720 MB scanned

Every CVE published by NIST, including CVSS scores, severity ratings, CWE weaknesses, and advisory references. Updated periodically from the NVD API v2.0.

Schema — table cve
ColumnTypeDescription
idstringCVE identifier (e.g. CVE-2024-12345)
publishedstringPublication date
lastModifiedstringLast modification date
vulnStatusstringAnalyzed, Modified, Awaiting Analysis
descriptions[0].valuestringEnglish description
metrics.cvssMetricV31[0]
.cvssData.baseScore
floatCVSS v3.1 base score (0–10)
metrics.cvssMetricV31[0]
.cvssData.baseSeverity
stringLOW / MEDIUM / HIGH / CRITICAL
metrics.cvssMetricV31[0]
.cvssData.vectorString
stringFull CVSS vector string
weaknesses[0].description[0].valuestringPrimary CWE identifier
referencesarrayAdvisory URLs and sources

Note: Timestamp columns require backtick-quoted ISO 8601 literals for range comparisons — single-quoted strings return zero rows. Example: `2024-01-01T00:00:00Z`

-- Critical CVEs published in 2024
SELECT id, published,
       metrics.cvssMetricV31[0].cvssData.baseScore AS score,
       descriptions[0].value AS description
FROM   cve
WHERE  metrics.cvssMetricV31[0].cvssData.baseSeverity = 'CRITICAL'
  AND  published >= `2024-01-01T00:00:00Z`
ORDER BY score DESC
LIMIT 20

OSV — Open Source Vulnerabilities

osv
Table: advisories Rows: ~4,140,000 Size: ~420 MB scanned

OSV.dev advisories flattened to one row per (advisory, affected package). Covers PyPI, npm, Go, Maven, Rust, Debian, Alpine, and more. Superset of NVD — includes ecosystem-specific advisories (GHSA, PYSEC, RUSTSEC) with precise package-version ranges for direct dependency scanning.

Schema — table advisories
ColumnTypeDescription
idstringAdvisory ID: CVE-..., GHSA-..., PYSEC-...
publishedstringFirst published date
modifiedstringLast modified date
withdrawnstringWithdrawal date if retracted (optional)
aliasesarrayRelated IDs (e.g. CVE alias for a GHSA)
summarystringShort description
severity_typestringCVSS type (e.g. CVSS_V3)
severity_scorestringFull CVSS vector string
cwe_idsarrayCWE weakness identifiers
ecosystemstringPyPI, npm, Go, Maven, crates.io, ...
package_namestringPackage name within the ecosystem
purlstringPackage URL (optional)
introducedstringFirst affected version (0 = all versions before fixed)
fixedstringFirst fixed version; absent if not yet fixed
-- All advisories for a specific npm package
SELECT id, introduced, fixed, summary
FROM   advisories
WHERE  ecosystem = 'npm' AND package_name = 'lodash'

-- Unfixed critical advisories in PyPI (CVSS network-reachable, high impact)
SELECT id, package_name, summary
FROM   advisories
WHERE  ecosystem = 'PyPI'
  AND  severity_score ~ 'AV:N.*C:H'
  AND  fixed IS MISSING
LIMIT 50

-- Advisory counts by ecosystem
SELECT ecosystem, COUNT(*) AS n
FROM   advisories
GROUP BY ecosystem
ORDER BY n DESC

SEC EDGAR — Financial Facts

sec
Table: edgar Rows: ~97,800,000 Companies: ~8,900 Size: ~6.2 GB scanned

Structured XBRL financial facts from SEC filings (10-K, 10-Q, 8-K). One row per reported fact — every line item a public company files: revenue, net income, assets, shares outstanding, and thousands of other GAAP/IFRS concepts.

Note: The end column (period end date) is a reserved keyword in Sneller SQL. Do not use it in SELECT or ORDER BY — filter by filed or fy instead.

Schema — table edgar
ColumnTypeDescription
cikstringSEC Central Index Key
tickerstringStock ticker symbol
companystringCompany name
taxonomystringus-gaap, ifrs-full, dei, ...
conceptstringXBRL concept (e.g. Revenues, NetIncomeLoss)
labelstringHuman-readable label for the concept
unitstringUSD, shares, pure, ...
endstringPeriod end date (reserved keyword — avoid in SELECT)
valfloatReported value
formstringFiling form: 10-K, 10-Q, 8-K, ...
fystringFiscal year
fpstringFiscal period: FY, Q1, Q2, Q3
filedstringFiling date
accnstringSEC accession number
-- Annual revenue for a company across fiscal years
SELECT ticker, fy, fp, val AS revenue, filed
FROM   edgar
WHERE  ticker = 'AAPL'
  AND  concept = 'RevenueFromContractWithCustomerExcludingAssessedTax'
  AND  form = '10-K'
ORDER BY fy DESC
LIMIT 10

-- Compare net income across S&P 500 for latest fiscal year
SELECT ticker, company, val AS net_income, fy
FROM   edgar
WHERE  concept = 'NetIncomeLoss'
  AND  form = '10-K' AND fy = 2024 AND unit = 'USD'
ORDER BY net_income DESC
LIMIT 20

-- Balance sheet: total assets vs total liabilities for one company
SELECT ticker, fy, fp, concept, val, filed
FROM   edgar
WHERE  ticker = 'MSFT'
  AND  concept IN ('Assets', 'Liabilities')
  AND  form = '10-K'
ORDER BY fy DESC
LIMIT 10

-- Discover which concepts a company has reported (useful before writing queries)
SELECT concept, label, unit, COUNT(*) AS filings
FROM   edgar
WHERE  ticker = 'TSLA'
GROUP BY concept, label, unit
ORDER BY filings DESC
LIMIT 50

-- Quarterly earnings per share trend
SELECT ticker, fy, fp, val AS eps, filed
FROM   edgar
WHERE  ticker = 'NVDA'
  AND  concept = 'EarningsPerShareBasic'
  AND  form = '10-Q'
ORDER BY filed DESC
LIMIT 12

Ethereum — On-chain Data

eth
Tables: transfers, blocks, contracts, transactions, dex_swaps, lending, lp_events, mev History: 2015 – present Size: transfers ~70 GB · blocks ~770 MB · contracts ~9.6 GB · transactions ~136 GB · dex_swaps ~17 GB

On-chain Ethereum data sourced from the AWS Public Blockchain dataset. All tables are partitioned by date with native block_timestamp indexing — queries target only the relevant daily partitions; no full-table scans needed for time-range queries.

Table transfers — decoded ERC-20 / ERC-721 Transfer events  ·  ~70 GB scanned (1 year ingested)
ColumnTypeDescription
block_numberintBlock height
block_timestampstringBlock timestamp (ISO 8601)
transaction_hashstringTransaction hash
log_indexintLog position within the transaction
token_addressstringContract address of the token
from_addressstringSender address
to_addressstringRecipient address
valuestringTransfer amount (raw; token decimals vary)
-- Recent large USDC transfers (USDC = 6 decimals, $1M = 1e12)
SELECT transaction_hash, from_address, to_address, "value"
FROM   transfers
WHERE  token_address = '0xa0b86991c6218b36c1d19d4a2e9eb0ce3606eb48'
  AND  CAST("value" AS FLOAT) > 1000000000000
  AND  block_timestamp >= '2025-01-01'
LIMIT 20
Table blocks — one row per block (header, gas, base fee)  ·  ~770 MB scanned (365 days ingested)
ColumnTypeDescription
numberintBlock height
timestampstringBlock timestamp (ISO 8601)
hashstringBlock hash
minerstringValidator / miner address
gas_usedintGas consumed by all transactions
gas_limitintBlock gas limit
base_fee_per_gasintEIP-1559 base fee (wei); null pre-London
transaction_countintNumber of transactions in the block
difficultyintBlock difficulty (0 post-Merge)
sizeintBlock size in bytes
-- Average gas price and block utilisation over the last 7 days
SELECT DATE(timestamp) AS day,
       AVG(base_fee_per_gas) / 1e9 AS avg_base_fee_gwei,
       AVG(gas_used * 100.0 / gas_limit) AS avg_utilisation_pct
FROM   blocks
WHERE  timestamp >= '2026-04-06'
GROUP BY day ORDER BY day
Table contracts — one row per contract deployment  ·  full history 2015 – present, ~9.6 GB scanned
ColumnTypeDescription
addressstringDeployed contract address
deployer_addressstringAddress that deployed the contract
block_timestampstringDeployment timestamp (ISO 8601)
block_numberintBlock height of deployment
transaction_hashstringDeployment transaction hash
is_erc20boolDetected ERC-20 token contract
is_erc721boolDetected ERC-721 NFT contract
function_sighashesarray4-byte function selectors present in bytecode
-- Lookup a contract: who deployed it and when?
SELECT deployer_address, block_timestamp, transaction_hash,
       is_erc20, is_erc721
FROM   contracts
WHERE  address = '0xdac17f958d2ee523a2206206994597c13d831ec7'
Table transactions — one row per transaction (native ETH transfers, gas spend, contract calls)  ·  ~136 GB, last 365 days ingested
ColumnTypeDescription
transaction_hashstringTransaction hash (unique identifier)
block_numberintBlock height
block_timestampstringBlock timestamp (ISO 8601)
transaction_indexintPosition of tx within the block
from_addressstringSender address
to_addressstringRecipient address (null for contract creation)
valueintNative ETH transferred (wei; 1 ETH = 1e18)
gasintGas limit set by sender
gas_priceintGas price (wei/gas; for legacy type 0/1 txns)
nonceintSender nonce
transaction_typeint0=legacy, 1=access-list, 2=EIP-1559
max_fee_per_gasintEIP-1559 max total fee (wei/gas)
max_priority_fee_per_gasintEIP-1559 tip to validator (wei/gas)
receipt_statusint1=success, 0=reverted
receipt_gas_usedintActual gas consumed
receipt_effective_gas_priceintPrice actually paid (base fee + tip, wei/gas)
receipt_contract_addressstringDeployed contract address; null if not a creation
receipt_cumulative_gas_usedintCumulative gas used in block up to this tx
-- Top 10 gas spenders yesterday (total ETH paid in fees)
SELECT from_address,
       COUNT(*) AS tx_count,
       SUM(receipt_gas_used * receipt_effective_gas_price) / 1e18 AS eth_fees
FROM   transactions
WHERE  block_timestamp >= '2026-04-12'
  AND  block_timestamp <  '2026-04-13'
GROUP BY from_address
ORDER BY eth_fees DESC
LIMIT 10
Table dex_swaps — Uniswap v2 & v3 Swap events  ·  365 daily partitions
ColumnTypeDescription
pool_addressstringUniswap pool contract address
transaction_hashstringTransaction hash
log_indexintLog position within the transaction
block_numberintBlock height
block_timestampstringBlock timestamp (ISO 8601)
protocolstringv2 or v3
senderstringSwap initiator address
recipientstringSwap recipient address
amount0intv3: signed token0 delta (negative = token0 left pool)
amount1intv3: signed token1 delta; null for v2
amount0_in / amount0_outintv2: unsigned token0 in/out; null for v3
amount1_in / amount1_outintv2: unsigned token1 in/out; null for v3
sqrt_price_x96intv3: post-swap sqrt(price) × 296; null for v2
liquidityintv3: active liquidity at time of swap; null for v2
tickintv3: current tick after swap; null for v2
-- Daily swap count and net USDC flow on the USDC/WETH 0.05% pool
-- amount0 is in USDC raw units (6 dec); positive = USDC into pool (ETH bought)
SELECT partition_date,
       COUNT(*) AS swap_count,
       SUM(amount0) / 1e6 AS net_usdc_flow
FROM   dex_swaps
WHERE  pool_address = '0x88e6a0c2ddd26feeb64f039a2c41296fcb3f5640'
  AND  block_timestamp >= '2026-04-07'
  AND  block_timestamp <  '2026-04-14'
GROUP BY partition_date ORDER BY partition_date LIMIT 10
Table lending — Aave v2/v3 and Compound v3 lending events  ·  365 daily partitions
ColumnTypeDescription
contract_addressstringLending pool contract address
transaction_hashstringTransaction hash
log_indexintLog position within the transaction
block_numberintBlock height
block_timestampstringBlock timestamp (ISO 8601)
protocolstringaave_v2, aave_v3, or compound_v3
event_typestringsupply, withdraw, borrow, repay, liquidation, supply_collateral, withdraw_collateral
assetstringToken contract address (reserve asset)
userstringUser address
on_behalf_ofstringBeneficiary address (may differ from user)
amountintRaw token amount (apply asset decimals)
borrow_rate_modeintAave borrow rate mode: 1=stable, 2=variable; null for other events
collateral_assetstringCollateral token address; non-null for liquidation events only
collateral_amountintCollateral seized; non-null for liquidation events only
-- Daily USDC supply volume by protocol (Aave v3, Compound v3)
SELECT partition_date,
       protocol,
       COUNT(*) AS supply_events,
       SUM(amount) / 1e6 AS volume_usdc
FROM   lending
WHERE  event_type = 'supply'
  AND  asset = '0xa0b86991c6218b36c1d19d4a2e9eb0ce3606eb48'
  AND  block_timestamp >= '2026-04-07'
  AND  block_timestamp <  '2026-04-14'
GROUP BY partition_date, protocol ORDER BY partition_date LIMIT 20
Table lp_events — Uniswap v3 Mint and Burn (liquidity provision) events  ·  365 daily partitions
ColumnTypeDescription
pool_addressstringUniswap v3 pool contract address
ownerstringAddress that owns the position
tick_lowerintLower bound of the price range (tick)
tick_upperintUpper bound of the price range (tick)
transaction_hashstringTransaction hash
log_indexintLog position within the transaction
block_numberintBlock height
block_timestampstringBlock timestamp (ISO 8601)
event_typestringmint (add liquidity) or burn (remove liquidity)
senderstringCaller address for mint; null for burn
amountintLiquidity units added or removed
amount0intToken0 deposited (mint) or withdrawn (burn)
amount1intToken1 deposited (mint) or withdrawn (burn)
-- Net liquidity minted vs burned in USDC/WETH 0.05% pool (last 7 days)
SELECT event_type,
       COUNT(*) AS events,
       SUM(amount0) / 1e6  AS usdc_total,
       SUM(amount1) / 1e18 AS weth_total
FROM   lp_events
WHERE  pool_address = '0x88e6a0c2ddd26feeb64f039a2c41296fcb3f5640'
  AND  block_timestamp >= '2026-04-07'
GROUP BY event_type
Table mev — MEV-boost relay payloads (block auction data)  ·  365 daily partitions  ·  ~2.2M rows  ·  ~0.6 GB scanned

Note: Sourced from the ultra_sound relay API. The date partition field is stored as a datetime internally — use backtick ISO 8601 literals for date range filters.

ColumnTypeDescription
block_numberintBlock height
block_hashstringBlock hash
slotintBeacon chain slot number
timestampintUnix timestamp of the block
datedatePartition key (stored as datetime internally)
relaystringRelay name (e.g. ultra_sound, flashbots, bloxroute_max)
proposer_fee_recipientstringValidator fee recipient address
builder_pubkeystringBlock builder BLS public key
gas_usedintGas consumed by the block
gas_limitintBlock gas limit
num_txsintNumber of transactions in the block
value_weiintBlock value paid to proposer (wei)
value_ethfloatBlock value paid to proposer (ETH)
-- Relay market share — block count and MEV earned (last 30 days)
SELECT relay,
       COUNT(*) AS blocks,
       TRUNC(AVG(value_eth)*1000000)/1000000  AS avg_mev_eth,
       TRUNC(SUM(value_eth)*10000)/10000      AS total_mev_eth
FROM   mev
WHERE  date >= `2026-03-16T00:00:00Z`
GROUP BY relay
ORDER BY blocks DESC
LIMIT 10

-- Top 10 highest-value MEV blocks all time
SELECT date, block_number, relay, value_eth, num_txs
FROM   mev
ORDER BY value_eth DESC
LIMIT 10

Base — On-chain Data

base
Tables: transfers, blocks, transactions History: 30 days (~7-day publishing lag) Size: transfers ~70 GB · blocks ~0.6 GB · transactions ~67 GB

On-chain data for Base, Coinbase's OP-Stack L2 rollup, sourced from the AWS Public Blockchain dataset. All tables are partitioned by partition_date (one partition per calendar day). Source data has a ~7-day publishing lag. Transactions include L2-specific fee fields (l1_fee, l2_fee, l1_gas_price) not present on Ethereum mainnet.

Table transfers — decoded ERC-20 Transfer events  ·  ~70 GB / 30 days
ColumnTypeDescription
block_numberintBlock height
block_timestampstringBlock timestamp (ISO 8601)
transaction_hashstringTransaction hash
log_indexintLog position within the transaction
token_addressstringContract address of the token
from_addressstringSender address (42-char, 0x-prefixed)
to_addressstringRecipient address
valueintTransfer amount (raw, token decimals vary)
partition_datestringPartition key (YYYY-MM-DD)
-- Recent USDC transfers on Base over $50 (USDC = 6 decimals)
-- Update partition_date to a recent date within the last 30 days
SELECT transaction_hash, from_address, to_address, "value"
FROM   transfers
WHERE  token_address = '0x833589fcd6edb6e08f4c7c32d4f71b54bda02913'
  AND  CAST("value" AS FLOAT) > 50000000
  AND  partition_date IN ('2026-04-12', '2026-04-13')
LIMIT 20
Table blocks — one row per block  ·  ~0.6 GB / 30 days
ColumnTypeDescription
numberintBlock height
timestampstringBlock timestamp (ISO 8601)
hashstringBlock hash
minerstringSequencer / fee recipient address
gas_usedintGas consumed by all transactions
gas_limitintBlock gas limit
base_fee_per_gasintEIP-1559 base fee (wei)
transaction_countintNumber of transactions in the block
sizeintBlock size in bytes
partition_datestringPartition key (YYYY-MM-DD)
-- Average base fee and block utilisation over the last 7 days on Base
SELECT DATE(timestamp) AS day,
       AVG(base_fee_per_gas) / 1e9 AS avg_base_fee_gwei,
       AVG(gas_used * 100.0 / gas_limit) AS avg_utilisation_pct
FROM   blocks
WHERE  partition_date >= '2026-04-06'
GROUP BY day ORDER BY day
Table transactions — one row per transaction, with L2 fee fields  ·  ~67 GB / 30 days
ColumnTypeDescription
transaction_hashstringTransaction hash
block_numberintBlock height
block_timestampstringBlock timestamp (ISO 8601)
from_addressstringSender address
to_addressstringRecipient address (null for contract creation)
valueintNative ETH transferred (wei)
gasintGas limit set by sender
receipt_statusint1=success, 0=reverted
receipt_gas_usedintActual gas consumed
receipt_effective_gas_priceintPrice actually paid (wei/gas)
l1_feeintL1 data fee paid to Ethereum (wei)
l1_gas_priceintL1 gas price at inclusion (wei/gas)
l1_fee_scalarstringScalar used to compute L1 fee
l2_feeintL2 execution fee (wei)
mintintETH minted by L1-to-L2 deposit; null for regular txns
source_hashstringDeposit source hash; non-null for L1-to-L2 deposits
partition_datestringPartition key (YYYY-MM-DD)
-- Top gas spenders on Base by total L1 + L2 fees paid
SELECT from_address,
       COUNT(*) AS tx_count,
       SUM(l2_fee + COALESCE(l1_fee, 0)) / 1e18 AS total_eth_fees
FROM   transactions
WHERE  partition_date >= '2026-04-06'
GROUP BY from_address
ORDER BY total_eth_fees DESC
LIMIT 10

Bitcoin

btc
Tables: blocks, outputs Source: AWS Public Blockchain
Table blocks — one row per block  ·  ~6.6 GB scanned, full history 2009 – present
ColumnTypeDescription
numberintBlock height
timestampstringBlock timestamp (ISO 8601); natively indexed
mediantimestringMedian time of last 11 blocks (ISO 8601); natively indexed
hashstringBlock hash
transaction_countintNumber of transactions in the block
difficultyfloatMining difficulty at this block
sizeintBlock size in bytes
weightintBlock weight (SegWit units)
minerstringCoinbase nonce (miner tag)
previousblockhashstringParent block hash
-- Bitcoin difficulty trend over the last 90 days
SELECT partition_date,
       AVG(difficulty) AS avg_difficulty,
       SUM(transaction_count) AS daily_txns
FROM   blocks
WHERE  timestamp >= `2026-01-13T00:00:00Z`
GROUP BY partition_date ORDER BY partition_date
Table outputs — one row per transaction output  ·  730 daily partitions (2024 – present)  ·  ~115 GB scanned  ·  partition key: block_timestamp

Note: "value" must be quoted in SQL — it is a reserved word in Sneller. This table contains outputs only; inputs are not included so UTXO spent/unspent state is not available.

ColumnTypeDescription
txidstringTransaction ID
block_numberintBlock height containing this transaction
block_timestampstringBlock timestamp (ISO 8601); natively indexed
output_indexintOutput index within the transaction (vout)
addressstringRecipient address
"value"intOutput value in satoshis (1 BTC = 100,000,000 satoshis)
typestringOutput script type: P2PKH, P2SH, P2WPKH, P2WSH, P2TR, etc.
is_coinbaseboolTrue if this output is a miner block reward
-- Total BTC received by an address over the last 2 years
SELECT SUM("value") / 1e8    AS total_btc,
       COUNT(*)               AS output_count
FROM   outputs
WHERE  address = 'bc1qxy2kgdygjrsqtzq2n0yrf2493p83kkfjhx0wlh'

DeFi — Protocol Data

defi
Tables: protocols, tvl Source: DeFiLlama API

DeFi protocol data sourced from the DeFiLlama API. The protocols table is a current snapshot of all tracked protocols; tvl contains daily total-value-locked history for protocols with TVL ≥ $10M at fetch time.

Note: The date partition field in tvl is stored as a datetime internally — use backtick ISO 8601 literals: `2026-04-15T00:00:00Z`

Table protocols — snapshot of all DeFiLlama protocols  ·  ~7,300 rows  ·  ~2 MB scanned
ColumnTypeDescription
slugstringDeFiLlama protocol slug (unique identifier)
namestringProtocol display name
categorystringProtocol category (e.g. Lending, DEX, Liquid Staking)
chainstringPrimary chain
chainsarrayAll chains where the protocol is deployed
tvlfloatTotal value locked (USD)
change_1hfloatTVL change over last 1 hour (%)
change_1dfloatTVL change over last 24 hours (%)
change_7dfloatTVL change over last 7 days (%)
mcapfloatMarket capitalisation (USD; optional)
fdvfloatFully diluted valuation (USD; optional)
descriptionstringShort protocol description
urlstringProtocol website URL
twitterstringTwitter handle (optional)
gecko_idstringCoinGecko identifier (optional)
-- Top 20 protocols by TVL
SELECT name, category, chain,
       TRUNC(tvl/1e9*1000)/1000  AS tvl_billions
FROM   protocols
ORDER BY tvl DESC
LIMIT 20

-- TVL by category (all protocols with TVL > 0)
SELECT category,
       COUNT(*) AS protocols,
       TRUNC(SUM(tvl)/1e9*100)/100  AS total_tvl_billions
FROM   protocols
WHERE  tvl > 0
GROUP BY category
ORDER BY total_tvl_billions DESC
LIMIT 20
Table tvl — daily TVL history per protocol  ·  365 daily partitions  ·  ~0.4 GB scanned
ColumnTypeDescription
datedatePartition key (stored as datetime; use backtick literals)
protocolstringProtocol display name
slugstringDeFiLlama protocol slug (joins to protocols)
categorystringProtocol category
chainstringChain for this TVL record
chainsarrayAll chains where the protocol is deployed
tvlfloatTotal value locked on this date (USD)
-- Total DeFi TVL trend — last 30 days
SELECT date,
       TRUNC(SUM(tvl)/1e9*100)/100  AS total_tvl_billions,
       COUNT(*)                        AS protocol_count
FROM   tvl
WHERE  date >= `2026-03-16T00:00:00Z`
GROUP BY date
ORDER BY date
LIMIT 30

-- Aave V3 TVL history (last 90 days; each version has its own slug)
SELECT date,
       TRUNC(SUM(tvl)/1e9*1000)/1000  AS tvl_billions
FROM   tvl
WHERE  slug = 'aave-v3'
  AND  date >= `2026-01-15T00:00:00Z`
GROUP BY date
ORDER BY date
LIMIT 90

PubMed — Biomedical Literature

pubmed
Table: baseline Rows: ~40,000,000 Partitions: 247 year partitions (1781–2027) · partition key: pub_year Size: ~138 GB scanned

Full PubMed baseline: titles, abstracts, authors, journals, MeSH terms, and publication metadata for biomedical literature. Data originates from NLM XML; nested fields preserve the original XML element names. Filter by pub_year (integer) to scan only the relevant year partition instead of the full 138 GB.

Schema — table baseline (selected fields)
ColumnTypeDescription
MedlineCitation.PMID.textstringPubMed identifier (PMID is an object; use .text for the plain ID)
MedlineCitation.Article
.ArticleTitle
stringArticle title
MedlineCitation.Article
.Abstract.AbstractText
string / objectAbstract text (89% coverage, polymorphic): string for plain abstracts; {text, Label} per section for structured abstracts (e.g. Label: “BACKGROUND”)
MedlineCitation.Article
.Journal.Title
stringJournal name
MedlineCitation.Article
.AuthorList.Author
arrayAuthors (polymorphic elements): {LastName, ForeName, Affiliation} for individuals; {CollectiveName} for groups; 57% coverage
MedlineCitation
.MeshHeadingList.MeshHeading
arrayMeSH controlled vocabulary terms; 57% coverage
MedlineCitation
.KeywordList.Keyword
arrayAuthor-supplied keywords; 60% coverage
MedlineCitation
.ChemicalList.Chemical
arrayChemical substances (NameOfSubstance, RegistryNumber); 28% coverage
MedlineCitation
.DateCompleted.Year
stringYear NLM indexing was completed
PubmedData.ArticleIdList
.ArticleId
arrayDOI, PMC ID, and other identifiers
-- Recent articles matching a topic (regex title search)
SELECT MedlineCitation.PMID.text                  AS pmid,
       MedlineCitation.Article.ArticleTitle        AS title,
       MedlineCitation.Article.Journal.Title        AS journal,
       MedlineCitation.DateCompleted.Year           AS year
FROM   baseline
WHERE  pub_year = 2023
  AND  MedlineCitation.Article.ArticleTitle ~ '(?i)CRISPR'
LIMIT 20
-- All papers by a specific author (any position in author list)
-- Author is an array: use comma in FROM to expand one row per author
SELECT DISTINCT
       p.MedlineCitation.PMID.text                  AS pmid,
       p.MedlineCitation.Article.ArticleTitle       AS title,
       p.MedlineCitation.Article.Journal.Title      AS journal,
       p.pub_year
FROM   baseline AS p,
       p.MedlineCitation.Article.AuthorList.Author AS a
WHERE  p.pub_year = 2023
  AND  a.LastName = 'Smith'
LIMIT 20
-- Papers co-authored by ALL of the listed authors (AND semantics)
-- IN() matches any name; HAVING matched = N enforces all-must-match
SELECT p.MedlineCitation.PMID.text             AS pmid,
       COUNT(DISTINCT a.LastName)              AS matched
FROM   baseline AS p,
       p.MedlineCitation.Article.AuthorList.Author AS a
WHERE  p.pub_year = 2023
  AND  a.LastName IN ('Smith', 'Jones', 'Brown')
GROUP BY pmid
HAVING matched = 3

ClinicalTrials.gov — Clinical Studies

clinicaltrials
Table: studies Rows: ~580,000 Size: ~663 MB scanned

All registered clinical studies from ClinicalTrials.gov, sourced via the v2 API. One row per study. Pairs with PubMed via nct_id. Useful for trial matching by condition or drug, drug pipeline analysis, sponsor activity, and eligibility criteria queries.

Schema — table studies (selected fields)
ColumnTypeDescription
nct_idstringClinicalTrials.gov identifier (e.g. NCT00123456)
brief_titlestringShort study title
official_titlestringFull protocol title
overall_statusstringRECRUITING, COMPLETED, ACTIVE_NOT_RECRUITING, etc.
start_datestringStudy start date (YYYY-MM-DD)
completion_datestringPrimary completion date (YYYY-MM-DD)
study_first_submitted_datestringDate first submitted to registry
study_typestringINTERVENTIONAL or OBSERVATIONAL
phasesarrayTrial phase(s): PHASE1, PHASE2, PHASE3, PHASE4, NA
enrollmentintTarget or actual enrollment count
lead_sponsorstringName of the lead sponsor
sponsor_classstringINDUSTRY, NIH, OTHER_GOV, OTHER
conditionsarrayDisease/condition names
keywordsarrayFree-text keywords
interventionsarrayObjects with type (DRUG, DEVICE, …) and name
brief_summarystringPlain-language study summary (truncated at 2000 chars)
sexstringEligibility: ALL, FEMALE, MALE
minimum_agestringMinimum participant age (e.g. "18 Years")
maximum_agestringMaximum participant age
has_resultsboolTrue if results have been posted
-- Trials mentioning GLP-1 (regex search on title)
SELECT nct_id, brief_title, overall_status
FROM   studies
WHERE  brief_title ~ '(?i)GLP-?1'
LIMIT 20

FDA FAERS — Adverse Event Reports

fda
Table: faers Rows: ~63,000,000 Coverage: 2012Q4 – 2025Q4 (53 quarters) Size: ~6.5 GB scanned

FDA Adverse Event Reporting System (FAERS) — voluntary reports of suspected drug side effects submitted by patients, healthcare providers, and manufacturers. One row per (case, drug): reactions and outcomes are stored as arrays so a single WHERE clause covers all events in a case. Drug names are verbatim reporter text — use regex to match across brand names, generics, and dosage variants. Pairs with PubMed and ClinicalTrials.gov for pharmacovigilance workflows.

Note: Pre-2012 LAERS-era data is not included. Reports are deduplicated to the highest caseversion per case. Voluntary reporting introduces bias — counts reflect reporting volume, not true incidence rates.

Schema — table faers (selected fields)
ColumnTypeDescription
primaryidintUnique case-version identifier
caseidintCase identifier (stable across versions)
event_dtstringDate of adverse event (YYYY-MM-DD; may be partial)
yearintYear of adverse event; partition key
age_yearsfloatPatient age in years (normalised from age + age_cod)
sexstringM / F / UNK
occr_countrystringCountry where event occurred (ISO 2-letter)
drug_seqintDrug sequence number within the case
drugnamestringVerbatim drug name as reported
drug_rolestringPS = primary suspect, SS = secondary, C = concomitant, I = interacting
routestringRoute of administration (Oral, Intravenous, etc.)
reactionsarrayMedDRA preferred terms for all reactions in this case
outcomesarrayDE=death, HO=hospitalisation, LT=life-threatening, DS=disability, CA=congenital anomaly, RI=required intervention, OT=other
indicationsarrayMedDRA preferred terms for the drug's indication(s)
-- Most common reactions for a drug (regex matches brand + generic names)
SELECT reaction, COUNT(*) AS n
FROM   faers, UNNEST(reactions) AS reaction
WHERE  drugname ~ '(?i)ozempic|semaglutide'
  AND  drug_role = 'PS'
GROUP BY reaction
ORDER BY n DESC
LIMIT 20
-- Fatal cases by drug class (regex on drug name, filter outcome = death)
SELECT drugname,
       COUNT(*) AS fatal_reports
FROM   faers
WHERE  drugname ~ '(?i)fentanyl|oxycodone|hydrocodone'
  AND  ARRAY_CONTAINS(outcomes, 'DE')
GROUP BY drugname
ORDER BY fatal_reports DESC
LIMIT 15

FEC Campaign Finance

fec
Table: contributions Rows: 279,415,061 Partitions: 24 cycle partitions (1980–2026) · partition key: cycle Size: ~37 GB scanned

Individual campaign contributions reported to the FEC, covering every US federal election cycle from 1980 to 2026. One row per contribution, enriched with committee name, type, and party affiliation (from the FEC committee master file) and candidate name where applicable. Filter by cycle (4-digit even year) to scan only the cycles relevant to your query. Source: FEC Bulk Data — US Government public domain.

Schema — table contributions
ColumnTypeDescription
cycleintElection cycle year (1980, 1982, …, 2026) — partition key
sub_idstringUnique FEC record identifier
cmte_idstringFEC committee ID receiving the contribution
cmte_namestringCommittee name (enriched from committee master)
cmte_typestringCommittee type code (H=House, S=Senate, P=Presidential, Q/N/W=PAC, …)
cmte_partystringParty affiliation (DEM, REP, …)
cand_idstringFEC candidate ID (when contribution targets a specific candidate)
cand_namestringCandidate name (enriched from candidate master)
entity_tpstringContributor entity type (IND=individual, PAC, ORG, …)
namestringContributor name
city / state / zip_codestringContributor location
employerstringContributor employer
occupationstringContributor occupation
transaction_dtstringContribution date (YYYY-MM-DD)
transaction_amtintAmount in dollars
transaction_tpstringTransaction type code
other_idstringOther committee ID (set for transfers between committees)
tran_idstringTransaction identifier within the filing
memo_textstringOptional memo / description
Example queries
-- Top employers of donors to a committee (2024 cycle)
SELECT   employer, COUNT(*) AS n, SUM(transaction_amt) AS total_usd
FROM     contributions
WHERE    cycle = 2024
  AND    cmte_name ~ '(?i)actblue|winred'
  AND    employer != ''
GROUP BY employer
ORDER BY total_usd DESC
LIMIT    20
-- Political exposure check: donations by employees of a company (2024 cycle)
SELECT   name, cycle, cmte_name, cmte_party, transaction_amt, transaction_dt
FROM     contributions
WHERE    employer ~ '(?i)goldman sachs'
  AND    cycle = 2024
  AND    transaction_amt >= 2500
ORDER BY transaction_amt DESC
LIMIT    25
-- Top recipients of Google employee donations, 2024 cycle
SELECT   cmte_name, cmte_party,
         SUM(transaction_amt) AS total_usd,
         COUNT(*) AS n_contributions
FROM     contributions
WHERE    cycle = 2024
  AND    employer ~ '(?i)google'
GROUP BY cmte_name, cmte_party
ORDER BY total_usd DESC
LIMIT    10

FDA Orange Book — Approved Drug Products

fda
Table: orangebook Rows: 51,063 Size: 5 MB scanned

FDA-approved drug products from the Orange Book via the openFDA bulk export. One row per product strength/form. Includes therapeutic equivalence (TE) codes that determine generic substitutability. Essential for pharma agents querying approval status, generics entry, and formulary decisions.

Schema — table orangebook
ColumnTypeDescription
application_numberstringNDA or ANDA number (e.g. NDA009700)
appl_typestringN = NDA (brand), A = ANDA (generic)
sponsor_namestringApplicant / sponsor company name
product_numberstringProduct sequence within the application
brand_namestringTrade / brand name
active_ingredientsarray[{name, strength}] — active ingredient names and strengths
dosage_formstringe.g. TABLET, CAPSULE
routestringe.g. ORAL, INTRAVENOUS
marketing_statusstringe.g. Prescription, OTC
te_codestringTherapeutic equivalence code (AB = substitutable; may be null)
-- All approved generics for a drug (AB-rated = substitutable)
SELECT application_number, sponsor_name, brand_name, te_code
FROM   orangebook
WHERE  ARRAY_CONTAINS(active_ingredients[0].name, 'METFORMIN')
  AND  te_code = 'AB'
ORDER BY sponsor_name
LIMIT 20
-- All products by a given sponsor
SELECT application_number, brand_name, dosage_form,
       route, te_code
FROM   orangebook
WHERE  sponsor_name ~ '(?i)pfizer'
  AND  appl_type = 'N'
ORDER BY brand_name
LIMIT 25

FRED — Federal Reserve Economic Data

fred
Table: series Rows: 357,442 Series: 115 curated Size: 16 MB scanned

Curated macroeconomic time series from the St. Louis Fed FRED database: GDP, inflation, employment, interest rates, money supply, housing, fiscal, markets, and trade. One row per (series, date) observation. Series with redistribution restrictions are excluded.

This product uses the FRED® API but is not endorsed or certified by the Federal Reserve Bank of St. Louis.

Schema — table series
ColumnTypeDescription
series_idstringFRED series identifier (e.g. UNRATE, GDP)
categorystringThematic group: gdp, inflation, employment, rates, money, housing, production, trade, fiscal, markets, credit, income
titlestringFull series name (e.g. Unemployment Rate)
unitsstringUnit of measure (e.g. %, Bil. of $)
frequencystringObservation frequency (D=daily, M=monthly, Q=quarterly, A=annual)
seasonal_adjustmentstringSA = seasonally adjusted; NSA = not seasonally adjusted
obs_datetimestampObservation date
obs_yearintObservation year (partition key; use for fast year-range queries)
valuefloatObservation value in the series units (reserved word — quote as "value")
-- US unemployment rate since 2020
SELECT obs_date, "value"
FROM   series
WHERE  series_id = 'UNRATE'
  AND  obs_year >= 2020
ORDER BY obs_date
LIMIT 100
-- Compare CPI and core PCE year-over-year
SELECT series_id, title, obs_date, "value"
FROM   series
WHERE  series_id IN ('CPIAUCSL', 'PCEPILFE')
  AND  obs_year >= 2021
ORDER BY obs_date, series_id
LIMIT 200
-- All available series IDs with titles
SELECT DISTINCT series_id, category, title, units, frequency
FROM   series
ORDER BY category, series_id
LIMIT 200

End of Life — Runtime & Framework Lifecycle Dates

eol
Table: cycles Rows: 7,884 Products: 451 Size: 1 MB scanned

End-of-life and support status for 450+ software products — languages, runtimes, frameworks, databases, operating systems, and cloud services. One row per release cycle. Source: endoflife.date. Ideal for agents checking dependency support status or auditing infrastructure.

Schema — table cycles
ColumnTypeDescription
productstringProduct slug (e.g. python, nodejs, ubuntu)
cyclestringRelease cycle label (e.g. 3.11, 20.04)
release_datestringGeneral availability date (YYYY-MM-DD)
eolstringEnd-of-life date, or false (still supported), or true (EOL, no exact date)
lateststringLatest patch release in this cycle
latest_datestringRelease date of the latest version
ltsstringLTS status: true, false, or LTS end date
supportstringActive support end date (before EOL), or true/false
linkstringURL for the release notes / changelog
-- Is Python 3.8 still supported?
SELECT product, cycle, "eol", lts, "latest"
FROM   cycles
WHERE  product = 'python'
ORDER BY cycle DESC
LIMIT 10
-- All products reaching EOL in 2025
SELECT product, cycle, "eol", "latest"
FROM   cycles
WHERE  "eol" LIKE '2025%'
ORDER BY "eol"
LIMIT 50

arXiv — Preprint Metadata

arxiv
Table: papers Rows: ~2,400,000 Coverage: 1991 – present Size: ~4 GB scanned

Metadata for all arXiv preprints — titles, abstracts, authors, categories, and submission dates. Covers CS, math, physics, quantitative finance, statistics, biology, and more. Papers appear on arXiv months before journal publication, making this the primary source for the latest research. Partitioned by submission month.

Schema — table papers
ColumnTypeDescription
idstringarXiv ID (e.g. 2301.00001)
titlestringPaper title (whitespace normalised)
abstractstringFull abstract text
categoriesarrayarXiv category codes (e.g. ["cs.LG", "stat.ML"])
primary_catstringPrimary category (first in list)
authorsarray[{name, affiliation}]
submittedstringOriginal submission date (YYYY-MM-DD); partition key
updatedstringLast-updated date
doistringDOI if published; null otherwise
journal_refstringJournal reference if published; null otherwise
commentsstringAuthor comments (page count, code links, etc.)
licensestringLicense URL (e.g. CC BY 4.0)
-- Recent ML papers mentioning transformers
SELECT id, submitted, title, primary_cat
FROM   papers
WHERE  primary_cat = 'cs.LG'
  AND  submitted >= '2024-01-01'
  AND  title ~ '(?i)transformer'
ORDER BY submitted DESC
LIMIT 25
-- Papers by an author across all categories
SELECT id, submitted, title, categories
FROM   papers, UNNEST(authors) AS a
WHERE  a.name ~ '(?i)lecun'
ORDER BY submitted DESC
LIMIT 20
-- Category submission volume by year (quantitative finance)
SELECT DATE_TRUNC('year', CAST(submitted AS TIMESTAMP)) AS yr,
       COUNT(*) AS papers
FROM   papers
WHERE  primary_cat = 'q-fin.TR'
GROUP BY yr
ORDER BY yr
LIMIT 50

GWAS Catalog — Genome-Wide Association Studies

gwas
Table: associations Rows: 1,095,212 Partitions: 19 years (2008–2026) Size: ~0.3 GB scanned

NHGRI-EBI GWAS Catalog — curated genome-wide association study results. One row per variant–trait association. Joins ClinVar on rsid / dbsnp_id and PubMed on pubmed_id. Source: EBI FTP. Partition key: date_added.

Schema — table associations
ColumnTypeDescription
rsidstringdbSNP rsID (e.g. rs1801131)
snp_risk_allelestringRisk allele in rsID-allele format (e.g. rs1801131-C)
risk_allele_freqfloatRisk allele frequency in discovery sample
p_valuefloatAssociation p-value
pvalue_mlognumber−log&sub1;&sub0;(p-value)
or_betafloatOdds ratio (binary traits) or beta (quantitative traits)
ci_95string95% confidence interval
disease_traitstringDisease or trait name as reported by the study
mapped_traitstringEFO-mapped trait label
mapped_genestringNearest gene(s) mapped by ENSEMBL
reported_genesarrayGenes reported by the study authors
chromosomestringChromosome
chr_posintChromosomal position (GRCh38)
contextstringVariant functional context (e.g. intron_variant, missense_variant)
pubmed_idintPubMed ID of the source study
pub_datestringPublication date
first_authorstringFirst author surname
study_accessionstringGWAS Catalog study accession (e.g. GCST000001)
date_addedstringDate added to the GWAS Catalog (partition key)
initial_samplestringDiscovery sample description (ancestry, size)
-- Top associations for Type 2 Diabetes
SELECT rsid, snp_risk_allele, mapped_gene, p_value, or_beta, risk_allele_freq
FROM   associations
WHERE  disease_trait ~ '(?i)type 2 diabetes'
ORDER BY pvalue_mlog DESC
LIMIT 25
-- Join GWAS hits with ClinVar pathogenic variants
SELECT g.rsid, g.disease_trait, g.mapped_gene, g.p_value,
       c.significance, c.gene_symbol
FROM   gwas.associations AS g
JOIN   clinvar.variants   AS c
  ON   g.rsid = c.dbsnp_id
  AND  c.assembly = 'GRCh38' AND c.sig_simple = 1
WHERE  g.pvalue_mlog > 50
ORDER BY g.pvalue_mlog DESC
LIMIT 20
-- Most-studied traits by association count
SELECT disease_trait, COUNT(*) AS n_associations
FROM   associations
GROUP BY disease_trait
ORDER BY n_associations DESC
LIMIT 20

ClinVar — Genetic Variant Classifications

clinvar
Table: variants Rows: 8,918,806 Assemblies: GRCh37 + GRCh38 Size: ~2.1 GB scanned

NCBI ClinVar variant classifications — pathogenicity, associated phenotypes, chromosomal coordinates, and review status for millions of human genetic variants. One row per variant × assembly (GRCh37 and GRCh38 each have their own row). Source: NCBI FTP, updated weekly.

Schema — table variants
ColumnTypeDescription
allele_idintStable ClinVar AlleleID
variation_idintClinVar VariationID
typestringVariant type (SNV, Indel, Deletion, Duplication, …)
namestringHGVS-like variant name
gene_symbolstringGene symbol (e.g. BRCA1)
gene_idintNCBI GeneID; null if intergenic
significancestringClinical significance (Pathogenic, Likely pathogenic, Benign, …)
sig_simpleint-1 conflicting, 0 uncertain/other, 1 pathogenic, 2 benign, 3 risk-factor
last_evaluatedstringDate last evaluated by submitter
dbsnp_idstringdbSNP rsID (e.g. rs80357713); null if none
phenotypesarrayAssociated disease/phenotype names
phenotype_idsarrayMedGen/OMIM/Orphanet IDs for phenotypes
rcv_accessionsarrayRCV accession numbers
originstringAllele origin (germline, somatic, …)
assemblystringReference assembly (GRCh38, GRCh37, NCBI36)
chromosomestringChromosome
startintStart position (1-based)
stopintStop position (1-based)
ref_allelestringReference allele (or null)
alt_allelestringAlternate allele (or null)
review_statusstringReview status (e.g. "criteria provided, multiple submitters")
n_submittersintNumber of submitters
pos_vcfintVCF position
ref_vcfstringVCF reference allele
alt_vcfstringVCF alternate allele
other_idsstringOther identifiers (comma-sep "type:value" pairs)
oncogenicitystringOncogenicity classification; null if not assessed
-- Pathogenic variants in BRCA1 (GRCh38)
SELECT allele_id, name, significance, chromosome, start, dbsnp_id
FROM   variants
WHERE  gene_symbol = 'BRCA1'
  AND  assembly    = 'GRCh38'
  AND  sig_simple  = 1
ORDER BY start
LIMIT 50
-- Variants associated with a phenotype (by name)
SELECT gene_symbol, significance, name, review_status, n_submitters
FROM   variants, UNNEST(phenotypes) AS ph
WHERE  assembly = 'GRCh38'
  AND  ph ~ '(?i)breast cancer'
  AND  sig_simple = 1
ORDER BY n_submitters DESC
LIMIT 25
-- Pathogenic variant count per gene (well-reviewed only)
SELECT gene_symbol, COUNT(*) AS pathogenic_variants
FROM   variants
WHERE  assembly      = 'GRCh38'
  AND  sig_simple    = 1
  AND  n_submitters >= 2
GROUP BY gene_symbol
ORDER BY pathogenic_variants DESC
LIMIT 20

ALFA — Allele Frequency Aggregator

alfa
Table: variants Rows: ~200M Populations: 12 Size: ~100.8 GB scanned

NCBI Allele Frequency Aggregator — allele frequencies from 1M+ dbGaP subjects across 12 population groups. GRCh38 coordinates. Partitioned by chromosome. The key advantage over gnomAD is finer African ancestry resolution: afr_am (African American), afr (continental African), and afr_oth (other African) are reported separately rather than collapsed into a single afr bucket. This split is clinically significant for variant interpretation in underrepresented populations. Source: NCBI FTP, open access. Note: chromosome uses no chr prefix (“1” not “chr1”) — same as ClinVar, opposite of gnomAD.

Schema — table variants  ·  partitioned by chrom
ColumnTypeDescription
chromstringChromosome (1–22, X, Y, MT — no chr prefix)
posintGRCh38 position (1-based)
rsidstringdbSNP RS identifier (e.g. rs12345); null if none
refstringReference allele
altstringAlternate allele (first for multi-allelic sites)
af_totfloatTotal allele frequency (all subjects)
ac_tot / an_totintTotal alternate allele count / total allele count
af_eurfloatEuropean — closest to gnomAD af_nfe
af_afr_amfloatAfrican American (US) — not available in gnomAD separately
af_afrfloatContinental African (non-US) — not available in gnomAD separately
af_afr_othfloatOther African ancestry — not available in gnomAD separately
af_easfloatEast Asian — comparable to gnomAD af_eas
af_oth_asnfloatOther Asian — broader than gnomAD coverage
af_asnfloatAsian (broad grouping, includes EAS + other)
af_sasfloatSouth Asian
af_lat1 / af_lat2floatTwo Latino sub-populations — gnomAD reports a single af_amr
af_othfloatOther / unassigned
ac_* / an_*intPer-population allele count / allele number for each population above
-- Population frequency comparison for a known variant
SELECT rsid, ref, alt,
       af_tot, af_eur, af_afr_am, af_afr, af_eas, af_sas
FROM   variants
WHERE  chrom = '17'
  AND  rsid  = 'rs80357713'
-- Variants with large African American vs European frequency gap
-- (population-enriched variants, clinically under-studied)
SELECT rsid, chrom, pos, ref, alt,
       af_afr_am, af_eur,
       af_afr_am - af_eur AS delta
FROM   variants
WHERE  chrom     = '13'
  AND  an_afr_am >= 100
  AND  an_eur    >= 100
  AND  af_afr_am - af_eur > 0.3
ORDER BY delta DESC
LIMIT 20

DGIdb — Drug–Gene Interaction Database

dgidb
Tables: interactions, genes, drugs Interactions: 98,239 Genes: 80,234 Drugs: 81,572

Drug-Gene Interaction Database (DGIdb) — ~100K curated drug–gene interactions aggregated from 40+ sources including ChEMBL, CIViC, and FDA drug labels. MIT licensed; commercial use permitted. Useful for target druggability checks: given a gene of interest, find what drugs already modulate it (approved or investigational), enabling competitive landscape and prior art queries without a manual ChEMBL or DrugBank search. Source: dgidb.org.

Usage notes: interaction_type is NULL in most records from the DTC source — filter on approved, anti_neoplastic, and interaction_score instead. drug_concept_id uses ChEMBL/NCI-T namespaces and will not directly join FAERS or NADAC; use drug_name (plain text) as the practical join key.

Schema — table interactions (98,239 rows)
ColumnTypeDescription
gene_namestringHGNC gene symbol (e.g. EGFR)
gene_concept_idstringNormalized gene concept ID (HGNC namespace)
drug_namestringDrug name (normalized); practical join key against FAERS / NADAC
drug_concept_idstringNormalized drug concept ID (ChEMBL or NCI-T namespace)
approvedboolFDA-approved drug
immunotherapyboolImmunotherapy agent
anti_neoplasticboolAnti-neoplastic (cancer) drug
interaction_typestringInteraction type (inhibitor, activator, …); NULL in most DTC-source rows
interaction_scorefloatDGIdb composite score; null if not computed
source_db_namestringSource database (e.g. ChEMBL, CIViC, DTC)
source_db_versionstringSource database version
Schema — table genes (80,234 rows)
ColumnTypeDescription
gene_namestringHGNC gene symbol
concept_idstringNormalized concept ID
nomenclaturestringGene nomenclature source
source_db_namestringSource database
source_db_versionstringSource database version
Schema — table drugs (81,572 rows)
ColumnTypeDescription
drug_namestringDrug name (normalized)
concept_idstringNormalized concept ID (ChEMBL or NCI-T)
nomenclaturestringDrug nomenclature source
approvedboolFDA-approved
immunotherapyboolImmunotherapy agent
anti_neoplasticboolAnti-neoplastic (cancer) drug
source_db_namestringSource database
source_db_versionstringSource database version
-- Approved drugs targeting a gene (prior art / competitor check)
SELECT drug_name, interaction_type, interaction_score, source_db_name
FROM   interactions
WHERE  gene_name = 'EGFR'
  AND  approved  = TRUE
ORDER BY interaction_score DESC
LIMIT 20
-- Anti-cancer drugs for a gene, with adverse-event cross-check via FAERS
-- Step 1: get approved anti-neoplastic drugs targeting the gene
SELECT drug_name
FROM   dgidb.interactions
WHERE  gene_name       = 'BRAF'
  AND  approved        = TRUE
  AND  anti_neoplastic = TRUE

-- Step 2: feed drug_name list into FAERS for adverse event profile
SELECT reaction_meddra_pt, COUNT(*) AS reports
FROM   fda.faers
WHERE  drug_name IN ('VEMURAFENIB', 'DABRAFENIB', 'ENCORAFENIB')
GROUP BY reaction_meddra_pt
ORDER BY reports DESC
LIMIT 15

Sanctions — U.S. Consolidated Screening List

sanctions
Table: entities Rows: 25,447 Lists: OFAC SDN + BIS + State Dept Size: ~8 MB scanned

The U.S. government Consolidated Screening List (CSL) — a unified export of OFAC SDN (Treasury), BIS Entity List (Commerce), State Dept Nonproliferation, ITAR Debarred, and several other sanctions and export-control lists. One row per sanctioned individual, entity, vessel, or aircraft. Source: trade.gov, updated daily.

Schema — table entities
ColumnTypeDescription
idstringStable hash ID from trade.gov
sourcestringSource list (e.g. "Specially Designated Nationals (SDN) - Treasury Department")
namestringPrimary name
typestringIndividual, Entity, Vessel, or Aircraft (may be empty)
programsarraySanctions program codes (e.g. ["UKRAINE-EO13685", "RUSSIA-EO14024"])
alt_namesarrayAliases and alternate name spellings
addressesarrayKnown addresses
idsarrayIdentity documents: passport, tax ID, registration (format: "type, value, country")
nationalitiesarrayNationality countries
dobsarrayDates of birth (YYYY or YYYY-MM-DD)
pobsarrayPlaces of birth
start_datestringListing date (YYYY-MM-DD)
license_requirementstringBIS license requirement (BIS entries only)
remarksstringAdditional remarks
source_list_urlstringURL to the source list page
-- Look up an entity by name (partial match)
SELECT name, type, source, programs, addresses
FROM   entities
WHERE  name ~ '(?i)vtb bank'
LIMIT 10
-- All individuals sanctioned under Russia programs
SELECT name, type, start_date, alt_names
FROM   entities, UNNEST(programs) AS prog
WHERE  prog  ~ '(?i)russia'
  AND  type = 'Individual'
ORDER BY start_date DESC
LIMIT 25
-- Count entities per source list
SELECT source, COUNT(*) AS total
FROM   entities
GROUP BY source
ORDER BY total DESC

MalwareBazaar — Malware Sample Repository

malwarebazaar
Table: samples Rows: 984,697 Partitions: 68 months (2020-02–2025-09) Size: ~0.4 GB scanned

abuse.ch MalwareBazaar — crowdsourced malware sample metadata. One row per submitted sample. Includes file hashes, family signatures, VirusTotal detection rates, and fuzzy hashes for similarity matching. License: CC0. Partition key: first_seen_utc.

Schema — table samples
ColumnTypeDescription
sha256_hashstringSHA-256 hash (primary identifier)
md5_hashstringMD5 hash
sha1_hashstringSHA-1 hash
file_namestringOriginal submitted file name
file_type_guessstringGuessed file type (e.g. Win32 EXE, PDF Document, ZIP)
mime_typestringMIME type
first_seen_utctimestampDate first submitted to MalwareBazaar (partition key)
reporterstringReporting researcher or automated feed
signaturestringMalware family name / AV detection name
vtpercentnumberVirusTotal detection rate 0–100 (null if not scanned)
clamavstringClamAV signature name
imphashstringPE import hash (Windows executables)
ssdeepstringssdeep fuzzy hash for similarity clustering
tlshstringTLSH fuzzy hash
-- All samples for a malware family
SELECT sha256_hash, file_name, file_type_guess, first_seen_utc, vtpercent
FROM   samples
WHERE  signature ~ '(?i)emotet'
ORDER BY first_seen_utc DESC
LIMIT 25
-- Most prevalent malware families (last 6 months)
SELECT signature, COUNT(*) AS samples, AVG(vtpercent) AS avg_vt
FROM   samples
WHERE  first_seen_utc >= '2025-03-01'
  AND  signature != ''
GROUP BY signature
ORDER BY samples DESC
LIMIT 20
-- Lookup by hash (any of the three types)
SELECT sha256_hash, file_name, signature, first_seen_utc, vtpercent
FROM   samples
WHERE  md5_hash = 'd41d8cd98f00b204e9800998ecf8427e'
LIMIT 5

Open Food Facts — Global Food Product Database

openfoodfacts
Table: products Rows: 556,027 Partitions: 7 years (2012–2017) Size: ~0.4 GB scanned

Open Food Facts 2019 Kaggle snapshot (DbCL 1.0) — crowdsourced nutritional data for food products sold globally. One row per product. Includes ingredients, allergens, Nutri-Score grades, and 12 key nutrients per 100g. Partition key: created.

Schema — table products
ColumnTypeDescription
codestringBarcode (EAN/UPC)
product_namestringProduct name
brandsstringBrand names (comma-separated)
categoriesstringProduct categories (comma-separated)
main_categorystringPrimary category
countriesstringCountries where sold (comma-separated)
ingredientsstringIngredients text
allergensstringDeclared allergens
tracesstringMay-contain-traces declarations
additives_nintCount of food additives
nutrition_gradestringNutri-Score grade (a–e; a = healthiest)
energy_100gnumberEnergy per 100g (kcal)
fat_100gnumberTotal fat per 100g (g)
sat_fat_100gnumberSaturated fat per 100g (g)
carbs_100gnumberCarbohydrates per 100g (g)
sugars_100gnumberSugars per 100g (g)
fiber_100gnumberDietary fiber per 100g (g)
proteins_100gnumberProtein per 100g (g)
salt_100gnumberSalt per 100g (g)
createdtimestampDate product was added (partition key)
-- Lowest-sugar breakfast cereals with Nutri-Score A or B
SELECT product_name, brands, sugars_100g, nutrition_grade, countries
FROM   products
WHERE  categories ~ '(?i)cereal'
  AND  nutrition_grade IN ('a', 'b')
  AND  sugars_100g IS NOT MISSING
ORDER BY sugars_100g
LIMIT 20
-- Average macros and Nutri-Score by food category
SELECT main_category,
       COUNT(*) AS products,
       AVG(energy_100g) AS avg_kcal,
       AVG(sugars_100g) AS avg_sugars,
       AVG(proteins_100g) AS avg_protein
FROM   products
WHERE  main_category != ''
GROUP BY main_category
ORDER BY products DESC
LIMIT 20

Hacker News — Y Combinator Tech Community

hackernews
Table: items Rows: 4.6M Partitions: 20 years (2007–2026) License: CC0 Size: ~0.7 GB scanned

Every story, Ask HN, Show HN, and job post from Hacker News since 2007. One row per item with score, author, linked URL, and descendant comment count. Partitioned by year. Useful for tech trend analysis, startup signal detection, and tracking how developer attention has shifted over time. Source: BigQuery bigquery-public-data.hacker_news (CC0).

Schema — table items
ColumnTypeDescription
idstringItem ID
bystringAuthor username
titlestringStory or post title
urlstringLinked URL (null for Ask/Show HN text posts)
scoreintUpvote score
timeintUnix timestamp of submission
descendantsintTotal comment count
yearintSubmission year (partition key)
-- Top stories of 2024 by score
SELECT title, url, score, descendants
FROM   items
WHERE  year = 2024
ORDER BY score DESC
LIMIT 20
-- Show HN posts per year — measure of builder activity over time
SELECT year,
       COUNT(*) AS show_hn_posts,
       AVG(score) AS avg_score
FROM   items
WHERE  title ~ '(?i)^show hn'
GROUP BY year
ORDER BY year DESC

NADAC — National Average Drug Acquisition Cost

nadac
Table: prices Rows: 18,172,629 Coverage: 2013–2026 (weekly) Size: ~0.8 GB scanned

CMS weekly drug acquisition cost survey — the price pharmacies actually pay for drugs. One row per NDC code per weekly report. Covers generic and brand-name drugs, retail and non-retail pharmacies. Pairs with FDA FAERS and Orange Book. Source: data.medicaid.gov, public domain. Partition key: effective_date.

Schema — table prices
ColumnTypeDescription
ndcstringNational Drug Code (11-digit)
ndc_descriptionstringDrug name, strength, and dosage form
nadac_per_unitstringAcquisition cost per unit (tablet, mL, gram, etc.)
pricing_unitstringUnit type: TAB, ML, GM, EA
classificationstringGeneric or Brand Name
otcboolOver-the-counter (true) or prescription (false)
pharmacy_typestringRetail or Non-Retail
as_of_datestringDate of the weekly survey report
effective_datestringPrice effective date (partition key)
explanation_codestringReason for price change or data source code
corresponding_generic_nadacstringFor brand drugs: comparable generic NADAC price
-- Current acquisition cost for metformin generics
SELECT ndc_description, nadac_per_unit, pricing_unit, as_of_date
FROM   prices
WHERE  ndc_description ~ '(?i)metformin'
  AND  classification = 'Generic'
  AND  as_of_date >= '2025-01-01'
ORDER BY as_of_date DESC, ndc_description
LIMIT 20
-- Brand vs generic price gap for a drug class
SELECT classification,
       AVG(CAST(nadac_per_unit AS float)) AS avg_cost,
       COUNT(DISTINCT ndc) AS ndcs
FROM   prices
WHERE  ndc_description ~ '(?i)atorvastatin'
  AND  as_of_date >= '2025-01-01'
GROUP BY classification

World Bank — Global Commodity Prices

worldbank
Table: commodity_prices Rows: 795 Coverage: 1960–2026 (monthly) License: CC BY 4.0

World Bank Pink Sheet — monthly prices for ~70 global commodities: energy, metals, agriculture, and fertilizers. One row per month, one column per commodity. All prices are nominal USD. Source: World Bank, CC BY 4.0. Partition key: year.

Schema — table commodity_prices (selected columns)
ColumnTypeDescription
periodstringMonth label in YYYY-Mmm format (e.g. 2024-Jan)
yearintYear (partition key)
monthintMonth number (1–12)
crude_oil_brentnumberBrent crude oil ($/bbl)
crude_oil_wtinumberWTI crude oil ($/bbl)
natural_gas_usnumberUS natural gas ($/mmbtu)
natural_gas_europenumberEuropean natural gas ($/mmbtu)
coal_australiannumberAustralian thermal coal ($/mt)
goldnumberGold ($/troy oz)
coppernumberCopper ($/mt)
aluminumnumberAluminum ($/mt)
maizenumberMaize / corn ($/mt)
wheat_us_hrwnumberUS hard red winter wheat ($/mt)
soybeansnumberSoybeans ($/mt)
sugar_worldnumberWorld sugar (¢/kg)
coffee_arabicanumberArabica coffee (¢/kg)
cocoanumberCocoa ($/mt)

Plus ~50 additional commodity columns. Use GET /v1/databases for the full list.

-- Brent crude and natural gas price history since 2020
SELECT period, crude_oil_brent, crude_oil_wti, natural_gas_us
FROM   commodity_prices
WHERE  year >= 2020
ORDER BY year, month
-- Gold vs copper ratio over time (risk-off indicator)
SELECT period, gold, copper,
       gold / NULLIF(copper, 0) AS gold_copper_ratio
FROM   commodity_prices
WHERE  year >= 2010
ORDER BY year, month

FHFA — House Price Index

fhfa
Table: hpi Rows: 133,216 Coverage: 1975–2026 Levels: USA · state · MSA

Federal Housing Finance Agency (FHFA) House Price Index — quarterly and monthly home price indices for the U.S., all 50 states, and ~400 metropolitan statistical areas (MSAs). Covers purchase-only and all-transactions variants, seasonally adjusted and not. Source: FHFA, public domain. Partition key: yr.

Schema — table hpi
ColumnTypeDescription
yrintYear (partition key)
periodintQuarter (1–4) or month (1–12) depending on frequency
frequencystringquarterly or monthly
hpi_typestringpurchase-only or all-transactions
hpi_flavorstringtraditional or expanded-data
levelstringGeographic level: USA, state, MSA
place_idstringFIPS code or state abbreviation
place_namestringState or MSA name (e.g. San Francisco-Oakland-Fremont, CA)
index_nsanumberNot-seasonally-adjusted HPI (base = 100 at 1991 Q1)
index_sanumberSeasonally-adjusted HPI (null for MSA-level series)
-- National quarterly HPI since 2000 (purchase-only, seasonally adjusted)
SELECT yr, period, index_sa, index_nsa
FROM   hpi
WHERE  level      = 'USA'
  AND  hpi_type  = 'purchase-only'
  AND  frequency = 'quarterly'
  AND  yr >= 2000
ORDER BY yr, period
-- State-level HPI comparison (most recent quarter)
SELECT place_name, index_nsa, index_sa
FROM   hpi
WHERE  level      = 'state'
  AND  hpi_type  = 'purchase-only'
  AND  frequency = 'quarterly'
  AND  yr = 2025 AND period = 4
ORDER BY index_nsa DESC

ClinPGx — Pharmacogenomics Knowledge Base

clinpgx
Tables: 4 License: CC BY-SA 4.0

Curated pharmacogenomics data from ClinPGx (formerly PharmGKB) — gene–drug interactions, clinical annotations, dosing guidelines (CPIC, DPWG), and FDA/EMA drug label PGx requirements. Four tables: clinical variants, guidelines, drug labels, and relationships.

Schema — table clinical_variants (~5,200 rows)
ColumnTypeDescription
variantstringVariant name or star allele (e.g. CYP2C9*3, rs1801131)
genestringGene symbol (e.g. CYP2C9)
typestringAnnotation type (Metabolism/PK, Toxicity, Efficacy, Dosage)
level_of_evidencestringEvidence level (1A, 1B, 2A, 2B, 3, 4)
chemicalsarrayAssociated drug names
phenotypesarrayAssociated phenotype names
-- Highest-evidence (1A) variant–drug pairs
SELECT variant, gene, type, chemicals, phenotypes
FROM   clinical_variants
WHERE  level_of_evidence = '1A'
ORDER BY gene
Schema — table guidelines (~217 rows)
ColumnTypeDescription
idstringPharmGKB accession ID
namestringGuideline annotation name
sourcestringIssuing body (CPIC, DPWG, FDA, HCSC, EMA)
recommendationboolIncludes a dosing recommendation
dosing_informationboolIncludes dosing information
has_testing_infoboolIncludes genetic testing guidance
alternate_drug_availableboolAlternative drug available
chemicalsarrayInvolved drug names
genesarrayInvolved gene symbols
allelesarraySpecific alleles referenced
pmidsarraySupporting PubMed IDs
-- All CPIC guidelines with dosing recommendations
SELECT name, chemicals, genes, alleles
FROM   guidelines
WHERE  source = 'CPIC'
  AND  recommendation = true
ORDER BY name
Schema — table drug_labels (~1,400 rows)
ColumnTypeDescription
idstringPharmGKB ID
namestringAnnotation name
sourcestringRegulatory body (FDA, EMA, HCSC)
testing_levelstringTesting Required, Actionable PGx, Informative PGx
has_prescribing_infoboolLabel includes prescribing guidance
has_dosing_infoboolLabel includes dosing guidance
has_alternate_drugboolAlternate drug recommended
chemicalsarrayDrug names
genesarrayGene symbols
variantsarrayRelevant variants or haplotypes
latest_history_datestringMost recent annotation update (YYYY-MM-DD)
-- FDA labels requiring genetic testing
SELECT name, chemicals, genes, testing_level
FROM   drug_labels
WHERE  source = 'FDA'
  AND  testing_level = 'Testing Required'
ORDER BY latest_history_date DESC
LIMIT 20
Schema — table relationships (~127,700 rows)
ColumnTypeDescription
entity1_idstringPharmGKB ID of first entity
entity1_namestringName of first entity
entity1_typestringGene, Chemical, Disease, Variant, ...
entity2_idstringPharmGKB ID of second entity
entity2_namestringName of second entity
entity2_typestringType of second entity
evidencearrayEvidence sources (ClinicalAnnotation, VariantAnnotation, ...)
associationstringassociated, not associated, ambiguous
pkboolPharmacokinetic relationship
pdboolPharmacodynamic relationship
pmidsarraySupporting PubMed IDs
-- All confirmed gene–drug associations for CYP2D6
SELECT entity2_name AS drug, association, evidence, pmids
FROM   relationships
WHERE  entity1_name = 'CYP2D6'
  AND  entity1_type = 'Gene'
  AND  entity2_type = 'Chemical'
  AND  association  = 'associated'
ORDER BY entity2_name

Reference tables: genes (~25,000 rows) — NCBI Gene ID, Ensembl ID, GRCh37/GRCh38 coordinates, VIP and CPIC flags. drugs (~3,700 rows) — SMILES, InChI, RxNorm/ATC identifiers, annotation counts. variants (~7,600 rows) — chromosomal location, synonyms, annotation counts. phenotypes (~1,600 rows) — alternate names, MeSH/OMIM cross-references.

Illumina — SNP Array Manifests

illumina
Tables: 2 Build: GRCh38

Probe-level manifests for two Illumina SNP genotyping arrays. One row per assay locus — variant coordinates, probe sequence, alleles, strand orientation, and source annotation. Useful for pharmacogenomics variant lookup, array coverage comparison, and probe-sequence retrieval.

TableArrayLociDecompressed
gda_pgxGDA+ePGx v11,933,117800 MiB
gsa_pgxGSA-PGx v4683,054274 MiB

Both tables share the same schema.

Schema — tables gda_pgx / gsa_pgx
ColumnTypeDescription
ilmn_idstringIllumina internal probe ID
namestringVariant name (position-based or rsID)
ilmn_strandstringProbe strand relative to TOP/BOT designation
allelesarrayTwo alleles, e.g. ["A","C"]
address_astringBead address for allele A
probe_seq_astring50-mer probe sequence for allele A
address_bstringBead address for allele B (two-colour probes only)
probe_seq_bstring50-mer probe sequence for allele B (optional)
chrstringChromosome
positionintGRCh38 genomic position (1-based)
sourcestringVariant source (PAGE, 1000genomes, CNV, ...)
source_strandstringSource strand (TOP / BOT)
source_seqstring~120 bp flanking sequence with allele in brackets
top_genomic_seqstringTOP-strand genomic context with allele in brackets
ref_strandstringStrand relative to reference (+ / -)
bead_set_idintBead set identifier
exp_clustersintExpected cluster count (2 = mono, 3 = biallelic)
intensity_onlyboolIntensity-only probe (no genotype call)
-- All probes on chromosome 22 in gda_pgx
SELECT name, position, alleles, source, ref_strand
FROM   gda_pgx
WHERE  chr = '22'
ORDER BY position
LIMIT 20
-- CNV probes in gsa_pgx
SELECT name, chr, position, probe_seq_a
FROM   gsa_pgx
WHERE  source = 'CNV'
ORDER BY chr, position
LIMIT 20