先梳理整个流程
tip
前言:我只是想下点视频
回到当时的场景。拿到需求之后,我想,无非是三步
- 手动下载excel里的sheet为 CSV
- 用 nushell/python/jq 把 CSV 转 JSON (当时没想着要直接写入sqlite,用DB来做数据最终收敛),这里涉及到洗数据(因为它提供的excel里的这个字段格式很乱(比如说 有的 Baidu URL 并非 https 开头,有的是URL本身带着
?pwd有些则不带,需要自己获取,又或者是没添该字段)) - 用 BaiduPCS-Go 把这些视频下载到本地
可以看到
我下意识想到的就是直接把视频下载到本地(现在来说,一开始就想茬了),并且也没想明白这些数据到了本地之后要怎么处理
然后去做各种可行性分析
比如说
- 是否可以把通过 百度URL
第一版:本地下载 + Cloudflare Tunnel 的灾难
回头看需求:我到底真正要什么?
第二版:企业级“VPS 流式传输 + CDN”架构
把U盘邮出去之后,复盘了一下这个事
此时我的思考(仍然)是:
tip
- 我为啥会犯这些错误呢?
- 怎么才能在做设计时,一开始就考虑到大部分这类问题?
- 我发现这种需求,很难用上面这个“架构设计框架”来套用,是吗?还是说要做一些调整?
简而言之就是,为什么会这么狼狈?我做错了什么?以后怎么改进?
当时的想法是
Review一下前两天给我小舅写那个【把excel里的百度网盘URL视频,处理为方便查看的形式】。
我来简单说说,我一开始没想明白到底是否要发到CDN上,所以我只处理了下载到本地。之后做了 cloudflare tunnel 的内网穿透,来让别人通过URL直接访问我本地下载的这些视频。但是发现这种视频资源需要高带宽支撑,并不适合用内网穿透实现。
再则,我搞错了一个问题,也就是需要对所有这些视频是否下载到本地,以及具体错误,通过sqlite来标记状态。我以为从视频URL的获得到下载都不会有任何问题。结果耗费大量时间,都是为了处理这部分没有下载到的视频的URL以及相关处理了。
我主要犯了这两个错误。
想错了第一个问题,导致我犯了两个错误,如果早知道最终要存到CDN上,那我肯定就直接在VPS上(而非本地)跑了。如果早知道要到CDN上,那我肯定也会选择流式上传(也就是一边下载到本地,一边上传到CDN),不会等到后来已经全部下载到本地了,想传到CDN上,时间又不够。
第三版:用 OpenList/AList 缩短链路
再之后重新回到 是否可以通过 OpenList 来简化操作(此时我的思路仍然是“一定是需要”)
最终版:BaiduPCS-Go transfer + AList 的最小可行方案
tip
链路越长,可能出问题的地方就越多,需要handle 的失败场景就越多
使用 nocodb 来洗数据
从这次迭代里,我学到的几个架构设计习惯
胡渊鸣:Meshy AI,太极,MIT,清华姚班,图形学,物理仿真模拟,开源,商业化,勇气 ,智慧 | WhynotTV Podcast #2_哔哩哔哩_bilibili
结语:小工具也是系统设计的练习场
最佳实践:实操
tip
最终需要按照最佳实践来实操一下上面说的最终的解决方案
核心流程4步:
- 1、【导入数据】
- 2、【洗数据、在table里补充相关字段】核心流程,需要验证数据是否与excel里的数据源保持一致
- 3、【transfer资源】通过 BaiduPCS-Go 把所有资源 transfer 到我的网盘里 # 需要
- 4、【验证】查看OpenList上mount的网盘,是否已经已经可以看到资源。 # 以及导出所有百度网盘URL有问题的用户,并根据 failed status 分组展示
导入并洗数据
goland内置的那个Database服务,不支持直接把excel导入到DB里。只支持导入CSV文件。
先后尝试了 nocodb,
那我如果经常有这种excel洗数据到DB的需求
有什么比较好的方案?我感觉这个 nocodb 行不通(只支持导入,不支持导出)
tip
所以最终选择使用 csvkit 导入数据,使用 duckdb 来洗数据
- sqlite sql
- duckdb sql
WITH RECURSIVE
awards_sort AS (
SELECT '金奖' AS award, 1 AS sort_order UNION ALL
SELECT '银奖', 2 UNION ALL
SELECT '铜奖', 3 UNION ALL
SELECT '未来之星奖', 4 UNION ALL
SELECT '希望之星', 5 UNION ALL
SELECT '未获奖', 6
),
detail_rows AS (
SELECT
id AS detail_id,
group_name,
CASE
WHEN COALESCE(absent_final, 0) = 1 THEN '希望之星'
WHEN award IN ('金奖','银奖','铜奖','未来之星奖') THEN award
ELSE '未获奖'
END AS award_label,
name,
COALESCE(teacher, '') AS teacher,
num,
CASE WHEN group_name = '弦乐团' THEN name ELSE '' END AS ensemble_name,
member_list,
COALESCE(contact, '') AS contact,
COALESCE(address, '') AS address
FROM v_2025_final
),
group_totals AS (
SELECT group_name, award_label, COUNT(*) AS people_count, SUM(num) AS total_num
FROM detail_rows
GROUP BY group_name, award_label
),
member_split AS (
SELECT
dr.detail_id,
dr.group_name,
dr.award_label,
1 AS member_order,
CASE WHEN instr(dr.member_list,'、') > 0
THEN substr(dr.member_list,1,instr(dr.member_list,'、')-1)
ELSE dr.member_list END AS member_detail,
CASE WHEN instr(dr.member_list,'、') > 0
THEN substr(dr.member_list,instr(dr.member_list,'、')+1)
ELSE NULL END AS remainder
FROM detail_rows dr
WHERE dr.member_list IS NOT NULL AND dr.member_list <> ''
UNION ALL
SELECT
detail_id,
group_name,
award_label,
member_order + 1,
CASE WHEN instr(remainder,'、') > 0
THEN substr(remainder,1,instr(remainder,'、')-1)
ELSE remainder END,
CASE WHEN instr(remainder,'、') > 0
THEN substr(remainder,instr(remainder,'、')+1)
ELSE NULL END
FROM member_split
WHERE remainder IS NOT NULL AND remainder <> ''
),
member_lines AS (
SELECT detail_id, group_name, award_label, member_order, member_detail
FROM member_split
)
SELECT
group_name,
award_label AS award,
row_type,
people_count,
total_num,
name,
teacher,
num,
ensemble_name,
member_detail,
contact,
address
FROM (
-- 小计行
SELECT
g.group_name,
g.award_label,
'小计' AS row_type,
g.people_count,
g.total_num,
'' AS name,
'' AS teacher,
NULL AS num,
'' AS ensemble_name,
NULL AS member_detail,
'' AS contact,
'' AS address,
s.sort_order,
-1 AS detail_seq,
0 AS row_order,
0 AS member_order
FROM group_totals g
JOIN awards_sort s ON s.award = g.award_label
UNION ALL
-- 参赛明细
SELECT
d.group_name,
d.award_label,
'明细' AS row_type,
NULL,
NULL,
d.name,
d.teacher,
d.num,
d.ensemble_name,
NULL,
d.contact,
d.address,
s.sort_order,
d.detail_id,
1 AS row_order,
0 AS member_order
FROM detail_rows d
JOIN awards_sort s ON s.award = d.award_label
UNION ALL
-- 名单
SELECT
m.group_name,
m.award_label,
'名单' AS row_type,
NULL,
NULL,
'' AS name,
'' AS teacher,
NULL AS num,
'' AS ensemble_name,
m.member_detail,
'' AS contact,
'' AS address,
s.sort_order,
m.detail_id,
2 AS row_order,
m.member_order
FROM member_lines m
JOIN awards_sort s ON s.award = m.award_label
)
ORDER BY
group_name,
sort_order,
detail_seq,
row_order,
member_order;
WITH awards_sort(award, sort_order) AS (
VALUES ('金奖',1),('银奖',2),('铜奖',3),('未来之星奖',4),('希望之星',5),('未获奖',6)
),
detail_rows AS (
SELECT
id AS detail_id,
group_name,
CASE
WHEN COALESCE(absent_final,0)=1 THEN '希望之星'
WHEN award IN ('金奖','银奖','铜奖','未来之星奖') THEN award
ELSE '未获奖'
END AS award_label,
name,
COALESCE(teacher,'') AS teacher,
num,
CASE WHEN group_name='弦乐团' THEN name ELSE '' END AS ensemble_name,
member_list,
COALESCE(contact,'') AS contact,
COALESCE(address,'') AS address
FROM v_2025_final
),
group_totals AS (
SELECT group_name, award_label, COUNT(*) AS people_count, SUM(num) AS total_num
FROM detail_rows
GROUP BY 1,2
),
member_lines AS (
SELECT
d.detail_id,
d.group_name,
d.award_label,
gs.generate_series AS member_order,
TRIM(list_element(s.ss, gs.generate_series)) AS member_detail
FROM detail_rows d
CROSS JOIN LATERAL (SELECT string_split(member_list, '、') AS ss) s
CROSS JOIN LATERAL generate_series(1, array_length(s.ss)) gs
WHERE member_list IS NOT NULL AND member_list <> ''
)
SELECT
group_name,
award_label AS award,
row_type,
people_count,
total_num,
name,
teacher,
num,
ensemble_name,
member_detail,
contact,
address
FROM (
SELECT g.group_name, g.award_label, '小计' AS row_type,
g.people_count, g.total_num,
'' AS name, '' AS teacher, NULL AS num, '' AS ensemble_name,
NULL AS member_detail, '' AS contact, '' AS address,
s.sort_order, -1 AS detail_seq, 0 AS row_order, 0 AS member_order
FROM group_totals g JOIN awards_sort s ON s.award = g.award_label
UNION ALL
SELECT d.group_name, d.award_label, '明细',
NULL, NULL,
d.name, d.teacher, d.num, d.ensemble_name,
NULL, d.contact, d.address,
s.sort_order, d.detail_id, 1, 0
FROM detail_rows d JOIN awards_sort s ON s.award = d.award_label
UNION ALL
SELECT m.group_name, m.award_label, '名单',
NULL, NULL,
'' AS name, '' AS teacher, NULL, '' AS ensemble_name,
m.member_detail, '' AS contact, '' AS address,
s.sort_order, m.detail_id, 2, m.member_order
FROM member_lines m JOIN awards_sort s ON s.award = m.award_label
)
ORDER BY group_name, sort_order, detail_seq, row_order, member_order;
• 原因主要是 DuckDB 的内置函数更强,能把你在 SQLite 里“手搓”的步骤用一两行代替:
- 划奖项排序:用 VALUES 一次列出 6 行,不必 UNION ALL 6 次。
- 拆名单:DuckDB 有 string_split() 直接把顿号分隔的名单转成列表,再用 generate_series + list_element 拉平成多行;SQLite 里没有等价函数,只能写一个递归 CTE member_split 手动 instr/ substr 循环。
- 列转行顺序:generate_series(1, array_length(ss)) 自带递增序号,省掉递归里的 member_order + 1 逻辑。
- 其他小处:少量 COALESCE/CASE 与原样一致,但 DuckDB 允许在同一 CTE 里直接 JOIN LATERAL 取列 表长度和元素,避免额外子查询。
本质上,行数减少是因为 DuckDB 提供了“字符串拆分 + 序号 + 爆炸”的原生操作,取代了 SQLite 需要 的递归拆分模板。
# Parse a Baidu share text into structured fields.
# Usage: jq -n --arg raw "<string>" -f parse_baidu_link.jq
def trim: gsub("^\\s+"; "") | gsub("\\s+$"; "");
def collapse_ws:
gsub("\\r"; "")
| gsub("\\n"; " ")
| gsub("\\t"; " ")
| gsub(" +"; " ");
def first_match($text; $pattern):
([ $text | match($pattern) ] | first?);
def capture_named($match; $name):
if $match == null then null
else
($match.captures
| map(select(.name == $name))
| if length > 0 then .[0].string else null end)
end;
def sanitize_url($url):
if ($url // "") == "" then null
else
($url
| trim
| gsub("[,,。;;::、…\\)\\]】》〉>』」\"']+$"; "")
| if startswith("https://") then .
elif startswith("http://") then ("https://" + (ltrimstr("http://")))
elif startswith("http") then .
else "https://" + .
end)
end;
def capture_extension($text):
($text // "") as $body
| ($body | first_match(.; "(?i)(\\.)(?<ext>(mp4|mov|mkv|m4v|avi|wmv|flv|mp3|wav|flac|m4a))")) as $hit
| (capture_named($hit; "ext")) as $ext
| if $ext == null then null else ($ext | ascii_downcase) end;
def parse_share($text):
($text // "") as $raw
| ($raw | collapse_ws | trim) as $normalized
| (first_match($normalized; "(?<url>https?://pan\\.baidu\\.com/[\\w\\-_/\\?=]+)")) as $url_match
| (if $url_match == null then first_match($normalized; "(?<url>pan\\.baidu\\.com/[\\w\\-_/\\?=]+)") else $url_match end) as $url_match2
| (capture_named($url_match2; "url")) as $raw_url
| (sanitize_url($raw_url)) as $clean_url
| (first_match($normalized; "(?i)(提取码|密码)[:: ]*(?<code>[A-Za-z0-9]{4,8})")) as $code_match
| (capture_named($code_match; "code")) as $text_code
| (if $clean_url == null then null else first_match($clean_url; "(?i)[?&]pwd[=::]?(?<code>[A-Za-z0-9]{4,8})") end) as $url_code_match
| (capture_named($url_code_match; "code")) as $url_code
| (if ($text_code // "") != "" then ($text_code // "") else ($url_code // "") end) as $final_code
| (capture_extension($raw)) as $ext_hint
| { raw: $raw,
normalized: $normalized,
has_baidu_url: (($clean_url // "") != ""),
baidu_url: ($clean_url // ""),
baidu_code: ($final_code // ""),
url_contains_pwd: (if $clean_url == null then false else (($clean_url | test("(?i)\\?pwd"))) end),
extension_hint: ($ext_hint // ""),
issues: [
(if ($clean_url // "") == "" then "missing_url" else empty end),
(if ($final_code // "") == "" then "missing_code" else empty end)
] | map(select(. != null))
};
parse_share($ARGS.named.raw)
def char_code($c): [$c] | implode;
def csvsplit:
(. | explode) as $chars |
reduce range(0; $chars|length) as $idx (
{field:"", fields:[], in_quotes:false, skip:false};
if .skip then
.skip = false
else
($chars[$idx]) as $c |
if .in_quotes then
if $c == 34 then
if ($idx + 1 < ($chars|length)) and ($chars[$idx + 1] == 34) then
.field += char_code(34) | .skip = true
else
.in_quotes = false
end
else
.field += char_code($c)
end
else
if $c == 34 then
.in_quotes = true
elif $c == 44 then
.fields += [.field] | .field = ""
elif $c == 13 then
.
else
.field += char_code($c)
end
end
end
)
| .fields + [.field];
def parse_csv:
(split("\n")
| map(gsub("\r$"; ""))
| map(select(length > 0))) as $rows |
if ($rows|length) == 0 then []
else
($rows[0] | csvsplit) as $headers |
$rows[1:]
| map(
(. | csvsplit) as $cols |
reduce range(0; $headers|length) as $i (
{}; . + { ($headers[$i]): ($cols[$i] // "") }
)
)
end;
def should_advance($name):
if $name == "殷茗飏" or $name == "黄堇忱" then
false
else
true
end;
def process_csv($data):
$data | parse_csv
| map({
"Contact": .["电话"],
"Group": "独奏组",
"Name": .["姓名"],
"WetherAdvance": should_advance(.["姓名"])
});
# 从输入读取数据并处理
. as $input | process_csv($input)