top of page
Vyhledat
  • Obrázek autoraŠtěpán Vácha

Usefull BigQuery SQL snippets for data analytics

Aktualizováno: 6. 12. 2021

Word Count

Usefull for word cloud analysis

SELECT 
lower(fn.replace_cz_chars_regex(trim(word))) as word, --remove diacritics ect
count(*) as cnt 
FROM `my_project.my_table` 
,UNNEST(split(my_column,' ')) word 
where length(word)>2 --or NOT IN list of stop words. characters
group by 1 
having(count(*))>5 -- line of interrest
order by 2 desc

Replace CZ and other diacritics and other OCR chars in BigQuery effective way

Most effective especially for large data processing is use of regexp_replace


CREATE OR REPLACE FUNCTION `steady-vine-203410.fn.replace_cz_chars_regex`(str STRING) AS (
regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace
(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace
(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(
regexp_replace(regexp_replace(regexp_replace(regexp_replace(regexp_replace(
str,
r"Á|À|Â|Ã|Ä|Å|Ā|Ą|Æ|А",'A'),r"à|á|â|ã|ä|å|ā|ą|æ|а",'a'),r"ç|č|ć|¢|с",'c'),r"Ç|Ć|Č|С",'C'),r"đ|ď",'d'),r"Đ|Ď",'D'),r"È|É|Ê|Ë|Ě|Ē|Ę|Е",'E'),r"ė|è|é|ê|ë|ě|ē|ę|е",'e'),r"ì|í|î|ï|Ī|ī|і",'i'),r"Ì|Í|Î|Ï|І",'I'),r"ł|£",'l')
,r"'Ł",'L'),r"ñ|ň|ń",'n'),r"Ñ|Ň|Ń",'N'),r"Ò|Ó|Ô|Õ|Õ|Ö|Ø|Ō",'O'),r"ò|ó|ô|õ|ö|ø|ō",'o'),r"Ř|Ŕ",'R'),r"ř|ŕ",'r'),r"Š|Ś|Ѕ",'S'),r"š|ś|ß|ѕ|Ş",'s'),r"ť",'t'),r"Ť|Т|т",'T'),r"Ù|Ú|Û|Ü|Ů|Ū",'U'),r"ù|ú|û|ü|ů|ū|µ",'u'),r"×|х",'x')
,r"¥|Ÿ|Ý",'Y'),r"ÿ|ý|У|у",'y'),r"Ž|Ż|Ź",'Z'),r"ž|ż|ź",'z'),r"Р",'P'),r"р|Þ",'p'),r"Ԛ",'Q'),r"ԛ",'q'),r"В|в",'B'),r"Ј",'J'),r"ј","j"),r"К",'K'),r"к",'k'),r"М",'M'),r"м",'m'),r"Н|н",'H'),r"О",'O'),r"о",'o'),r"Х",'X'), r"З|з|Э|э",'3')
);


35 zobrazení0 komentářů

Nejnovější příspěvky

Zobrazit vše
bottom of page