📕 subnode [[@karlicoss/bleanser]] in 📚 node [[bleanser]]

📓 garden/karlicoss/projects/bleanser.md by @karlicoss

[[related]] [[exports]] [[backup]] [[infra]]
[* motivation](#mtvtn TIDDLYLINK)
- [[reddit processing takes quite a bit.. but I guess bleanser will optimize it]] [[hpi]] [[bleanser]] [[reddit]]
[* ideas](#ds TIDDLYLINK)
- [[pattern of handling unknown data sources]] [[toblog]]
[* specific data sources](#spcfcdtsrcs TIDDLYLINK)
- [[firefox history?]] [[promnesia]] [[bleanser]]
- [[github-events – prune via triplet approach?]]
- [[xml: smscalls]]
- [[not sure, maybe ignore comment/link karma? it results in lots of differences…]] [[reddit]]
[* bugs](#bgs TIDDLYLINK)
- [[shit, dry run still left turds???]]
[[compress backups?]] [[infra]]
- [2019-03-25] also compressing rtm
[2021-01-11] move description to github
[[-----]]
[[generic sqlite databases export]] [[bleanser]]
[[hmm. they serve sort of the same purpose???]] [[bleanser]] [[backupchecker]]
[[maybe 'dynamic' optimizer for bleanser? and later can use it to actually delete stuff]] [[hpi]]
- [2021-03-02] I guess HPI could import it as a dependency..
[[github events via triples would be a good example]] [[bleanser]]
[2021-03-02] Search results · PyPI [[bleanser]]
[2021-02-27] trailofbits/graphtage: A semantic diff utility and library for tree-like files such as JSON, JSON5, XML, HTML, YAML, and CSV. [[bleanser]]
[[ok, triples for browser history are def gonna be impactful]] [[bleanser]]
[2021-04-07] Memory Filesystem — PyFilesystem 2.4.13 documentation [[bleanser]]
[[warn about being disk/tmp intense?]] [[bleanser]]
[[write about multiprocessing?]] [[bleanser]]
[['extract' query]] [[bleanser]]
[[hmm, for attributes that can change back and force in json, sorted strategy isn't the best… ugh]] [[bleanser]]
[[sqlite: hmm….note sure about cascades… probably need to disable somehow?]] [[bleanser]]
[[investigating diffs]] [[bleanser]]
[[for properly impressive demo should prob run in single threaded mode?]] [[bleanser]]
[[run tox first? to protect from crashes]] [[bleanser]] [[setup]]
[[could artificially map jsons to line-based format (with full path to the entity?)]] [[bleanser]]
[[kinds of snapshots]] [[toblog]] [[bleanser]]
[[moving old files – not sure what to do about empty dirs?]] [[bleanser]]
[[implement 'extract' mode later… after writing to blog definitely]] [[bleanser]]
[[readme: gotcha about group boundaries not being removed (nad having empty diff)]] [[bleanser]]
[[document what's happening in which case… with a literate test]] [[bleanser]]
[[proper end2end test — could run against firefox? reinstalled at about 202006, could track by file size changes]] [[bleanser]]
[[multiway is a bit more speculative]] [[bleanser]] [[toblog]]
[[add for takeouts… I even had some script to compare it somewhere]] [[bleanser]]
[[hypothesis, endomondo, pinboard – just use regular json processor]] [[bleanser]]
[[old code for 'extract' bit]] [[bleanser]] [[pinboard]]
[[json: sorting stuff might definitely make it more confusing when there is just one volatile attribute that has two values]] [[bleanser]]
[[foursquare is a good motiation – lots of random changing crap even without the changes of underlying data?]] [[bleanser]]

Bleanser' stands for 'backup cleanser'.

The idea is figuring out 'redundant' backups and removing them to

save on disk space
same on data access time (see "data access layer")

This is the most relevant to incremental/synthetic style data exports.

It's not necessarily hard to implement for something specific, but the challenge is to do it in a data source agnostic way,
or at least with as minimum effort as possible.

This is possible for example for JSON: if the export from today is a superset of an export from yesterday, you can safely remove the old export. This actually works surprisingly well as is for many data sources.
For a few I've got slight adjustments that normalise them before comparing by removing certain fields that change often, but not very important. For example, Reddit upvotes/downvotes always jump, so I just exclude them from the comparison.
It's similar to extracting the useful fields, but instead it filters the useless ones. That makes it safer in case new fields are added by the backend, I'd rather keep extra data than potentially lose useful information.

related [[exports]] [[backup]] [[infra]]

* motivation

reddit processing takes quite a bit.. but I guess bleanser will optimize it [[hpi]] [[bleanser]] [[reddit]]

* ideas

pattern of handling unknown data sources [[toblog]]

lower bound
specify data (fields/files etc) to preserve
if you only do that you might miss new useful data/schema changes like renames etc

. ideally they meet here
.. warn if we ended up here, i.e. dropping is not converting with picking. but keep the data

if you only do that you end up with too much garbage
specify data (fileds/files etc) to drop
upper bound

* specific data sources

firefox history? [[promnesia]] [[bleanser]]

github-events – prune via triplet approach?

xml: smscalls

not sure, maybe ignore comment/link karma? it results in lots of differences… [[reddit]]

* bugs

shit, dry run still left turds???

compress backups? [[infra]]

started compressing reddit…

`[2019-03-25]` also compressing rtm

`[2021-01-11]` move description to github

-----

generic sqlite databases export [[bleanser]]

do not remove; move to killzone
get all tables
make sure all schemas match
maybe convert it to json or something? and then compare jsons…
checks that entries are dominated?

hmm. they serve sort of the same purpose??? [[bleanser]] [[backupchecker]]

maybe 'dynamic' optimizer for bleanser? and later can use it to actually delete stuff [[hpi]]

`[2021-03-02]` I guess HPI could import it as a dependency..

github events via triples would be a good example [[bleanser]]

`[2021-03-02]` Search results · PyPI [[bleanser]]

could name like this…

`[2021-02-27]` trailofbits/graphtage: A semantic diff utility and library for tree-like files such as JSON, JSON5, XML, HTML, YAML, and CSV. [[bleanser]]

ok, triples for browser history are def gonna be impactful [[bleanser]]

maybe before comparison explicitly 'cleanup' stuff

`[2021-04-07]` Memory Filesystem — PyFilesystem 2.4.13 documentation [[bleanser]]

could use for processing… maybe via option

warn about being disk/tmp intense? [[bleanser]]

write about multiprocessing? [[bleanser]]

'extract' query [[bleanser]]

might be useful as a sanity check? to ensure stuff isn't deleted by accident? (like foreign key triggers)
e.g.

run extract query first to get a snapshot of data
run cleanup query
run extract query first to ensure the data we care about is there?

hmm, for attributes that can change back and force in json, sorted strategy isn't the best… ugh [[bleanser]]

sqlite: hmm….note sure about cascades… probably need to disable somehow? [[bleanser]]

investigating diffs [[bleanser]]

shell globing nice for not typing too much

python3 -m bleanser.modules.firefox diff /path/to/database-201906{16,17}*.sqlite | less

for properly impressive demo should prob run in single threaded mode? [[bleanser]]

run tox first? to protect from crashes [[bleanser]] [[setup]]

could artificially map jsons to line-based format (with full path to the entity?) [[bleanser]]

that way might work more reliably… hmm

kinds of snapshots [[toblog]] [[bleanser]]

append only (e.g. foursquare, hypothesis)
rolling (e.g. rescuetime, github, reddit)

either way you can think of it as as set of strings

moving old files – not sure what to do about empty dirs? [[bleanser]]

maybe keep all dirs that were there before – and only remove new empty dirs?

implement 'extract' mode later… after writing to blog definitely [[bleanser]]

readme: gotcha about group boundaries not being removed (nad having empty diff) [[bleanser]]

document what's happening in which case… with a literate test [[bleanser]]

e.g. 'all files are same'
only added data
rolling data (some fake datetime stuff with 30d retention)
error in cleaner script

proper end2end test — could run against firefox? reinstalled at about 202006, could track by file size changes [[bleanser]]

multiway is a bit more speculative [[bleanser]] [[toblog]]

add for takeouts… I even had some script to compare it somewhere [[bleanser]]

hypothesis, endomondo, pinboard – just use regular json processor [[bleanser]]

old code for 'extract' bit [[bleanser]] [[pinboard]]

 return pipe(
     '.tags  |= .',
     '.posts |= map({href, description, time, tags})', # TODO maybe just delete hash?
     '.notes |= {notes: .notes | map({id, title, updated_at}), count}',  # TODO hhmm, it keeps length but not content?? odd.
)

json: sorting stuff might definitely make it more confusing when there is just one volatile attribute that has two values [[bleanser]]

e.g. on foursquare with isMayor: true – hmmm

foursquare is a good motiation – lots of random changing crap even without the changes of underlying data? [[bleanser]]

📖 stoas

public document at doc.anagora.org/bleanser
video call at meet.jit.si/bleanser

⥱ context

To see links, go up to full node [[bleanser]].

Table of Contents