|
| 1 | +# Datasets, Kernels, Models, and Problems |
| 2 | + |
| 3 | +As we start publishing more datasets and models, it is important to keep in mind why we're doing this. |
| 4 | + |
| 5 | +> We publish datasets because we want to contribute back to the Open Source and Machine Learning communities. |
| 6 | +
|
| 7 | +We consider datasets and models to be good when they are: |
| 8 | +- discoverable, |
| 9 | +- reproducible, and |
| 10 | +- reusable. |
| 11 | + |
| 12 | +Keeping all of this in mind, let me propose a way to write documentation for these. |
| 13 | + |
| 14 | +## A Common Vocabulary |
| 15 | + |
| 16 | +It seems to be quite established that the relationship between datasets, models, and other concepts is somehow expressed in the following graph. |
| 17 | + |
| 18 | + |
| 19 | +<!-- |
| 20 | +To rebuild the graph above, run: |
| 21 | +
|
| 22 | +$ dot -Tpng -o graph.png |
| 23 | +
|
| 24 | +And give the following as input: |
| 25 | +
|
| 26 | +digraph G { |
| 27 | + dataset -> kernel [ label = "feeds" ]; |
| 28 | + {kernel dataset} -> model [ label = "generates" ]; |
| 29 | + model -> problem [ label = "solves" ]; |
| 30 | + predictor -> model [ label = "uses" ]; |
| 31 | + predictor -> problem [ label = "solves" ]; |
| 32 | +} |
| 33 | +--> |
| 34 | + |
| 35 | +The following sections get into more detail on each concept, |
| 36 | +but let me give a quick intro of all of these concepts. |
| 37 | + |
| 38 | +### Problems |
| 39 | + |
| 40 | +Everything we do at source{d} is around solving problems and |
| 41 | +making predictions. Problems are the starting motivation |
| 42 | +and ending point of most of our Machine Learning processes. |
| 43 | + |
| 44 | +Problems have a clear objective, and a measure of success that |
| 45 | +let us rank different solutions to any problem in an objective |
| 46 | +way. Think about accuracy, recall, etc. |
| 47 | + |
| 48 | +An example problem could be predicting what is the next key |
| 49 | +a developer will press given what they've written so far. |
| 50 | + |
| 51 | +### Models |
| 52 | + |
| 53 | +Problems are solved using Models. Models are trained |
| 54 | +to solve a specific problem by feeding Dataset to a |
| 55 | +Kernel that optimizes a set of parameters. |
| 56 | +These parameters, once optimized, are what models are made of. |
| 57 | + |
| 58 | +Models can be considered as a black box, where the only thing |
| 59 | +we care about is the input and output formats. This provides |
| 60 | +the possibility of reusing a model, to solve the same problem, |
| 61 | +or to somehow feed into a different model (by knowledge |
| 62 | +transfer or other techniques). |
| 63 | + |
| 64 | +Given the previous problem of predicting the next key pressed, |
| 65 | +a model could get as an input the sequence of all keys pressed |
| 66 | +so far, as ASCII codes, and the output could be a single ASCII |
| 67 | +code with the prediction. |
| 68 | + |
| 69 | +A secondary goal of models is to be reproducible, meaning that |
| 70 | +someone could try to repeat the same process we went through and |
| 71 | +expect to obtain a similar result. If the kernel that generated |
| 72 | +the dataset requires metaparameters (such as learning rate), |
| 73 | +these values should also be documented. |
| 74 | + |
| 75 | +This is normally documented in research papers, with references |
| 76 | +to what datasets and kernels were used, as well as how much |
| 77 | +training time it took to obtain the resulting model. |
| 78 | + |
| 79 | +### Kernels |
| 80 | + |
| 81 | +Kernels are algorithms that feed from datasets and |
| 82 | +generate models. These algorithms are responsible for describing |
| 83 | +the model architecture chosen to solve a problem, e.g. RNN, |
| 84 | +CNN, etc, and what metaparamaters were used |
| 85 | + |
| 86 | +### Datasets |
| 87 | + |
| 88 | +Datasets contain information retrieved from one or more |
| 89 | +data sources, then pre-processed so it can easily be used to |
| 90 | +answer questions, solve problems, train models, or even as |
| 91 | +the data source to another dataset. |
| 92 | + |
| 93 | +The most important aspects of a dataset are its format, how to |
| 94 | +download it, reproduce it, and what version contains what |
| 95 | +exactly. |
| 96 | + |
| 97 | +Datasets evolve over time, and it's important to have versions |
| 98 | +that can be explicitly referred to from trained models. |
| 99 | + |
| 100 | +### Predictor |
| 101 | + |
| 102 | +The last piece of the puzzle is what I call a predictor. |
| 103 | +A predictor uses a model (sometimes more, sometimes no model |
| 104 | +at all) to predict the answer to a question given some input. |
| 105 | + |
| 106 | +For instance, given a model trained with a large dataset of |
| 107 | +the keystrokes of thousands of developers, we could write a |
| 108 | +predictor that uses that trained model to create predictions. |
| 109 | +That would be a pretty decent predictor. |
| 110 | + |
| 111 | +But we could also use a simple function that outputs random |
| 112 | +ASCII codes, ignoring any other information available. This |
| 113 | +predictor would probably have a lower accuracy for the given |
| 114 | +problem. |
| 115 | + |
| 116 | +## Documenting these Artifacts |
| 117 | + |
| 118 | +So far we've documented models and some datasets to a certain |
| 119 | +extent, but I think it's time to provide a framework for all |
| 120 | +of these elements to be uniformly documented to improve the |
| 121 | +discoverability, reproducibility, and reusability of our |
| 122 | +results. |
| 123 | + |
| 124 | +We will evolve our documentation over time, into something that |
| 125 | +hopefully will delight every one of our engineers and users. |
| 126 | +But for now, let's keep it realistic and propose a reduced set |
| 127 | +of measure we can start applying today to evolve towards that |
| 128 | +perfect solution. |
| 129 | + |
| 130 | +## Current status |
| 131 | + |
| 132 | +Currently we document only datasets and models in two different |
| 133 | +repositories: github.com/src-d/datasets and |
| 134 | +github.com/src-d/models. |
| 135 | + |
| 136 | +We also have a modelforge tool that is intended to provide a way |
| 137 | +to discover and download existing models. |
| 138 | + |
| 139 | +### Datasets |
| 140 | + |
| 141 | +We currently have only one public dataset: Public Git Archive. |
| 142 | +For this dataset we document: |
| 143 | + |
| 144 | +- how to download the current version of the dataset with the `pga` CLI tool |
| 145 | +- how to reproduce the dataset with borges and GHTorrent |
| 146 | + |
| 147 | +What are we missing? |
| 148 | + |
| 149 | +- versioning of the resulting dataset, how to download this an previous versions? |
| 150 | +- format of the dataset |
| 151 | +- what other datasets (and versions) were used to generate this? |
| 152 | +- what models have been trained with this dataset |
| 153 | +- LICENSE (the tools and scripts are licensed, but not the datasets?) |
| 154 | + |
| 155 | +### Models |
| 156 | + |
| 157 | +Models are already documented following some structure, following the |
| 158 | +efforts put in place for [modelforge](https://github.com/src-d/modelforge). |
| 159 | + |
| 160 | +Currently models have an ID, which looks like a long random string like |
| 161 | +`f64bacd4-67fb-4c64-8382-399a8e7db52a`. |
| 162 | + |
| 163 | +Models are accompanied by an example on how to use them, unfortunately the |
| 164 | +examples are a bit simpler than expected. They mostly look like this: |
| 165 | + |
| 166 | +```python |
| 167 | +from ast2vec import DocumentFrequencies |
| 168 | +df = DocumentFrequencies().load() |
| 169 | +print("Number of tokens:", len(df)) |
| 170 | +``` |
| 171 | + |
| 172 | +What are we missing? |
| 173 | +- Versioned models, corresponding to versioned datasets. |
| 174 | +- Reference to the code (kernel) that was used to generate the model. |
| 175 | +- Technical sheet with accuracy, recall, etc for the given model and dataset |
| 176 | +- Format of input and output of the model |
| 177 | +- At least one example using the model to make a prediction |
| 178 | + |
| 179 | +## My Proposal |
| 180 | + |
| 181 | +Since we care about individual versioning of datasets and models, |
| 182 | +it seems like it's an obvious choice to use a git repository per dataset, |
| 183 | +and model. |
| 184 | + |
| 185 | +Problems, predictors, and kernels can, for now, be documented directly with |
| 186 | +a model. If we see that we start to have too much repetition because we have |
| 187 | +many models for a single problem we will reassess this decision. |
| 188 | + |
| 189 | +### Dataset Repository |
| 190 | + |
| 191 | +A dataset repository should contain the following information: |
| 192 | + |
| 193 | +- short description |
| 194 | +- long description and links to papers and blog posts |
| 195 | +- technical sheet |
| 196 | + - size of dataset |
| 197 | + - schema(s) of the dataset |
| 198 | + - download link |
| 199 | +- using the dataset: |
| 200 | + - downloading the dataset |
| 201 | + - related tools |
| 202 | +- reproducing the dataset: |
| 203 | + - link to the original data sources |
| 204 | + - related tools |
| 205 | + |
| 206 | +### Model Repository |
| 207 | + |
| 208 | +A dataset repository should contain the following information: |
| 209 | + |
| 210 | +- short description |
| 211 | +- long description and links to papers and blog posts |
| 212 | +- technical sheet |
| 213 | + - size of model |
| 214 | + - input/output schemas |
| 215 | + - download link |
| 216 | + - datasets used to train the model (including versions) |
| 217 | +- using the model: |
| 218 | + - downloading the model |
| 219 | + - loading the model |
| 220 | + - prerequisites (tensorflow? keras?) |
| 221 | + - quick guide: making a prediction |
| 222 | +- reproducing the model: |
| 223 | + - link to the original dataset |
| 224 | + - kernel used to train the model |
| 225 | + - training process |
| 226 | + - hardware and time spent |
| 227 | + - metaparameters if any |
| 228 | + - any other relevant details |
| 229 | + |
| 230 | +### General |
| 231 | + |
| 232 | +As any other source{d} repository, we need to follow the guidelines in |
| 233 | +[Documentation at source{d}](https://github.com/src-d/guide/blob/master/engineering/documentation.md). |
| 234 | +This includes having a CONTRIBUTING.md, Code of Conduct, etc. |
| 235 | + |
| 236 | +Every time a new version of a dataset or model is released a new tag and |
| 237 | +associated release should be created in the repository. |
| 238 | +The release should include links to anything that has changed since the |
| 239 | +previous relaease: such as a new version of the datasets or changes in |
| 240 | +the kernel. |
| 241 | + |
| 242 | +### github.com/src-d/datasest and github.com/src-d/models |
| 243 | + |
| 244 | +These two repositories should simply contain what is common to all datasets, |
| 245 | +or to all models. They will also provide all the tooling build on top of |
| 246 | +the documentation for datasets and models. |
| 247 | + |
| 248 | +Since we imagine these tools extracting information from the repositories |
| 249 | +automatically, it is important to keep formatting in mind. |
| 250 | + |
| 251 | +I'm currently considering whether a `toml` file should be defined containing |
| 252 | +the data common to all the datasets and models. |
| 253 | +For instance, we could have the download size for each dataset and model, |
| 254 | +as well as the associated schemas. A simple tool could then generate |
| 255 | +documentation based on these values. |
0 commit comments