Skip to content

Commit 8675ab4

Browse files
committed
[WIP] a proposal to document all datasets and models
Signed-off-by: Francesc Campoy <campoy@golang.org>
1 parent fb0b645 commit 8675ab4

File tree

2 files changed

+255
-0
lines changed

2 files changed

+255
-0
lines changed
+255
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,255 @@
1+
# Datasets, Kernels, Models, and Problems
2+
3+
As we start publishing more datasets and models, it is important to keep in mind why we're doing this.
4+
5+
> We publish datasets because we want to contribute back to the Open Source and Machine Learning communities.
6+
7+
We consider datasets and models to be good when they are:
8+
- discoverable,
9+
- reproducible, and
10+
- reusable.
11+
12+
Keeping all of this in mind, let me propose a way to write documentation for these.
13+
14+
## A Common Vocabulary
15+
16+
It seems to be quite established that the relationship between datasets, models, and other concepts is somehow expressed in the following graph.
17+
18+
![dataset graph](graph.png)
19+
<!--
20+
To rebuild the graph above, run:
21+
22+
$ dot -Tpng -o graph.png
23+
24+
And give the following as input:
25+
26+
digraph G {
27+
dataset -> kernel [ label = "feeds" ];
28+
{kernel dataset} -> model [ label = "generates" ];
29+
model -> problem [ label = "solves" ];
30+
predictor -> model [ label = "uses" ];
31+
predictor -> problem [ label = "solves" ];
32+
}
33+
-->
34+
35+
The following sections get into more detail on each concept,
36+
but let me give a quick intro of all of these concepts.
37+
38+
### Problems
39+
40+
Everything we do at source{d} is around solving problems and
41+
making predictions. Problems are the starting motivation
42+
and ending point of most of our Machine Learning processes.
43+
44+
Problems have a clear objective, and a measure of success that
45+
let us rank different solutions to any problem in an objective
46+
way. Think about accuracy, recall, etc.
47+
48+
An example problem could be predicting what is the next key
49+
a developer will press given what they've written so far.
50+
51+
### Models
52+
53+
Problems are solved using Models. Models are trained
54+
to solve a specific problem by feeding Dataset to a
55+
Kernel that optimizes a set of parameters.
56+
These parameters, once optimized, are what models are made of.
57+
58+
Models can be considered as a black box, where the only thing
59+
we care about is the input and output formats. This provides
60+
the possibility of reusing a model, to solve the same problem,
61+
or to somehow feed into a different model (by knowledge
62+
transfer or other techniques).
63+
64+
Given the previous problem of predicting the next key pressed,
65+
a model could get as an input the sequence of all keys pressed
66+
so far, as ASCII codes, and the output could be a single ASCII
67+
code with the prediction.
68+
69+
A secondary goal of models is to be reproducible, meaning that
70+
someone could try to repeat the same process we went through and
71+
expect to obtain a similar result. If the kernel that generated
72+
the dataset requires metaparameters (such as learning rate),
73+
these values should also be documented.
74+
75+
This is normally documented in research papers, with references
76+
to what datasets and kernels were used, as well as how much
77+
training time it took to obtain the resulting model.
78+
79+
### Kernels
80+
81+
Kernels are algorithms that feed from datasets and
82+
generate models. These algorithms are responsible for describing
83+
the model architecture chosen to solve a problem, e.g. RNN,
84+
CNN, etc, and what metaparamaters were used
85+
86+
### Datasets
87+
88+
Datasets contain information retrieved from one or more
89+
data sources, then pre-processed so it can easily be used to
90+
answer questions, solve problems, train models, or even as
91+
the data source to another dataset.
92+
93+
The most important aspects of a dataset are its format, how to
94+
download it, reproduce it, and what version contains what
95+
exactly.
96+
97+
Datasets evolve over time, and it's important to have versions
98+
that can be explicitly referred to from trained models.
99+
100+
### Predictor
101+
102+
The last piece of the puzzle is what I call a predictor.
103+
A predictor uses a model (sometimes more, sometimes no model
104+
at all) to predict the answer to a question given some input.
105+
106+
For instance, given a model trained with a large dataset of
107+
the keystrokes of thousands of developers, we could write a
108+
predictor that uses that trained model to create predictions.
109+
That would be a pretty decent predictor.
110+
111+
But we could also use a simple function that outputs random
112+
ASCII codes, ignoring any other information available. This
113+
predictor would probably have a lower accuracy for the given
114+
problem.
115+
116+
## Documenting these Artifacts
117+
118+
So far we've documented models and some datasets to a certain
119+
extent, but I think it's time to provide a framework for all
120+
of these elements to be uniformly documented to improve the
121+
discoverability, reproducibility, and reusability of our
122+
results.
123+
124+
We will evolve our documentation over time, into something that
125+
hopefully will delight every one of our engineers and users.
126+
But for now, let's keep it realistic and propose a reduced set
127+
of measure we can start applying today to evolve towards that
128+
perfect solution.
129+
130+
## Current status
131+
132+
Currently we document only datasets and models in two different
133+
repositories: github.com/src-d/datasets and
134+
github.com/src-d/models.
135+
136+
We also have a modelforge tool that is intended to provide a way
137+
to discover and download existing models.
138+
139+
### Datasets
140+
141+
We currently have only one public dataset: Public Git Archive.
142+
For this dataset we document:
143+
144+
- how to download the current version of the dataset with the `pga` CLI tool
145+
- how to reproduce the dataset with borges and GHTorrent
146+
147+
What are we missing?
148+
149+
- versioning of the resulting dataset, how to download this an previous versions?
150+
- format of the dataset
151+
- what other datasets (and versions) were used to generate this?
152+
- what models have been trained with this dataset
153+
- LICENSE (the tools and scripts are licensed, but not the datasets?)
154+
155+
### Models
156+
157+
Models are already documented following some structure, following the
158+
efforts put in place for [modelforge](https://github.com/src-d/modelforge).
159+
160+
Currently models have an ID, which looks like a long random string like
161+
`f64bacd4-67fb-4c64-8382-399a8e7db52a`.
162+
163+
Models are accompanied by an example on how to use them, unfortunately the
164+
examples are a bit simpler than expected. They mostly look like this:
165+
166+
```python
167+
from ast2vec import DocumentFrequencies
168+
df = DocumentFrequencies().load()
169+
print("Number of tokens:", len(df))
170+
```
171+
172+
What are we missing?
173+
- Versioned models, corresponding to versioned datasets.
174+
- Reference to the code (kernel) that was used to generate the model.
175+
- Technical sheet with accuracy, recall, etc for the given model and dataset
176+
- Format of input and output of the model
177+
- At least one example using the model to make a prediction
178+
179+
## My Proposal
180+
181+
Since we care about individual versioning of datasets and models,
182+
it seems like it's an obvious choice to use a git repository per dataset,
183+
and model.
184+
185+
Problems, predictors, and kernels can, for now, be documented directly with
186+
a model. If we see that we start to have too much repetition because we have
187+
many models for a single problem we will reassess this decision.
188+
189+
### Dataset Repository
190+
191+
A dataset repository should contain the following information:
192+
193+
- short description
194+
- long description and links to papers and blog posts
195+
- technical sheet
196+
- size of dataset
197+
- schema(s) of the dataset
198+
- download link
199+
- using the dataset:
200+
- downloading the dataset
201+
- related tools
202+
- reproducing the dataset:
203+
- link to the original data sources
204+
- related tools
205+
206+
### Model Repository
207+
208+
A dataset repository should contain the following information:
209+
210+
- short description
211+
- long description and links to papers and blog posts
212+
- technical sheet
213+
- size of model
214+
- input/output schemas
215+
- download link
216+
- datasets used to train the model (including versions)
217+
- using the model:
218+
- downloading the model
219+
- loading the model
220+
- prerequisites (tensorflow? keras?)
221+
- quick guide: making a prediction
222+
- reproducing the model:
223+
- link to the original dataset
224+
- kernel used to train the model
225+
- training process
226+
- hardware and time spent
227+
- metaparameters if any
228+
- any other relevant details
229+
230+
### General
231+
232+
As any other source{d} repository, we need to follow the guidelines in
233+
[Documentation at source{d}](https://github.com/src-d/guide/blob/master/engineering/documentation.md).
234+
This includes having a CONTRIBUTING.md, Code of Conduct, etc.
235+
236+
Every time a new version of a dataset or model is released a new tag and
237+
associated release should be created in the repository.
238+
The release should include links to anything that has changed since the
239+
previous relaease: such as a new version of the datasets or changes in
240+
the kernel.
241+
242+
### github.com/src-d/datasest and github.com/src-d/models
243+
244+
These two repositories should simply contain what is common to all datasets,
245+
or to all models. They will also provide all the tooling build on top of
246+
the documentation for datasets and models.
247+
248+
Since we imagine these tools extracting information from the repositories
249+
automatically, it is important to keep formatting in mind.
250+
251+
I'm currently considering whether a `toml` file should be defined containing
252+
the data common to all the datasets and models.
253+
For instance, we could have the download size for each dataset and model,
254+
as well as the associated schemas. A simple tool could then generate
255+
documentation based on these values.

developer-community/graph.png

4.45 KB
Loading

0 commit comments

Comments
 (0)