Commit 52c873e5 authored by Lars Prehn's avatar Lars Prehn
Browse files

Pushing Public Repo

parents
.breval/
data/routes/
THE CRAPL v0 BETA 1
0. Information about the CRAPL
If you have questions or concerns about the CRAPL, or you need more
information about this license, please contact:
Matthew Might
http://matt.might.net/
I. Preamble
Science thrives on openness.
In modern science, it is often infeasible to replicate claims without
access to the software underlying those claims.
Let's all be honest: when scientists write code, aesthetics and
software engineering principles take a back seat to having running,
working code before a deadline.
So, let's release the ugly. And, let's be proud of that.
II. Definitions
1. "This License" refers to version 0 beta 0 of the Community
Research and Academic Programming License (the CRAPL).
2. "The Program" refers to the medley of source code, shell scripts,
executables, objects, libraries and build files supplied to You,
or these files as modified by You.
[Any appearance of design in the Program is purely coincidental and
should not in any way be mistaken for evidence of thoughtful
software construction.]
3. "You" refers to the person or persons brave and daft enough to use
the Program.
4. "The Documentation" refers to the Program.
5. "The Author" probably refers to the caffeine-addled graduate
student that got the Program to work moments before a submission
deadline.
III. Terms
1. By reading this sentence, You have agreed to the terms and
conditions of this License.
2. If the Program shows any evidence of having been properly tested
or verfied, You will disregard this evidence.
3. You agree to hold the Author free from shame, embarrassment or
ridicule for any hacks, kludges or leaps of faith found within the
Program.
4. You recognize that any request for support for the Program will be
discarded with extreme prejudice.
5. The Author reserves all rights to the Program, except for any
rights granted under any additional licenses attached to the
Program.
IV. Permissions
1. You are permitted to use the Program to validate published
scientific claims.
2. You are permitted to use the Program to validate scientific claims
submitted for peer review, under the condition that You keep
modifications to the Program confidential until those claims have
been published.
3. You are permitted to use and/or modify the Program for the
validation of novel scientific claims if You make a good-faith
attempt to notify the Author of Your work and Your claims prior to
submission for publication.
4. If You publicly release any claims or data that were supported or
generated by the Program or a modification thereof, in whole or in
part, You will release any inputs supplied to the Program and any
modifications You made to the Progam. This License will be in
effect for the modified program.
V. Disclaimer of Warranty
THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY
APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT
HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT
WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT
LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND
PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE PROGRAM PROVE
DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, REPAIR OR
CORRECTION.
VI. Limitation of Liability
IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR
CONVEYS THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES,
INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES
ARISING OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT
NOT LIMITED TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR
LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM
TO OPERATE WITH ANY OTHER PROGRAMS), EVEN IF SUCH HOLDER OR OTHER
PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
This repository contains all scripts necessary to reproduce the major results presented in:
> Prehn, Lars and Feldmann, Anja. "How biased is our Validation (Data) for AS Relationships?" Proceedings of the ACM Internet Measurement Conference. 2021.
You can also find a copy of the publication in this git, see [here](/imc21_prehn_breval_crv210928.pdf).<br>
All scripts in this repository can be found in /scripts. In particular, the structure within scripts/ looks as follows:
~~~
.
├── geo_cov # Reproduce Results for Figure 1
│   ├── asn2rir.py
│   ├── calc_coverage.py
│   ├── get_links_per_group.py
│   └── README.md
├── perf # Reproduce Results for Table 1-3
│   ├── analyze_correlations_bu.py
│   ├── calc_confusion_matrices.py
│   ├── calc_performance.py
│   └── README.md
├── populate_datasets # Download and populate all the data sources needed for the scripts
│   ├── populate.sh
│   └── README.md
├── topo_cov # Reproduce Results for Figure 2
│   ├── calc_coverage.py
│   ├── get_links_per_group.py
│   ├── produce_topo_classifier.py
│   └── README.md
├── transit_link_imbalance # Reproduce Results for Figure 3
│   ├── calc_transit_link_sizes.py
│   └── README.md
└── utils.py # Functions to read/write data sources easily
~~~
Each sub-directory contains a README.md files that explains the requirements, script calls, and outputs that can be generated. The first directory that needs to entered is scripts/populate_datasets/.
Please notice that this repository is licensed under the CRAPL license; please treat it as such.
In case you have any questions, please contact
Lars Prehn \<lprehn@mpi-inf.mpg.de\><br>
Anja Feldmann \<anja@mpi-inf.mpg.de\>
0,Reserved
1-1876,ARIN
1877-1901,RIPENCC
1902-2042,ARIN
2043,RIPENCC
2044-2046,ARIN
2047,RIPENCC
2048-2106,ARIN
2107-2136,RIPENCC
2137-2584,ARIN
2585-2614,RIPENCC
2615-2772,ARIN
2773-2822,RIPENCC
2823-2829,ARIN
2830-2879,RIPENCC
2880-3153,ARIN
3154-3353,RIPENCC
3354-4607,ARIN
4608-4865,APNIC
4866-5376,ARIN
5377-5631,RIPENCC
5632-6655,ARIN
6656-6911,RIPENCC
6912-7466,ARIN
7467-7722,APNIC
7723-8191,ARIN
8192-9215,RIPENCC
9216-10239,APNIC
10240-12287,ARIN
12288-13311,RIPENCC
13312-15359,ARIN
15360-16383,RIPENCC
16384-17407,ARIN
17408-18431,APNIC
18432-20479,ARIN
20480-21503,RIPENCC
21504-23455,ARIN
23456,AS_TRANS
23457-23551,ARIN
23552-24575,APNIC
24576-25599,RIPENCC
25600-26623,ARIN
26624-27647,ARIN
27648-28671,LACNIC
28672-29695,RIPENCC
29696-30719,ARIN
30720-31743,RIPENCC
31744-32767,ARIN
32768-33791,ARIN
33792-34815,RIPENCC
34816-35839,RIPENCC
35840-36863,ARIN
36864-37887,AFRINIC
37888-38911,APNIC
38912-39935,RIPENCC
39936-40959,ARIN
40960-41983,RIPENCC
41984-43007,RIPENCC
43008-44031,RIPENCC
44032-45055,RIPENCC
45056-46079,APNIC
46080-47103,ARIN
47104-48127,RIPENCC
48128-49151,RIPENCC
49152-50175,RIPENCC
50176-51199,RIPENCC
51200-52223,RIPENCC
52224-53247,LACNIC
53248-54271,ARIN
54272-55295,ARIN
55296-56319,APNIC
56320-57343,RIPENCC
57344-58367,RIPENCC
58368-59391,APNIC
59392-60415,RIPENCC
60416-61439,RIPENCC
61440-61951,LACNIC
61952-62463,RIPENCC
62464-63487,ARIN
63488-63999,APNIC
64000-64098,APNIC
64099-64197,LACNIC
64198-64296,ARIN
64297-64395,APNIC
64396-64495,RIPENCC
64496-64511,Reservedforuseindocumentationandsamplecode
64512-65534,ReservedforPrivateUse
65535,Reserved
65536-65551,Reservedforuseindocumentationandsamplecode
65552-131071,Reserved
131072-132095,APNIC
132096-133119,APNIC
133120-133631,APNIC
133632-134556,APNIC
134557-135580,APNIC
135581-136505,APNIC
136506-137529,APNIC
137530-138553,APNIC
138554-139577,APNIC
139578-140601,APNIC
140602-141625,APNIC
141626-142649,APNIC
142650-143673,APNIC
143674-196607,Unallocated
196608-197631,RIPENCC
197632-198655,RIPENCC
198656-199679,RIPENCC
199680-200191,RIPENCC
200192-201215,RIPENCC
201216-202239,RIPENCC
202240-203263,RIPENCC
203264-204287,RIPENCC
204288-205211,RIPENCC
205212-206235,RIPENCC
206236-207259,RIPENCC
207260-208283,RIPENCC
208284-209307,RIPENCC
209308-210331,RIPENCC
210332-211355,RIPENCC
211356-212379,RIPENCC
212380-213403,RIPENCC
213404-262143,Unallocated
262144-263167,LACNIC
263168-263679,LACNIC
263680-264604,LACNIC
264605-265628,LACNIC
265629-266652,LACNIC
266653-267676,LACNIC
267677-268700,LACNIC
268701-269724,LACNIC
269725-270748,LACNIC
270749-271772,LACNIC
271773-272796,LACNIC
272797-327679,Unallocated
327680-328703,AFRINIC
328704-329727,AFRINIC
329728-393215,Unallocated
393216-394239,ARIN
394240-395164,ARIN
395165-396188,ARIN
396189-397212,ARIN
397213-398236,ARIN
398237-399260,ARIN
399261-400284,ARIN
400285-401308,ARIN
401309-4199999999,Unallocated
4200000000-4294967294,ReservedforPrivateUse
4294967295,Reserved
This diff is collapsed.
The files in this directory allow you to reproduce the results of Figure 1.
First, we'll read the RIR files and check whether (i) the files are corrupted and (ii) whether the lookup process works properly. Please run:
~~~
python3 asn2rir.py
~~~
which should produce the following output:
~~~
############### Comparing data to summaries ###############
afrinic: received 2302 out of 2302 promised record, 100%
apnic: received 8804 out of 8804 promised record, 100%
arin: received 26873 out of 26873 promised record, 100%
lacnic: received 8496 out of 8496 promised record, 100%
ripencc: received 36376 out of 36376 promised record, 100%
################## Running random tests ###################
I ran 82851 random test lookups out of which 82851 (100%) succeeded.
~~~
each RIR delegation file carries a summary of the record count; hence, the first first part just compares the amount of data that it read against those summaries. If this does not yield 100% for all RIRs, please ensure that the file for the RIR that's not 100% is not corrupted (or just redownload it). For each asn record that we read, we produce one test case, i.e, we select one random entry from the range described and later use that for testing the lookup.
Once all values show 100%, the asn lookup is good to go. Now you can run
~~~
python3 calc_coverage.py
~~~
which should generate a file at ../../results/coverage/rir.csv that includes the amount of inferred links and validatable links per RIR class.
Finally, please run
~~~
get_links_per_group.py
~~~
which generates a file at ../../results/links_per_class/rir_classes.csv that includes the links that are contained in each group. While this is not immediately helpful, it is used by later scripts.
import sys
sys.path.insert(1, '../')
from utils import smartopen, find_le
import random, bisect
from collections import defaultdict
random.seed(12212102)
DATA_DIR='../../data/rir/'
FILES=['delegated-afrinic-extended-20180405','delegated-apnic-extended-20180405.gz','delegated-arin-extended-20180405','delegated-lacnic-extended-20180405','delegated-ripencc-extended-20180405.bz2']
IANA = 'iana_initial_assignments.csv'
def yield_lines():
for file in FILES:
with smartopen(DATA_DIR+file) as fileinput:
for line in fileinput:
yield line
class AsnDelegationLookup(object):
def __init__(self):
self.mapping = defaultdict(dict)
self.starts = defaultdict(list)
self.summaries = defaultdict(int)
def add(self, start: str, value: int, rir: str, data: list):
try:
start = int(start)
except ValueError:
# older files (e.g., form 2010-12-31) have fatfingered entries like:
# afrinic|MU|asn|5.1|1|20070504|allocated
# let's just ignore them ...
return
stop = start+value-1
self.mapping[rir][start] = (stop, data)
bisect.insort(self.starts[rir], start)
def _lookup_rir_entry(self, rir, asn):
try:
start = find_le(self.starts[rir], asn)
stop, data = self.mapping[rir][start]
if start <= asn <= stop:
return (start, stop, data)
raise ValueError
except ValueError as ve:
raise ve
def lookup(self, query: str):
query = int(query)
min_diff, min_rir, min_data = float('inf'), 'NA', []
self.potential_rirs = []
for rir in self.mapping.keys():
try:
start, stop, data = self._lookup_rir_entry(rir, query)
diff = stop-start
self.potential_rirs.append((rir, diff))
if diff < min_diff:
min_diff, min_rir, min_data = diff, rir, data
except ValueError:
continue
return min_rir, min_data
def add_summary(self, elems):
registry, _, rtype, _, rcnt, _ = elems
if rtype != 'asn':
return
self.summaries[registry] = int(rcnt)
def get_potential_matches(self):
return self.potential_rirs
def show_comparison_to_summaries(self):
print('############### Comparing data to summaries ###############')
for rir in self.summaries:
n = len(self.mapping[rir].keys())
m = self.summaries[rir]
print("%s: received %d out of %d promised record, %d%%" % (rir, n, m, int((n/m)*100)))
print()
class ASNDelegationLookupTester:
def __init__(self, lookup: AsnDelegationLookup):
self.test_cases = []
self.lookup = lookup
for rir in lookup.mapping.keys():
for start in lookup.mapping[rir].keys():
stop = lookup.mapping[rir][start][0]
query = random.randint(start, stop)
self.test_cases.append((rir, query, start, stop))
def run_tests(self):
print('################## Running random tests ###################')
tests, successes = 0, 0
for rir, query, start, stop in self.test_cases:
tests += 1
lo_rir, _ = self.lookup.lookup(query)
if rir == lo_rir:
successes += 1
continue
print('Mismatch, ASN: %d (%d-%d), expected RIR: %s, retrieved RIR: %s' % (query, start, stop, rir, lo_rir))
print('Potential:', self.lookup.get_potential_matches())
print("I ran %d random test lookups out of which %d (%d%%) succeeded." % (tests, successes, int(successes*100/tests)))
print()
def load_lookup():
lookup = AsnDelegationLookup()
for line in yield_lines():
elems = line.split('|')
# version line, comment, or empty line -> can be ignored safely
if len(line) == 0 or elems[0][0].isdigit() or elems[0].startswith('#'):
continue
# this is a summary line
if len(elems) == 6:
lookup.add_summary(elems)
continue
registry, cc, rtype, start, value, date, status, *extra = elems
if rtype != 'asn': continue
data = [date, status]
data.extend(extra)
lookup.add(start, int(value), registry.rstrip(), data)
return lookup
def load_iana_lookup():
lookup = AsnDelegationLookup()
with smartopen(DATA_DIR+IANA) as fileinput:
for line in fileinput:
therange, rir = line.split(',')
if '-' in therange:
start, stop = [int(x) for x in therange.split('-')]
else:
start = stop = int(therange)
rir = rir.lower().strip()
if not rir in ['apnic', 'afrinic', 'arin', 'lacnic', 'ripencc']: continue
lookup.add(start, stop-start+1, 'base-'+rir, [])
return lookup
class TwoStageLookup:
def __init__(self):
self.AsnLookup = load_lookup()
self.IanaAsnLookup = load_iana_lookup()
def lookup(self, asn):
min_rir, _ = self.AsnLookup.lookup(asn)
if min_rir == 'NA':
return self.IanaAsnLookup.lookup(asn)[0]
else:
return min_rir
def get(self, asn, default = 'NA'):
rir = self.lookup(asn)
return (rir, default)[rir == 'NA']
if __name__ == "__main__":
# loading the tester
AsnLookup = load_lookup()
# checking in-file consistency
AsnLookup.show_comparison_to_summaries()
# doing a bunch of random checks
tester = ASNDelegationLookupTester(AsnLookup)
tester.run_tests()
import sys
sys.path.insert(1, '../')
from utils import read_val_links,group_links_per_class, read_asrank_links
from asn2rir import TwoStageLookup
FN='../../results/coverage/rir.csv'
TSL = TwoStageLookup()
valids = read_val_links()
infers = read_asrank_links()
vdata = group_links_per_class(valids, TSL)
idata = group_links_per_class(infers, TSL)
with open(FN, 'wt') as out:
out.write('#Group|InferredLinks|ValidatableLinks\n')
for comb in set(vdata.keys()).union(set(idata.keys())):
out.write('|'.join([comb, str(len(idata[comb])), str(len(idata[comb].intersection(vdata[comb])))])+'\n')
import sys
sys.path.insert(1, '../')
from utils import read_val_links,group_links_per_class
from asn2rir import TwoStageLookup
FN='../../results/links_per_class/rir_classes.csv'
TSL = TwoStageLookup()
valids = read_val_links()
with open(FN, 'wt') as out:
data = group_links_per_class(valids, TSL)
for comb in data.keys():
out.write(comb+'='+','.join(data[comb])+'\n')
This directory contains the scripts needed to reproduce Table 1-3.
We assume that before running any script in this directory, you already executed the scripts in the geo_cov and topo_cov directories---these generate the files that are used as inputs.
First, please calculate the confusion matrices by running:
~~~
python3 calc_confusion_matrices.py
~~~
This will generate a single file at ../../results/perf/confusion_matrices.csv which stores a single confusion matrix plus extra information per line as:
~~~
algorithm|class|TruePositives|FalsePositives|TrueNegatives|FalseNegatives|DirectionMismatches|NumP2PLinks|NumP2CLinks
~~~
Once this file is generate, you can generate the validation tables within the paper by running:
~~~
python3 calc_performance.py
~~~
This will print the Latex code of each table to stdout. Please note: Just copy-pasting the output will not lead to nice formatting right-away. Please add the following lines before the \begin{document} in your latex file:
~~~
\usepackage{xcolor}
\usepackage{colortbl}
\definecolor{goodgreen}{HTML}{c6ebc9}
\definecolor{badyellow}{HTML}{ffefa1}
\definecolor{badorange}{HTML}{fdc099}
\definecolor{badred}{HTML}{fa9191}
~~~
import sys
sys.path.insert(1, '../')
from utils import *
import bz2
from collections import defaultdict
def read_link_classes():
DIR='../../results/links_per_class/'
links_per_class = defaultdict(set)
for file, ctype in [('topo_classes.csv', 'as'), ('rir_classes.csv', 'rir')]:
with open(DIR+file, 'r') as INPUT:
for line in INPUT:
theclass, linksraw = line.split('=', 1)
links_per_class[ctype+'|'+theclass] = set(linksraw.split(','))
links_per_class['total|all-all'].update(set(linksraw.split(',')))
return links_per_class