Discussion:
Usenet map
(too old to reply)
U.ee
2020-04-07 17:57:04 UTC
Permalink
Bonjour!

In last few days I have tried to make visual representation about Usenet
peering relations.
Data is collected from my point of view, from Usenet.ee. There is
missing links and nodes. Source articles are collected over few month
period, so some peerings are probably changed in that time.
First trouble was with source quality: my filtering was little bit too
inclusive. Next time I can filter some bogus data out in beginning, so
hopefully next maps are better.
Another issue is complicated sites showing their inner workings. That
issue is two fold:
1) There is lot of point to point relations, complicating filtering by
peer count.
2) High nodecount, making map noisy.

For lowering node count I tried few filters, one measure was only show
nodes with at least 2 peers.


So, map with some uninteresting leafs removed:
Loading Image...

Little bit different cleanup:
Loading Image...

Different representation to better readability:
Loading Image...


I hope that readers here found this interesting!

Best regards,
U.ee
Grant Taylor
2020-04-07 23:49:37 UTC
Permalink
Post by U.ee
Bonjour!
Hi,
Post by U.ee
In last few days I have tried to make visual representation about Usenet
peering relations.
I did something quite similar last week.
Post by U.ee
Data is collected from my point of view, from Usenet.ee. There is
missing links and nodes. Source articles are collected over few month
period, so some peerings are probably changed in that time.
Ya. Every post is inherently a tree that connects to you. Frequently
those trees coalesce into a graph from your systems point of view.
Post by U.ee
First trouble was with source quality: my filtering was little bit too
inclusive. Next time I can filter some bogus data out in beginning, so
hopefully next maps are better.
I'm curious what you used as your source, how you filtered, and if you
modified.

I took the Path: headers from my data source (10,000 articles from a
newsgroup), derived connections between the upstream and downstream
hosts, shift, and repeat.

I added these, and labels, to a dot (Graphviz) file by way of a sort
uniq filter to remove duplication.

I ended up using mutated node names in the dot file, with labels of
their actual node name. This way I didn't have to worry about dot node
naming conventions.

I also modified a lot of nodes to be their organization name with a sed
script. This way, all of Google would show up as simply "Google".
Post by U.ee
Another issue is complicated sites showing their inner workings. That
1) There is lot of point to point relations, complicating filtering by
peer count.
Yes, Usenet is inherently point to point.

Aside: I've heard of NNTP via multi-cast, but I've never seen it used,
much less between organizations.
Post by U.ee
2) High nodecount, making map noisy.
I think that's where normalizing the node name to organization name made
a HUGE difference.

See this tweet:

https://twitter.com/DrScriptt/status/1245529142738575362

In particular the follow up discussion:

https://twitter.com/revprez/status/1245529633409417217
Post by U.ee
For lowering node count I tried few filters, one measure was only show
nodes with at least 2 peers.
I'm curious how you did your filtering.
Post by U.ee
http://usenet.ee/maps/graph3-n-20200402.png
I wanted to avoid removing nodes.

Or more specifically, I wanted to avoid removing organizations. Where
one or more nodes map to an organization.
Post by U.ee
http://usenet.ee/maps/graph-n-cleaned-20200402.png
http://usenet.ee/maps/graph3-d-20200402.png
I saw some a crazy looping line in that graph, between
newsfeed.CARNet.hr and feeder.erje.net. The crazy loop is below
newsreader4.netcolgone.de.
Post by U.ee
I hope that readers here found this interesting!
I found the idea interesting enough to spend some time doing it last week.

I'm considering a new funnel feed that goes into a filter to extract
Path: headers of all incoming articles to collect more data. (I worked
on a subset of just one group.)
Post by U.ee
Best regards,
Likewise.
--
Grant. . . .
unix || die
Grant Taylor
2020-04-08 00:06:13 UTC
Permalink
Post by Grant Taylor
I took the Path: headers from my data source (10,000 articles from a
newsgroup), derived connections between the upstream and downstream
hosts, shift, and repeat.
I hacked together a shell script that read Path: header lines from
STDIN, looping across each header.

Each path header was split on the bang, and walked posting host to my
receiving host, looping across each host.

awk -F\! '{for (i=NF; i>1; i--)printf("%s ",$i); print $1; }'

This allowed me to take action on each host in each path entry.

Since I was starting with the posting / upstream host, I was able to
iterate through the hosts in the path and deduce the upstream to
downstream relation. This directly translated to "$upstream ->
$downstream" entries for dot to work with.

There was a little bit of clean up work, like priming when I don't have
both upstream and downstream, as well as some things like not-for-mail /
mail2news, etc.

I did my node to organizational clean up with sed with things like this
the following before running loops across them:

s/fx01.iad.POSTED!/Google!/
s/peer03.ams1!/Google!/
--
Grant. . . .
unix || die
U.ee
2020-04-08 09:29:29 UTC
Permalink
Post by Grant Taylor
Post by U.ee
Bonjour!
Hi,
Tervitus! :)
Post by Grant Taylor
Post by U.ee
In last few days I have tried to make visual representation about
Usenet peering relations.
I did something quite similar last week.
Nice!
Post by Grant Taylor
I'm curious what you used as your source, how you filtered, and if you
modified.
All articles from all groups since January from my server. Sorted and
counted unique paths. Dropped everything below 10 occurrences. Some data
was simply noise (containing something other than real paths), removed
those too.
Because I have played it several days or looking back more like weeks
I have tried several things, so little bit is lost, what was exactly
done with specific map (some crappier ones are deleted, some more
amusing specimens are still in my archive).
I tried remove those IP-ADDRESS.POSTED components, same with
not-for-mail and similar.
Post by Grant Taylor
I took the Path: headers from my data source (10,000 articles from a
newsgroup), derived connections between the upstream and downstream
hosts, shift, and repeat.
I added these, and labels, to a dot (Graphviz) file by way of a sort
uniq filter to remove duplication.
I ended up using mutated node names in the dot file, with labels of
their actual node name.  This way I didn't have to worry about dot node
naming conventions.
I used shell tools (grep, sort, uniq; sed in one point)
and python with pygraphviz.
For graph3-n-20200402.png I generated dot file with python and
"doctored" it manually to remove split clouds. Those small clusters
representing some inner working for some usenet site.
Then used neato to generate PNG file.
Post by Grant Taylor
Post by U.ee
1) There is lot of point to point relations, complicating filtering by
peer count.
Yes, Usenet is inherently point to point.
Aside:  I've heard of NNTP via multi-cast, but I've never seen it used,
much less between organizations.
I probably used bad terminology here, sorry. I felt that PtP describes
it well.
I meant point-to-point in more general sense, one node having only one
upstream, and upstream node having only one downstream, so similar for
example wireless backhauls.
Post by Grant Taylor
Post by U.ee
2) High nodecount, making map noisy.
I think that's where normalizing the node name to organization name made
a HUGE difference.
https://twitter.com/DrScriptt/status/1245529142738575362
https://twitter.com/revprez/status/1245529633409417217
I am not well versed with twitter. Your map there looks nice!
Post by Grant Taylor
Post by U.ee
For lowering node count I tried few filters, one measure was only show
nodes with at least 2 peers.
I'm curious how you did your filtering.
Because I used python and converted all data to python data structures
(dictionaries and sets), filtering was somewhat easy. More "smartness"
is needed though, for example in this time I simply delete specific
keys, using hardcoded hostnames.
Post by Grant Taylor
Post by U.ee
http://usenet.ee/maps/graph3-n-20200402.png
I wanted to avoid removing nodes.
Or more specifically, I wanted to avoid removing organizations.  Where
one or more nodes map to an organization.
Yes, I removed nodes, but generally not sites (organizations).
Post by Grant Taylor
Post by U.ee
http://usenet.ee/maps/graph3-d-20200402.png
I saw some a crazy looping line in that graph, between
newsfeed.CARNet.hr and feeder.erje.net.  The crazy loop is below
newsreader4.netcolgone.de.
You mean that left from news.tnib.de? Nice catch!
Post by Grant Taylor
Post by U.ee
I hope that readers here found this interesting!
I found the idea interesting enough to spend some time doing it last week.
I'm considering a new funnel feed that goes into a filter to extract
Path: headers of all incoming articles to collect more data.  (I worked
on a subset of just one group.)
Some time ago I looked into newsreaders (posting agent) statistics and
between groups there was noticeable differences. I think same is true
with paths. Some groups are representative for some other network, for
example fido. So, now of course question is, do you want this kind hosts
show up in your map.

Best wishes
U.ee
Grant Taylor
2020-04-08 15:50:19 UTC
Permalink
Post by U.ee
Nice!
I'm completely new to dot / Graphviz, so I'm still learning.
Post by U.ee
All articles from all groups since January from my server. Sorted and
counted unique paths. Dropped everything below 10 occurrences.
Hum.

I would be concerned about dropping information. I know that there are
some newsgroups that get very low traffic, as in one post every few
months. But that's group specific and probably not as much of an issue
with servers with additional groups.
Post by U.ee
Some data was simply noise (containing something other than real
paths), removed those too.
I'm curious to see an example of such.
Post by U.ee
Because I have played it several days or looking back more like weeks
I have tried several things, so little bit is lost, what was exactly
done with specific map (some crappier ones are deleted, some more
amusing specimens are still in my archive).
I get it.

I think that's more the "art" part than "science".
Post by U.ee
I tried remove those IP-ADDRESS.POSTED components, same with
not-for-mail and similar.
Did you discard the entire Path? Or just those portions (to the end of
the Path)?

Also, IP-ADDRESS.POSTED is little different than FQDN.POSTED to me.
They are different ways of conveying the same information. The former
didn't have functioning reverse DNS (or it was disabled) and the latter did.

But the posting itself is still a viable article to me.
Post by U.ee
I used shell tools (grep, sort, uniq; sed in one point)
I think that such tools are under appreciated.
Post by U.ee
and python with pygraphviz.
ACK
Post by U.ee
For graph3-n-20200402.png I generated dot file with python and
"doctored" it manually to remove split clouds. Those small clusters
representing some inner working for some usenet site.
That's what I used sed to normalize those nodes to org names for.
Post by U.ee
Then used neato to generate PNG file.
Why neato vs dot itself?
Post by U.ee
I probably used bad terminology here, sorry. I felt that PtP describes
it well.
yes, point-to-point is a distinct type of connection. I believe that
the vast majority of NNTP servers are point-to-point connected.
Post by U.ee
I meant point-to-point in more general sense, one node having only one
upstream, and upstream node having only one downstream, so similar for
example wireless backhauls.
Ah. I think you're talking about removing things that chain through
each other without branching. E.g. remove n2 & n3 below.

[n1]---[n2]---[n3]---[n4]

You wanted "significant nodes" (which interconnect three or more other
nodes). E.g. remove n2 below.

[n5]
|
[n1]---[n2]---[n3]---[n4]
|
[n6]

Where remove can mean collapse into the larger organization.
Post by U.ee
I am not well versed with twitter. Your map there looks nice!
Thank you.
Post by U.ee
Because I used python and converted all data to python data structures
(dictionaries and sets), filtering was somewhat easy.
Hum.... The old unix admin in me has concerns about loading all of that
data into memory. Conversely, other than sort and dot, much of what I
did was based on streaming data through and using much less memory at
any given time. Though, such optimizations are not necessarily as
important these days.
Post by U.ee
More "smartness" is needed though, for example in this time I simply
delete specific keys, using hardcoded hostnames.
Please elaborate on what you are deleting. What does it represent in
the Path: header? Why are you deleting it?

Admittedly, the sort / uniq I was doing would remove data. But it was
data that was already represented in my data.
Post by U.ee
Yes, I removed nodes, but generally not sites (organizations).
ACK
Post by U.ee
You mean that left from news.tnib.de? Nice catch!
:-)

It was luck. My viewer happened to zoom and show it center of the
zoomed view.

Tracking it was more difficult.
Post by U.ee
Some time ago I looked into newsreaders (posting agent) statistics and
between groups there was noticeable differences. I think same is true
with paths. Some groups are representative for some other network, for
example fido.
I agree that there is quite likely — what I'm going to call — clustering
of groups & paths to form message flows.

Though remember Usenet's flooding nature.
Post by U.ee
So, now of course question is, do you want this kind hosts show up
in your map.
I would think so.

I'll counter with why would you not want these hosts to show up?

They are articles that flow across Usenet.

I guess it could be that you're mapping a specific part / subset of Usenet.
Post by U.ee
Best wishes
Likewise.
--
Grant. . . .
unix || die
U.ee
2020-04-08 17:53:32 UTC
Permalink
Grant,

First, thank you very much for asking those questions and explaining
your process!
Hum.
I would be concerned about dropping information.  I know that there are
some newsgroups that get very low traffic, as in one post every few
months.  But that's group specific and probably not as much of an issue
with servers with additional groups.
I drop them, because there is too much data.
Before dropping I have 86246 lines in my unique paths file.
Post by U.ee
Some data was simply noise (containing something other than real
paths), removed those too.
I'm curious to see an example of such.
Well, I didn't limit Path: occurrences per article, so if body contained
(^Path: ), it got included.
For example:
Path: /Applications/HouseCall.app/Contents/MacOS/HouseCall
Post by U.ee
I tried remove those IP-ADDRESS.POSTED components, same with
not-for-mail and similar.
Did you discard the entire Path?  Or just those portions (to the end of
the Path)?
No, only that specific node. Path itself remains, shorter form.
Post by U.ee
For graph3-n-20200402.png I generated dot file with python and
"doctored" it manually to remove split clouds. Those small clusters
representing some inner working for some usenet site.
That's what I used sed to normalize those nodes to org names for.
Post by U.ee
Then used neato to generate PNG file.
Why neato vs dot itself?
I have used both, in different map files.
graph3-d-20200402.png was generated by pygraphviz using dot backend.
I prefer neato though. This isn't very scientific reasoning, but looks
more compact and there is feel, what is more traveled path. Those middle
nodes look like stars middle in galaxy.
Ah.  I think you're talking about removing things that chain through
each other without branching.  E.g. remove n2 & n3 below.
[n1]---[n2]---[n3]---[n4]
You wanted "significant nodes" (which interconnect three or more other
nodes).  E.g. remove n2 below.
              [n5]
               |
[n1]---[n2]---[n3]---[n4]
               |
              [n6]
Where remove can mean collapse into the larger organization.
Indeed, that is good description.
Post by U.ee
Because I used python and converted all data to python data structures
(dictionaries and sets), filtering was somewhat easy.
Hum....  The old unix admin in me has concerns about loading all of that
data into memory.  Conversely, other than sort and dot, much of what I
did was based on streaming data through and using much less memory at
any given time.  Though, such optimizations are not necessarily as
important these days.
Actually dot/neato in end are heavy, all that before is fast and don't
take much resources.
I had several times neato crashing because memory starvation.
Dot seemed more stable, but with bigger .dot file extremely slow.
Post by U.ee
More "smartness" is needed though, for example in this time I simply
delete specific keys, using hardcoded hostnames.
Please elaborate on what you are deleting.  What does it represent in
the Path: header?  Why are you deleting it?
Single server/node. End result is sometimes removing something generic
noninformative from the end (like not-for-mail), or something from
middle (again those "same organization, different load balancer" deals).
Paths themselves are still there, but shorter.
Post by U.ee
Some time ago I looked into newsreaders (posting agent) statistics and
between groups there was noticeable differences. I think same is true
with paths. Some groups are representative for some other network, for
example fido.
I agree that there is quite likely — what I'm going to call — clustering
of groups & paths to form message flows.
Though remember Usenet's flooding nature.
Those articles are flooded to every site, but found in specific groups.
So, if you use limited groups to collect path info, you lose some more
exotic hosts.
Post by U.ee
So, now of course question is, do you want this kind hosts show up in
your map.
I would think so.
I'll counter with why would you not want these hosts to show up?
Because they don't represent NNTP peering relations and sometimes
(listservers behind mail to usenet gateways) aren't even part of Usenet.
Then again, when articles flow both directions, showing them has some merit.
I suppose, when you map something, you need critically think, what
exactly you are trying to represent. Same with data collection, what and
where you are collecting, what is missing and what is excessive.
They are articles that flow across Usenet.
I guess it could be that you're mapping a specific part / subset of Usenet.
Exactly, if you want map how articles themselves move, then you can
include them.


Best regards,
U.ee
Julien ÉLIE
2020-04-08 21:33:41 UTC
Permalink
Hi,
Post by U.ee
Tervitus! :)
Oh, it reminds me of my trip to Tallinn/Lahemaa/Tartu last December.
Estonia is such a very nice country! I really enjoyed it!
Post by U.ee
Post by Grant Taylor
Post by U.ee
In last few days I have tried to make visual representation about
Usenet peering relations.
I did something quite similar last week.
In case you had not already taken a look at inpath2dot or inflow:
https://cord.de/news-stuff
https://ftp.isc.org/isc/inn/unoff-contrib/inflow
you might find in these two scripts useful parsing tricks or enhancements.
--
Julien ÉLIE

« Les soucis d'aujourd'hui sont les plaisanteries de demain. Rions-en
donc tout de suite. » (Henri Béraud)
U.ee
2020-04-09 12:15:04 UTC
Permalink
Post by Julien ÉLIE
Hi,
Post by U.ee
Tervitus! :)
Oh, it reminds me of my trip to Tallinn/Lahemaa/Tartu last December.
Estonia is such a very nice country!  I really enjoyed it!
That is nice to hear! I hope that you visited Viru bog in Lahemaa.
Post by Julien ÉLIE
Post by U.ee
Post by Grant Taylor
Post by U.ee
In last few days I have tried to make visual representation about
Usenet peering relations.
I did something quite similar last week.
  https://cord.de/news-stuff
  https://ftp.isc.org/isc/inn/unoff-contrib/inflow
you might find in these two scripts useful parsing tricks or enhancements.
Thanks for those links!

Best regards
U.ee
Julien ÉLIE
2020-04-19 07:28:21 UTC
Permalink
Hi,
Post by U.ee
Post by Julien ÉLIE
Oh, it reminds me of my trip to Tallinn/Lahemaa/Tartu last December.
Estonia is such a very nice country!  I really enjoyed it!
That is nice to hear! I hope that you visited Viru bog in Lahemaa.
Yes! I visited Viru bog. So marvellous! I remember well that 3,5 km
hike following a very little path surrounded by bogs. I had the chance
to see the sunset when reaching the little tower near the middle of the
hike. Furthermore, it snowed the morning (while driving to Lahemaa) and
the weather was perfect during the afternoon. The hike in such snowy
landscapes was fantastic for the eyes.

This was a guided tour to Viru blog with traveller.ee; on that day, we
also went to an old manor house and Jägala waterfall.
--
Julien ÉLIE

« C'est la goutte qui fait déborder l'amphore ! »
(Assurancetourix)
ptomblin+ (Paul Tomblin)
2020-04-08 12:56:09 UTC
Permalink
Post by U.ee
In last few days I have tried to make visual representation about Usenet
peering relations.
Data is collected from my point of view, from Usenet.ee. There is
missing links and nodes. Source articles are collected over few month
Years ago there was a script that many sites ran on a regular basis which sent
their data to a central location that collated all that information. That had
obvious advantages over just collecting the info at one node.
--
Paul Tomblin <***@xcski.com> http://blog.xcski.com/
...I'm not one of those who think Bill Gates is the devil. I simply
suspect that if Microsoft ever met up with the devil, it wouldn't need an
interpreter. -- Nick Petreley
Grant Taylor
2020-04-08 15:53:14 UTC
Permalink
Post by ptomblin+ (Paul Tomblin)
Years ago there was a script that many sites ran on a regular basis
which sent their data to a central location that collated all that
information. That had obvious advantages over just collecting the
info at one node.
Are you referring to Top1000 or something else?

Top1000 is still a thing.

Aside: My main server is in the top quarter. :-)
--
Grant. . . .
unix || die
Karl Kleinpaste
2020-04-08 16:10:38 UTC
Permalink
Post by Grant Taylor
Are you referring to Top1000 or something else?
It was Brian Reid's (then of DECWRL) Network Measurement Project.
He once described an article flowing through the (empirically observed)
core NNTP servers as a flare fired into a munitions dump.
Grant Taylor
2020-04-08 20:40:00 UTC
Permalink
Post by Karl Kleinpaste
It was Brian Reid's (then of DECWRL) Network Measurement Project.
Hum. I'm not familiar. I'm guessing that's because was implies past
tense and I'm mostly current tense.
Post by Karl Kleinpaste
He once described an article flowing through the (empirically observed)
core NNTP servers as a flare fired into a munitions dump.
LOL

That seems accurate.
--
Grant. . . .
unix || die
Karl Kleinpaste
2020-04-08 21:24:26 UTC
Permalink
that's because was implies past tense
1987-1989 or thereabouts. Not sure if it continued into the '90s.
ptomblin+ (Paul Tomblin)
2020-04-09 13:30:20 UTC
Permalink
Post by Grant Taylor
Post by ptomblin+ (Paul Tomblin)
Years ago there was a script that many sites ran on a regular basis
which sent their data to a central location that collated all that
information. That had obvious advantages over just collecting the
info at one node.
Are you referring to Top1000 or something else?
Top1000 is still a thing.
Except the "how to participate" section still is just a bunch of "XXX add a
link to this".
Post by Grant Taylor
Aside: My main server is in the top quarter. :-)
I'm 153rd.
--
Paul Tomblin <***@xcski.com> http://blog.xcski.com/
"Always try to do things in chronological order; it's less confusing
that way."
Loading...