4 Oct2012

How to Extract Domain Names from VeriSign's .COM Zone File

Posted by Matt Mazur (@mhmazur)

For those of you not familiar with it, the zone file is a file generated daily that lists all of the active domain names and corresponding name servers for a particular Top Level Domain (TLD) like .COM, .NET, etc.

VeriSign managers the .COM zone file and in the last post I explained how to obtain access to download VeriSign's .COM zone file. In this tutorial I'll explain how to parse the zone file to extract just the domain names and strip out everything else.

Working with the zone file

The size of the uncompressed zone file varies by day but you can figure it will be somewhere around 8.5 GB:

Any attempt you make to open the entire zone file in a text editor will likely crash your computer, so you have to resort to working with it via the command line (the rest of this tutorial assumes you are using a Unix-like machine such as Mac OS X or Linux).

For this tutorial we're going to work with only the first 100 lines of the zone file because it makes it easy to show you what's going on step by step. These steps would also work on the entire zone file, though it would take significantly longer to execute each of the commands.

1. Navigate to the location of the zone file

For this tutorial we're only going to work with the first 100 lines of the zone file because it's easier to show you what's going on with each command we run. If you'd like to follow along, you can download the truncated zone file here.

After you're in the folder with the zone file (or zone file example) you should be able to run ls -l and see the zone file:

$ ls l
-rw-r-r— 1 matt staff 4973 Oct 4 12:06 com.zone.example.txt

You can confirm the number of lines in the zone file by running wc -l:

$ wc -l com.zone.example.txt
100 com.zone.example.txt

You can view the contents of the example zone file by opening it up in a text editor (again, don't try this on the full zone file) or by running cat:

$ cat com.zone.example.txt

Here's what the contents of the example zone file look like:

; The use of the Data contained in Verisign Inc.'s aggregated
; .com, and .net top-level domain zone files (including the checksum
; files) is subject to the restrictions described in the access Agreement
; with Verisign Inc.

$ORIGIN COM.
$TTL 900
@ IN SOA a.gtld-servers.net. nstld.verisign-grs.com. (
1349069930 ;serial
1800 ;refresh every 30 min
900 ;retry every 15 min
604800 ;expire after a week
86400 ;minimum of 15 min
)
$TTL 172800
NS A.GTLD-SERVERS.NET.
NS G.GTLD-SERVERS.NET.
NS H.GTLD-SERVERS.NET.
NS C.GTLD-SERVERS.NET.
NS I.GTLD-SERVERS.NET.
NS B.GTLD-SERVERS.NET.
NS D.GTLD-SERVERS.NET.
NS L.GTLD-SERVERS.NET.
NS F.GTLD-SERVERS.NET.
NS J.GTLD-SERVERS.NET.
NS K.GTLD-SERVERS.NET.
NS E.GTLD-SERVERS.NET.
NS M.GTLD-SERVERS.NET.
COM. 86400 DNSKEY 256 3 8 AQOl6m3T1q7fQhwJYFfWJO/IvVmtfI2Eg2wX4UR9jcl/qaTiMp+7Kx7baGOsPvZwX4lVGYWif955l4lLh/VnnNJvjDxBWVQcDrH3cHzFAaq9QXZPcEk7UyTOBL1piVpB2dqJzbO2bH9XGFiOXPUj3nhQ7mxvW0bgRiKv9Qah/7NH2w==
COM. 86400 DNSKEY 257 3 8 AQPDzldNmMvZFX4NcNJ0uEnKDg7tmv/F3MyQR0lpBmVcNcsIszxNFxsBfKNW9JYCYqpik8366LE7VbIcNRzfp2h9OO8HRl+H+E08zauK8k7evWEmu/6od+2boggPoiEfGNyvNPaSI7FOIroDsnw/taggzHRX1Z7SOiOiPWPNIwSUyWOZ79VmcQ1GLkC6NlYvG3HwYmynQv6oFwGv/KELSw7ZSdrbTQ0HXvZbqMUI7BaMskmvgm1G7oKZ1YiF7O9ioVNc0+7ASbqmZN7Z98EGU/Qh2K/BgUe8Hs0XVcdPKrtyYnoQHd2ynKPcMMlTEih2/2HDHjRPJ2aywIpKNnv4oPo
/COM. 86400 NSEC3PARAM 1 0 0 -
COM. 86400 RRSIG NSEC3PARAM 8 1 86400 20121006041758 20120929030758 47783 COM. chypT55+F8iGvkLS2TVSiqonms7mRNjDe2g560bqSulngD7z4y+4qsz13ZOGY4yrM9lBOfGYtNxpai3Q9TAT9n0X2L/3cJDY/xyA5LFSC0ilRvm+zr41d5TRwuf/GMdj7pfN2w6IoSxYckgPczqWHG2yOpX6EPXuIPW5E2L8GZU=
COM. RRSIG NS 8 1 172800 20121006041758 20120929030758 47783 COM. hfdySh/hHeA0zNLcbLQMNtRsXcOVKzH7vGED0t8IbkdaOTeuSFi0E8vXMVUJDjK9hlVYsCa4bE5wh5X61pIKkI9SjyCDjUK92ZpG/2+rtHeYWRbREAMpgcZ4FAySSknskHOnkUa4c/0tA9ZOJ0AkNzxztUr+KinlC+Co8rp5aGg=
COM. 900 RRSIG SOA 8 1 900 20121008053850 20121001042850 47783 COM. odeDdoJS/JVKOMNcdDd4Oh8MnY2DoKobagNU44AKjYE9GuQ3sBgbXmyH3JOrS6a7iBmFexN6UAdLSNcCozOO0Ta51WQFcuJhbvZwhXNrjOH50pkcG7Xw9pzwlOrftj9R7pHwCDEagZp20GGtbGATf946D6CCUJSBmtZ8pqoEu7s=
COM. 86400 RRSIG DNSKEY 8 1 86400 20121005182533 20120928182033 30909 COM. nPBzPp1A3EBgwjf3IrrYVgVh0YcVqdd6YKQ4CeraP5vK8nIyUMqGMnLc2ykA/BWb8AtAdg6KiOVsXl+4dkkqijccbt8mEzUZ6aD3Gd1IT13K5uDq4tjhxaQTRkloZU1TC4FfRhe5DHQSHzTmOWn9ClqonMa2FeNaf9rlsaNCaWq4fctndbPhuhuN0m9EKSh0So8WhM/5wZqjsie9+S2yBPsxakXWTA3zwxR7y9sqfabfmH+KmrQRF2lCXxhF
of4zp3VLpG9UK1kS/4mQTdm8kNRzfgNgCKo1ejS4uMj5g0rS6n5aZvk8PfeVbBlhnVb3oDRImz/RIhZJ1x0w3kzA==
ENERCONTECHNOLOGIES NS NS1.BIZ.RR
ENERCONTECHNOLOGIES NS NS2.BIZ.RR
SELF-DRIVE-CAR-RENTAL NS NS9.IZP
SELF-DRIVE-CAR-RENTAL NS IZA.HOSTING.DIGIWEB.IE.
SELF-DRIVE-CAR-RENTAL NS NS8.FOR-SALE-IF-THE-PRICE-IS-RIGHT
NANCYVRAINE NS NS1.IMCONLINE.NET.
NANCYVRAINE NS NS2.IMCONLINE.NET.
SELFDRIVECARRENTAL NS NS9.IZP
SELFDRIVECARRENTAL NS IZA.HOSTING.DIGIWEB.IE.
SELFDRIVECARRENTAL NS NS8.FOR-SALE-IF-THE-PRICE-IS-RIGHT
WORLDDATASOURCE NS NS01.DOMAINCONTROL
WORLDDATASOURCE NS NS02.DOMAINCONTROL
SAUDIPHOTOGRAPHERS NS NS1.R4L
SAUDIPHOTOGRAPHERS NS NS2.R4L
MERCKCHOICE NS CBRU.BR.NS.ELS-GMS.ATT.NET.
MERCKCHOICE NS CMTU.MT.NS.ELS-GMS.ATT.NET.
ENVIRONMENTALSCHOOLS NS PSNS01.PAULSMITHS.EDU.
ENVIRONMENTALSCHOOLS NS PSNS02.PAULSMITHS.EDU.
EASTHAMPTONHOMES NS BUY.INTERNETTRAFFIC
EASTHAMPTONHOMES NS SELL.INTERNETTRAFFIC
AMERICASHOMEBUILDER NS BUY.INTERNETTRAFFIC
AMERICASHOMEBUILDER NS SELL.INTERNETTRAFFIC
BOVINUS NS C3P0.CBFENTERPRISES
BOVINUS NS R2D2.CBFENTERPRISES
CONSTELLATIONCOLLEGE NS NS1.SEDOPARKING
CONSTELLATIONCOLLEGE NS NS2.SEDOPARKING
DOCHERTYCONSULTING NS NS1.VERINOTE.NET.
DOCHERTYCONSULTING NS NS3.VERINOTE.NET.
SONOMETRICS NS NS35.WORLDNIC
SONOMETRICS NS NS36.WORLDNIC
UNLIMITEDDISCOUNTPHONECALLS NS DNS1.NAME-SERVICES
UNLIMITEDDISCOUNTPHONECALLS NS DNS2.NAME-SERVICES
UNLIMITEDDISCOUNTPHONECALLS NS DNS3.NAME-SERVICES
UNLIMITEDDISCOUNTPHONECALLS NS DNS4.NAME-SERVICES
UNLIMITEDDISCOUNTPHONECALLS NS DNS5.NAME-SERVICES
FREILAND NS NS1.FABULOUS
FREILAND NS NS2.FABULOUS
KUMA-NET NS UNS01.LOLIPOP.JP.
KUMA-NET NS UNS02.LOLIPOP.JP.
SANGYOSHIEN NS NS55.WORLDNIC
SANGYOSHIEN NS NS56.WORLDNIC
JONATHANCHARLESNOVAK NS DNS077.A.REGISTER
JONATHANCHARLESNOVAK NS DNS030.B.REGISTER
JONATHANCHARLESNOVAK NS DNS030.C.REGISTER
JONATHANCHARLESNOVAK NS DNS010.D.REGISTER
HQSINGAPORE NS NS41.DOMAINCONTROL
HQSINGAPORE NS NS42.DOMAINCONTROL
PANASOURCE NS F1G1NS1.DNSPOD.NET.
PANASOURCE NS F1G1NS2.DNSPOD.NET.
PRIVATESAUNAS NS NS.BUYDOMAINS
PRIVATESAUNAS NS THIS-DOMAIN-FOR-SALE
BARBARA-STREISAND NS NS1.LAMEDELEGATION.NET.
BARBARA-STREISAND NS NS2.LAMEDELEGATION.NET.
MONICAMAGNETTI NS NS21.DOMAINCONTROL
MONICAMAGNETTI NS NS22.DOMAINCONTROL
IGUANA-WORLD NS DNS1.TNIB.DE.
IGUANA-WORLD NS DNS2.TNIB.DE.
IGUANA-WORLD NS DNS3.TNIB.DE.
PERFECTDAYSTUDIOS NS NS2.DYNADOT
PERFECTDAYSTUDIOS NS NS1.DYNADOT
SVCROSS NS NS1.ZONEEDIT
SVCROSS NS NS5.ZONEEDIT
EBEIJING NS NS1.PEER1.NET.
EBEIJING NS NS2.PEER1.NET.
NASHSATTERFIELD NS HOME.GIS.NET.

For the purposes of this tutorial you can ignore the first 35 lines or so. The first domain name the file contains is EnerconTechnologies:

ENERCONTECHNOLOGIES NS NS1.BIZ.RR
ENERCONTECHNOLOGIES NS NS2.BIZ.RR
SELF-DRIVE-CAR-RENTAL NS NS9.IZP
SELF-DRIVE-CAR-RENTAL NS IZA.HOSTING.DIGIWEB.IE.
SELF-DRIVE-CAR-RENTAL NS NS8.FOR-SALE-IF-THE-PRICE-IS-RIGHT

Notice that there's one entry for each name server associated with the domain name.

You can confirm the name server's are correct by running whois on a domain name and checking whether the name servers are are identical:

$ whois ENERCONTECHNOLOGIES.COM

Whois Server Version 2.0

Domain names in the .com and .net domains can now be registered
with many different competing registrars. Go to www.internic.net for detailed information.

Domain Name: ENERCONTECHNOLOGIES.COM
Registrar: NETWORK SOLUTIONS, LLC.
Whois Server: whois.networksolutions.com
Referral URL: www.networksolutions.com/en_US
Name Server: NS1.BIZ.RR.COM
Name Server: NS2.BIZ.RR.COM
Status: clientTransferProhibited
Updated Date: 03-mar-2012
Creation Date: 03-mar-1999
Expiration Date: 03-mar-2022

Extracting just the domain names

In order to extract just the domain names we've got to run a series of commands so that all that's left when we're done is a list of the domain names. We'll do this step by step, though you could easily pipe (|) these commands together to achieve the same result.

Notice that the line with domain name and name server is formatted consistently: it's the domain name, then a space, then the name server. If we want to extract just the domain names, then we can run a command that will extract everything before the first space. To do this, we use the awk command and send the output to first.com.zone.example.txt:

$ awk '{print $1}' com.zone.example.txt > first.com.zone.example.txt

If you examine the resulting first.com.zone.example.txt you'll notice a much cleaner output:

;
;
;
;

$ORIGIN
$TTL
@
1349069930
1800
900
604800
86400
)
$TTL
NS
NS
NS
NS
NS
NS
NS
NS
NS
NS
NS
NS
NS
COM.
COM.
COM.
COM.
COM.
COM.
COM.
ENERCONTECHNOLOGIES
ENERCONTECHNOLOGIES
SELF-DRIVE-CAR-RENTAL
SELF-DRIVE-CAR-RENTAL
SELF-DRIVE-CAR-RENTAL
NANCYVRAINE
NANCYVRAINE
SELFDRIVECARRENTAL
SELFDRIVECARRENTAL
SELFDRIVECARRENTAL
WORLDDATASOURCE
WORLDDATASOURCE
SAUDIPHOTOGRAPHERS
SAUDIPHOTOGRAPHERS
MERCKCHOICE
MERCKCHOICE
ENVIRONMENTALSCHOOLS
ENVIRONMENTALSCHOOLS
EASTHAMPTONHOMES
EASTHAMPTONHOMES
AMERICASHOMEBUILDER
AMERICASHOMEBUILDER
BOVINUS
BOVINUS
CONSTELLATIONCOLLEGE
CONSTELLATIONCOLLEGE
DOCHERTYCONSULTING
DOCHERTYCONSULTING
SONOMETRICS
SONOMETRICS
UNLIMITEDDISCOUNTPHONECALLS
UNLIMITEDDISCOUNTPHONECALLS
UNLIMITEDDISCOUNTPHONECALLS
UNLIMITEDDISCOUNTPHONECALLS
UNLIMITEDDISCOUNTPHONECALLS
FREILAND
FREILAND
KUMA-NET
KUMA-NET
SANGYOSHIEN
SANGYOSHIEN
JONATHANCHARLESNOVAK
JONATHANCHARLESNOVAK
JONATHANCHARLESNOVAK
JONATHANCHARLESNOVAK
HQSINGAPORE
HQSINGAPORE
PANASOURCE
PANASOURCE
PRIVATESAUNAS
PRIVATESAUNAS
BARBARA-STREISAND
BARBARA-STREISAND
MONICAMAGNETTI
MONICAMAGNETTI
IGUANA-WORLD
IGUANA-WORLD
IGUANA-WORLD
PERFECTDAYSTUDIOS
PERFECTDAYSTUDIOS
SVCROSS
SVCROSS
EBEIJING
EBEIJING
NASHSATTERFIELD

Not bad, but there's a lot of duplicates because domain names are listed once for each name server, so let's clean the file up a bit by removing duplicates and sorting the results alphabetically:

$ sort -u first.com.zone.example.txt > sorted_and_unique.com.zone.example.txt

Here we use the sort command with the u switch to sort the file and remove the duplicates.

The new sorted_and_unique.com.zone.example.txt is looking pretty good:


$ORIGIN
$TTL
)
1349069930
1800
604800
86400
900
;
@
AMERICASHOMEBUILDER
BARBARA-STREISAND
BOVINUS
COM.
CONSTELLATIONCOLLEGE
DOCHERTYCONSULTING
EASTHAMPTONHOMES
EBEIJING
ENERCONTECHNOLOGIES
ENVIRONMENTALSCHOOLS
FREILAND
HQSINGAPORE
IGUANA-WORLD
JONATHANCHARLESNOVAK
KUMA-NET
MERCKCHOICE
MONICAMAGNETTI
NANCYVRAINE
NASHSATTERFIELD
NS
PANASOURCE
PERFECTDAYSTUDIOS
PRIVATESAUNAS
SANGYOSHIEN
SAUDIPHOTOGRAPHERS
SELF-DRIVE-CAR-RENTAL
SELFDRIVECARRENTAL
SONOMETRICS
SVCROSS
UNLIMITEDDISCOUNTPHONECALLS
WORLDDATASOURCE

The last problem is that there are a number of lines left over that can't possibly be domain names because they contain invalid characters such as $ORIGIN and COM.. We'll use egrep (or grep -e) to extract only the lines that are valid domain names:

$ egrep '^[A-Z0-9]([A-Z0-9\-]{0,61}[A-Z0-9])?$' \
sorted_and_unique.com.zone.example.txt > domains.com.zone.example.txt

In case you're curious, that regular expression finds strings that:

  1. Start and end with a letter of a number (we can just look at uppercase letters because that's how all the domain names in the zone file are formatted)
  2. Contain letters, numbers, or dashes in between
  3. Are 1 to 63 characters in length

At last, we have a file we can work with:

1349069930
1800
604800
86400
900
AMERICASHOMEBUILDER
BARBARA-STREISAND
BOVINUS
CONSTELLATIONCOLLEGE
DOCHERTYCONSULTING
EASTHAMPTONHOMES
EBEIJING
ENERCONTECHNOLOGIES
ENVIRONMENTALSCHOOLS
FREILAND
HQSINGAPORE
IGUANA-WORLD
JONATHANCHARLESNOVAK
KUMA-NET
MERCKCHOICE
MONICAMAGNETTI
NANCYVRAINE
NASHSATTERFIELD
NS
PANASOURCE
PERFECTDAYSTUDIOS
PRIVATESAUNAS
SANGYOSHIEN
SAUDIPHOTOGRAPHERS
SELF-DRIVE-CAR-RENTAL
SELFDRIVECARRENTAL
SONOMETRICS
SVCROSS
UNLIMITEDDISCOUNTPHONECALLS
WORLDDATASOURCE

The one thing that's inaccurate about this list are the numbers which are not actually domain names though they are listed as such at the top of the list. You could remove these, though the resulting list is more than accurate enough to analyze and use at this point (and all of those numbers do have corresponding .coms except for 1349069930 anyway).

Hope you found this tutorial useful. If you have any questions or notice anything that can be done more efficiently please drop me a note matt@leandomainsearch.com.

Thanks!

Welcome to Lean Domain Search, a fast new domain name generator.

You can read our blog or subscribe via RSS to stay up to date on Lean Domain Search news, features, and more.

Blog home

Subscribe to our feed

View blog post archive

Share the Love

If you find Lean Domain Search useful, please help spread the word: