从bind及dig看DNS工作原理

一、DNS的基本功能

在互联网中,从域名到IP地址的转换是一个基础功能,之前一直想结合流行的DNS服务器bind来看下服务器侧的配置,所以最近有时间就总结一下。对于应用(或者说客户端)来说,对域名服务的使用主要基于C库的gethostbyname函数,该函数实现比较复杂,事实上,在glibc的根目录下有一个专门的resolv文件夹用来完成该解析功能,而这其中最为重要的就是gethostbyname及其相关的函数簇。

二、DNS server的工作模式

稍微了解过DNS工作原理的应该都听说过DNS查询的时候是可能要多次递归查询的,以www.gnu.org这个域名为例,正则的查询是先到根目录服务器上查找到org域名服务器的地址,然后到org域名服务器上查找到gnu.org域名服务器,然后再到gnu.org的域名服务器上查找www.gnu.org这个机器的IP地址。但是这里的描述忽略了一些重要的细节,这个递归查询/迭代到底是谁来查询的,是在客户端gethostbyname函数完成,还是由DNS server完成?如果是在域名服务器中完成,那么到底是在哪个域名服务器完成?同样是域名服务器,为什么有的只提供域名服务器的间接地址,而有的却不厌其烦的刨根问底,直到找到最终的目的IP地址?如果有gethostbyname在客户端完成,客户端抓包是不是可以看到很多往返来往的报文。

1、谁来完成迭代

这里可以大致考虑下:如果在客户端多次查询,这个地方应该是有浪费的,因为这些查询比较分散,并且DNSserver不能知道最终的查询结果,从而不利于进行查询结果的缓存。最好的策略是让DNS服务器完成迭代查询并且将查询结果缓存起来,从而可以被其它客户端复用。

2、谁不能迭代

下面文章中,其中说到了重要的一点
STEP 3: As the answer for the query is not available with the DNS server 172.16.200.30, this server sends a query to one of the DNS root server,for the answer. Now an important fact to note here is that root server‘s are always iterative servers.
也即是根服务器总是迭代的,也就是对于www.gnu.org,根服务器只会给出org这个域名的DNS查询服务器,来这个服务器查询的客户端需要再次到返回的服务器上执行新的查询。

3、递归服务器的问题

当需要进行递归查询时,其实可以看到在DNS服务器山存在异步的流程:也就是当一个客户端请求过来的时候,DNS服务器要记录下这个客户端的请求信息(包括地址、端口、请求的域名等),同时不断的到其它DNS上进行迭代查询,这个对服务器来说其实是一个比较大的开销,因为异步必须记录状态,记录状态必然会占用内存。
现在反观根服务器,它们始终只是提供迭代服务,这意味着它们没有异步状态:当一个查询过来的时候,它只需要回复去哪个服务器查找,这个查询只查询本地数据即可。考虑到根服务器的请求量,这种设计和实现是比较合理的。

三、协议的报文格式

报文格式的说明可以在这里找到

1、请求报文

对于是否迭代,在请求报文中有一个专门的RD标志位(Recurse Demain),如果客户端设置了这个标志位,表示要求DNS server进行递归,也就是说要求DNS给出该域名的最终IP地址,但是这个只是客户端的请求(demand)。具体服务器是不是满足这个请求还是要看服务器的配置,对于当前的bind服务器来说,在配置文件的option的recursion可以确定是否进行递归,如果配置了禁止递归,那么即使请求报文置位了RD标志位,DNS还是不会迭代。但是反过来说,如果客户端的RD没有置位,那么server一定不能进行迭代。
options {
query-source address 9.9.9.9;
port 53;
pid-file "named.pid";
listen-on {9.9.9.9;};
listen-on-v6 {none;};
recursion yes;
notify yes;
};

2、应答报文

从这个报文格式中可以看到,在每个回报结构中有专门的“Authority count” 和 “Authority”字段,这个字段表示了指定的权威服务器的位置,其实也就是给出的客户端去再次查询的服务器的位置。假设向一个DNS查询并且禁用了递归(RD清零),那么此时回报中可能就只有这个Authority section中有效,也就是这个字段告诉了请求客户端需要到这个地址再次进行查询。
Response section is empty in general,
Authority section contains NS records describing "a better zone server" for next iteration. (Worst case is root zone)
Additional section contains if known addresses for servers described in authority section.

四、单机测试

可以在本机上启动一个named服务器,配合dig命令可以看到不同情况下DNS对于请求的处理报文

0、基础配置文件

/etc/named.con
options {
query-source address 9.9.9.9;
port 53;
pid-file "named.pid";
listen-on {127.0.0.1;};
listen-on-v6 {none;};
recursion yes;
notify yes;
};

view "internal" {
match-clients { 10.53.0.2;
10.53.0.3; };

zone "." {
type hint;
file "root.hint";
};

zone "example" {
type master;
file "internal.db";
allow-update { any; };
};
};

view "external" {
match-clients { any; };

zone "." {
type hint;
file "root.hint";
};

zone "example" {
type master;
file "example.db";
};
};

/etc/root.hint
; Copyright (C) 2000 Internet Software Consortium.
;
; Permission to use, copy, modify, and distribute this software for any
; purpose with or without fee is hereby granted, provided that the above
; copyright notice and this permission notice appear in all copies.
;
; THE SOFTWARE IS PROVIDED "AS IS" AND INTERNET SOFTWARE CONSORTIUM DISCLAIMS
; ALL WARRANTIES WITH REGARD TO THIS SOFTWARE INCLUDING ALL IMPLIED WARRANTIES
; OF MERCHANTABILITY AND FITNESS. IN NO EVENT SHALL INTERNET SOFTWARE
; CONSORTIUM BE LIABLE FOR ANY SPECIAL, DIRECT, INDIRECT, OR CONSEQUENTIAL
; DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR
; PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS
; ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS
; SOFTWARE.

; $Id: root.hint,v 1.3 2000/06/22 21:52:55 tale Exp $

$TTL 999999
. IN NS a.root-servers.nil.
a.root-servers.nil. IN A 10.12.216.180

1、服务器启用recursion、客户端启用RD

执行dig命令
: dig @127.0.0.1 www.no.exist.test

; <<>> DiG 9.9.4-RedHat-9.9.4-29.el7_2.2 <<>> @127.0.0.1 www.no.exist.test
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 64638
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 2048
;; QUESTION SECTION:
;www.no.exist.test. IN A

;; Query time: 38 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Fri Jul 03 15:14:57 CST 2020
;; MSG SIZE rcvd: 46

抓包输出
tcpdump: listening on any, link-type LINUX_SLL (Linux cooked), capture size 65535 bytes
15:14:57.735988 IP (tos 0x0, ttl 64, id 34993, offset 0, flags [none], proto UDP (17), length 74)
VM_15_187_centos.52918 > VM_15_187_centos.domain: [bad udp cksum 0xfe49 -> 0x0b9e!] 64638+ [1au] A? www.no.exist.test. ar: . OPT UDPsize=4096 (46)
15:14:57.736249 IP (tos 0x0, ttl 64, id 55832, offset 0, flags [DF], proto UDP (17), length 74)
9.9.9.9.55234 > 10.12.216.180.domain: [udp sum ok] 52155% [1au] A? www.no.exist.test. ar: . OPT UDPsize=2048 (46)

可以看到,本地服务器向bind中配置的10.12.216.180进行了查询。

2、服务器启用recursion、客户端禁用RD

可以看到,在回报的”AUTHORITY SECTION“和”ADDITIONAL SECTION:“中包含了权威服务器的域名和IP地址
: dig +norecurse @127.0.0.1 www.no.exist.testt

; <<>> DiG 9.9.4-RedHat-9.9.4-29.el7_2.2 <<>> +norecurse @127.0.0.1 www.no.exist.test
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 20342
;; flags: qr ra ad; QUERY: 1, ANSWER: 0, AUTHORITY: 13, ADDITIONAL: 14

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 2048
;; QUESTION SECTION:
;www.no.exist.test. IN A

;; AUTHORITY SECTION:
. 60095 IN NS k.root-servers.net.
. 60095 IN NS h.root-servers.net.
. 60095 IN NS j.root-servers.net.
. 60095 IN NS i.root-servers.net.
. 60095 IN NS e.root-servers.net.
. 60095 IN NS c.root-servers.net.
. 60095 IN NS a.root-servers.net.
. 60095 IN NS m.root-servers.net.
. 60095 IN NS l.root-servers.net.
. 60095 IN NS b.root-servers.net.
. 60095 IN NS g.root-servers.net.
. 60095 IN NS d.root-servers.net.
. 60095 IN NS f.root-servers.net.

;; ADDITIONAL SECTION:
f.root-servers.net. 31823 IN A 192.5.5.241
k.root-servers.net. 28825 IN A 193.0.14.129
h.root-servers.net. 29235 IN A 198.97.190.53
j.root-servers.net. 33803 IN A 192.58.128.30
i.root-servers.net. 39413 IN A 192.36.148.17
e.root-servers.net. 55941 IN A 192.203.230.10
c.root-servers.net. 38867 IN A 192.33.4.12
a.root-servers.net. 63886 IN A 198.41.0.4
m.root-servers.net. 28815 IN A 202.12.27.33
l.root-servers.net. 59789 IN A 199.7.83.42
b.root-servers.net. 36953 IN A 199.9.14.201
g.root-servers.net. 60093 IN A 192.112.36.4
d.root-servers.net. 71961 IN A 199.7.91.13

;; Query time: 0 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Fri Jul 03 15:19:22 CST 2020
;; MSG SIZE rcvd: 465

:

3、服务器禁用recursion、客户端启用RD

named.conf中option设置recursion no;
: dig +recurse @127.0.0.1 www.no.exist.test

; <<>> DiG 9.9.4-RedHat-9.9.4-29.el7_2.2 <<>> +recurse @127.0.0.1 www.no.exist.test
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 42574
;; flags: qr rd ad; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 2048
;; QUESTION SECTION:
;www.no.exist.test. IN A

;; AUTHORITY SECTION:
. 999999 IN NS a.root-servers.nil.

;; Query time: 0 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Fri Jul 03 15:22:53 CST 2020
;; MSG SIZE rcvd: 77

:

4、服务器禁用recursion、客户端禁用RD

: dig +norecurse @127.0.0.1 www.no.exist.test

; <<>> DiG 9.9.4-RedHat-9.9.4-29.el7_2.2 <<>> +norecurse @127.0.0.1 www.no.exist.test
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 2502
;; flags: qr ad; QUERY: 1, ANSWER: 0, AUTHORITY: 1, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 2048
;; QUESTION SECTION:
;www.no.exist.test. IN A

;; AUTHORITY SECTION:
. 999999 IN NS a.root-servers.nil.

;; Query time: 0 msec
;; SERVER: 127.0.0.1#53(127.0.0.1)
;; WHEN: Fri Jul 03 15:23:23 CST 2020
;; MSG SIZE rcvd: 77

:

四、bind工程中代码备忘

该部分代码属于大致调试了下关键流程,只是简单做个备忘,具体的内容不做进一步详细分析。

1、一个表示异步的客户端

为了处理异步,当一个客户端查询请求过来的时候,需要创建一个对应的ns_client_t对象
bind9-9_0\bin\named\client.c
/*
* Handle an incoming request event from the dispatch (UDP case)
* or tcpmsg (TCP case).
*/
static void
client_request(isc_task_t *task, isc_event_t *event) {
……
}

2、客户端请求相关逻辑

对DNS来说,客户端过来的请求叫做“query”
bind9-9_0\bin\named\query.c
static void
query_find(ns_client_t *client, dns_fetchevent_t *event) {
……
/*
* Now look for an answer in the database.
*/
result = dns_db_find(db, client->query.qname, version, type,
client->query.dboptions, client->now,
&node, fname, rdataset, sigrdataset);
……
}

3、DNS服务器去远端服务器查询

对DNS来说,去其它DNS服务器查询的操作为fetch,一个调用链为
(gdb) bt
#0 dns_resolver_createfetch (res=0x7ffff7ed90e0, name=0x7ffff7eafb00, type=1, domain=0x7ffff7eafab0, nameservers=0x7fffe80cc310, forwarders=0x0, options=0, task=0x7ffff7ea1c90,
action=0x40f8db <query_resume>, arg=0x7ffff7ec93f0, rdataset=0x7fffe80cc1b0, sigrdataset=0x7ffff7ea7f88, fetchp=0x7ffff7ec95e8) at resolver.c:4720
#1 0x000000000040fe98 in query_recurse (client=0x7ffff7ec93f0, qtype=1, qdomain=0x7ffff7eafab0, nameservers=0x7fffe80cc310) at query.c:1919
#2 0x0000000000410bf3 in query_find (client=0x7ffff7ec93f0, event=0x0) at query.c:2408
#3 0x0000000000411f38 in ns_query_start (client=0x7ffff7ec93f0) at query.c:3055
#4 0x0000000000405bba in client_request (task=0x7ffff7ea1c90, event=0x7fffe80c41b0) at client.c:1106
#5 0x000000000056555d in run (uap=0x7ffff7e9f010) at task.c:799
#6 0x00007ffff79ade25 in start_thread () from /lib64/libpthread.so.0
#7 0x00007ffff76da35d in clone () from /lib64/libc.so.6
(gdb)

4、bind如何自后向前进行域名匹配

域名的一个特殊性在于它是反相匹配的,所以对于客户端的请求和本地数据库内容要从后向前匹配,这个在服务器的实现中并不是真正转换,而只是提前统计了点分位置,然后根据统计信息进行转换。从下面代码中可以看到,它是从后向前进行拆分之后的domain逐个匹配的
bind9-9_0\lib\dns\name.c
dns_namereln_t
dns_name_fullcompare(const dns_name_t *name1, const dns_name_t *name2,
int *orderp,
unsigned int *nlabelsp, unsigned int *nbitsp)
{
……
l1 = name1->labels;
l2 = name2->labels;
ldiff = (int)l1 - (int)l2;
if (ldiff < 0)
l = l1;
else
l = l2;

while (l > 0) {
l--;
l1--;
l2--;
label1 = &name1->ndata[offsets1[l1]];
label2 = &name2->ndata[offsets2[l2]];
count1 = *label1++;
count2 = *label2++;
……
}
……
}

5、递归模式下的迭代

a、对其它DNS服务器回报处理,可以看到如果权威服务器section中有字段,则返回DNS_R_DELEGATION
static isc_result_t
noanswer_response(fetchctx_t *fctx, dns_name_t *oqname) {
……
/*
* If the current qname is not a subdomain of the query
* domain, there‘s no point in looking at the authority
* section without doing DNSSEC validation.
*
* Until we do that validation, we‘ll just return success
* in this case.
*/
if (!dns_name_issubdomain(qname, &fctx->domain))
return (ISC_R_SUCCESS);
……

/*
* Set the current query domain to the referral name.
*
* XXXRTH We should check if we‘re in forward-only mode, and
* if so we should bail out.
*/
INSIST(dns_name_countlabels(&fctx->domain) > 0);
dns_name_free(&fctx->domain, fctx->res->mctx);
if (dns_rdataset_isassociated(&fctx->nameservers))
dns_rdataset_disassociate(&fctx->nameservers);
dns_name_init(&fctx->domain, NULL);
result = dns_name_dup(ns_name, fctx->res->mctx, &fctx->domain);
if (result != ISC_R_SUCCESS)
return (result);
fctx->attributes |= FCTX_ATTR_WANTCACHE;
return (DNS_R_DELEGATION);
}
……
}

b、迭代过程
其中的fctx_try会触发前面的回报处理,从而形成一个间接的递归调用
static void
resquery_response(isc_task_t *task, isc_event_t *event) {
……
result = noanswer_response(fctx, NULL);
if (result == DNS_R_DELEGATION) {
/*
* We don‘t have the answer, but we know a better
* place to look.
*/
get_nameservers = ISC_TRUE;
keep_trying = ISC_TRUE;
result = ISC_R_SUCCESS;
}
……

if (keep_trying) {
……
/*
* Try again.
*/
fctx_try(fctx);
……
}

五、glibc中相关代码

1、如何查看C库当前使用的__res_state

: cat gethost.cpp
#include <netdb.h>
#include <resolv.h>
//extern "C" struct *__res_state _res;

struct __res_state * pstste = &_res;
int main()
{

gethostbyname("www.ridicu.org.not.exit");
//__res_state ();
}
: g++ gethost.cpp
: ./a.out
(gdb) p *pstste
$2 = {retrans = 5, retry = 2, options = 524993, nscount = 3, nsaddr_list = {{sin_family = 2, sin_port = 13568, sin_addr = {s_addr = 1651997450}, sin_zero = "\000\000\000\000\000\000\000"}, {
sin_family = 2, sin_port = 13568, sin_addr = {s_addr = 1844972554}, sin_zero = "\000\000\000\000\000\000\000"}, {sin_family = 2, sin_port = 13568, sin_addr = {s_addr = 1853389578},
sin_zero = "\000\000\000\000\000\000\000"}}, id = 13280, dnsrch = {0x7ffff75b8ac0 <_2.2.5+128> "not.exist", 0x0, 0x0, 0x0, 0x0, 0x0, 0x0},
defdname = "not.exist", ‘\000‘ <repeats 246 times>, pfcode = 0, ndots = 1, nsort = 0, ipv6_unavail = 0, unused = 0, sort_list = {{addr = {s_addr = 0}, mask = 0}, {addr = {s_addr = 0},
mask = 0}, {addr = {s_addr = 0}, mask = 0}, {addr = {s_addr = 0}, mask = 0}, {addr = {s_addr = 0}, mask = 0}, {addr = {s_addr = 0}, mask = 0}, {addr = {s_addr = 0}, mask = 0}, {addr = {
s_addr = 0}, mask = 0}, {addr = {s_addr = 0}, mask = 0}, {addr = {s_addr = 0}, mask = 0}}, qhook = 0x0, rhook = 0x0, res_h_errno = 0, _vcsock = -1, _flags = 0, _u = {
pad = "\000\000\003\000\003\000\003", ‘\000‘ <repeats 44 times>, _ext = {nscount = 0, nsmap = {3, 3, 3}, nssocks = {0, 0, 0}, nscount6 = 0, nsinit = 0, nsaddrs = {0x0, 0x0, 0x0},
_initstamp = {0, 0}}}}
(gdb)

2、/etc/resolv.conf中search的作用

然后在/etc/resolv.conf文件中添加一个
search not.exit.foo
可以看到抓包的内容中多了一个" www.ridicu.org.not.exit.foo.",也就是在这个基础上追加了search后的字符串然后再次查询
20:28:29.840008 IP 9.9.9.9.51252 > 10.123.119.98.domain: 30130+ A? www.ridicu.org.not.exit. (41)
20:28:29.932262 IP 10.123.119.98.domain > 9.9.9.9.51252: 30130 NXDomain 0/1/0 (116)
20:28:29.932375 IP 9.9.9.9.59114 > 10.123.119.98.domain: 6007+ A? www.ridicu.org.not.exit.foo. (45)
20:28:29.980252 IP 10.123.119.98.domain > 9.9.9.9.59114: 6007 NXDomain 0/1/0 (143)

六、一个小问题

在抓包的协议中,可以看到地址后面有一个多于的点符号,这个其实是tcmdump对于这些包做了特殊打印的处理,其实包中发送的内容并不是原本的文本格式的原始域名,而是做了转换,下面的位置描述了转换方法。
For example, “www.xyzindustries.com” would be encoded as:
“[3] w w w [13] x y z i n d u s t r i e s [3] c o m [0]”
I have shown the label lengths in square brackets to distinguish them. Remember that these label lengths are binary encoded numbers, so a single byte can hold a value from 0 to 255; that “[13]” is one byte and not two, as you can see in Figure 252. Labels are actually limited to a maximum of 63 characters, and we‘ll see shortly why this is significant.
也就是在每个部分开头都是一个长度字段开始的,最后结束的时候有一个0,通常的展示都是将这个结尾的0展示成了‘.‘。该格式在RFC1035的“4.1.2. Question section format”中对于QNAME也有描述。
由于这种编码格式,其实domainname的长度是有限制的,在这个帖子中讨论了这些限制:

253 characters is the maximum length of full domain name, including dots: e.g. www.example.com = 15 characters.
63 characters in the maximum length of a "label" (part of domain name separated by dot). Labels for www.example.com are com, example and www.

This is an example of the domain with longest possible label (a fully working website BTW): http://www.abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijk.com/. The domain name length = 71 characters.

This will be an example of longest domain name: abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcde.abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijk.abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijk.abcdefghijklmnopqrstuvwxyzabcdefghijklmnopqrstuvwxyzabcdefghijk.com