kafka客户端代码解析

转载：http://backend.blog.163.com/blog/static/202294126201431724652597/

可以使用服务器端下载的kafka二进制包及依赖，也可以通过mavne获取(注意实测发现该方式拿到的包是用jdk7打的)： <dependency> <groupId>com.sksamuel.kafka</groupId> <artifactId>kafka_2.10</artifactId> <version>0.8.0-beta1</version> </dependency> 生产者 下面是开发生产者代码的例子： Propertiesprops=newProperties(); //指定kafka节点：注意这里无需指定集群中所有Boker，只要指定其中部分即可，它会自动取meta信息并连接到对应的Boker节点 props.put("metadata.broker.list", "172.17.1.163:9093"); //指定采用哪种序列化方式将消息传输给Boker,你也可以在发送消息的时候指定序列化类型，不指定则以此为默认序列化类型 props.put("serializer.class", "kafka.serializer.StringEncoder"); //指定消息发送对应分区方式，若不指定，则随机发送到一个分区，也可以在发送消息的时候指定分区类型。 props.put("partitioner.class", "example.producer.SimplePartitioner"); //该属性表示你需要在消息被接收到的时候发送ack给发送者。以保证数据不丢失 props.put("request.required.acks", "1"); ProducerConfig config = new ProducerConfig(props); //申明生产者：泛型1为分区key类型，泛型2为消息类型 Producer<String, String> producer = new Producer<String, String>(config); //创建KeyedMessage发送消息，参数1为topic名，参数2为分区名（若为null则随机发到一个分区），参数3为消息 producer.send(new KeyedMessage<String,String>("topic","partitionKey1","msg1")); producer.close(); //自定义分区： public class SimplePartitioner implements Partitioner<String> { public SimplePartitioner (VerifiableProperties props) { } public int partition(String key, int a_numPartitions) { return key.length()%a_numPartitions; } } 消费者 消费者api分上层api和底层api，这里是采用上层api的消费者例子（无需关系消息的offset，只是希望获得数据）注意： 1.上层api将会内部实现持久化每个分区最后读到的消息的offset，数据保存在zookeeper中的消费组名中(如/consumers/id1/offsets/test2/2。其中id1是消费组，test2是topic，最后一个2表示第3个分区)，每间隔一个很短的时间更新一次offset，那么可能在重启消费者时拿到重复的消息。此外，当分区leader发生变更时也可能拿到重复的消息。因此在关闭消费者时最好等待一定时间（10s）然后再shutdown() 2.消费组名是一个全局的信息，要注意在新的消费者启动之前旧的消费者要关闭。如果新的进程启动并且消费组名相同，kafka会添加这个进程到可用消费线程组中用来消费topic和触发重新分配负载均衡，那么同一个分区的消息就有可能发送到不同的进程中。 3.如果消费的线程多于分区数，一些线程可能永远无法看到一些消息。 4.如果分区数多于线程数，一些线程会收到多个分区的消息 5.如果一个线程对应了多个分区，那么接收到的消息是不能保证顺序的。备注：可用zk的命令查询：get /consumers/id1/owners/test3/2其中id1为消费组，test3为topic,2为分区3.查看里面的内容如：id1_163-PC-1382409386474-1091aef2-1表示该分区被该标示的线程所执行。下面举例： Propertiesprops=newProperties(); // 指定zookeeper服务器地址 props.put("zookeeper.connect", "172.17.1.163:2181"); // 指定消费组（没有它会自动添加） props.put("group.id", "id1"); // 指定kafka等待多久zookeeper回复（ms）以便放弃并继续消费。 props.put("zookeeper.session.timeout.ms", "4000"); // 指定zookeeper同步最长延迟多久再产生异常 props.put("zookeeper.sync.time.ms", "2000"); // 指定多久消费者更新offset到zookeeper中。注意offset更新时基于time而不是每次获得的消息。一旦在更新zookeeper发生异常并重启，将可能拿到已拿到过的消息 props.put("auto.commit.interval.ms", "1000"); ConsumerConnector consumer = Consumer .createJavaConsumerConnector(new ConsumerConfig(props)); // 我们要告诉kafka该进程会有多少个线程来处理对应的topic Map<String, Integer> topicCountMap = new HashMap<String, Integer>(); int a_numThreads = 3; // 用3个线程来处理topic:test2 topicCountMap.put("test2", a_numThreads); // 拿到每个stream对应的topic Map<String, List<KafkaStream<byte[], byte[]>>> consumerMap = consumer .createMessageStreams(topicCountMap); List<KafkaStream<byte[], byte[]>> streams = consumerMap.get("test2"); // 调用thread pool来处理topic ExecutorService executor = Executors.newFixedThreadPool(a_numThreads); for (final KafkaStream stream : streams) { executor.submit(new Runnable() { public void run() { ConsumerIterator<byte[], byte[]> it = stream.iterator(); while (it.hasNext()) { System.out.println(Thread.currentThread() + ":" + new String(it.next().message())); } } }); } System.in.read(); // 关闭 if (consumer != null) consumer.shutdown(); if (executor != null) executor.shutdown(); 下面是采用底层api实现的消费者（需要多次读消息，或从部分分区里读数据，或用事务保证消息只处理一次）注意： 1.你必须自己实现当停止消费时如何持久化offset 2.你必须自己找到哪个broker是leader以便处理topic和分区 3.你必须自己处理leader变更使用阶段： 1.找到那些broker是leader以便读取topic和partition 2.自己决定哪个副本作为你的topic和分区 3.建立自己需要请求并自定义获取你感兴趣的数据 4.获取数据 5.当leader变更时自己识别和恢复。例子：

Stringtopic="test2"; intpartition=1; Stringbrokers="172.17.1.163:9093"; intmaxReads=100;// 读多少条数据 // 1.找leader PartitionMetadata metadata = null; for (String ipPort : brokers.split(",")) { //我们无需要把所有的brokers列表加进去，目的只是为了获得metedata信息，故只要有broker可连接即可 SimpleConsumer consumer = null; try { String[] ipPortArray = ipPort.split(":"); consumer = new SimpleConsumer(ipPortArray[0], Integer.parseInt(ipPortArray[1]), 100000, 64 * 1024, "leaderLookup"); List<String> topics = new ArrayList<String>(); topics.add(topic); TopicMetadataRequest req = new TopicMetadataRequest(topics); // 取meta信息 TopicMetadataResponse resp = consumer.send(req); //获取topic的所有metedate信息(目测只有一个metedata信息，何来多个？) List<TopicMetadata> metaData = resp.topicsMetadata(); for (TopicMetadata item : metaData) { for (PartitionMetadata part : item.partitionsMetadata()) { //获取每个meta信息的分区信息,这里我们只取我们关心的partition的metedata System.out.println("----"+part.partitionId()); if (part.partitionId() == partition) { metadata = part; break; } } } } catch (Exception e) { System.out.println("Error communicating with Broker [" + ipPort + "] to find Leader for [" + topic + ", " + partition + "] Reason: " + e); } finally { if (consumer != null) consumer.close(); } } if (metadata == null || metadata.leader() == null) { System.out.println("meta data or leader not found, exit."); return; } // 拿到leader Broker leadBroker = metadata.leader(); // 获取所有副本 System.out.println(metadata.replicas()); // 2.获取lastOffset(这里提供了两种方式：从头取或从最后拿到的开始取，下面这个是从头取) long whichTime = kafka.api.OffsetRequest.EarliestTime(); //这个是从最后拿到的开始取 // long whichTime = kafka.api.OffsetRequest.LatestTime(); System.out.println("lastTime:"+whichTime); String clientName = "Client_" + topic + "_" + partition; SimpleConsumer consumer = new SimpleConsumer(leadBroker.host(), leadBroker.port(), 100000, 64 * 1024, clientName); TopicAndPartition topicAndPartition = new TopicAndPartition(topic, partition); Map<TopicAndPartition, PartitionOffsetRequestInfo> requestInfo = new HashMap<TopicAndPartition, PartitionOffsetRequestInfo>(); requestInfo.put(topicAndPartition, new PartitionOffsetRequestInfo( whichTime, 1)); OffsetRequest request = new OffsetRequest(requestInfo, kafka.api.OffsetRequest.CurrentVersion(), clientName); // 获取指定时间前有效的offset列表 OffsetResponse response = consumer.getOffsetsBefore(request); if (response.hasError()) { System.out .println("Error fetching data Offset Data the Broker. Reason: " + response.errorCode(topic, partition)); return; } // 千万不要认为offset一定是从0开始的 long[] offsets = response.offsets(topic, partition); System.out.println("offset list:" + Arrays.toString(offsets)); long offset = offsets[0]; // 读数据 while (maxReads > 0) { // 注意不要调用里面的replicaId()方法，这是内部使用的。 FetchRequest req = new FetchRequestBuilder().clientId(clientName) .addFetch(topic, partition, offset, 100000).build(); FetchResponse fetchResponse = consumer.fetch(req); if (fetchResponse.hasError()) { // 出错处理。这里只直接返回了。实际上可以根据出错的类型进行判断，如code == ErrorMapping.OffsetOutOfRangeCode()表示拿到的offset错误 // 一般出错处理可以再次拿offset,或重新找leader，重新建立consumer。可以将上面的操作都封装成方法。再在该循环来进行消费 // 当然，在取所有leader的同时可以用metadata.replicas()更新最新的节点信息。另外zookeeper可能不会立即检测到有节点挂掉，故如果发现老的leader和新的leader一样，可能是leader根本没挂，也可能是zookeeper还没检测到，总之需要等等。 short code = fetchResponse.errorCode(topic, partition); System.out.println("Error fetching data from the Broker:" + leadBroker + " Reason: " + code); return; } //取一批消息 boolean empty = true; for (MessageAndOffset messageAndOffset : fetchResponse.messageSet( topic, partition)) { empty = false; long curOffset = messageAndOffset.offset(); //下面这个检测有必要，因为当消息是压缩的时候，通过fetch获取到的是一个整块数据。块中解压后不一定第一个消息就是offset所指定的。就是说存在再次取到已读过的消息。 if (curOffset < offset) { System.out.println("Found an old offset: " + curOffset + " Expecting: " + offset); continue; } // 可以通过当前消息知道下一条消息的offset是多少 offset = messageAndOffset.nextOffset(); ByteBuffer payload = messageAndOffset.message().payload(); byte[] bytes = new byte[payload.limit()]; payload.get(bytes); System.out.println(String.valueOf(messageAndOffset.offset()) + ": " + new String(bytes, "UTF-8")); maxReads++; } //进入循环中，等待一会后获取下一批数据 if(empty){ Thread.sleep(1000); } } // 退出（这里象征性的写一下） if (consumer != null) consumer.close();

另外还有采用hadoop专用api实现消息保存到hadoop中(这里略) 消费者配置 消费者或consumer.properties配置： zookeeper.connect：zookeeper连接服务器地址 zookeeper.session.timeout.ms对zookeeper的session过期时间，默认6000ms，用于检测消费者是否挂掉，当消费者挂掉，其他消费者要等该指定时间才能检查到并且触发重新负载均衡 group.id：指定消费组 auto.commit.enable:是否自动提交：这里提交意味着客户端会自动定时更新offset到zookeeper.默认为true auto.commit.interval.ms:自动更新时间。默认60 * 1000 auto.offset.reset：如果zookeeper没有offset值或offset值超出范围。那么就给个初始的offset。有smallest、largest、anything可选，分别表示给当前最小的offset、当前最大的offset、抛异常。默认largest consumer.timeout.ms:如果一段时间没有收到消息，则抛异常。默认-1 fetch.message.max.bytes：每次取的块的大小（默认1024*1024），多个消息通过块来批量发送给消费者，指定块大小可以指定有多少消息可以一次取出。注意若一个消息就超过了该块指定的大小，它将拿不到 queued.max.message.chunks：最大取多少块缓存到消费者(默认10)。更多配置可参见ConsumerConfig类 生产者配置 生产者或producer.properties配置： metadata.broker.list:指定kafka节点列表，用于获取metadata，不必全部指定 request.required.acks:指定生产者发送请求如何确认完成：0（默认）表示生产者不用等待broker返回ack。1表示当有复本（该复本节点不一定是同步）收到了消息后发回ack给生产者（如果leader挂掉且刚好收到消息的复本也挂掉则消息丢失）。-1表示所有已同步的复本收到了消息后发回ack给生产者（可以保证只要有一个已同步的复本存活就不会有数据丢失）。 producer.type:同步还是异步，默认2表同步，1表异步。异步可以提高发送吞吐量，但是也可能导致丢失未发送过去的消息 queue.buffering.max.ms:如果是异步，指定每次发送最大间隔时间 queue.buffering.max.messages：如果是异步，指定每次发送缓存最大数据量 serializer.class：指定序列化处理类，默认为kafka.serializer.DefaultEncoder,即byte[] key.serializer.class：单独序列化key处理类，默认和serializer.class一致 partitioner.class：指定分区处理类。默认kafka.producer.DefaultPartitioner，表通过key哈希到对应分区 message.send.max.retries：消息发送重试次数，默认3次 retry.backoff.ms：消息发送重试间隔次数 compression.codec：是否压缩，默认0表示不压缩，1表示用gzip压缩，2表示用snappy压缩。压缩后消息中会有头来指明消息压缩类型，故在消费者端消息解压是透明的无需指定。 compressed.topics：如果要压缩消息，这里指定哪些topic要压缩消息，默认empty，表示全压缩。更多配置可参见ProducerConfig类 优质内容筛选与推荐>>
1、搜索---从初始状态到目标状态(学习笔记)
2、170118、快速失败Vs安全失败(Java迭代器附示例)
3、bootstrap 弹出式搜索模板
4、关于__doPostBack之前截获调用
5、HTTP的长连接和短连接

朋友将在看一看看到

分享想法到看一看