发新话题 回复该主题

如何处理这种两行或多行单元格合并的数据提取? [复制链接]

1#
银光图片
如何处理以下这种两行或多行单元格合并的数据提取?主要是相对XPATH路径定位与数据提取。

要采集的网页截图


采集流程中,无合并单元格,正确的数据抓取模板显示


无合并单元格,正确的数据预览显示


有合并单元格,不正确的数据抓取模板显示


有合并单元格,不正确的数据预览显示
分享 转发
TOP
2#

以下是代码,部分重复格式的代码未做展开。

  <document>    
   <center>
    <font size="3" color="#ff00ff" face="5">
     <div class="style2" align="center">
      <br />
      <font size="2"> <p> <font size="2" color="#8000FF">各次填报最高分最低分情况</font> </p>
       <table cellspacing="0" border="1" align="center">
        <tbody>
         <tr>
          <td class="style1"> <strong> <p align="CENTER">填报次序</p> </strong> </td>
          <td class="style1"> <strong> <p align="CENTER">最高分</p> </strong> </td>
          <td class="style1"> <strong> <p align="CENTER">最低分</p> </strong> </td>
          <td class="style1"> <strong> <p align="CENTER">录取人数</p> </strong> </td>
         </tr>
         <tr>
          <td class="style1"> <p align="CENTER">第1次填报</p> </td>
          <td class="style1"> <p align="CENTER">383</p> </td>
          <td class="style1"> <p align="CENTER">161</p> </td>
          <td class="style1"> <p align="CENTER">970</p> </td>
         </tr>
         <tr>
          <td class="style1"> <p align="CENTER">第2次填报</p> </td>
          <td class="style1"> <p align="CENTER">318</p> </td>
          <td class="style1"> <p align="CENTER">161</p> </td>
          <td class="style1"> <p align="CENTER">17</p> </td>
         </tr>
         <tr>
          <td class="style1"> <p align="CENTER">第3次填报</p> </td>
          <td class="style1"> <p align="CENTER">258</p> </td>
          <td class="style1"> <p align="CENTER">165</p> </td>
          <td class="style1"> <p align="CENTER">4</p> </td>
         </tr>
        </tbody>
       </table> <p></p> <p> <font size="2" color="#8000FF">各专业最高分最低分情况</font> </p>
       <table cellspacing="0" border="1" align="center">
        <tbody>
         <tr>
          <td class="style1"> <strong> <p align="CENTER">专业代号</p> </strong> </td>
          <td class="style1"> <strong> <p align="CENTER">专业名称</p> </strong> </td>
          <td class="style1"> <strong> <p align="CENTER">填报次序</p> </strong> </td>
          <td class="style1"> <strong> <p align="CENTER">最高分</p> </strong> </td>
          <td class="style1"> <strong> <p align="CENTER">最低分</p> </strong> </td>
          <td class="style1"> <strong> <p align="CENTER">最低分位次</p> </strong> </td>
          <td class="style1"> <strong> <p align="CENTER">录取人数</p> </strong> </td>
         </tr>
         <tr>
          <td class="style1" rowspan="2"> <p align="CENTER">01</p> </td>
          <td class="style1" rowspan="2"> <p align="left"> <a href="lqmaxmin_2.jsp?pcdm=7&amp;kldm=A&amp;yxdh=204&amp;zydh=01">电力系统自动化技术</a> </p> </td>
          <td class="style1"> <p align="CENTER">第1次填报</p> </td>
          <td class="style1"> <p align="CENTER">364</p> </td>
          <td class="style1"> <p align="CENTER">161</p> </td>
          <td class="style1"> <p align="CENTER">40654</p> </td>
          <td class="style1"> <p align="CENTER">59</p> </td>
         </tr>
         <tr>
          <td class="style1"> <p align="CENTER">第2次填报</p> </td>
          <td class="style1"> <p align="CENTER">216</p> </td>
          <td class="style1"> <p align="CENTER">214</p> </td>
          <td class="style1"> <p align="CENTER">39616</p> </td>
          <td class="style1"> <p align="CENTER">2</p> </td>
         </tr>
         <tr>
          <td class="style1" rowspan="3"> <p align="CENTER">02</p> </td>
          <td class="style1" rowspan="3"> <p align="left"> <a href="lqmaxmin_2.jsp?pcdm=7&amp;kldm=A&amp;yxdh=204&amp;zydh=02">会计</a> </p> </td>
          <td class="style1"> <p align="CENTER">第1次填报</p> </td>
          <td class="style1"> <p align="CENTER">383</p> </td>
          <td class="style1"> <p align="CENTER">291</p> </td>
          <td class="style1"> <p align="CENTER">34415</p> </td>
          <td class="style1"> <p align="CENTER">91</p> </td>
         </tr>
         <tr>
          <td class="style1"> <p align="CENTER">第2次填报</p> </td>
          <td class="style1"> <p align="CENTER">318</p> </td>
          <td class="style1"> <p align="CENTER">161</p> </td>
          <td class="style1"> <p align="CENTER">40654</p> </td>
          <td class="style1"> <p align="CENTER">6</p> </td>
         </tr>
         <tr>
          <td class="style1"> <p align="CENTER">第3次填报</p> </td>
          <td class="style1"> <p align="CENTER">207</p> </td>
          <td class="style1"> <p align="CENTER">207</p> </td>
          <td class="style1"> <p align="CENTER">39849</p> </td>
          <td class="style1"> <p align="CENTER">1</p> </td>
         </tr>
         <tr>
          <td class="style1" rowspan="1"> <p align="CENTER">07</p> </td>
          <td class="style1" rowspan="1"> <p align="left"> <a href="lqmaxmin_2.jsp?pcdm=7&amp;kldm=A&amp;yxdh=204&amp;zydh=07">建筑工程技术</a> </p> </td>
          <td class="style1"> <p align="CENTER">第1次填报</p> </td>
          <td class="style1"> <p align="CENTER">292</p> </td>
          <td class="style1"> <p align="CENTER">171</p> </td>
          <td class="style1"> <p align="CENTER">40580</p> </td>
          <td class="style1"> <p align="CENTER">10</p> </td>
         </tr>
         <tr>
          <td class="style1" rowspan="1"> <p align="CENTER">08</p> </td>
          <td class="style1" rowspan="1"> <p align="left"> <a href="lqmaxmin_2.jsp?pcdm=7&amp;kldm=A&amp;yxdh=204&amp;zydh=08">计算机网络技术</a> </p> </td>
          <td class="style1"> <p align="CENTER">第1次填报</p> </td>
          <td class="style1"> <p align="CENTER">356</p> </td>
          <td class="style1"> <p align="CENTER">258</p> </td>
          <td class="style1"> <p align="CENTER">37307</p> </td>
          <td class="style1"> <p align="CENTER">60</p> </td>
         </tr>
         <tr>
         </tr>
         .         .
         .
         .
         .
         <tr>
         </tr>
        </tbody>
       </table> <p></p>
       <hr /> <br /> <br /> <font size="2" color="#00aaff" face="5">教育招生考试中心版权所有,未经授权,不得转载或链接。</font> <br /> <br /> </font>
     </div></font>
   </center>
   <font size="3" color="#ff00ff" face="5"><font size="2">
     <div class="firebugResetStyles firebugBlockBackgroundColor" style="left: 8px !important; top: -68.3167px !important; width: 1392.75px !important; height: 1062px !important; border-radius: 0px !important; box-shadow: 0px 0px 2px 2px highlight !important;">  
     </div></font></font>
  </document>
最后编辑ne**p1 最后编辑于 2019-07-03 00:47:34
TOP
3#

回复 1楼ne**p1的帖子

判断条件中的两个分支中的字段个数和字段名称需保持一致,具体请看教程:https://www.bazhuayu.com/tutorial/judge
TOP
4#

你解决了吗,大兄弟
TOP
发新话题 回复该主题