标签云

微信群

扫码加入我们

WeChat QR Code

I have a difficulty building a regex.Suppose there is a html clip as below.I want to use Javascript to cut the <tbody> part with the link of "apple"(which <a> is inside of the <td class="by">)I construct the following expression : /<tbody.*?text[\s\S]*?<td class="by"[\s\S]*?<a.*?>apple<\/a>[\s\S]*?<\/tbody>/gBut the result is different from what I wanted. Each match contains more than one block of <tbody>. How it should be? Regards!!!! (I tested with https://regex101.com/ and get the unexpected selection. Please forgive me I can't figure out the problem :( ) <tbody id="text_0"><td class="by">...lots of other tags<a href="xxx">cat</a> ...lots of other tags</td></tbody><tbody id="text_1"> ...lots of other tags<td class="by"><a href="xxx">apple</a></td> ...lots of other tags</tbody><tbody id="text_2"> ...lots of other tags<td class="by"><a href="xxx">cat</a></td> ...lots of other tags</tbody><tbody id="text_3"> ...lots of other tags<td class="by"> ...lots of other tags<a href="xxx">tiger</a></td> ...lots of other tags</tbody><tbody id="text_4"><td class="by"><a href="xxx">banana</a></td></tbody><tbody id="text_5"><td class="by"><a href="xxx">peach</a></td></tbody><tbody id="text_6"><td class="by"><a href="xxx">apple</a></td></tbody><tbody id="text_7"><td class="by"><a href="xxx">banana</a></td></tbody>And this is what i expect to get<tbody id="text_1"><td class="by"><a href="xxx">apple</a></td></tbody><tbody id="text_6"><td class="by"><a href="xxx">apple</a></td></tbody>


try putting it on regex101.com to see what is going wrong. for starters, the text[\s\S] doesn't make sense.

2019年04月22日04分00秒

Oh, sorry , the condition also select the <tbody> with id begins with "text". there are lots of other <tbody> with other serial id, but i didn't put it in the question

2019年04月22日04分00秒

Before i post the question, I have tested with regex101.com and get the unexpected selection. I have no idea how to figure it out

2019年04月22日04分00秒

include the link to regex101.com in your question

2019年04月22日04分00秒

See this question on SO for more information about why regex won't work: stackoverflow.com/questions/590747/…

2019年04月22日04分00秒

The real html i am working on have the right structure of html. But it is very huge and i make the question simple. The difficult for me is i cant get the right selection by testing in regex101.com

2019年04月22日04分00秒

I tried with DOM and it works well, with the only problem...very slow and give my boss a bad feeling and make me feel same....When i tried with Regex and it is much much faster, without right response8...(. The real html i works on could be few hundred kb.

2019年04月22日04分00秒

Thanks, it works like magic. I will try to learn from your answer and make it work with my real "html clip". Thank you very much again!! You saved my day!

2019年04月22日04分00秒

Sorry it is not working if there are other uncertain tags inside of <tbody>, for example if the second part is like this will not be selected<tbody id="text_1"> <sas></saa> <td class="by"> <a href="xxx">apple</a> </td> </tbody>

2019年04月22日04分00秒

Yes, that is correct. Now you see why everyone is saying not use Regex. Regular expressions only work for regular languages and HTML is not a regular language. This is like trying to drive a nail into a board using a screwdriver instead of hammer. You're using the wrong tool for the job.

2019年04月22日04分00秒

Agree. finally i used XPath slove the problem. To the speed it is a little slower as the Regex but much faster than DOM/JQuerySelector

2019年04月22日04分00秒

Yes, but i want to get the block from <tbody> .... </tbody> as the return of string.match(reg)

2019年04月22日04分00秒

well then add it to the regular expression.. as in putting <tbody.*?>\s*<td.*?>\s* in front of the starting regular expression I gave you. The point is you have to build them and start with something that works

2019年04月22日04分00秒

This is a simpfied html clip. The target i am working on have lots of other tag between <tbody>, and i would like to select the whole part of the <tbody>, with the <a>apple</a> inside

2019年04月22日04分00秒

I think i must put something like id="text.*? behind <tbody and as there are "lots of other tag"inside of <tbody>(before and after <a>) i need the [\s\S]*? to include the line change

2019年04月22日04分00秒

As I mentioned, regular expressions are not ideal for this.

2019年04月22日04分00秒