Duplicate page.url as url.*, to make it searchable #3827

hmdhk · 2020-05-04T07:15:58Z

For RUM page load transactions and errors, index the page URL into ECS url fields so it is full-text searchable. For backwards compatibility (up until 8.0), we would need to continue indexing into page.url as well.

Original description

The current Transaction.page.url is not indexed and therefore not searchable (it's only used in the UI). Since searching over the page url has been requested before we should consider indexing this field.

We should, also, clean up the url for indexing. e.g. removing query strings. another possiblity is to add a fullUrl field, that contains the url without any change and set the clean up url on page.url

The text was updated successfully, but these errors were encountered:

vigneshshanmugam · 2020-05-04T08:24:00Z

What about using the request.url which has the full structure of URL defined properly in the spec.

apm-server/docs/spec/request.json

Lines 51 to 96 in 5d93733

 "url": { 

 "description": "A complete Url, with scheme, host and path.", 

 "type": "object", 

 "properties": { 

 "raw": { 

 "type": ["string", "null"], 

 "description": "The raw, unparsed URL of the HTTP request line, e.g https://example.com:443/search?q=elasticsearch. This URL may be absolute or relative. For more details, see https://www.w3.org/Protocols/rfc2616/rfc2616-sec5.html#sec5.1.2.", 

 "maxLength": 1024 

 }, 

 "protocol": { 

 "type": ["string", "null"], 

 "description": "The protocol of the request, e.g. 'https:'.", 

 "maxLength": 1024 

 }, 

 "full": { 

 "type": ["string", "null"], 

 "description": "The full, possibly agent-assembled URL of the request, e.g https://example.com:443/search?q=elasticsearch#top.", 

 "maxLength": 1024 

 }, 

 "hostname": { 

 "type": ["string", "null"], 

 "description": "The hostname of the request, e.g. 'example.com'.", 

 "maxLength": 1024 

 }, 

 "port": { 

 "type": ["string", "integer","null"], 

 "description": "The port of the request, e.g. '443'", 

 "maxLength": 1024 

 }, 

 "pathname": { 

 "type": ["string", "null"], 

 "description": "The path of the request, e.g. '/search'", 

 "maxLength": 1024 

 }, 

 "search": { 

 "description": "The search describes the query string of the request. It is expected to have values delimited by ampersands.", 

 "type": ["string", "null"], 

 "maxLength": 1024 

 }, 

 "hash": { 

 "type": ["string", "null"], 

 "description": "The hash of the request URL, e.g. 'top'", 

 "maxLength": 1024 

 } 

 } 

 },

Not sure if its searchable as well. /cc @elastic/apm-server Please let us know your thoughts.

moix · 2020-05-21T09:01:10Z

Hi, I was looking for something similar and found this ticket. I'm quite new to elastic apm and rum and are investigating now documents structure, I also read about ecs recommendations and was wondering if the url shouldn't be under url? url.registered_domain looks more appropiate? (https://www.elastic.co/guide/en/ecs/current/ecs-url.html) or am I misunderstanding something?

Thanks!

axw · 2020-05-27T13:03:21Z

I tend to agree, @moix. If we record the page URL as url.full in Elasticsearch, then it will be searchable (using url.full.text) as it's a multi-field.

@vigneshshanmugam In theory you could send request.url, but if you send request.url you also have to send request.method. It's meant for HTTP requests, not browser page loads (which might be stored locally, and have no corresponding HTTP request...)

We can alternatively update the server to store context.page.url as url.full in Elasticsearch. I suppose we would need to duplicate the URL in page.url for backwards compatibility, at least until 8.0. @jahtalab @vigneshshanmugam what do you think?

vigneshshanmugam · 2020-05-27T13:19:16Z

We can alternatively update the server to store context.page.url as url.full in Elasticsearch. I suppose we would need to duplicate the URL in page.url for backwards compatibility, at least until 8.0.

This solution sound good to me. Also I am in favour of the URL duplication work to be done in the server as it would reduce the payload size in case if we need to duplicate from RUM side.

hmdhk · 2020-05-27T14:21:19Z

Thanks @axw, duplicating the context.page.url as url.full sounds good to me, the only concern here is that page.url contains all parts of the url (including the query string and fragment) which in practice could be very long. So the server needs to shorten the url, instead of rejecting the payload if the length exceeds the allowed limit.

Also, if we plan to remove context.page.url from 8.0 forward then this requires a change in APM UI to use the new field!

axw · 2020-05-28T01:16:03Z

Cool. I'm going to transfer this to apm-server then, and we'll take it from there.

hmdhk · 2020-05-28T07:18:20Z

@axw, two questions:

Do you think we can get this on 7.9 milestone?
Is there a possibility to reindex existing data to duplicate the url when users upgrade? This is not a requirement but only nice to have, if the complexity of reindexing is not huge.

axw · 2020-05-28T08:06:51Z

@jahtalab

Do you think we can get this on 7.9 milestone?

I think we can probably manage that.

Is there a possibility to reindex existing data to duplicate the url when users upgrade? This is not a requirement but only nice to have, if the complexity of reindexing is not huge.

It's theoretically possible, but unlikely that we would do that - at least not for 7.9, and likely not in the foreseeable future. We don't have any infrastructure in place for things like this.

We could suggest an _update_by_query script to run manually though.

hmdhk · 2020-05-28T12:10:53Z

Thanks @axw , I think it's fine to duplicate url only on new data for now.

simitt · 2020-05-29T07:29:27Z

What is the usecase for collecting page.url.*? Depending on the usecase it might make sense to not only duplicate the field in url.full but split it into it's parts and fill as many of the ECS defined url.* fields as applicable.

Since the url.full is indexed as multifield, I'd also not trim it to 1024 chars but leave it unchanged.

hmdhk · 2020-05-29T08:26:13Z

Good idea @simitt , for now the use-cases have been around trying to search for a particular url. but I think if we start splitting the url into its fields (as well as keeping the full url), it can help address many more use-cases (e.g. ad-hoc search queries based on scheme).

cc @andrewvc @shahzad31 @justinkambic @drewpost

hmdhk · 2020-05-29T14:31:19Z

I've updated the description to apply the same solution for both Transactions and Errors

drewpost · 2020-06-01T11:20:21Z

I like @simitt suggest to break it down as much as possible. Question - would you propose normalising the query strings or storing them as received?

simitt · 2020-06-02T10:58:50Z

I was not thinking to normalize them; we could reuse the logic we have for jaeger in place https:/elastic/apm-server/blob/master/processor/otel/consumer.go#L508.

simitt · 2020-06-03T13:21:44Z

While on it, we could also map page.referer to the ECS field http.request.referer.

jalvz · 2020-06-08T09:37:26Z

where should the duplicated field be stored?
url is used already for context.request.url

axw · 2020-06-08T09:45:46Z

@jalvz the (previously unwritten) assumption here is that you only ever pass context.page or context.request. I would suggest we codify that by making the JSON Schema more strict.

In either case they would be recorded in the url.* fields.

jalvz · 2020-06-09T08:50:55Z

@jahtalab any reasons to not go with #3827 (comment)? (This was expected at least by 1 customer)

hmdhk · 2020-06-09T09:01:52Z

I don't see any reason why not. Other than the fact that ECS field is referrer and the actual http header name is referer wrong spelling but it is the standard 😄

axw transferred this issue from elastic/apm-agent-rum-js May 28, 2020

axw changed the title ~~Consider indexing page url~~ Duplicate page.url as url.*, to make it searchable May 28, 2020

axw added the enhancement label May 28, 2020

axw added the [zube]: Ready label Jun 3, 2020

jalvz self-assigned this Jun 8, 2020

jalvz mentioned this issue Jun 8, 2020

Index page URL as ECS field #3857

Merged

10 tasks

jalvz closed this as completed in #3857 Jun 10, 2020

zube bot added [zube]: Done and removed [zube]: Ready labels Jun 10, 2020

axw added ecs and removed [zube]: Done labels Jun 10, 2020

andrewvc added the v7.10.0 label Oct 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicate page.url as url.*, to make it searchable #3827

Duplicate page.url as url.*, to make it searchable #3827

hmdhk commented May 4, 2020 •

edited

Loading

vigneshshanmugam commented May 4, 2020

moix commented May 21, 2020

axw commented May 27, 2020

vigneshshanmugam commented May 27, 2020

hmdhk commented May 27, 2020

axw commented May 28, 2020

hmdhk commented May 28, 2020

axw commented May 28, 2020

hmdhk commented May 28, 2020

simitt commented May 29, 2020

hmdhk commented May 29, 2020

hmdhk commented May 29, 2020

drewpost commented Jun 1, 2020

simitt commented Jun 2, 2020

simitt commented Jun 3, 2020

jalvz commented Jun 8, 2020

axw commented Jun 8, 2020 •

edited

Loading

jalvz commented Jun 9, 2020

hmdhk commented Jun 9, 2020

Duplicate page.url as url.*, to make it searchable #3827

Duplicate page.url as url.*, to make it searchable #3827

Comments

hmdhk commented May 4, 2020 • edited Loading

vigneshshanmugam commented May 4, 2020

moix commented May 21, 2020

axw commented May 27, 2020

vigneshshanmugam commented May 27, 2020

hmdhk commented May 27, 2020

axw commented May 28, 2020

hmdhk commented May 28, 2020

axw commented May 28, 2020

hmdhk commented May 28, 2020

simitt commented May 29, 2020

hmdhk commented May 29, 2020

hmdhk commented May 29, 2020

drewpost commented Jun 1, 2020

simitt commented Jun 2, 2020

simitt commented Jun 3, 2020

jalvz commented Jun 8, 2020

axw commented Jun 8, 2020 • edited Loading

jalvz commented Jun 9, 2020

hmdhk commented Jun 9, 2020

hmdhk commented May 4, 2020 •

edited

Loading

axw commented Jun 8, 2020 •

edited

Loading